Benchmarks · last run 2026-04-23

Proof beats

prompt vibes.

Mesh is benchmarked on real code-agent tasks across five frontier models: needle finding, multi-hop reasoning, budget saturation, hallucination refusal. Same prompts, same models, same temperature — capsules vs raw source.

See results

mesh / test-rack 0 runs · 5 models

Pass rate

98.4%

+22.4 vs raw

Tokens / task

4.5k

5.3× cheaper

P50 latency

238ms

capsule lookup

Recall

100%

standard NIAH

claude-sonnet-4.6anthropic

NIAH·stdidle

—

gpt-5openai

multi-hopidle

—

llama-3.3-70bmeta

refusalidle

—

deepseek-v3deepseek

budget·satidle

—

gpt-oss-20bopenai

NIAH·hardidle

—

feed ↑ booting test runners across five models…

0^×

More files indexed per token budget.

compression

0^%

NIAH recall on the standard track.

accuracy

0^×

Fewer tokens for full-repo coverage.

savings

0^%

Hallucinated paths in span recovery.

safety

A1 · Needle in a Haystack

Find the right line in 247 files.

A planted constant — SECRET_KEY = "qz7m2p..." — is hidden somewhere in a 247-file workspace. Mesh keeps the structural map in budget, then recovers the exact span on demand. Raw source can only fit a fraction.

haystack · 247 files · 18,432 LOCMesh vs raw source

needle ↓

Standard track32k ctx · 6k budget

Mesh

100%

Raw

40%

Hard track32k ctx · 8k budget

Mesh

92%

Raw

20%

Extreme32k ctx · 4k budget

Mesh

84%

Raw

A2 · Multi-hop reasoning

Trace a bug across 4 files.

Each task requires the model to follow a call edge across 3–5 files to identify the root cause. Capsules expose the structural graph, so the model never loses the trail.

mesh / trace-studio

A3 · Budget Saturation

Full coverage at 1/5 the cost.

Same coverage, same recall — Mesh saturates a 247-file workspace at 4.5k tokens. Raw source needs 24k tokens to cover the same surface area. Same model, same task, same answer.

Source workspace

247files · 18,432 LOC

Streaming all symbols + signatures into a fixed 24k token budget. Watch what fits.

Mesh capsules 0% of budget

100% — 24k 50% — 12k 25% — 6k 0

19% used

0kcovers all 247 files · 100% recall

4.5k tokens ship the structural index for the whole repo. ~19.5k tokens of headroom remain for the model's own reasoning.

Raw source 0% of budget

100% — 24k 50% — 12k 25% — 6k 0

budget exhausted

0ksame 247 files · 5.3× more expensive

24k tokens floods the entire context. Zero headroom for reasoning — the model has to choose between seeing the code or thinking about it.

A4 · Hallucination refusal

Zero invented paths.

When asked for a file or symbol that doesn't exist, capsules return a structural miss — no fabricated paths, no fake line numbers. Raw context lets the model invent freely.

refusal · 200 negative probesspan · symbol · path

Mesh

Returned { found: false } on every probe. No invented line numbers.

Raw source

17%

34 of 200 prompts returned a fabricated file path or line number. Score is even worse on smaller models.

Recovered spans

100%

Every span Mesh returned matched the real file byte-for-byte. Verified against git SHA.

Refusal precision

1.00

When Mesh said "not found", it was always correct. Zero false negatives across 200 probes.

Full scorecard · 8 axes × 5 models

Mesh wins every axis.

Capsules Raw source

Selected axis

NIAH · standard

100%

▲ +60 vs raw

Capsules

100%

Raw

40%

Single-fact recall in a 247-file workspace. Mesh's structural index turns every symbol into an addressable target.

How we tested

No vibes. No cherry-picking.

Every benchmark runs on the same prompt set across the same five frontier models. Capsules vs raw source. Same temperature (0.2), same max-tokens, same retry policy. Results are reproducible — the run scripts ship in mesh bench.

~/mesh / bench-runner replaying real run · 247 tasks

replay

Run them yourself.

Install the CLI and reproduce every number on this page in under three minutes.

Proof beats prompt vibes.

Find the right line in 247 files.

Trace a bug across 4 files.

Full coverage at 1/5 the cost.

Zero invented paths.

Mesh wins every axis.

No vibes. No cherry-picking.

Run them yourself.

Proof beats

prompt vibes.