Benchmarks · last run 2026-04-23

Proof beats
prompt vibes.

Mesh is benchmarked on real code-agent tasks across five frontier models: needle finding, multi-hop reasoning, budget saturation, hallucination refusal. Same prompts, same models, same temperature — capsules vs raw source.

See results
0×
More files indexed per token budget.
compression
0%
NIAH recall on the standard track.
accuracy
0×
Fewer tokens for full-repo coverage.
savings
0%
Hallucinated paths in span recovery.
safety
A1 · Needle in a Haystack

Find the right line in 247 files.

A planted constant — SECRET_KEY = "qz7m2p..." — is hidden somewhere in a 247-file workspace. Mesh keeps the structural map in budget, then recovers the exact span on demand. Raw source can only fit a fraction.

haystack · 247 files · 18,432 LOCMesh vs raw source
needle ↓
Standard track32k ctx · 6k budget
Mesh
100%
Raw
40%
Hard track32k ctx · 8k budget
Mesh
92%
Raw
20%
Extreme32k ctx · 4k budget
Mesh
84%
Raw
6%
A2 · Multi-hop reasoning

Trace a bug across 4 files.

Each task requires the model to follow a call edge across 3–5 files to identify the root cause. Capsules expose the structural graph, so the model never loses the trail.

mesh / trace-studio
A3 · Budget Saturation

Full coverage at 1/5 the cost.

Same coverage, same recall — Mesh saturates a 247-file workspace at 4.5k tokens. Raw source needs 24k tokens to cover the same surface area. Same model, same task, same answer.

Source workspace
247files · 18,432 LOC
Streaming all symbols + signatures into a fixed 24k token budget. Watch what fits.
Mesh capsules 0% of budget
100% — 24k 50% — 12k 25% — 6k 0
19% used
0kcovers all 247 files · 100% recall
4.5k tokens ship the structural index for the whole repo. ~19.5k tokens of headroom remain for the model's own reasoning.
Raw source 0% of budget
100% — 24k 50% — 12k 25% — 6k 0
budget exhausted
0ksame 247 files · 5.3× more expensive
24k tokens floods the entire context. Zero headroom for reasoning — the model has to choose between seeing the code or thinking about it.
A4 · Hallucination refusal

Zero invented paths.

When asked for a file or symbol that doesn't exist, capsules return a structural miss — no fabricated paths, no fake line numbers. Raw context lets the model invent freely.

refusal · 200 negative probesspan · symbol · path
Mesh
0%
Returned { found: false } on every probe. No invented line numbers.
Raw source
17%
34 of 200 prompts returned a fabricated file path or line number. Score is even worse on smaller models.
Recovered spans
100%
Every span Mesh returned matched the real file byte-for-byte. Verified against git SHA.
Refusal precision
1.00
When Mesh said "not found", it was always correct. Zero false negatives across 200 probes.
Full scorecard · 8 axes × 5 models

Mesh wins every axis.

Capsules Raw source
Selected axis
NIAH · standard
100%
▲ +60 vs raw
Capsules
100%
Raw
40%

Single-fact recall in a 247-file workspace. Mesh's structural index turns every symbol into an addressable target.

How we tested

No vibes. No cherry-picking.

Every benchmark runs on the same prompt set across the same five frontier models. Capsules vs raw source. Same temperature (0.2), same max-tokens, same retry policy. Results are reproducible — the run scripts ship in mesh bench.

~/mesh / bench-runner replaying real run · 247 tasks
replay

Run them yourself.

Install the CLI and reproduce every number on this page in under three minutes.