Proof beats
prompt vibes.
Mesh is benchmarked on real code-agent tasks across five frontier models: needle finding, multi-hop reasoning, budget saturation, hallucination refusal. Same prompts, same models, same temperature — capsules vs raw source.
Find the right line in 247 files.
A planted constant — SECRET_KEY = "qz7m2p..." — is hidden somewhere in a 247-file workspace. Mesh keeps the structural map in budget, then recovers the exact span on demand. Raw source can only fit a fraction.
Trace a bug across 4 files.
Each task requires the model to follow a call edge across 3–5 files to identify the root cause. Capsules expose the structural graph, so the model never loses the trail.
Full coverage at 1/5 the cost.
Same coverage, same recall — Mesh saturates a 247-file workspace at 4.5k tokens. Raw source needs 24k tokens to cover the same surface area. Same model, same task, same answer.
Zero invented paths.
When asked for a file or symbol that doesn't exist, capsules return a structural miss — no fabricated paths, no fake line numbers. Raw context lets the model invent freely.
{ found: false } on every probe. No invented line numbers.Mesh wins every axis.
Single-fact recall in a 247-file workspace. Mesh's structural index turns every symbol into an addressable target.
No vibes. No cherry-picking.
Every benchmark runs on the same prompt set across the same five frontier models. Capsules vs raw source. Same temperature (0.2), same max-tokens, same retry policy. Results are reproducible — the run scripts ship in mesh bench.
Run them yourself.
Install the CLI and reproduce every number on this page in under three minutes.