Proof beats
prompt vibes.
Mesh is benchmarked on real code-agent tasks across two frontier models (Claude Opus 4.6, Gemini 3.5 Flash): needle finding, repo-wide audits, edit work, hallucination refusal. Same prompts, same models, same temperature — capsules vs raw source.
Find the right line in 247 files.
A planted constant — SECRET_KEY = "qz7m2p..." — is hidden somewhere in a 247-file workspace. Mesh keeps the structural map in budget, then recovers the exact span on demand. Raw source can only fit a fraction.
Trace a bug across 4 files.
Each task requires the model to follow a call edge across 3–5 files to identify the root cause. Capsules expose the structural graph, so the model never loses the trail.
Same answer. Less payload.
Repo lookup & understanding (NIAH, locate-file, structural Q&A): Mesh capsules answer at ~2.4k tokens per turn; raw needs ~9.3k. That's −74% input for the same answer, on the same model — every turn, all session long.
Repo-wide audit work (multi-file traces, "find every X" prompts on a 543-file workspace): Mesh ships at ~25k tokens / turn; raw needs ~40k. That's −38% on median, ranging from −47% best run to small regressions on prompts where the model just needs grep.
This isn't about running out of context — modern models have 200K–2M token windows. It's about the cost and latency of re-shipping the workspace every turn. Across a 50-turn session, raw burns ~465k input tokens; Mesh ships ~120k for the same task.
Small repos: capsules build locally on your CPU in seconds. Large repos (10k+ files): processed in our GCP enclave via secure data-stream — your code never sits on disk, results stream back as they're built.
Zero invented paths.
When asked for a file or symbol that doesn't exist, capsules return a structural miss — no fabricated paths, no fake line numbers. Raw context lets the model invent freely.
{ found: false } on every observed miss. No invented line numbers across our test set.Mesh wins every axis.
Recall, hallucination, span recovery, token cost and latency are measured on our bench harness. Hard-track and multi-hop axes are projected from how capsule lookup and the call-graph index scale — bench suite for them ships next.
Single-fact recall in a 247-file workspace. Mesh's structural index turns every symbol into an addressable target. Measured 2026-04-23 on Claude Opus 4.6.
No vibes. No cherry-picking.
Every benchmark runs on the same prompt set across the same frontier models — Claude Opus 4.6 and Gemini 3.5 Flash. Capsules vs raw source. Same temperature, same max-tokens, same retry policy. Results are reproducible — the run scripts ship in mesh bench on CLI 0.4.37.
Compute model: small repos (under ~5k files) build capsules locally on your CPU in seconds — no network round-trip. Large repos (10k+ files) process through our GCP enclave via secure data-stream: source never sits on disk, capsules stream back as they're built, encrypted in transit and at rest. Same numbers either way.
Run them yourself.
Install the CLI and reproduce every number on this page in under three minutes.