Benchmarks · last run 2026-05-27 · CLI 0.4.37

Proof beats
prompt vibes.

Mesh is benchmarked on real code-agent tasks across two frontier models (Claude Opus 4.6, Gemini 3.5 Flash): needle finding, repo-wide audits, edit work, hallucination refusal. Same prompts, same models, same temperature — capsules vs raw source.

See results
−74%
Tokens vs raw on repo lookup & understanding.
compression
100%
NIAH recall on the standard track.
accuracy
−38%
Tokens vs raw on repo-wide audit work.
savings
0%
Hallucinated paths in observed runs.
safety
A1 · Needle in a Haystack

Find the right line in 247 files.

A planted constant — SECRET_KEY = "qz7m2p..." — is hidden somewhere in a 247-file workspace. Mesh keeps the structural map in budget, then recovers the exact span on demand. Raw source can only fit a fraction.

haystack · 247 files · 18,432 LOCMesh vs raw source
needle ↓
Standard track20–180 file workspaces
Mesh
100%
Raw
33%
Hard track500+ files · projected
Mesh
90%
Raw
10%
Extreme1000+ files · projected
Mesh
82%
Raw
4%
A2 · Multi-hop reasoning

Trace a bug across 4 files.

Each task requires the model to follow a call edge across 3–5 files to identify the root cause. Capsules expose the structural graph, so the model never loses the trail.

mesh / trace-studio
A3 · Per-turn payload

Same answer. Less payload.

Repo lookup & understanding (NIAH, locate-file, structural Q&A): Mesh capsules answer at ~2.4k tokens per turn; raw needs ~9.3k. That's −74% input for the same answer, on the same model — every turn, all session long.

Repo-wide audit work (multi-file traces, "find every X" prompts on a 543-file workspace): Mesh ships at ~25k tokens / turn; raw needs ~40k. That's −38% on median, ranging from −47% best run to small regressions on prompts where the model just needs grep.

This isn't about running out of context — modern models have 200K–2M token windows. It's about the cost and latency of re-shipping the workspace every turn. Across a 50-turn session, raw burns ~465k input tokens; Mesh ships ~120k for the same task.

Small repos: capsules build locally on your CPU in seconds. Large repos (10k+ files): processed in our GCP enclave via secure data-stream — your code never sits on disk, results stream back as they're built.

Input tokens per turn — NIAH lookup, same model 0 · · · 5k · · · 10k
Raw source re-ships workspace every turn
0tok
Mesh capsules structural recall, same answer
0tok
−74% per turn
Compounded over a 50-turn session
465kraw input shipped
120kmesh input shipped
·
−345ktokens saved
A4 · Hallucination refusal

Zero invented paths.

When asked for a file or symbol that doesn't exist, capsules return a structural miss — no fabricated paths, no fake line numbers. Raw context lets the model invent freely.

refusal · negative probes in observed runsspan · symbol · path
Mesh
0%
Returned { found: false } on every observed miss. No invented line numbers across our test set.
Raw source
~17%
Raw context returned fabricated file paths or line numbers on ~1 in 6 negative probes. Smaller models drift further.
Recovered spans
100%
Every span Mesh returned matched the real file byte-for-byte. Verified against git SHA.
Refusal precision
100%
When Mesh said "not found", it was always correct in observed runs. Zero false negatives so far.
Full scorecard · 8 axes · capsules vs raw

Mesh wins every axis.

Recall, hallucination, span recovery, token cost and latency are measured on our bench harness. Hard-track and multi-hop axes are projected from how capsule lookup and the call-graph index scale — bench suite for them ships next.

Capsules Raw source
Selected axis
NIAH · standard
100%
▲ +67 vs raw
Capsules
100%
Raw
33%

Single-fact recall in a 247-file workspace. Mesh's structural index turns every symbol into an addressable target. Measured 2026-04-23 on Claude Opus 4.6.

How we tested

No vibes. No cherry-picking.

Every benchmark runs on the same prompt set across the same frontier models — Claude Opus 4.6 and Gemini 3.5 Flash. Capsules vs raw source. Same temperature, same max-tokens, same retry policy. Results are reproducible — the run scripts ship in mesh bench on CLI 0.4.37.

Compute model: small repos (under ~5k files) build capsules locally on your CPU in seconds — no network round-trip. Large repos (10k+ files) process through our GCP enclave via secure data-stream: source never sits on disk, capsules stream back as they're built, encrypted in transit and at rest. Same numbers either way.

~/mesh / bench-runner replaying real run · 60+ tasks
replay

Run them yourself.

Install the CLI and reproduce every number on this page in under three minutes.