Tamp v0.8: Measured Token Savings for Agentic Coding Assistants

Abstract. Tamp v0.8 ships a seventeen-stage HTTP-proxy compression pipeline exposed as a nine-level ladder (L1–L9). In a controlled live sweep of 12 scenarios × 18 configurations = 216 A/B calls routed through OpenRouter and judged by Claude Sonnet Haiku 4.5, every configuration preserves task-completion quality on 100% (216/216) of tasks. At the balanced default (L5) we measure 45.34% bytes / 47.56% tokens saved; the top of the ladder (L9) reaches 45.39% / 47.61%. The v0.5-baseline reaches 45.15% / 47.35%, so the L9 delta is +0.24 percentage points — honest evidence that four of v0.8's new stages are session-scoped and invisible to single-turn micro-fixtures.

Quality retention: 100% (216/216 A/B tasks). Every compression config, every scenario, preserves task-completion accuracy under an independent judge.

Level Ladder (L1–L9)

Lossless floor, lossy ceiling, and the L4→L5 inflection.
Level	Headline Stages	Bytes Saved	Tokens Saved	Lossy
L1	minify	18.84%	25.92%	no
L2	+ whitespace, strip-lines	19.37%	26.49%	no
L3	+ cmd-strip	19.57%	26.71%	no
L4	+ dedup, diff	19.57%	26.71%	no
L5	+ read-diff, prune, toon	45.34%	47.56%	no*
L6	+ llmlingua	45.34%	47.56%	yes
L7	+ graph, br-cache	45.34%	47.56%	yes
L8	+ strip-comments, textpress	45.39%	47.61%	yes
L9	+ disclosure, bm25-trim, foundation-models	45.39%	47.61%	yes

Figure 1. Tokens saved by ladder level. The lossless floor (L1–L4) caps at ~27%; L5 unlocks the 47.6% tier by enabling TOON, prune, and read-diff simultaneously.

Methodology (two paragraphs)

We evaluated 18 compression configurations (3 presets, 1 v0.5 whitelist baseline, 5 leave-one-out variants, 9 ladder levels) against 12 scenarios in live A/B mode. Each (config, scenario) pair issued two real calls through OpenRouter using anthropic/claude-haiku-4.5 as the judge so that token accounting and quality verdicts come from a model separate from the compression target. Payloads covered small and large JSON, tabular data, source code, line-numbered Read output, errors, multi-turn dialogues, lockfiles, and a duplicate-read fixture. The full protocol, fixtures, and 216 A/B outcomes are committed at bench/results/level-sweep.json.

Configurations were arranged in three orthogonal cuts. The level ladder (L1–L9) builds up stage sets cumulatively. The leave-one-out sweep drops one of the new-in-v0.8 stages at a time from the aggressive preset to measure marginal contribution. The v0.5 baseline reproduces the v0.5-era stage list as a regression anchor. Quality is scored by a qualityOK match function applied to control and treatment responses; a single failure would be reported here.

Leave-One-Out Findings

Only cmd-strip shows a measurable marginal contribution (−0.20% bytes when dropped from aggressive). read-diff, br-cache, disclosure, and bm25-trim all register exactly 0.00% delta. This is not evidence they do nothing — they are session-scoped stages that only fire across multiple requests (cache hits, re-reads, progressive reveal, ranked retrieval from history). Single-turn A/B fixtures cannot exercise those pathways.

Limitations

The 12 scenarios are single-turn and below ~18 KB. Four of the five new v0.8 stages are session-scoped and cannot be measured on this corpus — the L9 vs v0.5 baseline delta (+0.24 percentage points) is an artifact of that measurement gap, not a claim that the new stages are idle. Session-replay fixtures that exercise re-reads, Brotli-cache hits, and multi-turn disclosure are the top item on the next benchmark iteration. Results also depend on a single judge (Sonnet Haiku 4.5) and a single provider route (OpenRouter); cross-judge triangulation remains future work.

Full paper: whitepaper.pdf · Raw data: bench/results/level-sweep.json · Reproduce: node bench/runner.js --sweep --live