Over the last week, I implemented a strict measurement framework inside Zaguán Coder Daemon to answer a simple question:
Does context compression and selective retrieval actually reduce real model costs?
Instead of speculating, I built a controlled synthetic test to find out. Here is the raw data, the methodology, and what it actually means for real-world production workloads.
The Methodology: Naive vs Compressed Mode
To get accurate numbers, I ran 250 turns in a controlled environment, comparing two modes:
- A config-driven naive passthrough mode
- A compressed mode (our current production logic)
The testing framework included per-turn provider token accounting and latency/tool-call tracking. I ran 5 full 50-turn runs per mode using identical A/B synthetic workflows. There was no tuning mid-test, no parameter changes—just the exact same model, same seed, and same workflow. Even with identical seeds and workflows, model trajectories diverged, which reinforces the need for averaging across multiple runs.
Aggregate Results (5 Runs, 250 Turns)
The token reduction was measurable and significant. Here are the raw numbers across the entire test:
🟢 Naive Mode (Total)
- Input tokens: 15,082,481
- Output tokens: 184,992
🔵 Compressed Mode (Total)
- Input tokens: 10,406,237
- Output tokens: 127,229
The Token Reduction
Input tokens dropped from 15.08M to 10.41M, which is a reduction of 4,676,244 input tokens.
To provide full transparency, here is the breakdown across all 5 runs:
| Run | 🟢 Naive Input | 🔵 Compressed Input | Savings | Impact Scale |
|---|---|---|---|---|
| 1 | 3.59M | 2.16M | 40% | ████████░░ |
| 2 | 3.22M | 1.66M | 48% | █████████░ |
| 3 | 2.15M | 2.23M | -3% | ⚠️ lean run |
| 4 | 2.95M | 1.66M | 44% | █████████░ |
| 5 | 3.15M | 2.68M | 15% | ███░░░░░░░ |
On average across all runs, this resulted in ≈ 31% lower total token cost across the full test set.
Savings were higher in long, context-heavy trajectories (reaching 40–48%). In one lean trajectory (Run 3), compression provided no savings and slightly increased input tokens. This reinforces that the system prevents bloat rather than magically reducing already-lean sessions. On average, the reduction stabilized around one third.
Cost Impact
To put this into perspective, here is the impact using two real pricing models:
GPT-5.3-Codex ($1.75 / MTok input, $14 / MTok output)
- Naive: $28.98
- Compressed: $19.99
- Savings: $8.99 (≈31%)
Claude Opus 4.6 ($5 / MTok input, $25 / MTok output)
- Naive: $80.04
- Compressed: $55.21
- Savings: $24.83 (≈31%)
To see why this matters at scale: At 1,000 similar sessions per month, this would reduce GPT-5.3 costs from $5,796 to $3,998 — a ~$1,800 monthly reduction. At scale, the reduction becomes materially meaningful.
What This Actually Means
The biggest takeaway is that savings scale with session length.
When naive sessions balloon (re-sending redundant context repeatedly), compression and selective retrieval significantly reduce input tokens. When sessions remain lean, savings are minimal.
This suggests that Zaguán Coder Daemon is not “always cheaper.” Instead, it is protective against context bloat in long-running coding sessions. That is a much more honest—and useful—positioning.
Latency, Stability, and How It Works
Crucially, compression did not introduce systemic instability. Across the 5 runs, there was no consistent latency penalty across runs. Latency variance appeared more correlated with provider conditions than with compression mode. There was no catastrophic increase in tool failures, and compressed mode sometimes even reduced tool calls.
This works because the system doesn’t rely on lossy summarization. Instead, it uses:
- Context caching
- Artifact referencing
- On-demand retrieval (ZLP — Zaguán Language Protocol)
- Avoiding redundant full-history resends
The longer a session runs, the more redundant context naive mode resends—and the more compression can eliminate.
Honest Framing
This test was synthetic but controlled. Real-world workloads will naturally vary. Based on these results:
- 10–15% savings seems like a conservative baseline.
- 20–30% in long-running sessions appears realistic.
- Extreme 40%+ savings will depend entirely on the specific trajectory.
The important result is that the cost reduction is measurable, repeatable, and statistically meaningful across 250 turns.
Final Takeaway
My original goal for this system was modest: If I can save 10–15%, that’s worth it.
The measured result showed roughly a one third cost reduction on average in long synthetic coding sessions.
This transforms the fundamental question from “Does compression help?” to “How do we stabilize and optimize it for real-world production workloads?”
And honestly, that is a far better problem to have. The next phase is validating these results under real production workloads with diverse repositories and team usage patterns.