A drop-in proxy that compresses bloated code context in real-time, cutting LLM API costs by 50–80% on plain-chat coding agents.
TokenTamer is a middleware proxy that sits between an AI coding agent and the LLM API. It intercepts raw payloads, parses code with AST, and replaces "background" files with structural skeletons. The agent still sees signatures, classes, and imports — it just stops paying for function bodies it isn't editing.
⚠️ Alpha software. This is a real project in active development, not a polished SaaS. Please read the support matrix below before installing.
| Client | HTTPS interception | Compression active | Notes |
|---|---|---|---|
Aider (--openai-api-base) |
✅ Not needed | ✅ Full | Best supported. Use the proxy URL directly. |
| Cursor (custom base URL) | ✅ Not needed | ✅ Full | Best supported. |
Plain curl / SDK calls |
✅ Not needed | ✅ Full | Great for testing. |
| Claude Code (hardcoded endpoint) | ✅ Works | ✅ Tool-aware | Stale file reads in tool_result get skeletonized; latest read stays intact. |
| Codex CLI (hardcoded endpoint) | ✅ Works | ✅ Tool-aware | Same engine via /v1/responses. |
How tool-aware compression works. Agents like Claude Code call Read(file) repeatedly. The conversation accumulates the same file dumped multiple times. TokenTamer tracks every tool_use → file mapping, then skeletonizes the older tool_result reads while keeping the most recent read of each file 100% intact. tool_use blocks and tool definitions are never touched.
If something ever breaks, hit the kill switch:
token-tamer --ssl --port 443 --passthrough # disable all compression
# or
token-tamer --ssl --port 443 --no-tool-compression # disable only tool-aware path- Compression depends on re-reads. Single-read sessions get no tool savings (just text compression). Long sessions where the agent re-reads files benefit the most.
- Heuristic file detection. We look for
file_path/path/filenamekeys in tool inputs. Exotic agents with unusual schemas may be missed. - Multi-turn cross-request caching is not yet implemented.
- macOS only for the one-line cert setup. Linux/Windows users need to trust the CA manually.
- No production benchmarks yet. Savings numbers come from unit tests with synthetic payloads, not real long Claude Code sessions.
- v0.2 — Tool-aware compression (✅ shipped)
- v0.3 — Anthropic prompt caching / long-lived session hijacking (✅ shipped)
- v0.4 — Tree-sitter for proper multi-language AST (current C-style support is a brace-balance heuristic)
- v0.5 — Web dashboard with per-file compression heatmap + live cache hit metrics
- 🔌 Drop-in proxy — No changes needed to your coding agent. Just change the API base URL.
- 🔁 Long-lived session hijacking — Injects Anthropic
cache_controlbreakpoints into outbound requests. Long Claude Code sessions see up to 90% off input tokens (cached input is$0.30/Mtokenvs$3.00/Mtokenregular). - 🧠 Smart active file detection — Automatically identifies which files you're working on and leaves them 100% intact.
- 🌳 AST-based compression — Strips function bodies while preserving signatures, imports, and class structures.
- 🔧 Tool-aware compression — Skeletonizes stale
tool_resultreads while preserving the latest read of each file. Safe with agents that use function calling. - 💰 Real-time cost tracking — Beautiful terminal dashboard showing tokens saved and money saved.
- 🔄 Full streaming support — Transparent SSE streaming for both OpenAI and Anthropic APIs.
- ⚡ Zero latency overhead — Compression happens locally in milliseconds.
- Python 3.9 or newer (
python3 --version) - macOS, Linux, or Windows (Windows = manual cert trust step)
openssl(pre-installed on macOS & most Linux)
git clone https://github.com/borhen68/TokenTamer.git
cd TokenTamer
# Recommended: use a virtual environment to avoid messing with system Python
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -e .Verify it installed:
token-tamer --version
# → TokenTamer 0.2.0👉 Path A — Aider, Cursor, or your own SDK code (no SSL setup needed):
token-tamer --port 8000 --no-dashboardThen point your tool's API base URL at http://127.0.0.1:8000/v1:
aider --openai-api-base http://127.0.0.1:8000/v1For Cursor: Settings → Models → OpenAI API Base → http://127.0.0.1:8000/v1. Done. ✅
👉 Path B — Claude Code or Codex CLI (SSL setup, one-time):
These tools hardcode the API URL. We use HTTPS interception:
# Step 1 — Generate the local certificate (just runs and exits)
token-tamer --ssl --port 8443 --no-dashboard &
sleep 2 && kill %1
# Step 2 — Trust the certificate (macOS)
sudo security add-trusted-cert -d -r trustRoot \
-k /Library/Keychains/System.keychain \
~/.config/token-tamer/certs/ca-cert.pem
# Step 3 — Redirect API domains to localhost
echo "127.0.0.1 api.openai.com" | sudo tee -a /etc/hosts
echo "127.0.0.1 api.anthropic.com" | sudo tee -a /etc/hosts
# Step 4 — Run TokenTamer on port 443 (sudo required for low ports)
sudo $(which token-tamer) --ssl --port 443 --no-dashboardLeave that terminal open, then in a new terminal:
claude "create a snake game" # or
codex "refactor this module"You're now intercepting + compressing. 🎉
# Path A check:
curl http://127.0.0.1:8000/health
# Path B check:
curl https://api.openai.com/health # Should return TokenTamer's JSON, not OpenAI'sBoth should return:
{"status":"ok","version":"0.2.0","requests_processed":0,"tokens_saved":0}# Remove /etc/hosts entries
sudo sed -i.bak '/api.openai.com/d;/api.anthropic.com/d' /etc/hosts
# Untrust the cert
sudo security remove-trusted-cert -d ~/.config/token-tamer/certs/ca-cert.pem
# Uninstall the package
pip uninstall token-tamer| Symptom | Fix |
|---|---|
command not found: token-tamer |
Activate your venv: source venv/bin/activate |
ModuleNotFoundError: No module named 'uvicorn' |
Same — venv not active |
address already in use on port 8000 |
lsof -ti :8000 | xargs kill -9 |
Permission denied on port 443 |
Use sudo for ports <1024, or pick a higher port |
SSL certificate problem from curl |
Re-run the security add-trusted-cert step, then open a NEW terminal |
| Claude Code hangs / errors | Hit the kill switch: restart with --passthrough |
| Compression broke something | Restart with --no-tool-compression and file an issue |
TokenTamer resolves API keys in this priority order:
- Request headers — Keys sent by your agent (default behavior, zero config needed)
- Environment variables —
OPENAI_API_KEY/ANTHROPIC_API_KEY - Config file —
config.yaml
Your Agent TokenTamer LLM API
│ │ │
│── 100k token payload ──────▶│ │
│ │── Identify active files │
│ │── Skeletonize background │
│ │── 15k token payload ────────▶│
│ │ │
│ │◀──── Streaming response ─────│
│◀── Streaming response ──────│ │
│ │ │
│ │── Dashboard: saved $2.45! 💰 │
def calculate_tax(amount: float, region: str) -> float:
"""Calculates regional tax rates based on complex logic."""
rate = get_base_rate(region)
adjustments = fetch_adjustments(region, amount)
if amount > THRESHOLD:
rate *= 1.05
# ... 50 more lines of complex math ...
return final_tax# [TOKEN-GUARD: Compressed — structural skeleton only]
def calculate_tax(amount: float, region: str) -> float: ...The LLM still knows calculate_tax exists and how to call it, but doesn't waste tokens reading the implementation.
This is TokenTamer's most powerful cost-cutting feature. Anthropic offers a 90% discount on cached input tokens:
| Token type | Price per 1M tokens |
|---|---|
| Regular input | $3.00 |
| Cached input | $0.30 |
The catch: Claude Code (and most agents) don't use it well. They mutate the conversation prefix every turn, so the cache never hits. TokenTamer fixes this.
Without TokenTamer — each turn re-sends the entire conversation:
Turn 5: [sys, tools, msg1, msg2, msg3, msg4, msg5] → $$$$
Turn 6: [sys, tools, msg1, msg2, msg3, msg4, msg5, msg6] → $$$$$
Turn 7: [sys, tools, msg1...msg7] → $$$$$$$
With TokenTamer — stable prefix cached, only new messages billed:
Turn 5: [sys, tools, msg1, msg2, msg3] [msg4, msg5] → cache | $$
Turn 6: [sys, tools, msg1, msg2, msg3] [msg4, msg5, msg6] → cache | $$
Turn 7: [sys, tools, msg1, msg2, msg3] [msg4, msg5, msg6, msg7] → cache | $$
TokenTamer injects cache_control breakpoints at three stable positions:
- After tools array — rarely changes between turns
- After system prompt — fixed for the whole session
- After conversation prefix — everything except the last 2 turns
Result: a 50-turn Claude Code session drops from ~$5.00 to ~$0.50. Same model. Same reasoning. Same output. Just smarter billing.
When active, every Anthropic response includes cache headers:
curl -I http://127.0.0.1:8000/v1/messages ...
# X-TokenTamer-Cache-Breakpoints: 3
# X-TokenTamer-Cache-Tokens: 12400If you ever need to disable it:
token-tamer --no-session-cacheCreate a config.yaml in your working directory:
proxy:
host: "127.0.0.1"
port: 8000
upstream:
openai_url: "https://api.openai.com"
anthropic_url: "https://api.anthropic.com"
context:
repo_path: "/path/to/your/codebase" # Enables semantic active-file detection
skeletonizer:
keep_docstrings: false # Preserve function docstrings?
keep_class_attrs: true # Keep class-level attributes?
pricing: # Per 1M tokens for cost estimation
gpt-4o:
input: 2.50
output: 10.00
claude-sonnet-4-20250514:
input: 3.00
output: 15.00TokenTamer skeletonizes more than just Python:
| Language | Method | Status |
|---|---|---|
| Python | Native AST | ✅ |
| JavaScript | Brace-balance heuristic | ✅ |
| TypeScript | Brace-balance heuristic | ✅ |
| Go | Brace-balance heuristic | ✅ |
| Rust | Brace-balance heuristic | ✅ |
| Java / C# / C / C++ | Brace-balance heuristic | ✅ |
If you provide a repo_path in config.yaml and install sentence-transformers,
TokenTamer uses embeddings to detect which files are semantically relevant to your
query — even if you don't mention them by name.
pip install sentence-transformers scikit-learn| Provider | Endpoint | Status |
|---|---|---|
| OpenAI | /v1/chat/completions |
✅ |
| OpenAI | /v1/completions |
✅ (pass-through) |
| OpenAI | /v1/models |
✅ (pass-through) |
| Anthropic | /v1/messages |
✅ |
pip install -e ".[dev]"
pytest tests/ -v- Python 3.10+
- Dependencies: FastAPI, uvicorn, httpx, tiktoken, rich, pyyaml
MIT