Transparent, org-wide LLM context compression for Aperture.
Coding agents and RAG pipelines send the model the same bulky, low-value payloads over and over: multi-thousand-line tool outputs, search dumps, build logs, file listings. You pay for every one of those tokens on every turn, and they crowd out the context window. tsheadroom strips that waste out before the request reaches the provider, for every team in your org at once, with no client, SDK, or app changes. It plugs into Aperture as a pre_request hook.
It runs Headroom's compress() function, which crushes bulky content while leaving prompts, instructions, and recent turns intact. If there's nothing worth compressing (or anything goes wrong), the request passes through unchanged. tsheadroom can never block or break a request.
If you run Aperture, you already run a Tailscale tailnet. tsheadroom joins that tailnet as a device (using tsnet) and Aperture calls out to it. No public endpoint, no API keys: access is gated by your tailnet's grants.
On a representative tool-heavy request (a 400-entry file listing returned as a tool result):
| Before | After | |
|---|---|---|
| The bulky tool result | ~41 KB | ~25 KB (~40% smaller) |
| Full request body | ~47 KB | ~25 KB |
| The user's prompt, system prompt, recent turns | n/a | untouched |
- Token + cost savings scale with how much tool/log/search output your traffic carries. Chat-only and short requests are left alone (see What gets compressed).
- Transparent. Enforced at the Aperture grant level. Users and apps don't change anything and don't even know it's there.
- Fail-safe by design. If it's unreachable, slow, or crashed, tsheadroom always answers
allowand the original request proceeds. It is structurally incapable of blocking a request.
Decide whether this fits before you wire it in:
- Memory: text compression uses an ML model held resident in each worker (~600 MB/worker; ~4.8 GB at the default pool size of 8). Tool-output-only compression (the base install) needs far less. Size
-pool-sizedeliberately. - Latency: warm requests add single-digit milliseconds. The first request after startup can block while the ML model loads (and, on a fresh host, downloads). That one-time load is ~60s; see ML model loading. You set the ceiling through Aperture's hook
timeout. - Ops: one long-lived process running as a tailnet device, ideally under
systemd. Targets Linux (see Requirements). - Prompt cache: compressing historical context changes the cached prefix, which can cause a prompt-cache miss on the next turn. See the note below.
Aperture --(pre_request hook: POST /)--> tsheadroom (tsnet :80)
│
├─ pool of persistent Python workers
│ each runs: headroom.compress(messages, model, **config)
│
└─ reply: {"action":"modify","request_body":{…}}
or {"action":"allow"}
The tsheadroom Go binary owns a small pool of long-lived Python worker processes (so the Headroom pipeline is built once, not per request) and supervises them (auto-restart on crash, fail-open on timeout). Compression runs out-of-process, so a slow or faulty compress() call can't take down the listener.
For each request Aperture forwards, tsheadroom hands the messages array to compress() and returns a modify action with the rewritten body, or allow if there's nothing to gain or anything goes wrong.
On the Aperture side:
- A running Aperture instance with at least one LLM provider configured, and access to edit its configuration (
config.hujson) and/or your tailnet policy file. - A Tailscale tailnet (you have one if you run Aperture).
On the host that runs tsheadroom:
- Linux, for the cloud hosts this is meant to run on. The binary also builds and runs on macOS, which is convenient for local development and testing.
- Go 1.26.4+ to build the binary (required by the pinned Tailscale dependency).
- Python 3.10 or newer with
headroom-aiinstalled. Since 0.25.0,headroom-aiships a stable-ABI (abi3) wheel built for Python 3.10, so a single prebuilt wheel covers 3.10 and every newer release — 3.14 included — with no Rust toolchain needed. The source-build fallback now only happens below 3.10, or on a platform with no published wheel at all. If unsure, checkRequires: Pythonand the wheel filenames on headroom-ai's PyPI page. - A Tailscale auth key (
TS_AUTHKEY) so the device can join your tailnet unattended. Generate one in the Tailscale admin console under Settings > Keys.
headroom-ai has two compression paths, and which extras you install decides what actually compresses:
| You install | What compresses | Memory | Rust toolchain |
|---|---|---|---|
headroom-ai (base) |
Tool outputs (SmartCrusher, pure Python) | low | not needed |
headroom-ai[ml] |
Tool outputs + text/prose (Kompress, ML) | ~600 MB/worker | not needed |
Install [ml] if you want text/prose compression, or if you want the default configuration to work as advertised: compress_system_messages is on by default and Headroom also uses the ML model as its fallback for mixed content, so under defaults the model is exercised by ordinary traffic. With only the base wheel, the text-targeting knobs (compress_user_messages, target_ratio) change routing but produce no savings: only tool-output compression is active. See Tune compression.
# HTTPS (no SSH key needed):
git clone https://github.com/tailscale/tsheadroom.git
cd tsheadroom
make # produces ./build/tsheadroom (Linux)Cross-compiling from a Mac for a Linux host:
GOOS=linux GOARCH=amd64 go build -o build/tsheadroom . # or GOARCH=arm64Set up the Python interpreter (a virtualenv is recommended). headroom-ai ships a prebuilt abi3 wheel for Python 3.10+, so no Rust toolchain is needed:
python3.13 -m venv /opt/tsheadroom/venv # Python 3.10+, called explicitly (3.13 shown)
/opt/tsheadroom/venv/bin/python --version # confirm it's 3.10 or newer
/opt/tsheadroom/venv/bin/pip install 'headroom-ai[ml]' # or just 'headroom-ai' for tool-output onlyName the interpreter explicitly (python3.13, or whichever 3.10+ version you have) rather than using bare python3, which on a long-term-stable distro can still resolve to a release older than 3.10 — below the abi3 wheel's floor, so the pip install falls back to building headroom-ai from Rust source (see Troubleshooting). If your default python3 is too old, install a newer one with brew install python@3.13, pyenv install 3.13, or your platform's package manager.
Copy the worker script next to wherever you'll run the service:
cp worker.py /opt/tsheadroom/worker.pyYou now have the three things tsheadroom needs at runtime: the binary, a Python interpreter with headroom-ai, and worker.py.
If pip install 'headroom-ai[ml]' starts compiling from source — you'll see Building wheel for headroom-ai, a maturin invocation, or a PyO3 error like:
error: failed to build headroom-ai from source (maturin / PyO3)
instead of downloading a .whl, pip couldn't find a wheel compatible with your interpreter. Since 0.25.0 the abi3 wheel covers Python 3.10 and newer, so the usual cause is an interpreter older than 3.10 (some long-term-stable distros still default to 3.9 or earlier), or a platform/architecture with no published wheel. Confirm the version with /opt/tsheadroom/venv/bin/python --version. To fix it, delete the venv and recreate it on Python 3.10+ (3.13 shown):
rm -rf /opt/tsheadroom/venv
python3.13 -m venv /opt/tsheadroom/venv
/opt/tsheadroom/venv/bin/python --version # confirm 3.10 or newer
/opt/tsheadroom/venv/bin/pip install 'headroom-ai[ml]'A correct install downloads a prebuilt .whl (its name carries the abi3 tag, e.g. cp310-abi3); if you instead see Building wheel for headroom-ai or maturin, your interpreter still has no compatible wheel — most often it's older than 3.10.
headroom-ai[ml] downloads its compression model from the HuggingFace Hub the first time a worker loads it. The models are public, so no token is required — but anonymous Hub requests are rate-limited per IP, and a pool of workers cold-starting together can trip that limit and stall or fail their first model download. Setting HF_TOKEN authenticates those requests and lifts the limit. Set one if you:
- cold-start several workers at once, or run in CI / ephemeral hosts that re-download the model often;
- want workers to keep auto-fetching the newest model after a
headroom-aiupgrade. Without a token, tsheadroom runs already-cached workers offline (see ML model loading) and won't reach the Hub to pull a newer artifact.
Getting one is free:
- Sign in at huggingface.co and open Settings → Access Tokens (direct link).
- New token with the Read scope — enough to download public and gated models. Don't grant write; it's unnecessary privilege.
- Copy the
hf_…value (shown only once) into tsheadroom's environment — e.g.Environment=HF_TOKEN=hf_…in the systemd unit, orHF_TOKEN=hf_…in.env. The Go parent passes its environment through to every worker, so that's all the workers need.
HF_TOKEN is a credential: keep it out of git, scope it to read, and rotate it from that same page if it leaks (revocation is instant). Sanity-check a token with pip install huggingface_hub and HF_TOKEN=hf_… hf auth whoami, which should print your username.
TS_AUTHKEY=tskey-auth-xxxx ./build/tsheadroom \
-hostname tsheadroom \
-python /opt/tsheadroom/venv/bin/python \
-worker /opt/tsheadroom/worker.py \
-state-dir /var/lib/tsheadroom \
-config /var/lib/tsheadroom/tsheadroom.config.jsonThis joins the tailnet as tsheadroom and listens for hook calls on :80 at path /. Other tailnet devices (including Aperture) reach it at http://tsheadroom.<your-tailnet>.ts.net/ (find <your-tailnet>, your tailnet name, in the Tailscale admin console).
First run on a fresh host: with
[ml]installed, the workers load (and on a brand-new host, download ~600 MB of model from HuggingFace) at startup. The very first compression request can block for up to ~60s while that happens; after that, requests are milliseconds. To avoid a slow cold start in production, warm the cache once before serving traffic, or setHF_TOKEN(see HF_TOKEN).
TS_AUTHKEY is only needed on first start. Once -state-dir is populated you can omit it on restarts.
For durability, and for the cloud/Linux hosts this is meant for, wrap it in a service manager. A minimal systemd unit:
[Unit]
Description=tsheadroom (Aperture context-compression hook)
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=/opt/tsheadroom/tsheadroom \
-python /opt/tsheadroom/venv/bin/python \
-worker /opt/tsheadroom/worker.py \
-state-dir /var/lib/tsheadroom \
-config /var/lib/tsheadroom/tsheadroom.config.json
Environment=TS_AUTHKEY=tskey-auth-xxxx
StateDirectory=tsheadroom
Restart=always
[Install]
WantedBy=multi-user.target
StateDirectory=tsheadroom provisions /var/lib/tsheadroom with the right ownership. After the device has authenticated once, you can drop the TS_AUTHKEY line.
If you'd rather not set up a Go toolchain, a virtualenv, and a systemd unit by hand, the repo ships a Dockerfile and docker-compose.yml that build and run tsheadroom for you. You can skip the Install and Run steps above; deploys and updates become a single docker compose up -d --build.
git clone https://github.com/tailscale/tsheadroom.git
cd tsheadroom
cp .env.example .env # set TS_AUTHKEY, variant, pool size
mkdir -p config && cp tsheadroom.config.example.json config/config.json
docker compose up -d --buildThis builds the binary, installs headroom-ai, joins your tailnet as tsheadroom, and serves the hook on :80 — the same result as the manual install, in one command. Settings live in .env (what .env.example ships with):
| Variable | .env.example |
Description |
|---|---|---|
TS_AUTHKEY |
— | Tailscale auth key; only needed until the device authenticates once. |
TS_HOSTNAME |
tsheadroom |
Device name on the tailnet. |
HEADROOM_VARIANT |
ml |
base for tool-output compression, ml for text + tool compression (installs headroom-ai[ml]). See Choose tool-output or text compression. Changing it needs a rebuild. |
POOL_SIZE |
4 |
Number of Python workers; each holds a resident copy of the ML model under ml. |
The device identity and the HuggingFace model cache persist in named volumes (tsheadroom-state, tsheadroom-hf-cache), so restarts never re-authenticate the device or re-download the ~600 MB model. Compression knobs live in the bind-mounted config/config.json: edit it and restart, or change them live with PUT /config (see Tune compression).
Update to a newer version with:
git pull && docker compose up -d --buildMigrating from a manual or
systemdinstall? Before the firstup, copy your existing-state-dirinto thetsheadroom-statevolume (and~/.cache/huggingfaceintotsheadroom-hf-cache) so the container keeps the same device identity and skips the model download. Stop the old service first — the two can't share one tailnet identity.
| Flag | Default | Description |
|---|---|---|
-python |
python3 |
Interpreter with headroom-ai installed (used to launch workers). |
-worker |
worker.py |
Path to the worker script. |
-hostname |
tsheadroom |
Device name on the tailnet. |
-pool-size |
8 |
Number of persistent Python workers. Each holds a resident copy of the ML model (~600 MB) when text compression is active (~4.8 GB at 8), so raise it deliberately. |
-max-compress |
60s |
Hard cap on a single worker call before the worker is recycled; the sole worker-side timeout. Covers one-time ML model loads. |
-addr |
:80 |
Listen address on the tsnet device. |
-state-dir |
tsnet default | tsnet state directory. Persist this: it holds the device identity (tailscaled.state). See note below. |
-config |
tsheadroom.config.json |
Path to the tunable compression configuration file. Loaded at startup, rewritten on PUT /config. |
-local-addr |
off | Serve plain HTTP here instead of tsnet, for local testing only. |
-v |
off | Log a one-line per-request summary to stdout. See Verify. |
-no-affinity |
off | Disable session-affinity worker routing; dispatch every request to any free worker. See Worker affinity. |
State (
-state-dir): must be a stable, writable, persistent path. It containstailscaled.state, the device's private key. Treat it as a secret and keep it on durable storage, or the device re-authenticates as a new device on every restart.Config path (
-config): the default is resolved against the current working directory. In a service, always set an absolute path (for example, under your state dir) so a different launch CWD doesn't silently read/write a different file.
tsheadroom is configured as a pre_request hook in two places in your Aperture configuration (config.hujson).
Add it to the top-level hooks map, pointing at your tsheadroom device:
"hooks": {
"headroom": {
"url": "http://tsheadroom.<your-tailnet>.ts.net/",
"fail_policy": "fail_open", // if tsheadroom is unreachable, send uncompressed
"timeout": "30s", // tsheadroom's only client-facing latency ceiling
},
},| Field | Required | Notes |
|---|---|---|
url |
yes | Your tsheadroom device's tailnet URL, path /. |
fail_policy |
no (default fail_open) |
Keep fail_open. See below. |
timeout |
no (default 5s) |
Raise to 30s. This is the entire latency budget; the 5s default will cut off cold/large compressions. |
preference |
no (default 0) |
Set only if you stack tsheadroom with other hooks. Higher runs first; ties break alphabetically. A modify is visible to later hooks. |
tsheadroom needs no apikey or authorization. It's reached only over your tailnet and gated by your tailnet's grants.
A hook does nothing until a grant references it through send_hooks:
{
"src": ["*"],
"app": {
"tailscale.com/cap/aperture": [
{
"models": "**", // compress all models, or scope, for example "anthropic/**"
"send_hooks": [
{
"name": "headroom",
"events": ["pre_request"],
"send": ["request_body"],
},
],
},
],
},
},Note
Where you put this grant changes the dst rule. The example above is the Aperture configuration file (config.hujson) form, where dst is not used, so omit it. If you instead place the grant in your tailnet policy file, you must add an explicit dst (for example, "dst": ["tag:aperture"], where tag:aperture is a tag you've applied to your Aperture device). A tag gives a non-user device a stable identity at authentication time and doubles as a selector you can target in grants, so it's the durable way to name your Aperture device here. Omitting dst makes the grant silently apply to nothing and your hook never fires.
Notes:
send: ["request_body"]is required and is the only input tsheadroom uses. Amodifyhook replaces the entire request body, so Aperture must send it. (The model name is read from inside the body; tsheadroom doesn't needuser_message; it operates on the wholemessagesarray.)- Keep
fail_open. tsheadroom always answers200withallow/modifyand never blocks;fail_openensures that even if the device is unreachable (or a compression runs past thetimeout), requests still proceed (uncompressed).fail_closedwould instead block with HTTP 503, not what you want here. - The hook
timeoutis the only client-facing latency ceiling. tsheadroom has no soft timeout of its own: it waits for the compression and returns, bounded only by thistimeout(after which Aperture forwards the original) and by its internal-max-compressworker cap. Set it to how long you're willing to make a request wait.30smatches Headroom's own compression budget and comfortably fits large, late-session requests. There's no separate tsheadroom knob to keep in sync. - Scope
models(FQN glob, for example"anthropic/**", or"**"for everything) if you don't want to compress all providers.
Full hook/grant field reference: Aperture configuration documentation.
tsheadroom returns one of two GuardrailResponse actions, always with HTTP 200:
{ "action": "allow" }Nothing changed. The original request proceeds. Returned for short/chat-only requests, bodies with no messages array, errors, or timeouts.
{ "action": "modify", "request_body": { "model": "…", "messages": [ … ] } }The messages array was compressed; every other field of the body is preserved unchanged. The modified body replaces the original wholesale.
tsheadroom never returns "block".
Note
Prompt-cache interaction. tsheadroom compresses older context (it protects the most recent turns), so a modify changes the cached prefix and can trigger a prompt-cache miss on the next turn. For tool/log-heavy traffic the token savings typically dominate; for cache-heavy chat workloads, weigh this and scope models/knobs accordingly.
Run with -local-addr to serve plain HTTP (no tailnet) and -v for per-request logging:
./build/tsheadroom -local-addr 127.0.0.1:8080 -v \
-python /opt/tsheadroom/venv/bin/python -worker ./worker.pySend a request with a large tool result, exactly the shape Headroom is built to compress:
python3 - <<'PY' | curl -s -X POST http://127.0.0.1:8080/ -H 'Content-Type: application/json' -d @-
import json
big = json.dumps([{"id": i, "path": f"/src/m{i}.py", "status": "unchanged",
"hash": "deadbeef"*4} for i in range(400)])
print(json.dumps({"request_body": {"model": "claude-sonnet-4-5-20250929", "messages": [
{"role": "user", "content": "list files"},
{"role": "assistant", "content": [{"type": "tool_use", "id": "t1", "name": "ls", "input": {}}]},
{"role": "user", "content": [{"type": "tool_result", "tool_use_id": "t1", "content": big}]},
{"role": "user", "content": "summarize"}]}}))
PY
# -> {"action":"modify","request_body":{…}}The first request may take ~60s (the
[ml]model is loading/downloading). It isn't hung. Subsequent requests return in milliseconds.
The proof is in the -v log line. Cold first request, then warm:
request in_msgs=4 in_bytes=47227 out_bytes=24885 dur_ms=56956 worker_ms=42 cold=true model_limit=200000 -> modify
request in_msgs=4 in_bytes=47227 out_bytes=24885 dur_ms=12 worker_ms=11 cold=false model_limit=200000 -> modify
| Field | Meaning |
|---|---|
in_bytes / out_bytes |
Request body size before/after. out_bytes < in_bytes on a modify is compression happening. |
dur_ms / worker_ms |
Total handler time / time inside the Python worker. A large gap with cold=true is the one-time model load. |
cold=true |
This request paid the model load. Expect it once per worker. |
-> modify / -> allow(...) |
The action returned, with a reason (modify, allow(noop), allow(error), allow(passthrough), allow(read-error)). |
Confirm the inverse too. A short chat returns allow:
curl -s -X POST http://127.0.0.1:8080/ -H 'Content-Type: application/json' \
-d '{"request_body":{"model":"claude-sonnet-4-5-20250929","messages":[{"role":"user","content":"hi"}]}}'
# -> {"action":"allow"}Once the hook + grant are live:
- Make a real LLM call through your Aperture endpoint that includes a substantial tool result (or run a coding-agent session that does).
- On the tsheadroom device, watch the
-voutput (orjournalctl -u tsheadroom -funder systemd). A request that reaches it and compresses logs-> modifywithout_bytes < in_bytes.
If you see the request arrive and log modify, the hook is firing and compressing in production.
All allow, no modify? Work down this list:
| Symptom | Check |
|---|---|
| No requests reach the device at all | The hook is firing? In a tailnet policy-file grant, is dst set (see the note above)? Does the grant src match your users and models match your traffic? |
Requests arrive but always allow(passthrough) |
The body has no messages array (for example, Gemini contents or embeddings), so those pass through by design. |
Requests arrive but always allow(noop) |
Nothing worth compressing: short chat, prose-only, or no substantial tool result. This is correct behavior. See What gets compressed. |
| Text/prose isn't shrinking | The [ml] extra isn't installed, or compress_user_messages is off. Install headroom-ai[ml] and check GET /config. |
Occasional allow(error) under load or on first request |
A compression exceeded the Aperture hook timeout and failed open. Raise timeout, or pre-warm the model (cold start). |
tsheadroom exposes Headroom's compression knobs as live, persisted configuration, with no restart needed. Settings load from the -config file at startup (defaults if missing), and a small HTTP API on the same device reads and changes them on the fly; every change is written back to the file, so the service restarts with your last state.
# Read the current settings
curl -s http://tsheadroom.<your-tailnet>.ts.net/config
# Change one or more (partial updates merge onto current values)
curl -s -X PUT http://tsheadroom.<your-tailnet>.ts.net/config \
-H 'Content-Type: application/json' \
-d '{"compress_user_messages": true, "target_ratio": 0.3}'The default configuration (what GET /config returns out of the box):
{
"compress_user_messages": false,
"compress_system_messages": true,
"protect_recent": 4,
"protect_analysis_context": true,
"target_ratio": null,
"min_tokens_to_compress": 250,
"kompress_model": null
}Changes take effect on the next request. Invalid values are rejected with 400 and leave the current configuration untouched (for example, target_ratio: 5 → target_ratio must be in (0, 1] or null). You can also edit the -config JSON file directly and restart.
Tunable parameters (these mirror Headroom's CompressConfig):
| field | type | default | effect |
|---|---|---|---|
compress_user_messages |
bool | false |
Also compress user-message content (off = protect it). Turn on for prose/RAG-heavy inputs. (Needs [ml].) |
compress_system_messages |
bool | true |
Compress system messages. |
protect_recent |
int | 4 |
Leave the last N messages untouched. 0 = compress everything. |
protect_analysis_context |
bool | true |
Detect "analyze/review" intent and protect code. |
target_ratio |
float | null | null |
Keep-ratio for text compression: null = model decides (~aggressive), 0.5 = keep 50%. Must be in (0, 1]. (Needs [ml].) |
min_tokens_to_compress |
int | 250 |
Skip messages shorter than this. |
kompress_model |
string | null | null |
Override the Kompress model id; null = default. |
Access: the
/configendpoint is gated only by your tailnet policy file: anyone who can reach the device can read and change its configuration. Lock the device down accordingly (see Security).
tsheadroom exposes a Prometheus endpoint at GET /metrics on the same device. It uses the standard text exposition format, so any existing scraper works:
curl -s http://tsheadroom.<your-tailnet>.ts.net/metricsTwo families are emitted. The headroom_* metrics reuse the names Headroom's own proxy emits, so if you already scrape a Headroom proxy you can point the same dashboards at this endpoint:
| metric | type | meaning |
|---|---|---|
headroom_requests_total |
counter | Compression hook calls processed. |
headroom_requests_failed_total |
counter | Calls where compression errored and we failed open. |
headroom_tokens_input_total |
counter | Input tokens seen (tokens_before). |
headroom_tokens_saved_total |
counter | Tokens saved by compression (tokens_saved). |
headroom_overhead_ms_{sum,count,min,max} |
counter/gauge | tsheadroom processing time per request, in ms (excludes the upstream LLM — that's what Headroom calls "overhead"). |
headroom_requests_by_provider{provider} |
counter | Requests per provider (from Aperture metadata). |
headroom_requests_by_model{model} |
counter | Requests per model (from Aperture metadata). |
headroom_inbound_requests_total / _completed_total / _active |
counter/gauge | Inbound HTTP requests to the hook handler; _active is the live in-flight count. |
The tsheadroom_* metrics have no Headroom analog and describe this service specifically:
| metric | type | meaning |
|---|---|---|
tsheadroom_pool_slots_total / tsheadroom_pool_slots_busy |
gauge | Worker-pool size and how many slots are compressing right now. When busy sits at total, the pool is saturated and new requests queue (see ML model loading and timeouts). |
tsheadroom_affinity_hits_total / tsheadroom_affinity_spills_total |
counter | Session-affinity routing outcomes. A hit is a request sent to its conversation's warm worker; a spill is one whose worker was busy and went elsewhere (a likely cold recompress). A falling hit rate signals worker contention. See Worker affinity. |
tsheadroom_requests_by_outcome{outcome} |
counter | Requests by outcome: modify, noop, error, passthrough, read_error. |
tsheadroom_cold_starts_total |
counter | Worker first-real-call events that paid the one-time ML model load. |
tsheadroom_tokens_after_total |
counter | Output tokens after compression (tokens_after), so you can compute the realized ratio. |
Why a subset? tsheadroom is a
pre_requesthook, not a proxy: it runsheadroom.compress()and never sees the upstream response. So Headroom proxy metrics that depend on the response or on prompt caching — completion tokens, TTFB, end-to-end latency, cache reads/writes and busts, per-transform timing, waste signals — have no source here and are deliberately omitted rather than reported as misleading zeros.
Access: like
/config,/metricsis gated only by your tailnet policy file. Anyone who can reach the device can read it; it exposes aggregate counts and model/provider names, not request contents.
Headroom's compress() keeps a per-process cache of already-compressed content (a process-global pipeline with a content-keyed cache, ~30-minute TTL). So within one worker, a growing conversation does not recompress its unchanged prefix every turn — only genuinely new content runs through the model. In practice the same large tool result costs tens of milliseconds the first time a worker sees it and ~1 ms on later turns.
That cache only helps if a conversation's turns keep landing on the same worker. tsheadroom therefore routes by conversation: each request is dispatched to a worker chosen from its session_id (from Aperture's hook metadata; when absent, a stable hash of the opening messages). Each worker has a one-deep affinity queue, so back-to-back turns reliably reuse the warm worker; if that worker is already busy with a queued job, the request spills to any free worker (a likely cold recompress) rather than blocking. tsheadroom_affinity_hits_total / tsheadroom_affinity_spills_total expose the split.
This is best-effort by design: under sustained load, spills route to cold workers and erode the benefit. The planned next step is bounded per-worker queueing so a turn waits a short, capped time for its warm worker instead of spilling. Pass -no-affinity to disable routing entirely and dispatch every request to any free worker.
The ML text compressor loads a ~600 MB model on first use (one-time, then cached on disk and resident in each worker). This load takes several seconds; a large, late-session conversation can itself take a few seconds to compress. There is one worker timeout, and the client-facing wait is owned by Aperture:
-max-compress(60s) bounds the worker: a single call may run this long before the worker is treated as wedged and recycled. tsheadroom imposes no shorter deadline: a slow call runs to completion (the model finishes loading, the worker stays warm) and the result is used if Aperture is still waiting.- Aperture's hook
timeoutbounds the client: when it fires, Aperture (withfail_open) forwards the original request uncompressed. This is the only client-facing latency ceiling, and it lives in the Aperture configuration. There is no tsheadroom-side deadline to keep in sync.
The result: a cold worker pays at most one slow (or, behind Aperture's timeout, uncompressed allow) request while the model loads, then it works, with no restart needed. To avoid even that, workers preload the model at startup whenever the persisted configuration could route content through the ML compressor. With headroom-ai[ml] installed this is the default: compress_system_messages is on and Headroom uses the ML model as its fallback for tool/mixed content, so it's needed under ordinary traffic, and workers come up warm. (Each preloaded worker holds its own resident copy of the model, which is why -pool-size defaults low.)
The model downloads from HuggingFace on first ever use (cached under ~/.cache/huggingface thereafter). transformers revalidates that cache against the Hub on each cold load; tsheadroom skips that network round-trip (and its anonymous rate-limiting) by running workers offline once the model is cached (HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE, set automatically; your own values are respected). To stay online with higher rate limits (for example, to let a fresh host download the model), set HF_TOKEN in the environment (see HF_TOKEN); it takes precedence over the automatic offline mode. A worker has ~60s to report ready, so a slow cold download on a fresh host can exceed that and make workers retry: warm the cache once (start a worker with the cache empty, or run a single compress under the worker's interpreter) before serving production traffic.
By default Headroom uses a conservative coding-agent profile: it compresses tool outputs (tool_result blocks / role:"tool" messages) and other bulky content, but protects user messages, the system prompt, and the most recent turns. So short chats (even long prose in a user message) return allow. You see modify when a request carries a substantial tool result.
tsheadroom only inspects the messages array (Anthropic/OpenAI shape). A request body without one (for example, Gemini's contents, or an embeddings call) passes through unchanged (allow); only messages is ever rewritten.
- What tsheadroom sees: the full plaintext request body of every matched LLM call, including prompts, tool results, and everything in
messages. It processes them in memory and does not persist request content to disk. - Reachability: tsheadroom has no auth of its own. Anyone who can reach the device over your tailnet can call the hook and read/modify its
/config. Restrict access with grants, and give the device a tag so you can target it: a tag is the device's identity at authentication time and serves as the selector grants match on (for example,tag:tsheadroomin a grant'ssrc/dst). - Egress: the only outbound network call is the one-time HuggingFace model download, which you can eliminate by pre-seeding the cache and running offline (see above). Compression itself makes no network calls.
- Secrets:
-state-dirholds the device's private key, so keep it on durable, protected storage.
make # build ./build/tsheadroom (Linux)
make test-go # Go tests only: fake worker, no Python needed
make test # Go tests + Python tests (Python tests run without headroom-ai;
# the real-headroom integration test self-skips unless installed)
make test PYTHON=/opt/tsheadroom/venv/bin/python # also runs the real-headroom integration testGo tests use a fake worker (no Python needed). Python tests use a fake headroom by default and run the real-headroom integration test only when $PYTHON has headroom-ai installed.
tsheadroom is not affiliated with Headroom. File bugs about the compress() function and its compression behavior there. For bugs in tsheadroom (Aperture integration, request/response parsing, or program behavior), use the tsheadroom issue tracker.
Contributions are welcome. See CONTRIBUTING.md. We require a DCO Signed-off-by line on commits (git commit -s), and every source file carries the Tailscale copyright/SPDX header (enforced by TestLicenseHeaders). Follow the Code of Conduct. Report security issues privately per SECURITY.md.
BSD 3-Clause. See LICENSE.