Lk/fix cold model load and diagnostics by lkosewsk · Pull Request #2 · tailscale/tsheadroom

lkosewsk · 2026-06-09T07:55:34Z

No description provided.

…diagnostics With headroom-ai[ml] installed, the ML text compressor (Kompress/ModernBERT) is reached under the default config: compress_system_messages defaults on, and headroom also uses Kompress as its fallback for tool/mixed content. But textEnabled() only checked compress_user_messages/target_ratio, so preload was off by default and every worker cold-loaded the ~600MB model on its first live request. Spread across the pool, that caused staggered per-slot loads and, on a large request hitting a still-cold worker, a deadline blow -> fail-open -> uncompressed passthrough. Fixes: - config.go: textEnabled() now includes CompressSystemMessages, so workers preload at startup under the default config and come up warm. - worker.py: configure HF env before importing headroom. Stay online when HF_TOKEN is set (honored only when set); otherwise go offline iff the models are already cached, removing the per-cold-load HF Hub revalidation round-trip (and its anonymous rate-limiting). Operator-set OFFLINE vars are respected. - main.go: -pool-size now defaults to 4 (was max(4, GOMAXPROCS)); drop the now-unused runtime import. Diagnostics: - handler.go: -v line distinguishes allow(noop|error|passthrough|read-error) from modify, and adds dur_ms, worker_ms, and cold. - pool.go: log slow worker calls (with cold_first_call) at Info, and log the previously-invisible "client deadline elapsed; worker continues warming" fail-open transition. - worker.py: return elapsed_ms + cold_first_call per response; emit a structured model-preload / warmup-failure line to stderr. README: pool-size default + memory caveat, preload-by-default behavior, and the HF offline-when-cached / HF_TOKEN opt-in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Luke Kosewski <lkosewsk@tailscale.com>

headroom.compress() drives its aggressiveness off context_pressure = tokens_before / model_limit, but defaults model_limit to a flat 200000 regardless of the model. So we were over-compressing big-context models (e.g. a 1M-token Gemini) and mis-sizing others. Resolve the real limit per request and pass it to compress(): - a tsheadroom-side override table of precompiled, case-insensitive, unanchored regexes (e.g. claude-opus-?4.8 -> 1,000,000), consulted first because the bundled Headroom registry lists no current Claude 4.x model and would otherwise default them to 200K; - then ModelRegistry.get_context_limit(model, default=200000); - then 200000. The resolved limit is surfaced back through the worker response, compressResult.ModelLimit, summary.modelLimit, and the -v line for visibility. An explicit model_limit in the runtime config is never overridden. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Luke Kosewski <lkosewsk@tailscale.com>

…meout tsheadroom's 4s -deadline was a third timeout layered below two that already do the job: aperture's per-hook `timeout` (the client-facing latency ceiling, owned by the caller, which fail-opens on expiry) and the pool's -max-compress worker cap. Its only unique effect was to abandon compression *earlier* than aperture would have — exactly the slow, large-context requests this tool exists to compress. Remove it entirely: the handler passes the request context straight to the pool (cancelled when aperture's hook timeout fires or the client disconnects); a slow call runs to completion under -max-compress and leaves the worker warm. compress() has no "halt" signal, so every error still collapses to allow — tsheadroom never blocks. The guardrailResponse doc records why, and the WebSocket-client caveat that can't arise on aperture's request/response hook protocol today. Also: - raise the default worker pool 4 -> 8 (more concurrency headroom under the wider budget; ~4.8GB resident model RAM at 8); - README: recommend a 30s aperture hook timeout, drop the -deadline flag row and the two-timeouts framing, document the single worker cap; - refresh pool.go comments that still called the request context a "fail-open deadline". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Luke Kosewski <lkosewsk@tailscale.com>

lkosewsk and others added 3 commits June 9, 2026 04:56

lkosewsk merged commit 6755bcb into main Jun 9, 2026
2 checks passed

lkosewsk deleted the lk/fix-cold-model-load-and-diagnostics branch June 9, 2026 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lk/fix cold model load and diagnostics#2

Lk/fix cold model load and diagnostics#2
lkosewsk merged 3 commits into
mainfrom
lk/fix-cold-model-load-and-diagnostics

lkosewsk commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lkosewsk commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant