Skip to content

Lk/fix cold model load and diagnostics#2

Merged
lkosewsk merged 3 commits into
mainfrom
lk/fix-cold-model-load-and-diagnostics
Jun 9, 2026
Merged

Lk/fix cold model load and diagnostics#2
lkosewsk merged 3 commits into
mainfrom
lk/fix-cold-model-load-and-diagnostics

Conversation

@lkosewsk

@lkosewsk lkosewsk commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

lkosewsk and others added 3 commits June 9, 2026 04:56
…diagnostics

With headroom-ai[ml] installed, the ML text compressor (Kompress/ModernBERT) is
reached under the default config: compress_system_messages defaults on, and
headroom also uses Kompress as its fallback for tool/mixed content. But
textEnabled() only checked compress_user_messages/target_ratio, so preload was
off by default and every worker cold-loaded the ~600MB model on its first live
request. Spread across the pool, that caused staggered per-slot loads and, on a
large request hitting a still-cold worker, a deadline blow -> fail-open ->
uncompressed passthrough.

Fixes:
- config.go: textEnabled() now includes CompressSystemMessages, so workers
  preload at startup under the default config and come up warm.
- worker.py: configure HF env before importing headroom. Stay online when
  HF_TOKEN is set (honored only when set); otherwise go offline iff the models
  are already cached, removing the per-cold-load HF Hub revalidation round-trip
  (and its anonymous rate-limiting). Operator-set OFFLINE vars are respected.
- main.go: -pool-size now defaults to 4 (was max(4, GOMAXPROCS)); drop the
  now-unused runtime import.

Diagnostics:
- handler.go: -v line distinguishes allow(noop|error|passthrough|read-error)
  from modify, and adds dur_ms, worker_ms, and cold.
- pool.go: log slow worker calls (with cold_first_call) at Info, and log the
  previously-invisible "client deadline elapsed; worker continues warming"
  fail-open transition.
- worker.py: return elapsed_ms + cold_first_call per response; emit a structured
  model-preload / warmup-failure line to stderr.

README: pool-size default + memory caveat, preload-by-default behavior, and the
HF offline-when-cached / HF_TOKEN opt-in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Luke Kosewski <lkosewsk@tailscale.com>
headroom.compress() drives its aggressiveness off
context_pressure = tokens_before / model_limit, but defaults model_limit
to a flat 200000 regardless of the model. So we were over-compressing
big-context models (e.g. a 1M-token Gemini) and mis-sizing others.

Resolve the real limit per request and pass it to compress():
- a tsheadroom-side override table of precompiled, case-insensitive,
  unanchored regexes (e.g. claude-opus-?4.8 -> 1,000,000), consulted
  first because the bundled Headroom registry lists no current Claude
  4.x model and would otherwise default them to 200K;
- then ModelRegistry.get_context_limit(model, default=200000);
- then 200000.

The resolved limit is surfaced back through the worker response,
compressResult.ModelLimit, summary.modelLimit, and the -v line for
visibility. An explicit model_limit in the runtime config is never
overridden.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Luke Kosewski <lkosewsk@tailscale.com>
…meout

tsheadroom's 4s -deadline was a third timeout layered below two that
already do the job: aperture's per-hook `timeout` (the client-facing
latency ceiling, owned by the caller, which fail-opens on expiry) and
the pool's -max-compress worker cap. Its only unique effect was to
abandon compression *earlier* than aperture would have — exactly the
slow, large-context requests this tool exists to compress.

Remove it entirely: the handler passes the request context straight to
the pool (cancelled when aperture's hook timeout fires or the client
disconnects); a slow call runs to completion under -max-compress and
leaves the worker warm. compress() has no "halt" signal, so every error
still collapses to allow — tsheadroom never blocks. The guardrailResponse
doc records why, and the WebSocket-client caveat that can't arise on
aperture's request/response hook protocol today.

Also:
- raise the default worker pool 4 -> 8 (more concurrency headroom under
  the wider budget; ~4.8GB resident model RAM at 8);
- README: recommend a 30s aperture hook timeout, drop the -deadline flag
  row and the two-timeouts framing, document the single worker cap;
- refresh pool.go comments that still called the request context a
  "fail-open deadline".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Luke Kosewski <lkosewsk@tailscale.com>
@lkosewsk lkosewsk merged commit 6755bcb into main Jun 9, 2026
2 checks passed
@lkosewsk lkosewsk deleted the lk/fix-cold-model-load-and-diagnostics branch June 9, 2026 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant