Lk/fix cold model load and diagnostics#2
Merged
Conversation
…diagnostics With headroom-ai[ml] installed, the ML text compressor (Kompress/ModernBERT) is reached under the default config: compress_system_messages defaults on, and headroom also uses Kompress as its fallback for tool/mixed content. But textEnabled() only checked compress_user_messages/target_ratio, so preload was off by default and every worker cold-loaded the ~600MB model on its first live request. Spread across the pool, that caused staggered per-slot loads and, on a large request hitting a still-cold worker, a deadline blow -> fail-open -> uncompressed passthrough. Fixes: - config.go: textEnabled() now includes CompressSystemMessages, so workers preload at startup under the default config and come up warm. - worker.py: configure HF env before importing headroom. Stay online when HF_TOKEN is set (honored only when set); otherwise go offline iff the models are already cached, removing the per-cold-load HF Hub revalidation round-trip (and its anonymous rate-limiting). Operator-set OFFLINE vars are respected. - main.go: -pool-size now defaults to 4 (was max(4, GOMAXPROCS)); drop the now-unused runtime import. Diagnostics: - handler.go: -v line distinguishes allow(noop|error|passthrough|read-error) from modify, and adds dur_ms, worker_ms, and cold. - pool.go: log slow worker calls (with cold_first_call) at Info, and log the previously-invisible "client deadline elapsed; worker continues warming" fail-open transition. - worker.py: return elapsed_ms + cold_first_call per response; emit a structured model-preload / warmup-failure line to stderr. README: pool-size default + memory caveat, preload-by-default behavior, and the HF offline-when-cached / HF_TOKEN opt-in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Luke Kosewski <lkosewsk@tailscale.com>
headroom.compress() drives its aggressiveness off context_pressure = tokens_before / model_limit, but defaults model_limit to a flat 200000 regardless of the model. So we were over-compressing big-context models (e.g. a 1M-token Gemini) and mis-sizing others. Resolve the real limit per request and pass it to compress(): - a tsheadroom-side override table of precompiled, case-insensitive, unanchored regexes (e.g. claude-opus-?4.8 -> 1,000,000), consulted first because the bundled Headroom registry lists no current Claude 4.x model and would otherwise default them to 200K; - then ModelRegistry.get_context_limit(model, default=200000); - then 200000. The resolved limit is surfaced back through the worker response, compressResult.ModelLimit, summary.modelLimit, and the -v line for visibility. An explicit model_limit in the runtime config is never overridden. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Luke Kosewski <lkosewsk@tailscale.com>
…meout tsheadroom's 4s -deadline was a third timeout layered below two that already do the job: aperture's per-hook `timeout` (the client-facing latency ceiling, owned by the caller, which fail-opens on expiry) and the pool's -max-compress worker cap. Its only unique effect was to abandon compression *earlier* than aperture would have — exactly the slow, large-context requests this tool exists to compress. Remove it entirely: the handler passes the request context straight to the pool (cancelled when aperture's hook timeout fires or the client disconnects); a slow call runs to completion under -max-compress and leaves the worker warm. compress() has no "halt" signal, so every error still collapses to allow — tsheadroom never blocks. The guardrailResponse doc records why, and the WebSocket-client caveat that can't arise on aperture's request/response hook protocol today. Also: - raise the default worker pool 4 -> 8 (more concurrency headroom under the wider budget; ~4.8GB resident model RAM at 8); - README: recommend a 30s aperture hook timeout, drop the -deadline flag row and the two-timeouts framing, document the single worker cap; - refresh pool.go comments that still called the request context a "fail-open deadline". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Luke Kosewski <lkosewsk@tailscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.