Skip to content

Idle-stream timeout for remote LLM calls#575

Open
SuuBro wants to merge 6 commits into
masterfrom
goal/idle-strea-cc5f4d75
Open

Idle-stream timeout for remote LLM calls#575
SuuBro wants to merge 6 commits into
masterfrom
goal/idle-strea-cc5f4d75

Conversation

@SuuBro
Copy link
Copy Markdown
Owner

@SuuBro SuuBro commented May 13, 2026

Summary

Adds a node --require preload that wraps undici's global dispatcher in agent subprocesses so remote LLM SSE streams get an idle-gap timeout (default 60s headers, 120s body), while preserving the no-timeout behaviour pi-coding-agent needs for local vLLM / Ollama / LM Studio buffered tool-call responses.

Fixes the silent-stream hang observed on session 035f7442 where the agent pinned as streaming for 4+ minutes with no error after an LLM SSE stream went silent.

Root cause

@earendil-works/pi-coding-agent/dist/cli.js runs at startup:

setGlobalDispatcher(new EnvHttpProxyAgent({ bodyTimeout: 0, headersTimeout: 0 }));

This globally disables undici's idle-gap timeout for all outbound HTTP from the agent subprocess, including remote LLM streams. When a stream goes silent there's no built-in protection — the SDK waits forever.

Approach

Bobbit injects a CommonJS preload via NODE_OPTIONS=--require=... into every agent subprocess (direct and Docker-exec). The preload monkey-patches undici.setGlobalDispatcher so when pi installs its EnvHttpProxyAgent, the dispatcher gets wrapped in an IdleTimeoutDispatcher that injects bodyTimeout / headersTimeout per-request — but only when opts.origin is non-local and non-trusted.

Origin classification covers loopback, RFC1918, IPv6 ULA, Tailscale CGNAT, and .local / .localhost hostnames. Public-DNS AI gateways (Anthropic, OpenAI, ai-gateway.c3.zone, Bedrock, Vercel/Cloudflare AI Gateway) are treated as remote — operators opt them out via BOBBIT_TRUSTED_NO_TIMEOUT_ORIGINS if their gateway fronts a buffering on-prem backend.

For Docker, the preload is bind-mounted into containers and exec-time --require is gated on a probe (docker exec <id> test -f ...) so stale pre-upgrade containers don't crash. Project / sandbox containers carry a CONTAINER_FEATURE_VERSION=preload-1 label so old containers are treated as not-found and auto-recreated with the new mount.

Public contract

Three env vars (all optional):

  • BOBBIT_REMOTE_BODY_TIMEOUT_MS (default 120000)
  • BOBBIT_REMOTE_HEADERS_TIMEOUT_MS (default 60000)
  • BOBBIT_TRUSTED_NO_TIMEOUT_ORIGINS (comma-separated URL origins, default empty)

Files changed

  • defaults/agent-preload/undici-idle-timeouts.cjs (new)
  • src/server/agent/rpc-bridge.ts (preload wiring, direct + docker branches)
  • src/server/agent/docker-args.ts (preload mount, CONTAINER_FEATURE_VERSION)
  • src/server/agent/project-sandbox.ts (version-aware container reuse)
  • tests/undici-idle-timeouts.test.ts (new — 39 tests)
  • tests/container-feature-version.test.ts (new)
  • docs/internals.md, docs/debugging.md (documentation)

Validation

  • npm run check — pass
  • npm run test:unit — pass (1044 passed, 1 skipped, including all 39 new tests)
  • E2E + LLM gap-analysis + code-quality + security + QA verification all pass

Out of scope

  • Recovery of the already-hung session 035f7442 — preload only affects newly spawned subprocesses.
  • Gateway-side stall watchdog for in-flight sessions (separate concern).

🤖 Generated with Bobbit

SuuBro and others added 6 commits May 13, 2026 11:54
pi-coding-agent's cli.js disables undici bodyTimeout/headersTimeout globally
to accommodate buffered local vLLM responses, which leaves remote LLM SSE
streams unprotected. When a remote stream goes silent the agent hangs
indefinitely with status: streaming and no error surfaced.

Add a CommonJS preload (defaults/agent-preload/undici-idle-timeouts.cjs)
that monkey-patches undici.setGlobalDispatcher to wrap pi's dispatcher in
an IdleTimeoutDispatcher. The wrapper injects bodyTimeout/headersTimeout
on per-request opts only for non-local, non-trusted origins — preserving
the existing no-timeout behaviour for localhost/RFC1918/Tailscale CGNAT/
.local backends.

Wire the preload via NODE_OPTIONS=--require=... in rpc-bridge.ts for both
direct-spawn and docker-exec branches. docker-args.ts bind-mounts the
.cjs file read-only at /bobbit-preload/. Env vars:
  BOBBIT_REMOTE_BODY_TIMEOUT_MS     default 120000
  BOBBIT_REMOTE_HEADERS_TIMEOUT_MS  default 60000
  BOBBIT_TRUSTED_NO_TIMEOUT_ORIGINS default ''

scripts/copy-defaults.mjs already copies the whole defaults/ tree so
the new agent-preload/ subdir lands in dist automatically.

Adds 39 unit tests covering isLocalOrigin (IPv4/IPv6/RFC1918/Tailscale/
.local), isTrustedNoTimeout (env parsing, port-aware matching), and
IdleTimeoutDispatcher.dispatch (injection, passthrough, trusted-origin
opt-out, caller-supplied positive value wins, close/destroy forwarding).

Co-authored-by: bobbit-ai <bobbit@bobbit.ai>
… labels

Two coordinated fixes for the brittleness in the docker-exec idle-timeout
preload wiring:

A. Probe-gate exec-time --require flag in rpc-bridge.spawnDockerExec.
   Previously the flag was injected unconditionally, but the corresponding
   bind mount in docker-args.buildDockerRunArgs is guarded by fs.statSync —
   so a missing host preload (dev tree pre-build) produced a container
   without /bobbit-preload, then 'node --require=<missing>' exited
   immediately, breaking every session in that container.

   New behaviour: on first exec into each container, probe
   'docker exec <id> test -f /bobbit-preload/undici-idle-timeouts.cjs'
   (5s timeout, cached per containerId). If present → inject --require.
   If absent → emit a one-line warn and OMIT --require (timeout env vars
   are still exported; harmless without the preload). Short-circuits to
   false if the host PRELOAD_PATH is missing.

B. Stamp containers with a feature-version label so stale containers from
   older bobbit get recreated on upgrade. Project-sandbox._initContainer
   was matching solely on 'bobbit-project=<id>' and reusing any hit; pre-
   preload containers therefore got re-exec'd without the bind mount and
   hit the same 'Cannot find module' failure.

   New behaviour: docker-args exports CONTAINER_FEATURE_VERSION
   ('preload-1'). _initContainer passes it as labelVersion to
   buildDockerRunArgs (so new containers get '<prefix>-version=preload-1')
   AND filters _findContainerByLabel on BOTH the project label and the
   version label. Old containers lack the version label, fall through to
   not-found, get a fresh container created.

   _findContainerByLabel signature widened to accept string | string[];
   multiple --filter label=… args are emitted (docker AND-joins them).

Tests: tests/container-feature-version.test.ts asserts buildDockerRunArgs
emits the '<prefix>-version=<v>' label when labelVersion is passed and
omits it otherwise, for both 'bobbit-project' and 'bobbit-sandbox' prefixes.

npm run check and npm run test:unit (targeted subset) pass.

Co-authored-by: bobbit-ai <bobbit@bobbit.ai>
Co-authored-by: bobbit-ai <bobbit@bobbit.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant