Skip to content

claude-code adapter (pre-release / "coming soon"): in-VM Write/Bash/Skill/sub-agent tools stall ~120s — sidecar stays alive, agent ext-turn never completes #1542

Description

@idl3

⚠️ CORRECTION (2026-06-27, see latest comment): A runtime frame-capture + RUST_LOG=debug sidecar-log trace contradicts the stale-response / read-loop-death root cause described below. The sidecar stays alive the whole hang (doing TLS to the LLM endpoint); the in-VM agent's ext turn simply never completes — the sidecar itself even warns "ext request still pending — possible stall before response frame". Treat the analysis below as a superseded hypothesis; the accurate symptom is in the latest comment.


claude agent: every tool routed through the ext extension runtime (Write / Bash / Skill / sub-agent) hangs with timed out waiting for sidecar protocol frame for ext

Summary

With a claude session, the agent boots, authenticates, and can use the Read tool and reply normally. But any tool that goes through the ext extension runtime — the Write tool, Bash tool, the Skill tool, or spawning a sub-agent (Task) — hangs ~120s and fails with:

timed out waiting for sidecar protocol frame for ext
stderr:
… level=info message="sidecar process started"
… level=info message="vm phase" phase=create_vm elapsed_ms=42
(no further frames; the ext request never returns an `ext_result`)

So a real coding-agent run can't complete (it can't write files, run shell commands, use skills, or fan out). Read-only / trivial sessions work fine, which is why this isn't obvious until the agent first reaches for a write/exec tool.

Confirmed NOT a usage error

  • Reproduces with the canonical setup from examples/docs/agents/claude/: default HOME=/home/agentos, user agentos, createSession("claude", {cwd, env}), no overrides. (Initially suspected my HOME=/root override — ruled out; same failure with the documented home.)
  • Read tool + trivial prompts succeed in the same session config, so auth, VM, and the SDK bridge are healthy. The failure is specific to the ext extension request path.

Reproduces on Node 22 and 25

Identical failure under fresh installs on Node v22.22.3 and v25.9.0 (macOS 15.7.4, arm64). Not a Node-version/ABI artifact.

Reproduction

npm i @rivet-dev/agentos-core @agentos-software/common @agentos-software/claude-code
export ANTHROPIC_API_KEY=sk-...
node repro.mjs
// repro.mjs
import { AgentOs } from "@rivet-dev/agentos-core";
import common from "@agentos-software/common";
import claude from "@agentos-software/claude-code";
const race = (p, ms) => Promise.race([p, new Promise((_, r) => setTimeout(() => r(new Error("TIMEOUT")), ms))]);

const vm = await AgentOs.create({ software: [common, claude] });
const { sessionId } = await vm.createSession("claude", {
  cwd: "/home/agentos",
  env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY },
});

// WORKS: Read-only / trivial → returns fine.
console.log(await vm.prompt(sessionId, "Reply with exactly: OK"));

// HANGS ~120s → `timed out waiting for sidecar protocol frame for ext`
try {
  console.log(await race(
    vm.prompt(sessionId, "Write HELLO to /home/agentos/out.txt with your file tool, then reply DONE."),
    150000));
} catch (e) { console.error("FAILED:", e.message); }
await vm.dispose();

What works vs fails

Works (in-process, no ext) Fails — ext request times out (~120s)
AgentOs.create(), vm.exec() (direct) agent Write tool
auth + trivial agent reply agent Bash tool
agent Read tool Skill tool (using a discovered skill)
sub-agent (Task) spawn

Root cause (traced end-to-end)

This is a shared-sidecar lifecycle bug in secure-exec-sidecar 0.3.1 that was already fixed in 0.3.2 (secure-exec PR #133 / tag v0.3.2, commit d8a4435) — but 0.3.1 is what @agentos-software/claude-code@0.2.0 drags onto the tool path via its exact pin on @rivet-dev/agentos-core@0.2.0.

Mechanism:

  1. Write/Bash/Skill/Task each provision a VM on the shared secure-exec sidecar and dispose it (phase=create_vm). The ext extension that backs them is dev.rivet.agent-os.acp (agentos crates/agentos-protocol/src/lib.rs:7, AcpExtension in crates/agentos-sidecar/src/acp_extension.rs), baked into the sidecar binary. (An unregistered namespace would reject fast — service.rs:1483-1492 — so this is not a missing-registration / protocol-version issue; those fail immediately, not after 120s.)
  2. When a per-VM sidecar_response arrives after that VM was torn down, SidecarResponseTracker::accept_response returns UnmatchedResponse/DuplicateResponse (secure-exec crates/sidecar/src/protocol.rs:1899-1913).
  3. Pre-0.3.2, that error propagates: accept_wire_sidecar_responsestdio.rs:364 …? → the main read loop's handle_protocol_frame(...).await? (stdio.rs:232,252) exits the read loop. The sidecar stops servicing all frames.
  4. The host's in-flight ext request never gets ext_result and times out after DEFAULT_SIDECAR_FRAME_TIMEOUT_MS = 120_000 (secure-exec packages/core/src/native-client.ts:17) → the exact error in protocol-client.ts:122-124. Matches the 120s + the started → create_vm → silence signature precisely.
  5. 0.3.2 fixes it by tolerating stale responses (drop + warn instead of dying): secure-exec crates/sidecar/src/service.rs:2720-2742 ("a per-VM sidecar_request can be answered by the host after that VM has been torn down (multiple VMs share one sidecar process)").

This is a same-version-lockstep break

Per AGENTS.md: "The protocol has no backwards compatibility. Clients and the sidecar ship in same-version lockstep... the single same-version wire handshake is the only version check." And the package tracks: "agent-os product/API (@rivet-dev/agentos*)... pins compatible secure-exec and registry package versions", while "@agentos-software/* registry packages... [are] versioned independently."

The @agentos-software/claude-code ACP adapter therefore must run in lockstep with the agentos-core/sidecar it talks to. But @agentos-software/claude-code@0.2.0 exact-pins @rivet-dev/agentos-core@0.2.0, so the README install npm i @rivet-dev/agentos-core @agentos-software/claude-code resolves core 0.2.2 (latest, with the fixed secure-exec-sidecar 0.3.2) for the host alongside the adapter's pinned core 0.2.0 (unfixed 0.3.1) — a lockstep break that routes tool execution through the unfixed sidecar.

The fix (maintainer-side)

Publish a @agentos-software/claude-code that depends on the same @rivet-dev/agentos-core version agent-os ships (0.2.2), so the adapter and sidecar are in lockstep on the fixed secure-exec-sidecar 0.3.2. (That package has no public repo, so this is maintainer action; per AGENTS.md the registry packages are version-managed via just agentos-pkgs-set-version.)

An npm overrides workaround is NOT valid here and is not recommended: forcing agentos-core@0.2.2 under a claude-code@0.2.0 build removes the 120s hang (confirming the root cause — the Write prompt returns end_turn instead of timing out) but leaves a 0.2.0-built adapter running against a 0.2.2 protocol, i.e. exactly the lockstep break AGENTS.md forbids — and in testing tool calls did not cleanly persist in that mixed state. Only a lockstep-rebuilt adapter fixes it.

Secondary: the silent 120s hang is an observability gap (per AGENTS.md)

Independent of the version fix: AGENTS.mdLimits, Bounds & Observability states "The default 120s ACP method timeout is the adapter-stall failure mode — make it observable, not a silent 120s hang," and that ACP timeouts carry data.kind === "acp_timeout" while "the native-sidecar frame timeout... [should] emit a structured near-threshold signal (default ≥80%) and fail with a typed error that names the limit and how to raise it."

This bug is precisely that failure mode: the native-sidecar ext frame timeout (secure-exec packages/core/src/protocol-client.ts:122-124, DEFAULT_SIDECAR_FRAME_TIMEOUT_MS = 120_000) surfaces as a bare 120s silent hang with no near-threshold warning and no typed error/kind. Even after the lockstep fix, giving the native-sidecar frame timeout the same typed-error + warn-on-approach treatment as acp_timeout would have turned this from a 120s silent hang into an immediately actionable error.

Notes

  • Reproduces on Node 22 and 25 → not a Node/ABI artifact. macOS arm64.
  • The slash-skill Unknown skill behaviour (a /100x:... or personal ~/.claude/skills skill not resolving) is a separate matter from this hang and is not covered here.

Environment

OS macOS 15.7.4, arm64
Node v25.9.0 and v22.22.3 (same result)
@rivet-dev/agentos-core 0.2.2 (README install) — also tried 0.2.0
@agentos-software/claude-code 0.2.0 (pins agentos-core 0.2.0)
@secure-exec/core 0.3.2 (top) + 0.3.1 (nested under claude-code)
@rivet-dev/agentos-sidecar 0.2.2
@anthropic-ai/claude-agent-sdk 0.2.141

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions