claude-code adapter (pre-release / "coming soon"): in-VM Write/Bash/Skill/sub-agent tools stall ~120s — sidecar stays alive, agent ext-turn never completes

> **⚠️ CORRECTION (2026-06-27, see latest comment):** A runtime frame-capture + `RUST_LOG=debug` sidecar-log trace **contradicts** the stale-response / read-loop-death root cause described below. The sidecar stays **alive** the whole hang (doing TLS to the LLM endpoint); the in-VM agent's `ext` turn simply never completes — the sidecar itself even warns `"ext request still pending — possible stall before response frame"`. Treat the analysis below as a **superseded hypothesis**; the accurate symptom is in the latest comment.

---

# `claude` agent: every tool routed through the `ext` extension runtime (Write / Bash / Skill / sub-agent) hangs with `timed out waiting for sidecar protocol frame for ext`

## Summary

With a `claude` session, the agent boots, authenticates, and can use the **Read** tool and reply normally. But **any tool that goes through the `ext` extension runtime** — the **Write** tool, **Bash** tool, the **Skill** tool, or spawning a **sub-agent (Task)** — hangs ~120s and fails with:

```
timed out waiting for sidecar protocol frame for ext
stderr:
… level=info message="sidecar process started"
… level=info message="vm phase" phase=create_vm elapsed_ms=42
(no further frames; the ext request never returns an `ext_result`)
```

So a real coding-agent run can't complete (it can't write files, run shell commands, use skills, or fan out). Read-only / trivial sessions work fine, which is why this isn't obvious until the agent first reaches for a write/exec tool.

## Confirmed NOT a usage error

- Reproduces with the **canonical setup** from `examples/docs/agents/claude/`: default `HOME=/home/agentos`, user `agentos`, `createSession("claude", {cwd, env})`, no overrides. (Initially suspected my `HOME=/root` override — ruled out; same failure with the documented home.)
- Read tool + trivial prompts succeed in the same session config, so auth, VM, and the SDK bridge are healthy. The failure is specific to the `ext` extension request path.

## Reproduces on Node 22 and 25

Identical failure under fresh installs on Node v22.22.3 and v25.9.0 (macOS 15.7.4, arm64). Not a Node-version/ABI artifact.

## Reproduction

```bash
npm i @rivet-dev/agentos-core @agentos-software/common @agentos-software/claude-code
export ANTHROPIC_API_KEY=sk-...
node repro.mjs
```

```js
// repro.mjs
import { AgentOs } from "@rivet-dev/agentos-core";
import common from "@agentos-software/common";
import claude from "@agentos-software/claude-code";
const race = (p, ms) => Promise.race([p, new Promise((_, r) => setTimeout(() => r(new Error("TIMEOUT")), ms))]);

const vm = await AgentOs.create({ software: [common, claude] });
const { sessionId } = await vm.createSession("claude", {
  cwd: "/home/agentos",
  env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY },
});

// WORKS: Read-only / trivial → returns fine.
console.log(await vm.prompt(sessionId, "Reply with exactly: OK"));

// HANGS ~120s → `timed out waiting for sidecar protocol frame for ext`
try {
  console.log(await race(
    vm.prompt(sessionId, "Write HELLO to /home/agentos/out.txt with your file tool, then reply DONE."),
    150000));
} catch (e) { console.error("FAILED:", e.message); }
await vm.dispose();
```

## What works vs fails

| Works (in-process, no `ext`) | Fails — `ext` request times out (~120s) |
|---|---|
| `AgentOs.create()`, `vm.exec()` (direct) | agent **Write** tool |
| auth + trivial agent reply | agent **Bash** tool |
| agent **Read** tool | **Skill** tool (using a discovered skill) |
| | **sub-agent (Task)** spawn |

## Root cause (traced end-to-end)

This is a **shared-sidecar lifecycle bug in `secure-exec-sidecar 0.3.1`** that was **already fixed in `0.3.2`** (secure-exec PR #133 / tag `v0.3.2`, commit `d8a4435`) — but `0.3.1` is what `@agentos-software/claude-code@0.2.0` drags onto the tool path via its exact pin on `@rivet-dev/agentos-core@0.2.0`.

Mechanism:
1. Write/Bash/Skill/Task each provision a VM on the *shared* secure-exec sidecar and dispose it (`phase=create_vm`). The `ext` extension that backs them is `dev.rivet.agent-os.acp` (`agentos crates/agentos-protocol/src/lib.rs:7`, `AcpExtension` in `crates/agentos-sidecar/src/acp_extension.rs`), baked into the sidecar binary. (An *unregistered* namespace would reject fast — `service.rs:1483-1492` — so this is not a missing-registration / protocol-version issue; those fail immediately, not after 120s.)
2. When a per-VM `sidecar_response` arrives after that VM was torn down, `SidecarResponseTracker::accept_response` returns `UnmatchedResponse`/`DuplicateResponse` (`secure-exec crates/sidecar/src/protocol.rs:1899-1913`).
3. **Pre-0.3.2**, that error propagates: `accept_wire_sidecar_response` → `stdio.rs:364` `…?` → the main read loop's `handle_protocol_frame(...).await?` (`stdio.rs:232,252`) **exits the read loop**. The sidecar stops servicing *all* frames.
4. The host's in-flight `ext` request never gets `ext_result` and times out after `DEFAULT_SIDECAR_FRAME_TIMEOUT_MS = 120_000` (`secure-exec packages/core/src/native-client.ts:17`) → the exact error in `protocol-client.ts:122-124`. Matches the 120s + the `started → create_vm → silence` signature precisely.
5. **0.3.2 fixes it** by tolerating stale responses (drop + `warn` instead of dying): `secure-exec crates/sidecar/src/service.rs:2720-2742` (*"a per-VM `sidecar_request` can be answered by the host after that VM has been torn down (multiple VMs share one sidecar process)"*).

## This is a same-version-lockstep break

Per `AGENTS.md`: *"The protocol has no backwards compatibility. Clients and the sidecar ship in same-version lockstep... the single same-version wire handshake is the only version check."* And the package tracks: *"agent-os product/API (`@rivet-dev/agentos*`)... pins compatible secure-exec and registry package versions"*, while *"`@agentos-software/*` registry packages... [are] versioned independently."*

The `@agentos-software/claude-code` ACP adapter therefore must run in lockstep with the `agentos-core`/sidecar it talks to. But `@agentos-software/claude-code@0.2.0` exact-pins `@rivet-dev/agentos-core@0.2.0`, so the README install `npm i @rivet-dev/agentos-core @agentos-software/claude-code` resolves core **0.2.2** (latest, with the fixed `secure-exec-sidecar 0.3.2`) for the host alongside the adapter's pinned core **0.2.0** (unfixed `0.3.1`) — a lockstep break that routes tool execution through the unfixed sidecar.

## The fix (maintainer-side)

Publish a `@agentos-software/claude-code` that depends on the **same** `@rivet-dev/agentos-core` version agent-os ships (`0.2.2`), so the adapter and sidecar are in lockstep on the fixed `secure-exec-sidecar 0.3.2`. (That package has no public repo, so this is maintainer action; per `AGENTS.md` the registry packages are version-managed via `just agentos-pkgs-set-version`.)

**An npm `overrides` workaround is NOT valid here** and is not recommended: forcing `agentos-core@0.2.2` under a `claude-code@0.2.0` *build* removes the 120s hang (confirming the root cause — the Write prompt returns `end_turn` instead of timing out) but leaves a 0.2.0-built adapter running against a 0.2.2 protocol, i.e. exactly the lockstep break `AGENTS.md` forbids — and in testing tool calls did not cleanly persist in that mixed state. Only a lockstep-rebuilt adapter fixes it.

## Secondary: the silent 120s hang is an observability gap (per AGENTS.md)

Independent of the version fix: `AGENTS.md` → *Limits, Bounds & Observability* states *"The default 120s ACP method timeout is the adapter-stall failure mode — make it observable, not a silent 120s hang,"* and that ACP timeouts carry `data.kind === "acp_timeout"` while *"the native-sidecar frame timeout... [should] emit a structured near-threshold signal (default ≥80%) and fail with a typed error that names the limit and how to raise it."*

This bug is precisely that failure mode: the native-sidecar `ext` frame timeout (`secure-exec packages/core/src/protocol-client.ts:122-124`, `DEFAULT_SIDECAR_FRAME_TIMEOUT_MS = 120_000`) surfaces as a bare 120s silent hang with no near-threshold warning and no typed error/kind. Even after the lockstep fix, giving the native-sidecar frame timeout the same typed-error + warn-on-approach treatment as `acp_timeout` would have turned this from a 120s silent hang into an immediately actionable error.

## Notes

- Reproduces on Node 22 and 25 → not a Node/ABI artifact. macOS arm64.
- The slash-skill `Unknown skill` behaviour (a `/100x:...` or personal `~/.claude/skills` skill not resolving) is a **separate** matter from this hang and is not covered here.

## Environment

| | |
|---|---|
| OS | macOS 15.7.4, arm64 |
| Node | v25.9.0 and v22.22.3 (same result) |
| `@rivet-dev/agentos-core` | 0.2.2 (README install) — also tried 0.2.0 |
| `@agentos-software/claude-code` | 0.2.0 (pins agentos-core `0.2.0`) |
| `@secure-exec/core` | 0.3.2 (top) + 0.3.1 (nested under claude-code) |
| `@rivet-dev/agentos-sidecar` | 0.2.2 |
| `@anthropic-ai/claude-agent-sdk` | 0.2.141 |


Works (in-process, no `ext`)	Fails — `ext` request times out (~120s)
`AgentOs.create()`, `vm.exec()` (direct)	agent Write tool
auth + trivial agent reply	agent Bash tool
agent Read tool	Skill tool (using a discovered skill)
	sub-agent (Task) spawn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

claude-code adapter (pre-release / "coming soon"): in-VM Write/Bash/Skill/sub-agent tools stall ~120s — sidecar stays alive, agent ext-turn never completes #1542

`claude` agent: every tool routed through the `ext` extension runtime (Write / Bash / Skill / sub-agent) hangs with `timed out waiting for sidecar protocol frame for ext`

Summary

Confirmed NOT a usage error

Reproduces on Node 22 and 25

Reproduction

What works vs fails

Root cause (traced end-to-end)

This is a same-version-lockstep break

The fix (maintainer-side)

Secondary: the silent 120s hang is an observability gap (per AGENTS.md)

Notes

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


OS	macOS 15.7.4, arm64
Node	v25.9.0 and v22.22.3 (same result)
`@rivet-dev/agentos-core`	0.2.2 (README install) — also tried 0.2.0
`@agentos-software/claude-code`	0.2.0 (pins agentos-core `0.2.0`)
`@secure-exec/core`	0.3.2 (top) + 0.3.1 (nested under claude-code)
`@rivet-dev/agentos-sidecar`	0.2.2
`@anthropic-ai/claude-agent-sdk`	0.2.141

Uh oh!

claude-code adapter (pre-release / "coming soon"): in-VM Write/Bash/Skill/sub-agent tools stall ~120s — sidecar stays alive, agent ext-turn never completes #1542

Description

claude agent: every tool routed through the ext extension runtime (Write / Bash / Skill / sub-agent) hangs with timed out waiting for sidecar protocol frame for ext

Summary

Confirmed NOT a usage error

Reproduces on Node 22 and 25

Reproduction

What works vs fails

Root cause (traced end-to-end)

This is a same-version-lockstep break

The fix (maintainer-side)

Secondary: the silent 120s hang is an observability gap (per AGENTS.md)

Notes

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`claude` agent: every tool routed through the `ext` extension runtime (Write / Bash / Skill / sub-agent) hangs with `timed out waiting for sidecar protocol frame for ext`