Skip to content

fix(core): shared sidecar must not hang one-shot scripts; se 0.3.3 + agent×pkg-mgr e2e matrix#1551

Merged
NathanFlurry merged 1 commit into
mainfrom
max-listeners
Jun 27, 2026
Merged

fix(core): shared sidecar must not hang one-shot scripts; se 0.3.3 + agent×pkg-mgr e2e matrix#1551
NathanFlurry merged 1 commit into
mainfrom
max-listeners

Conversation

@NathanFlurry

Copy link
Copy Markdown
Member

Summary

Headline fix: the default shared sidecar pool no longer hangs one-shot scripts on exit. AgentOs.create() uses a process-global shared sidecar; vm.dispose() released the VM lease but left the sidecar's child process + stdio referenced, pinning the Node event loop — so every standalone script (all quickstart examples) completed its work and then hung until SIGINT.

How it works now

  • Counted event-loop hold. A hold is taken for the whole create→use→dispose lifetime of each VM lease (taken before VM creation, released on dispose/failure). The sidecar child + stdin/stdout/stderr are ref()'d while holds > 0 and unref()'d at 0. A counter (not a boolean) so concurrent create/dispose can't clobber each other.
  • unref() ≠ kill. The sidecar stays pooled and reusable across VM disposal and is re-ref()'d on the next lease. It's tied to the host-process lifetime, not the VM lifetime — so a long-running server keeps it; a finished one-shot script lets the loop drain.
  • Instant reap on exit. A one-time synchronous process.on("exit") SIGKILLs pooled sidecars, so a clean exit reaps them immediately (no orphan, no stdin-EOF grace wait). No SIGINT/SIGTERM handlers — the host owns signals (SIGINT still flows via the process group; SIGTERM-driven exit still closes stdin).

Also in this branch

  • Pin secure-exec 0.3.3 (npm + crates).
  • tests/agent-pkg-matrix.e2e.test.ts — skipped-by-default (gated by AGENTOS_MATRIX_E2E) real-API matrix: 4 package managers × 4 agents, fresh installs, asserts live token streaming. Codifies the issues hit shipping the preview (stale model ids, OpenCode config, permission keys, gap-based streaming, ACP-bootstrap flakiness).
  • tests/shared-sidecar-clean-exit.test.ts — regression for the hang above (spawns a real create()+dispose() script with no process.exit() and asserts it terminates on its own).
  • agentos-sidecar crate packaging fixAGENTOS_SYSTEM_PROMPT.md moved into the crate (src/) so cargo publish can package it (the prior out-of-crate include_str! broke the isolated package-verify build); dropped the now-empty fixtures dir from @rivet-dev/agentos-core's files.
  • docs(sessions) — document OpenCode agent config (model + provider baseURL ending in /v1 + cwd), the one real "didn't work out of the box" gap.

Review

Reviewed by two subagents (correctness/concurrency + branch/tests/docs). Addressed: counted (not boolean) hold to survive concurrent create/dispose; clear cached child + reset hold count on full dispose; loud warning if the secure-exec child handle can't be resolved (so the optimization can't silently regress); removed the empty fixtures dir from files; { recursive: true } in the docs mkdir; dist-existence skip guard on the clean-exit test.

Test evidence

  • Clean-exit regression test: pass (was a 60s hang).
  • All 4 quickstart examples exit cleanly verbatim: hello-world ~2.7s, filesystem ~3s, cron ~4s, agent-session ~7s (streamed + responded + exited); no orphaned sidecars.
  • agent-pkg-matrix against the published release: 16/16 live-streaming.
  • Core suite: 265 passed / 9 skipped, no regression.

🤖 Generated with Claude Code

@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1551 June 27, 2026 21:45 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1551 June 27, 2026 21:45 Destroyed
@railway-app

railway-app Bot commented Jun 27, 2026

Copy link
Copy Markdown

🚅 Deployed to the agentos-pr-1551 environment in agentos

Service Status Web Updated (UTC)
agentos ✅ Success (View Logs) Web Jun 27, 2026 at 10:33 pm

🚅 Deployed to the agentos-pr-1551 environment in rivet-frontend

Service Status Web Updated (UTC)
agent-os ✅ Success (View Logs) Jun 27, 2026 at 10:33 pm

@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1551 June 27, 2026 22:08 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1551 June 27, 2026 22:08 Destroyed
@NathanFlurry

Copy link
Copy Markdown
Member Author

Ran a second adversarial review pass over the counted-hold rewrite (it was written after the first review, so it was itself unreviewed). Fixed everything it surfaced and re-verified:

  • [BUG] rejected-spawn wedge — a failed sidecar spawn left nativeProcess as a rejected promise, permanently wedging the pool. Now cleared on failure so the next create() retries.
  • [BUG] orphan child on failed handshake — the child handle was cached after authenticateAndOpenSession(); a failed handshake left the spawned child untracked/unreapable and pinning the loop. Now cached right after spawn() and SIGKILL'd on failure.
  • [RISK] hold-counter clobberdisposeSharedSidecarNativeProcess no longer force-zeros the shared counter (could clobber a re-acquired generation); release-without-acquire now warns instead of silently flooring.

A separate verification pass confirmed the earlier fixes (counted hold, sharedChild clearing, internals-warning, package.json files, docs recursive, test dist guard) all landed correctly and found no further issues.

Re-verified: pnpm --dir packages/core build clean (0 TS errors); core suite 265 passed / 9 skipped; clean-exit regression passes; quickstart examples exit cleanly (~2.7–7s).

@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1551 June 27, 2026 22:18 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1551 June 27, 2026 22:18 Destroyed
The default shared sidecar pool kept its child process + stdio referenced after
vm.dispose() released the last VM lease, so every standalone script (all the
quickstart examples) hung on exit and had to be SIGINT'd.

Fix: a counted event-loop hold on the shared sidecar, taken for the WHOLE
create→use→dispose lifetime of each VM lease. Child + stdio are ref'd while
holds > 0 and unref'd at 0, so in-flight VM work keeps the host alive while an
idle program can exit. unref != kill: the sidecar stays pooled/reusable and is
re-ref'd on the next lease. A one-time synchronous process 'exit' hook SIGKILLs
pooled sidecars so a clean exit reaps them instantly. No SIGINT/SIGTERM handlers
— the host owns signals.

Hardened after adversarial review:
- counted (not boolean) hold survives concurrent create/dispose
- child handle cached BEFORE the handshake await so a failed
  authenticateAndOpenSession() can still reap it (no orphan / pinned loop)
- rejected spawn clears nativeProcess so a later create() retries
- dispose no longer force-zeros the hold counter; unbalanced release warns
- loud warning if the secure-exec child handle can't be resolved

Also:
- tests/shared-sidecar-clean-exit.test.ts: regression for the hang
- tests/agent-pkg-matrix.e2e.test.ts: skipped-by-default 4 pkg mgr × 4 agent
  real-API streaming matrix (AGENTOS_MATRIX_E2E)
- fix(agentos-sidecar): embed AGENTOS_SYSTEM_PROMPT.md in the crate so cargo
  publish can package it; drop now-empty fixtures dir from agentos-core files
- docs(agents): document OpenCode model config on its per-agent page; Sessions
  points to the per-agent docs for agent-specific configuration
@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1551 June 27, 2026 22:33 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1551 June 27, 2026 22:33 Destroyed
@NathanFlurry NathanFlurry merged commit eefc9f2 into main Jun 27, 2026
2 of 3 checks passed
@NathanFlurry NathanFlurry deleted the max-listeners branch June 27, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant