Thread AbortSignal through chat → tool → nested-spawn chains#289
Open
Vpr99 wants to merge 14 commits into
Open
Thread AbortSignal through chat → tool → nested-spawn chains#289Vpr99 wants to merge 14 commits into
Vpr99 wants to merge 14 commits into
Conversation
Tab close / client disconnect during an in-flight POST turn now aborts the chat turn controller via chatTurnRegistry.abort, mirroring the SSE GET pattern. Pre-aborted signals abort synchronously; mid-flight aborts via a one-shot listener. Gateway for US2 — without this the FSM kept running until a follow-up POST superseded it.
Add second `opts` arg to the bash tool's `execute` function and forward `opts.abortSignal` into `execFile`'s `signal` option. Node's `execFile` accepts both `signal` and `timeout` natively; the kernel kills the subprocess on whichever fires first, so no AbortSignal.any composition is required. Existing `error.killed` / `error.signal` handling already covers the aborted-subprocess case (exit_code 124). Regression test: start `sleep 30`, abort after 50ms, assert the promise resolves within 1s with a non-zero exit code.
…ecute (#19) Thread `opts.abortSignal` into the direct `tool.execute` call inside `executeCodeAgent`'s `mcpToolCall` closure. Without this, any tool that opts into reading `opts.abortSignal` (e.g. bash-tool after #17) sees `undefined` even though the composed signal is already in scope on the surrounding `executeCodeAgent` call. Adds a focused regression test that stubs `userAdapter.loadAgent`, `createMCPTools`, and the agent executor to drive the closure path without standing up real MCP / agent infrastructure. Verifies the abort signal handed to the stub tool aborts when the parent aborts. Refs: docs/plans/2026-05-12-abort-signal-threading-design.v2.md
Wrap explicitly threads opts.abortSignal into tool.execute so a parent-side
abort propagates through the AI-SDK MCP adapter into client.callTool({ signal }).
The Promise.race against the 15-minute timeout is preserved so the typed
MCPTimeoutError still fires for the warn-log branch in createMCPTools.
Test covers both legs in one case: (a) parent abort at 100ms → wrapper
rejects in ≤500ms via fake-timer clock; (b) hang execute with no parent
abort → wrapper still rejects with MCPTimeoutError after the 15-minute
ceiling.
…tch (#22) Replace the bespoke AbortController + setTimeout pair with AbortSignal.any([opts.abortSignal, AbortSignal.timeout(DEFAULT_TIMEOUT_MS)]). Cancelling the parent chat turn now aborts the in-flight fetch promptly; the 30s ceiling is preserved. filter() keeps AbortSignal.any happy when the parent signal is absent. Establishes the composition pattern reused by the bundled-agents audit (#23).
…e manual cancel listener (#20) MCP SDK ≥ 1.28.0 honours `RequestOptions.signal` at the protocol layer — the SDK both rejects the pending request promise and emits the single `notifications/cancelled` frame on abort. Pass `signal: context.abortSignal` into `client.callTool` and delete the in-orchestrator `sendCancellationNotification` helper + `abortListener` wiring + its `finally` cleanup, which would otherwise double-publish the cancel frame. Preserves `activeMCPRequests` bookkeeping — it's also used by `hasActiveSessionWork()` (idle-session cleanup gate) and the `releaseSession` flow. Regression test mocks the MCP client, asserts (a) `callTool` was called with `{ signal, timeout }`, (b) the call rejects on abort, and (c) the orchestrator never calls `client.notification` itself (the double-publish regression guard — the SDK emits the frame internally).
Thread the `abortSignal` already passed into `createJobTools` (and used by the SSE variant at line 320) through `ExecuteJobViaJSONDeps` and into the `$post` call's init bag. The daemon route's existing onClientAbort listener picks up the HTTP-level abort and publishes `signals.cancel.<correlationId>` for the spawned cascade — no daemon changes needed. Pre-aborted parent signal is handled by fetch natively (immediate reject), matching the SSE variant's behavior. Regression test: aborting the parent signal causes the $post promise to reject; verify the init bag carries the parent signal. End-to-end NATS-cancel-frame coverage is left to task #26 to avoid duplication with cb169cb's daemon-side tests.
…ost (#25) The MCP server handler for `workspace_signal_trigger` now reads `extra.signal` (MCP-SDK protocol-layer abort wired to client `notifications/cancelled`) and passes `{ init: { signal: extra.signal } }` to the daemon route's $post call. The daemon's existing `onClientAbort` listener picks up the HTTP-level abort and publishes `signals.cancel.<correlationId>` — closes the External MCP Client cancellation gap (US6). Validation `$get` left unsignaled: cancelling that synchronous read would silently fall through to the warn-log branch and trigger the signal without payload validation. Regression test mocks the Hono client to assert the init bag carries the parent signal verbatim. End-to-end cancel-frame coverage stays in the existing daemon harness and task #26.
End-to-end test for the v2 design's no-`forwardCancelToChild`-helper decision: aborting the parent chat turn cascades through `createJobTools` → `executeJobViaJSON` → mocked `$post` → in-process Hono route → `triggerWorkspaceSignal`'s `abortSignal` arg within 2s. Observable note: the original design doc test #9 (and #26's task description) asserted a `signals.cancel.<correlationId>` NATS frame, but `executeJobViaJSON` always sets `bypassConcurrency: true` (`packages/system/agents/workspace-chat/tools/job-tools.ts:203`). That routes through the bypass branch at `apps/atlasd/routes/workspaces/index.ts:2014-2044`, which calls `triggerWorkspaceSignal(..., c.req.raw.signal, ...)` directly — both `publishSignalCancellation()` sites in the route (lines 1783, 2067) live in the non-bypass branches. So `signals.cancel.*` never fires for a job_tool-spawned signal. Test asserts the honest in-process observable instead: the captured `abortSignal` arg's `aborted` flag flipping to true. The MCP `workspace_signal_trigger` path (#25) does hit the cancel-frame branch — covered separately by index.test.ts:929/:967. Design-doc errata captured in `docs/learnings/2026-05-12-improved-cancelation.md`. No new abstractions introduced — the test proves the existing wiring already aborts a nested cascade end-to-end, justifying the design's no-helper decision.
Three `require-await` violations broke CI on PR #289. The async keyword was unnecessary in each: `$post` returns a Promise from `app.request`, and the two `vi.fn`/`loadAgent` stubs synchronously return objects — wrap them in `Promise.resolve` to preserve the Promise-returning signature. Also reorder `vitest`/`zod` imports to satisfy biome's organizeImports rule (previously masked by deno lint short-circuit).
On AbortSignal-driven kill, Node's execFile sets error.code to the string "ABORT_ERR" with error.killed/error.signal undefined, so the existing mapping fell through to exit_code: 1 — indistinguishable from a generic non-zero exit. stderr also stayed empty because `stderr ?? error.message` treats "" as non-nullish. Detect ABORT_ERR explicitly, map to 137 (Unix SIGKILL convention), and use `stderr || error.message` so the abort message reaches the caller. Timeout (124) and generic-failure (1) paths unchanged.
- chat.test.ts: wire a real ChatTurnRegistry so abort tests observe the
actual AbortController.signal flipping, not just that registry.abort
was called as a mock. Catches wrong-controller / mismatched-id wiring.
- fetch.test.ts: drop the result.toContain("Fetch failed") assertion —
the impl funnels every failure through that string, so it didn't
distinguish abort from anything else. observedSignal.aborted carries
the contract.
- job-tools.test.ts: add sibling test that sets
FRIDAY_INTERNAL_SIGNAL_BYPASS_TOKEN and asserts the
{ headers, init: { signal } } call shape. The existing test guards
the dev-fallback branch; production uses the bypass path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AbortSignalthrough ~9 call sites between chat surfaces and tool execution so user-visible cancellation (Stop button, tab close, external MCP cancel) propagates in ≤2s instead of waiting for internal timeouts (15-min MCP ceiling, bashtimeout_ms, fetch 30s). Closes US1/US2/US6/US8 fromdocs/plans/2026-05-12-abort-signal-threading-design.v2.md.notifications/cancelledlistener inagent-orchestrator.ts(deps(deps): bump alpine from 3.23.3 to 3.23.4 in /apps/cypher #20) — MCP SDK ≥1.28.0 honoursRequestOptions.signalat the protocol layer (rejects pending request + emits the cancel frame), so the in-orchestrator listener would double-publish.signals.cancel.<correlationId>premise was wrong for the JSONjob_toolpath.executeJobViaJSONhardcodesbypassConcurrency: true, routing throughapps/atlasd/routes/workspaces/index.ts:2014-2044which callstriggerWorkspaceSignal(..., c.req.raw.signal, ...)directly — bothpublishSignalCancellation()sites in the route live in the non-bypass branches. Abort propagates in-process via theabortSignalarg, never via NATS. Test deps(deps-dev): update prettier requirement from ^3.8.1 to ^3.8.3 #26 asserts the honest in-process observable. The MCPworkspace_signal_triggerpath (deps(deps-dev): bump eslint from 10.2.0 to 10.2.1 in /apps/friday-website #25) still hits the cancel-frame branch — different caller, different branch, different observable.What changed
packages/workspace/src/bash-tool.tsexecutereadsopts.abortSignal, forwards intoexecFile's nativesignal(kernel races signal vstimeout)apps/atlasd/routes/workspaces/chat.tsc.req.raw.signalintochatTurnRegistry.abort(gateway for tab-close US2)packages/workspace/src/runtime.ts:3027tool.executecall receives composedabortSignalpackages/core/src/orchestrator/agent-orchestrator.tsclient.callTool({ signal }); deletesendCancellationNotification/abortListener/finally cleanup; preserveactiveMCPRequests(used byhasActiveSessionWork()/releaseSession)packages/mcp/src/create-mcp-tools.tswrapToolWithTimeoutforwardsopts.abortSignal;Promise.race+ typedMCPTimeoutErrorpreservedpackages/bundled-agents/src/web/tools/fetch.tsAbortController+setTimeoutreplaced withAbortSignal.any([opts.abortSignal, AbortSignal.timeout(DEFAULT_TIMEOUT_MS)].filter(Boolean))packages/bundled-agents/src/auditclaude-code/agent.ts:312) already composing parent signal via older listener pattern; no other stray sites; no commit neededpackages/system/agents/workspace-chat/tools/job-tools.ts$postreceivesinit.signal, mirrors SSE variant at line 320packages/mcp-server/src/tools/signals/trigger.ts(args, extra); forwardsextra.signalas{ init: { signal } }. Validation$getdeliberately left unsignaled (cancelling that read would silently bypass payload validation — load-bearing inline comment)apps/atlasd/routes/workspaces/nested-cancel.test.ts$post→ in-process Hono →triggerWorkspaceSignal'sabortSignalarg flips within 2s. Justifies the v2 design's no-helper decisionTest Plan
deno check— exit 0 (full repo)deno task lint— exit 0deno task fmt --check— exit 0manager.test.tsanddaemon-startup.test.tsconfirmed unrelated by stash-baseline check)chatTurnRegistryaborts within ~1sworkspace_signal_trigger— verifysignals.cancel.<correlationId>publishes on NATS