Skip to content

Thread AbortSignal through chat → tool → nested-spawn chains#289

Open
Vpr99 wants to merge 14 commits into
mainfrom
improved-cancelation
Open

Thread AbortSignal through chat → tool → nested-spawn chains#289
Vpr99 wants to merge 14 commits into
mainfrom
improved-cancelation

Conversation

@Vpr99
Copy link
Copy Markdown
Contributor

@Vpr99 Vpr99 commented May 12, 2026

Summary

  • Threads AbortSignal through ~9 call sites between chat surfaces and tool execution so user-visible cancellation (Stop button, tab close, external MCP cancel) propagates in ≤2s instead of waiting for internal timeouts (15-min MCP ceiling, bash timeout_ms, fetch 30s). Closes US1/US2/US6/US8 from docs/plans/2026-05-12-abort-signal-threading-design.v2.md.
  • Removes the manual notifications/cancelled listener in agent-orchestrator.ts (deps(deps): bump alpine from 3.23.3 to 3.23.4 in /apps/cypher #20) — MCP SDK ≥1.28.0 honours RequestOptions.signal at the protocol layer (rejects pending request + emits the cancel frame), so the in-orchestrator listener would double-publish.
  • Notable finding mid-implementation: the design doc's signals.cancel.<correlationId> premise was wrong for the JSON job_tool path. executeJobViaJSON hardcodes bypassConcurrency: true, routing through apps/atlasd/routes/workspaces/index.ts:2014-2044 which calls triggerWorkspaceSignal(..., c.req.raw.signal, ...) directly — both publishSignalCancellation() sites in the route live in the non-bypass branches. Abort propagates in-process via the abortSignal arg, never via NATS. Test deps(deps-dev): update prettier requirement from ^3.8.1 to ^3.8.3 #26 asserts the honest in-process observable. The MCP workspace_signal_trigger path (deps(deps-dev): bump eslint from 10.2.0 to 10.2.1 in /apps/friday-website #25) still hits the cancel-frame branch — different caller, different branch, different observable.

What changed

# Site Change
17 packages/workspace/src/bash-tool.ts execute reads opts.abortSignal, forwards into execFile's native signal (kernel races signal vs timeout)
18 apps/atlasd/routes/workspaces/chat.ts POST handler composes c.req.raw.signal into chatTurnRegistry.abort (gateway for tab-close US2)
19 packages/workspace/src/runtime.ts:3027 Direct tool.execute call receives composed abortSignal
20 packages/core/src/orchestrator/agent-orchestrator.ts client.callTool({ signal }); delete sendCancellationNotification/abortListener/finally cleanup; preserve activeMCPRequests (used by hasActiveSessionWork()/releaseSession)
21 packages/mcp/src/create-mcp-tools.ts wrapToolWithTimeout forwards opts.abortSignal; Promise.race + typed MCPTimeoutError preserved
22 packages/bundled-agents/src/web/tools/fetch.ts Bespoke AbortController+setTimeout replaced with AbortSignal.any([opts.abortSignal, AbortSignal.timeout(DEFAULT_TIMEOUT_MS)].filter(Boolean))
23 packages/bundled-agents/src/ audit grep found 1 production site (claude-code/agent.ts:312) already composing parent signal via older listener pattern; no other stray sites; no commit needed
24 packages/system/agents/workspace-chat/tools/job-tools.ts JSON $post receives init.signal, mirrors SSE variant at line 320
25 packages/mcp-server/src/tools/signals/trigger.ts Handler signature (args, extra); forwards extra.signal as { init: { signal } }. Validation $get deliberately left unsignaled (cancelling that read would silently bypass payload validation — load-bearing inline comment)
26 apps/atlasd/routes/workspaces/nested-cancel.test.ts End-to-end test: parent abort → mocked $post → in-process Hono → triggerWorkspaceSignal's abortSignal arg flips within 2s. Justifies the v2 design's no-helper decision

Test Plan

  • deno check — exit 0 (full repo)
  • deno task lint — exit 0
  • deno task fmt --check — exit 0
  • Per-task regression tests across 8 changed test files (442+ tests in the affected packages green; pre-existing flakes in manager.test.ts and daemon-startup.test.ts confirmed unrelated by stash-baseline check)
  • End-to-end nested-cancel test asserts in-process observable within 2s; bypass-token env handshake verified
  • Manual: Stop button mid-chat-turn — verify bash/MCP tools terminate within ~1s instead of running to internal timeout
  • Manual: Tab close mid-POST — verify chatTurnRegistry aborts within ~1s
  • Manual: External MCP client cancels workspace_signal_trigger — verify signals.cancel.<correlationId> publishes on NATS

Vpr99 added 10 commits May 12, 2026 10:27
Tab close / client disconnect during an in-flight POST turn now aborts the
chat turn controller via chatTurnRegistry.abort, mirroring the SSE GET
pattern. Pre-aborted signals abort synchronously; mid-flight aborts via a
one-shot listener.

Gateway for US2 — without this the FSM kept running until a follow-up POST
superseded it.
Add second `opts` arg to the bash tool's `execute` function and forward
`opts.abortSignal` into `execFile`'s `signal` option. Node's `execFile`
accepts both `signal` and `timeout` natively; the kernel kills the
subprocess on whichever fires first, so no AbortSignal.any composition
is required.

Existing `error.killed` / `error.signal` handling already covers the
aborted-subprocess case (exit_code 124).

Regression test: start `sleep 30`, abort after 50ms, assert the promise
resolves within 1s with a non-zero exit code.
…ecute (#19)

Thread `opts.abortSignal` into the direct `tool.execute` call inside
`executeCodeAgent`'s `mcpToolCall` closure. Without this, any tool that
opts into reading `opts.abortSignal` (e.g. bash-tool after #17) sees
`undefined` even though the composed signal is already in scope on the
surrounding `executeCodeAgent` call.

Adds a focused regression test that stubs `userAdapter.loadAgent`,
`createMCPTools`, and the agent executor to drive the closure path
without standing up real MCP / agent infrastructure. Verifies the
abort signal handed to the stub tool aborts when the parent aborts.

Refs: docs/plans/2026-05-12-abort-signal-threading-design.v2.md
Wrap explicitly threads opts.abortSignal into tool.execute so a parent-side
abort propagates through the AI-SDK MCP adapter into client.callTool({ signal }).
The Promise.race against the 15-minute timeout is preserved so the typed
MCPTimeoutError still fires for the warn-log branch in createMCPTools.

Test covers both legs in one case: (a) parent abort at 100ms → wrapper
rejects in ≤500ms via fake-timer clock; (b) hang execute with no parent
abort → wrapper still rejects with MCPTimeoutError after the 15-minute
ceiling.
…tch (#22)

Replace the bespoke AbortController + setTimeout pair with
AbortSignal.any([opts.abortSignal, AbortSignal.timeout(DEFAULT_TIMEOUT_MS)]).
Cancelling the parent chat turn now aborts the in-flight fetch promptly;
the 30s ceiling is preserved. filter() keeps AbortSignal.any happy when
the parent signal is absent.

Establishes the composition pattern reused by the bundled-agents audit
(#23).
…e manual cancel listener (#20)

MCP SDK ≥ 1.28.0 honours `RequestOptions.signal` at the protocol layer —
the SDK both rejects the pending request promise and emits the single
`notifications/cancelled` frame on abort. Pass `signal: context.abortSignal`
into `client.callTool` and delete the in-orchestrator
`sendCancellationNotification` helper + `abortListener` wiring + its
`finally` cleanup, which would otherwise double-publish the cancel frame.

Preserves `activeMCPRequests` bookkeeping — it's also used by
`hasActiveSessionWork()` (idle-session cleanup gate) and the
`releaseSession` flow.

Regression test mocks the MCP client, asserts (a) `callTool` was called
with `{ signal, timeout }`, (b) the call rejects on abort, and (c) the
orchestrator never calls `client.notification` itself (the double-publish
regression guard — the SDK emits the frame internally).
Thread the `abortSignal` already passed into `createJobTools` (and used
by the SSE variant at line 320) through `ExecuteJobViaJSONDeps` and into
the `$post` call's init bag. The daemon route's existing onClientAbort
listener picks up the HTTP-level abort and publishes
`signals.cancel.<correlationId>` for the spawned cascade — no daemon
changes needed.

Pre-aborted parent signal is handled by fetch natively (immediate
reject), matching the SSE variant's behavior.

Regression test: aborting the parent signal causes the $post promise to
reject; verify the init bag carries the parent signal. End-to-end
NATS-cancel-frame coverage is left to task #26 to avoid duplication with
cb169cb's daemon-side tests.
…ost (#25)

The MCP server handler for `workspace_signal_trigger` now reads `extra.signal`
(MCP-SDK protocol-layer abort wired to client `notifications/cancelled`) and
passes `{ init: { signal: extra.signal } }` to the daemon route's $post call.
The daemon's existing `onClientAbort` listener picks up the HTTP-level abort
and publishes `signals.cancel.<correlationId>` — closes the External MCP
Client cancellation gap (US6).

Validation `$get` left unsignaled: cancelling that synchronous read would
silently fall through to the warn-log branch and trigger the signal without
payload validation.

Regression test mocks the Hono client to assert the init bag carries the
parent signal verbatim. End-to-end cancel-frame coverage stays in the
existing daemon harness and task #26.
End-to-end test for the v2 design's no-`forwardCancelToChild`-helper
decision: aborting the parent chat turn cascades through `createJobTools`
→ `executeJobViaJSON` → mocked `$post` → in-process Hono route →
`triggerWorkspaceSignal`'s `abortSignal` arg within 2s.

Observable note: the original design doc test #9 (and #26's task
description) asserted a `signals.cancel.<correlationId>` NATS frame, but
`executeJobViaJSON` always sets `bypassConcurrency: true`
(`packages/system/agents/workspace-chat/tools/job-tools.ts:203`). That
routes through the bypass branch at
`apps/atlasd/routes/workspaces/index.ts:2014-2044`, which calls
`triggerWorkspaceSignal(..., c.req.raw.signal, ...)` directly — both
`publishSignalCancellation()` sites in the route (lines 1783, 2067) live
in the non-bypass branches. So `signals.cancel.*` never fires for a
job_tool-spawned signal. Test asserts the honest in-process observable
instead: the captured `abortSignal` arg's `aborted` flag flipping to
true. The MCP `workspace_signal_trigger` path (#25) does hit the
cancel-frame branch — covered separately by index.test.ts:929/:967.
Design-doc errata captured in
`docs/learnings/2026-05-12-improved-cancelation.md`.

No new abstractions introduced — the test proves the existing wiring
already aborts a nested cascade end-to-end, justifying the design's
no-helper decision.
Format-only changes across the abort-signal threading commits (#17, #19,
#21, #24, #25, #26). No production logic touched.
@Vpr99 Vpr99 requested a review from ljagiello as a code owner May 12, 2026 17:06
Vpr99 added 4 commits May 12, 2026 11:19
Three `require-await` violations broke CI on PR #289. The async keyword
was unnecessary in each: `$post` returns a Promise from `app.request`,
and the two `vi.fn`/`loadAgent` stubs synchronously return objects —
wrap them in `Promise.resolve` to preserve the Promise-returning
signature. Also reorder `vitest`/`zod` imports to satisfy biome's
organizeImports rule (previously masked by deno lint short-circuit).
On AbortSignal-driven kill, Node's execFile sets error.code to the string
"ABORT_ERR" with error.killed/error.signal undefined, so the existing
mapping fell through to exit_code: 1 — indistinguishable from a generic
non-zero exit. stderr also stayed empty because `stderr ?? error.message`
treats "" as non-nullish.

Detect ABORT_ERR explicitly, map to 137 (Unix SIGKILL convention), and use
`stderr || error.message` so the abort message reaches the caller. Timeout
(124) and generic-failure (1) paths unchanged.
- chat.test.ts: wire a real ChatTurnRegistry so abort tests observe the
  actual AbortController.signal flipping, not just that registry.abort
  was called as a mock. Catches wrong-controller / mismatched-id wiring.
- fetch.test.ts: drop the result.toContain("Fetch failed") assertion —
  the impl funnels every failure through that string, so it didn't
  distinguish abort from anything else. observedSignal.aborted carries
  the contract.
- job-tools.test.ts: add sibling test that sets
  FRIDAY_INTERNAL_SIGNAL_BYPASS_TOKEN and asserts the
  { headers, init: { signal } } call shape. The existing test guards
  the dev-fallback branch; production uses the bypass path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant