fix(orchestrator): session audit remediation — error fidelity, session resume, reconcile visibility#3677
Conversation
ClaudeAdapterV2 buffers post-turn SDK activity and emits turn.wakeup (task_notification / ScheduleWakeup origins); ProviderWakeupService waits for thread quiescence and dispatches attach_wakeup runs that adopt the in-flight turn and replay the buffer. Backgrounded Bash items stay running and complete cross-run via task notifications. Replay fixture claude_provider_wakeup covers the full loop. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two reliability fixes for the Claude adapter, both found by the session audit (.plans/22): - The SDK reports API-level failures (401 auth, 529 overloaded) as result subtype "success" with is_error: true. The adapter only checked the subtype, so those runs projected as completed with no failure recorded (threads 1156181e, a5a643b2, 47763f5e, ea84f015). Terminal status now honors is_error and the failure keeps the API error text and status code. - The create-vs-resume decision for query.open lived only in the in-memory openedNativeThreads set, which dies with the session runtime. After the 30-minute idle release (or an app restart) the next turn re-created an existing native session and failed in under a second — the first message after every idle gap was burned (11 occurrences across 6 threads). The persisted provider thread's firstRunOrdinal now serves as the durable resume signal, and threads are only marked opened after a successful open. New replay fixtures claude_result_is_error (from thread 47763f5e run 1) and claude_idle_resume (from thread d0fe9018 runs 6-8, using a new advance_clock fixture step that drives the real idle reaper via the test clock). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adapter errors are tagged wrappers whose message getter is a fixed string
("Claude Agent SDK query failed.", "Failed to start run ...") while the
real defect lives in .cause — makeProviderFailure only read the wrapper,
so every failed run persisted a generic message with the root cause
recorded nowhere (audit plan #2: 8 undebuggable failures across 5
threads). makeProviderFailure now walks the cause chain (bounded depth,
deduped) and joins the messages, and picks up the deepest error code.
The run-execution failure log also prints the Cause pretty-formatted
instead of a depth-elided object.
App-verified: breaking the Claude binary path now projects
"Claude Agent SDK query failed. ← Claude Code native binary not found at
/tmp/nonexistent-claude-binary. ..." as the user-visible error item.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A failed cursor turn projected only the generic "Provider turn failed." because the adapter read a nonexistent `error` field off RunResult (@cursor/sdk 1.0.22 carries no error text on results; errorCode lives only in the SDK's internal run store). Persist what the result does carry — run id, requestId, and duration — so failures can be matched to Cursor-side logs (audit plan #3, thread c9e72a05 run 2 lost requestId beca30c7 + 440s duration). The speculative `error` read stays as a makeProviderFailure cause for future SDK versions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Failed read/grep/glob tools projected only {status: failed, pattern} —
the provider's error message was unrecoverable from projections, and
successful outputs were dropped too (audit plan #12, thread 3029dc85
item prt_f15692f3). The file_search contract gains optional output and
error fields; the OpenCode adapter maps part.state.output/error by
terminal status.
App-verified with a real OpenCode agent: a successful package.json read
projects the content in output, and a read of a nonexistent file
projects status failed with error "File not found: ...".
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Locks in per-uuid assistant text segmentation (audit plan #11): the 6d618dc4 MCP session on build fc23be8 merged 5 interleaved assistant text segments into one separator-less end-of-turn item ordered after all tool calls. The current emitAssistantTextArtifacts path already projects one item per SDK message uuid; this fixture asserts text → command → text ordering survives replay. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A run cancelled by the startup/shutdown reconcile was indistinguishable from a user cancel: the user's message simply never got an answer and no explanation appeared anywhere (audit plan #5, threads 721fc23c and 48663fb7 — cursor crash and server restart both ate the turn silently). The reconcile now appends a "Run interrupted" error item per terminalized run carrying the reason ("Cancelled because the server restarted/shut down before the provider work completed."), ordinal-appended after all projected items to respect the thread-wide position uniqueness. App-verified: killed the backend mid-turn; after restart the thread shows "Run interrupted — Cancelled because the server restarted before the provider work completed." Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
ApprovabilityVerdict: Needs human review Diff is too large for automated approval analysis. A human reviewer should evaluate this PR. You can customize Macroscope's approvability policy. Learn more. |
The native provider logs are the debugging ground truth, but SDK rejections (query open, message stream, prompt offer, agent open, run start/wait) raised errors without any log write — a failed turn left the log ending mid-conversation with no explanation (audit plan #4: neither of thread 721fc23c's failed turns was explainable from its log). The Claude and Cursor runners now tap every fallible SDK boundary and write a runner.error frame carrying the redacted cause chain. App-verified: with a broken Claude binary the native log now records `runner.error messages.stream | Claude Agent SDK query failed. ← Claude Code native binary not found at ...` where it previously went silent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
One codex app-server process multiplexes many app threads (the opener plus every native subagent child), but the protocol logger froze the opener's threadId at open time, so all traffic landed in the opener's log file — and rotation of that busy file destroyed other threads' native ground truth (audit plan #9, threads c878541b/de5f191a/68f7595b/ af66fc2c had no log file at all). The logger now resolves the target per frame: it extracts the native thread id from each frame and looks it up in a per-session route map seeded when a root turn or subagent thread registers, falling back to the opener for unrouted frames. App-verified with a real codex subagent spawn: the two subagent native threads each got their own dedicated log file containing only their own traffic (94 and 54 frames), where previously everything multiplexed into the parent's file. Unit tests cover native-thread-id extraction across decoded/raw frame shapes and the two-thread routing decision. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ACP streams a subagent's result per token, and emitSubagentAssistant persisted a full-row message.updated + turn-item.updated event pair per chunk — one 6KB grok subagent result amplified into ~2700 stored events, 14% of a whole session's event table (audit plan #10). Streamed chunks now accumulate and emit at most once per 100ms window; the final text is always flushed immediately on task completion, so the projection stays exact. The grok_subagent_lineage fixture now asserts each child result persists a single coalesced message.updated event (was one per chunk). App-checked: a real streaming ACP result (4190 chars) persisted 4 message.updated events instead of thousands. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 3 potential issues.
There are 4 total unresolved issues (including 1 from previous review).
Autofix Details
Bugbot Autofix prepared fixes for all 3 issues found in the latest run.
- ✅ Fixed: Scheduled flush flag not cleared
- Added
subagent.streamFlushScheduled = falseinflushSubagentAssistantso that the schedule flag is properly reset on terminal/synchronous flushes, preventing subsequent chunks from hitting the early return.
- Added
- ✅ Fixed: Turn end skips subagent flush
- Added a loop in
finalizeTurnthat iterates overcontext.subagentsand callsflushSubagentAssistantfor any subagent withstreamPendingText === true, ensuring coalesced text is persisted before the turn is marked finalized.
- Added a loop in
- ✅ Resolved by another fix: Concurrent duplicate subagent snapshots
- The Bug 1 fix (clearing
streamFlushScheduledinflushSubagentAssistant) ensures the forked sleep callback seesstreamPendingText === falseafter a synchronous flush, preventing the duplicate emit race.
- The Bug 1 fix (clearing
Or push these changes by commenting:
@cursor push 59569dd592
Preview (59569dd592)
diff --git a/apps/server/src/orchestration-v2/Adapters/AcpAdapterV2.ts b/apps/server/src/orchestration-v2/Adapters/AcpAdapterV2.ts
--- a/apps/server/src/orchestration-v2/Adapters/AcpAdapterV2.ts
+++ b/apps/server/src/orchestration-v2/Adapters/AcpAdapterV2.ts
@@ -915,6 +915,7 @@
subagent.assistantText += finalText;
}
subagent.streamPendingText = false;
+ subagent.streamFlushScheduled = false;
yield* emitSubagentAssistantSnapshot(subagent);
});
@@ -2054,6 +2055,11 @@
if (context.finalized) return;
context.finalized = true;
yield* closeTextStreams(context);
+ for (const subagent of context.subagents.values()) {
+ if (subagent.streamPendingText) {
+ yield* flushSubagentAssistant(subagent);
+ }
+ }
const now = yield* DateTime.now;
const turn = providerTurnPayload(context, status, now);
yield* Ref.update(providerTurns, (current) => {You can send follow-ups to the cloud agent here.
- Decorate attachTurn with the same ensureThreadAttached/markBusy/markIdle pairing as startTurn: a wakeup-attached run left busyCount at 0, so the idle reaper could release the live session mid-turn (macroscope, High). - Guard the turn.wakeup pump branch with the same runtime-liveness check as persistProviderSessionUpdate so stale buffered wakeups drained after releaseEntry cannot dispatch attaches against a released or replaced session (macroscope, Medium). - Dispatch wakeups concurrently, one in flight per thread: quiescence waiting on one busy thread no longer head-of-line-blocks every other thread's wakeup; duplicate wakeups for a thread with one in flight are coalesced since the adapter buffers all activity behind a single attach (macroscope, Medium). - Flush throttled subagent assistant text in ACP finalizeTurn: after the turn ends the run fiber that routes child-thread events may be gone, so relying on the coalescer's timer could drop the stream tail (cursor bugbot, Medium). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Coalesced wakeup requests silently discarded
- Added a per-thread pendingWakeups map that stores coalesced requests and re-offers them to the relay queue when the in-flight dispatch completes, ensuring no wakeup is permanently lost.
Or push these changes by commenting:
@cursor push f013d3849b
Preview (f013d3849b)
diff --git a/apps/server/src/orchestration-v2/ProviderWakeupService.ts b/apps/server/src/orchestration-v2/ProviderWakeupService.ts
--- a/apps/server/src/orchestration-v2/ProviderWakeupService.ts
+++ b/apps/server/src/orchestration-v2/ProviderWakeupService.ts
@@ -171,9 +171,11 @@
// Wakeups dispatch concurrently, one in flight per thread: quiescence
// waiting for one busy thread must not head-of-line-block every other
// thread's wakeup (which could be superseded while it waits). A wakeup
- // arriving while its thread already has one in flight is coalesced away —
- // the adapter buffers all pending activity behind a single attach.
+ // arriving while its thread already has one in flight is held pending;
+ // when the in-flight dispatch finishes it re-offers the pending request
+ // so a failed dispatch does not permanently lose the wakeup.
const inFlightThreads = yield* Ref.make(new Set<ThreadId>());
+ const pendingWakeups = yield* Ref.make(new Map<ThreadId, ProviderWakeupRequest>());
return yield* relay.take.pipe(
Effect.flatMap((input) =>
Ref.modify(inFlightThreads, (current) => {
@@ -192,16 +194,37 @@
const next = new Set(current);
next.delete(input.threadId);
return next;
- }),
+ }).pipe(
+ Effect.andThen(
+ Ref.modify(pendingWakeups, (m) => {
+ const pending = m.get(input.threadId);
+ if (pending === undefined) return [undefined, m] as const;
+ const next = new Map(m);
+ next.delete(input.threadId);
+ return [pending, next] as const;
+ }),
+ ),
+ Effect.flatMap((pending) =>
+ pending !== undefined ? relay.offer(pending) : Effect.void,
+ ),
+ ),
),
Effect.forkScoped,
Effect.asVoid,
)
- : Effect.logInfo("orchestration-v2.provider-wakeup.coalesced", {
- threadId: input.threadId,
- providerThreadId: input.providerThreadId,
- origin: input.origin,
- }),
+ : Ref.update(pendingWakeups, (m) => {
+ const next = new Map(m);
+ next.set(input.threadId, input);
+ return next;
+ }).pipe(
+ Effect.andThen(
+ Effect.logInfo("orchestration-v2.provider-wakeup.coalesced", {
+ threadId: input.threadId,
+ providerThreadId: input.providerThreadId,
+ origin: input.origin,
+ }),
+ ),
+ ),
),
),
),You can send follow-ups to the cloud agent here.
A wakeup arriving while its thread already had a dispatch in flight was logged and discarded — if the in-flight attempt then gave up (quiescence timeout) or failed, the parked request never got another attempt (cursor bugbot, Medium). Each thread now keeps a keep-latest follow-up slot that dispatches after the in-flight attempt finishes regardless of its outcome. Keep-latest is sufficient because the adapter buffers all pending activity behind a single attach. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Wakeup drain lacks interrupt cleanup
- Added Effect.onInterrupt handler to the forked drainThread that deletes the thread's entry from threadStates upon fiber interruption, allowing subsequent wakeups to claim and dispatch normally.
Or push these changes by commenting:
@cursor push 1e87251a89
Preview (1e87251a89)
diff --git a/apps/server/src/orchestration-v2/ProviderWakeupService.ts b/apps/server/src/orchestration-v2/ProviderWakeupService.ts
--- a/apps/server/src/orchestration-v2/ProviderWakeupService.ts
+++ b/apps/server/src/orchestration-v2/ProviderWakeupService.ts
@@ -213,7 +213,17 @@
}).pipe(
Effect.flatMap((claimed) =>
claimed
- ? drainThread(input).pipe(Effect.forkScoped, Effect.asVoid)
+ ? drainThread(input).pipe(
+ Effect.onInterrupt(() =>
+ Ref.update(threadStates, (current) => {
+ const next = new Map(current);
+ next.delete(input.threadId);
+ return next;
+ }),
+ ),
+ Effect.forkScoped,
+ Effect.asVoid,
+ )
: Effect.logInfo("orchestration-v2.provider-wakeup.follow-up-parked", {
threadId: input.threadId,
providerThreadId: input.providerThreadId,You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 2a8052c. Configure here.
An interrupted drainThread fiber left its thread marked in flight, which would park every later wakeup for that thread forever (cursor bugbot, High). Interruption only occurs when the dispatcher scope tears down — taking the state map with it — but the invariant now holds locally via an interrupt-scoped cleanup instead of depending on that lifecycle. Interrupt-only rather than ensuring: on the success path the marker may already belong to a new claim. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


Stacked on #2829. Remediation work from the full v2 session audit (plan + verified findings in
.plans/22-orchestration-v2-audit-remediation.md/.plans/22-orchestration-v2-audit-findings.json— every finding was adversarially re-verified against raw provider logs, the event store, and adapter source).What's in here
turn.wakeupbuffering in ClaudeAdapterV2,ProviderWakeupServicerelay/dispatcher mintingattach_wakeupruns, cross-run backgrounded-Bash completion. The audit quantified the pre-fix loss: 134 tool calls + all wakeup assistant output across 3 threads, including a GitHub comment posted with no user-visible record. Replay fixtureclaude_provider_wakeup.is_erroron Claude SDK results (plan oxc stack #1): 401/529 results withsubtype: "success"no longer project as completed runs; the failure keeps the API error text and status code. Fixtureclaude_result_is_error(from real 401 chunks).firstRunOrdinal < runOrdinal) instead of only an in-memory set that dies with the session runtime — the first message after every 30-min idle gap or restart used to burn as an instant failure (11 occurrences across 6 threads). Fixtureclaude_idle_resumedrives the real idle reaper via a newadvance_clockstep. App-verified: kill backend mid-thread → first message after restart succeeds withresume:on the wire.makeProviderFailurewalks the wrapper-error cause chain; failed runs no longer persist only "Claude Agent SDK query failed." with the root cause recorded nowhere. App-verified with a broken Claude binary path.file_searchoutput/error (plan Add native context menu to delete threads #12): contract + adapter keep tool outcomes; failed reads no longer project pattern-only. App-verified with a real OpenCode agent.claude_text_segmentsregression fixture (plan Standardized messages #11): locks per-uuid assistant text segmentation (text → tool → text ordering).Verification
Each fix carries replay-fixture coverage built from observed native chunks, plus a real-agent verification round in an isolated dev instance (documented per item in the plan file). Remaining audit items (native-log failure frames, codex collab subagent ingestion, ACP subagent lifecycle, codex shared-log routing, event amplification, steering-latency UX) continue on this branch.
🤖 Generated with Claude Code
Note
Medium Risk
Touches orchestration v2 event persistence and replay fixtures; coalescing changes stored event granularity (not final projections) and runs timed flush fibers on the ACP session scope.
Overview
Adds orchestration v2 audit artifacts (
.plans/22-orchestration-v2-audit-findings.jsonand.plans/22-orchestration-v2-audit-remediation.md) that catalog verified ingestion/projection gaps and track remediation status across adapters.Runtime change in this diff:
AcpAdapterV2no longer persists a fullmessage.updated+turn-item.updatedpair on every subagentagent_message_chunk. Chunks accumulate and flush on a ~100ms coalescing window, with immediate flush when the subagent task completes or the parent turn finalizes (so tail text is not dropped after the run fiber ends).Test/replay:
ClaudeAdapterV2.testkittreats SDKstream_eventframes as valid replay input (for wakeup transcripts wheremessage_startprecedes complete assistant messages).Reviewed by Cursor Bugbot for commit 0e0df04. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add provider-initiated session wakeup, improve error fidelity, and keep background work visible in the timeline
turn.wakeupevents, which the newProviderWakeupServicedispatches asattach_wakeupruns once the thread is idle.attachTurnto the session runtime so the orchestrator can adopt a provider-initiated turn without sending a new prompt, gated by a newturnDelivery: "attach"field threaded through effect requests.makeProviderFailurenow walks nested cause chains; Claude SDKis_errorresults map tofailedstatus with API error codes; runner failures emitrunner.errorprotocol log frames.Run interruptedturn-item after reconcile-driven cancellations so interrupted runs are auditable.resumed(notsteer) in theUserMessageIntentBadge.claude_provider_wakeup,claude_idle_resume,claude_result_is_error,claude_text_segments) covering the new and fixed scenarios.Macroscope summarized 0e0df04.