Skip to content

fix: harden ACP peer disconnect error handling#1108

Open
simple-agent-manager[bot] wants to merge 3 commits into
mainfrom
sam/investigate-resolve-recurring-peer-01ksa6
Open

fix: harden ACP peer disconnect error handling#1108
simple-agent-manager[bot] wants to merge 3 commits into
mainfrom
sam/investigate-resolve-recurring-peer-01ksa6

Conversation

@simple-agent-manager
Copy link
Copy Markdown
Contributor

Summary

  • Harden the ACP peer disconnect failure path in the VM agent so that unrecoverable agent crashes produce actionable error messages, crash reports, and correct host status instead of raw JSON-RPC error blobs
  • When crash recovery via LoadSession is unavailable, the host now transitions to HostError status and broadcasts StatusError to prevent the UI from showing "ready" for a dead agent
  • Refactored beginCrashRecovery to return agentType and redacted stderr even on the failure path, eliminating redundant lock acquisitions (TOCTOU window fix)

Root Cause Investigation

Production D1 showed 24 tasks failing with peer disconnected before response between 2026-05-15 and 2026-05-22. The crash recovery feature (LoadSession) merged ~5 hours after the last observed failure. Analysis of the code path confirmed:

  1. Recoverable case (LoadSession available): finishPromptWithErrorbeginCrashRecovery → restart agent → LoadSession → resume. This path was already correct.
  2. Unrecoverable case (LoadSession unavailable): The code fell through to generic prompt failure handling, surfacing a raw JSON-RPC error blob with no crash report, no lifecycle event, and no status correction. This is the gap this PR fixes.

Changes

File Change
session_host_prompt.go New finishUnrecoverableCrashPrompt method with structured logging, crash report, actionable error message, and HostError status transition
session_host_crash.go beginCrashRecovery returns (agentType, stderr, ok) instead of (agentType, ok) — eliminates double-lock on failure path
session_host_test.go 2 new tests: recoverable peer disconnect triggers crash recovery; unrecoverable peer disconnect produces actionable failure + crash report + HostError status

Test Plan

  • TestSessionHost_FinishPromptWithPeerDisconnectBeginsCrashRecovery — verifies recoverable path enters crash recovery
  • TestSessionHost_FinishPromptWithUnrecoverablePeerDisconnectReportsActionableFailure — verifies unrecoverable path produces crash report, actionable error, secret redaction, and HostError status
  • TestSessionHost_BeginCrashRecoveryRequiresLoadSession — updated for 3-return-value signature
  • Full ACP test suite passes

Specialist Review Evidence

Reviewer Status Notes
go-specialist ADDRESSED 2 HIGH (status correction, double-lock) and 2 MEDIUM (double stderr peek, test status assertion) fixed in commit 4ba659d. LOW findings (reqID copy comment, goroutine callback ordering) are style/documentation items.

Do NOT Merge

Per task instructions, this PR is ready for human review. Do not merge without explicit authorization.


Agent Preflight

  • Change class: cross-component-change, business-logic-change
  • Preflight steps: Production D1 queried for evidence (24 failures). Code path traced through session_host_prompt.gosession_host_crash.go. LoadSession merge timestamp verified against failure timestamps. Constitution compliance checked (no hardcoded values).

🤖 Generated with Claude Code

raphaeltm and others added 3 commits May 23, 2026 10:51
…ate double-lock

Address Go specialist review findings:
- Set HostError status and broadcast StatusError after unrecoverable
  agent crash, preventing the UI from showing "ready" for a dead agent
- Refactor beginCrashRecovery to return agentType and redacted stderr
  even on the failure path, eliminating redundant lock acquisitions
- Add status assertion to the unrecoverable crash test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant