Skip to content

[codex] Fix restart-provider stop crash reporting#60

Draft
brsbl wants to merge 1 commit into
ymichael:mainfrom
brsbl:bb/investigate-provider-exit-crash-thr_vne66xfui9
Draft

[codex] Fix restart-provider stop crash reporting#60
brsbl wants to merge 1 commit into
ymichael:mainfrom
brsbl:bb/investigate-provider-exit-crash-thr_vne66xfui9

Conversation

@brsbl
Copy link
Copy Markdown

@brsbl brsbl commented Jun 3, 2026

Bug description

When the idle provider watchdog requested thread.stop for a wedged provider turn, the host daemon could report the stop as a provider crash:

  • server logged the watchdog stop request for an idle provider turn
  • host daemon handled a thread.stop command
  • the command failed with Provider "codex" exited unexpectedly
  • the thread timeline received a scary provider-crash/error even though restart-provider stop is supposed to tear the provider down

The same log bundle also contained app-data resync Zod errors and daemon reconnects. Those are separate findings and intentionally left out of this crash fix.

Root cause

For restart-provider stop plans, runtime.stopThread sent the provider interrupt request and awaited the JSON-RPC response before treating provider shutdown as expected.

If the provider process exited while the interrupt was still in flight, the pending request rejected from the process-exit path as Provider "<id>" exited unexpectedly. That rejection propagated out of stopThread, so the server marked the stop command as failed and host-daemon emitted an unexpected provider-exit event.

The race was legitimate for restart-provider: the provider exiting during/after interrupt is the intended stop outcome, not a crash.

Evidence

  • The error string comes from pending JSON-RPC requests being rejected on provider process exit, matching the in-flight interrupt path.
  • A regression test reproduces the baseline failure: provider exits while a restart-provider stop request is in flight, causing stopThread to reject before this fix.
  • Adversarial review found additional edge cases, now covered:
    • signal-based provider exits (signalCode with exitCode === null)
    • live JSON-RPC/protocol errors that must not poison future exits as expected
    • overlapping restart-provider stops on the same provider process

Proposed fix

  • Add per-caller expected-shutdown tokens to RuntimeProviderProcessManager.
    • markProviderShutdownExpected() mints a token and records it on the live process.
    • clearProviderShutdownExpected() removes only the caller's token, so overlapping stop operations cannot clear each other's expected shutdown marks.
    • process exit consumes all outstanding tokens and reports the exit as expected if at least one token exists.
  • Add hasChildProcessExited() to treat either normal exits or signal terminations as provider exits.
  • Update runtime.stopThread so restart-provider stops:
    • mark shutdown expected before sending the interrupt
    • tolerate the in-flight interrupt rejection only when the child has exited
    • clear only this stop's token and rethrow on non-exit JSON-RPC/protocol failures
  • Add regression tests for normal in-flight exit, signal exit, live request failure cleanup, and overlapping stop concurrency.

Validation

Worker validation:

  • pnpm exec turbo run typecheck --filter=@bb/agent-runtime
  • pnpm --filter @bb/agent-runtime exec vitest run --testTimeout=30000548/548 passing across 28 files

Review validation:

  • Initial adversarial review found 2 P1s; both fixed.
  • Second adversarial review found an overlapping-stop P1; fixed with tokenized expected-shutdown marks.
  • Final adversarial re-review passed with no P0/P1/P2 findings.
  • git diff --check

Out of scope / follow-up

The app-data resync ZodError in the original logs appears to be a separate schema drift in the app data resync hint path. It is not part of this provider stop crash and should be handled in a focused follow-up.

@brsbl
Copy link
Copy Markdown
Author

brsbl commented Jun 3, 2026

cc @ymichael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant