fix(runner): exec-control ops reject a finished (Done) exec via finis…#610
Open
G4614 wants to merge 2 commits into
Open
fix(runner): exec-control ops reject a finished (Done) exec via finis…#610G4614 wants to merge 2 commits into
G4614 wants to merge 2 commits into
Conversation
…hedLocked Five exec-control operations guarded only on the `closed` flag, missing the Done channel. `Done` is the canonical "exec finished" signal — closed by the wait-task's outermost defer, guaranteed on any goroutine exit incl. panic — while `closed` is set later in a nested defer an abnormal exit can skip. In that race window the op fell through and acted on a dead handle: Signal/Resize silently no-op'd (HTTP returned 204, falsely "delivered"), and the attach (WebSocket) ops wrote to a dead pipe / signalled a gone PID. Add one `finishedLocked()` predicate (Done-closed || closed flag, under handleMu) and call it from all five ops, each keeping its own per-resource nil check (execution vs stdinW) and its own error style: - Signal (HTTP) -> ErrExecClosed sentinel -> 409 via classifyExecError - ResizeTTY (HTTP) -> ErrExecClosed sentinel -> 409 - AttachSignal (WS) -> "execution X is closed" - AttachResize (WS) -> "execution X is closed" - AttachWriteStdin (WS)-> "execution X stdin is closed" The predicate dedups the Done+closed check (one rule, one place) so a new exec-control op can't silently forget it. Kill is intentionally untouched — best-effort idempotent terminate, no 409 contract. Signal's reproducer (TestBoxliteExecSignalClosedReturns409, PR boxlite-ai#505's "Finding 10") was already red since 2026-05-12; the other four get new reproducers. Two-side verified (all 5): pre-fix: Signal->204, Resize->400(not-tty), 3 Attach->nil error (5 FAIL) post-fix: 5 PASS; pkg/api/controllers + pkg/boxlite otherwise green (unrelated TestExecManagerSignalUnsupportedFallsThroughToKill is Bug 2, fixed on a separate branch, not regressed here) Supersedes fix/runner-signal-done-409 (Signal-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pen) Complements the Done-caught reproducers with the inverse teardown window: the inner defer ran (closed=true) but the outer defer hasn't closed Done yet. Verifies finishedLocked still rejects via the `closed` branch, so the Done-first refactor didn't turn it into dead code. Done is left open and execution is non-nil, so `closed` is the sole error source — dropping `return e.closed` makes it fail (got nil). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Inside
ManagedExec, after the box process has exited, the exec-control ops (signal/resize/write-stdin) must consult bothe.Doneande.closed, theManagedExecshould know signal sending failed when either flag occursTest plan
Run with
-tags boxlite_dev.closedandDoneflip at different instants during teardown — the inner defer setsclosed, the outer defer closesDone— so each teardown state is caught by a different signal, and the tests below pin each row:closedDoneDonenot yet broadcastclosedTestAttachSignalClosedFlagNoDoneErrors(new)closednever setDoneDone-caught row (close(exec.Done),closedstays false):TestAttachSignalClosedExecErrors(pkg/boxlite) — newTestAttachResizeClosedExecErrors(pkg/boxlite) — newTestAttachWriteStdinClosedExecErrors(pkg/boxlite) — newTestBoxliteExecResizeClosedReturns409(pkg/api/controllers) — newTestBoxliteExecSignalClosedReturns409(pkg/api/controllers) — pre-existing (PR feat(exec): runner attach controller + env/workdir/timeout plumbing #505 "Finding 10")closed-caught row (closed=true,Doneleft open):TestAttachSignalClosedFlagNoDoneErrors(pkg/boxlite) — newTwo-side verification:
Donerow: revertexec_manager.goto parent → 5 FAIL (Signal→204, Resize→400 not-tty, 3 attach→nil error); restore → 5 PASS.closedrow: the parent already rejected onclosed, so this is not a fail-before reproducer; instead, droppingreturn e.closedfromfinishedLocked→ FAIL (got nil), restore → PASS — pinning thatclosed(notDone) is what rejects in this row.