Skip to content

fix(runner): exec-control ops reject a finished (Done) exec via finis…#610

Open
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/runner-exec-ops-done-checks
Open

fix(runner): exec-control ops reject a finished (Done) exec via finis…#610
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/runner-exec-ops-done-checks

Conversation

@G4614
Copy link
Copy Markdown
Contributor

@G4614 G4614 commented May 28, 2026

Inside ManagedExec, after the box process has exited, the exec-control ops (signal/resize/write-stdin) must consult both e.Done and e.closed, the ManagedExec should know signal sending failed when either flag occurs

Test plan

Run with -tags boxlite_dev. closed and Done flip at different instants during teardown — the inner defer sets closed, the outer defer closes Done — so each teardown state is caught by a different signal, and the tests below pin each row:

Teardown state closed Done Caught by Test
Still running false open neither — op proceeds — (unchanged proceed path)
Handle closed, Done not yet broadcast true open closed TestAttachSignalClosedFlagNoDoneErrors (new)
Inner defer panicked, closed never set false closed Done the 5 below (new + pre-existing)
Fully finished (both defers ran) true closed either — (redundant)

Done-caught row (close(exec.Done), closed stays false):

  • TestAttachSignalClosedExecErrors (pkg/boxlite) — new
  • TestAttachResizeClosedExecErrors (pkg/boxlite) — new
  • TestAttachWriteStdinClosedExecErrors (pkg/boxlite) — new
  • TestBoxliteExecResizeClosedReturns409 (pkg/api/controllers) — new
  • TestBoxliteExecSignalClosedReturns409 (pkg/api/controllers) — pre-existing (PR feat(exec): runner attach controller + env/workdir/timeout plumbing #505 "Finding 10")

closed-caught row (closed=true, Done left open):

  • TestAttachSignalClosedFlagNoDoneErrors (pkg/boxlite) — new

Two-side verification:

  • Done row: revert exec_manager.go to parent → 5 FAIL (Signal→204, Resize→400 not-tty, 3 attach→nil error); restore → 5 PASS.
  • closed row: the parent already rejected on closed, so this is not a fail-before reproducer; instead, dropping return e.closed from finishedLocked → FAIL (got nil), restore → PASS — pinning that closed (not Done) is what rejects in this row.

Ubuntu and others added 2 commits May 28, 2026 05:41
…hedLocked

Five exec-control operations guarded only on the `closed` flag, missing
the Done channel. `Done` is the canonical "exec finished" signal — closed
by the wait-task's outermost defer, guaranteed on any goroutine exit incl.
panic — while `closed` is set later in a nested defer an abnormal exit can
skip. In that race window the op fell through and acted on a dead handle:
Signal/Resize silently no-op'd (HTTP returned 204, falsely "delivered"),
and the attach (WebSocket) ops wrote to a dead pipe / signalled a gone PID.

Add one `finishedLocked()` predicate (Done-closed || closed flag, under
handleMu) and call it from all five ops, each keeping its own per-resource
nil check (execution vs stdinW) and its own error style:
- Signal (HTTP)        -> ErrExecClosed sentinel -> 409 via classifyExecError
- ResizeTTY (HTTP)     -> ErrExecClosed sentinel -> 409
- AttachSignal (WS)    -> "execution X is closed"
- AttachResize (WS)    -> "execution X is closed"
- AttachWriteStdin (WS)-> "execution X stdin is closed"

The predicate dedups the Done+closed check (one rule, one place) so a new
exec-control op can't silently forget it. Kill is intentionally untouched
— best-effort idempotent terminate, no 409 contract.

Signal's reproducer (TestBoxliteExecSignalClosedReturns409, PR boxlite-ai#505's
"Finding 10") was already red since 2026-05-12; the other four get new
reproducers.

Two-side verified (all 5):
  pre-fix:  Signal->204, Resize->400(not-tty), 3 Attach->nil error  (5 FAIL)
  post-fix: 5 PASS; pkg/api/controllers + pkg/boxlite otherwise green
            (unrelated TestExecManagerSignalUnsupportedFallsThroughToKill is
             Bug 2, fixed on a separate branch, not regressed here)

Supersedes fix/runner-signal-done-409 (Signal-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pen)

Complements the Done-caught reproducers with the inverse teardown window:
the inner defer ran (closed=true) but the outer defer hasn't closed Done
yet. Verifies finishedLocked still rejects via the `closed` branch, so the
Done-first refactor didn't turn it into dead code. Done is left open and
execution is non-nil, so `closed` is the sole error source — dropping
`return e.closed` makes it fail (got nil).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant