fix(codexcli): reset kills session + zombie process lingers 30min#576
Conversation
Two P1 defects in codex app-server worker: 1. ResetContext clears worker state but never calls Start(), so the new forwardEvents goroutine (spawned by bridge InPlaceReset) reads from a closed recvCh → immediate error on resumed sessions. Fix: save sessionInfo in Start(), rewrite ResetContext to Terminate → clear state → call Start(ctx, savedSession) to re-establish a fresh thread. 2. Kill() is identical to Terminate() — both just call release() which decrements refs and starts the 30-minute idle drain timer. When a user kills a session, the singleton process lingers for half an hour. Fix: Kill() now calls release() followed by manager.KillIfIdle(), which force-kills the process by PGID when refs == 0, bypassing proc.Kill() to avoid the known proc.mu deadlock between Kill/Wait. New manager field pgid (cached in startProcessLocked) avoids needing proc.PGID() which also contends on proc.mu. Both fixes include integration-style tests with real codex app-server processes. Tests use save/restore pattern for global InitConfig to avoid state leakage to other parallel tests.
hotplex-ai
left a comment
There was a problem hiding this comment.
Code Review — hotplex-ai
Verdict: ✅ APPROVE | P0:0 P1:0 P2:0 P3:3
整体评价:两个 P1 缺陷的修复方案正确。ResetContext 的 Terminate→clear→Start 模式与 claudecode worker 一致;Kill/Terminate 语义区分合理;KillIfIdle 通过 ForceKill(pgid) 规避 proc.mu deadlock 是正当的架构决策。测试覆盖充分(5 个集成测试含 save/restore 防全局状态泄漏)。
P3-1: savedSession 命名与跨 worker 约定不一致 (worker.go:405)
claudecode/worker.go 和 codexcli/worker.go(ExecWorker) 均使用 origSession 作为 "保留首次 Start() 的 SessionInfo" 的字段名。新增的 AppServerWorker.savedSession 语义相同但命名不同。建议统一为 origSession,或在注释中明确说明 savedSession 总是保留 最近一次(非首次)Start 的值。
P3-2: KillIfIdle 跳过 graceful shutdown 缺少设计说明 (manager.go:726)
Shutdown() 使用 proc.Kill() 遵循 SIGTERM→5s→SIGKILL 三层协议,而 KillIfIdle 直接 ForceKill(pgid) 跳到 SIGKILL。opencodeserver 的 Kill() 也不做 force-kill。建议在 KillIfIdle 注释中补充说明跳过 graceful 的设计理由(如:idle 进程无活跃 session,graceful 无意义)。
P3-3: KillIfIdle deadlock 注释措辞可改进 (manager.go:714)
当前注释说 "monitorProcess may already hold proc.mu in its own Wait() call",但实际 monitorProcess 先释放 m.mu 再调 pm.Wait(),Wait() 内部会获取 proc.mu。更准确的说法是:Manager.Kill() 获取 proc.mu 后调 cmd.Wait(),而 monitorProcess 的 Wait() 也在 proc.mu 下调 cmd.Wait()(通过 waitOnce),两者构成 deadlock。ForceKill 不触碰 proc.mu,monitorProcess 自然观察到进程退出后完成 cleanup。
…ards Split #575 tests into: - Unit tests (no codex binary): TestResetContextClearsStateAndResetsOnce, TestKillCallsKillIfIdle, TestKillIfIdleNoOpWhenRefsPositive - Integration tests (require codex binary): guarded by testing.Short() and error-tolerant (t.Logf + return when codex not found) CI runners lack the codex binary, so integration tests gracefully skip. Unit tests validate the core logic without process dependencies.
- P3-1: Rename savedSession → origSession to match claudecode/exec worker convention - P3-2: Expand KillIfIdle doc comment explaining why graceful shutdown is skipped (idle process has no active sessions, zombie GC needs immediate cleanup) - P3-3: Fix deadlock explanation: both Kill() and monitorProcess's Wait() acquire proc.mu and call cmd.Wait() (serialised by waitOnce), not 'monitorProcess may already hold proc.mu'
hotplex-ai
left a comment
There was a problem hiding this comment.
Code Review — hotplex-ai
Verdict:
Two well-targeted fixes for real bugs. The ResetContext re-establish pattern (Terminate→clear→Start) correctly mirrors the claudecode precedent. KillIfIdle's ForceKill-via-PGID approach correctly avoids the proc.mu deadlock documented in the spec. One P1 bug in the error path, plus defense-in-depth items below.
P1 — Must Fix
[P1] ResetContext Start() failure leaves worker in unrecoverable state — worker.go:630-636
After Terminate() + state clear, if Start(saved) fails, the worker has: closed=false, doneCh=new, recvCh=nil, conn=stale-closed. A subsequent Kill() call will re-execute release() via the fresh releaseOnce, calling manager.Release() on already-released refs (refs already 0), producing a spurious warn log and leaving the session in zombie state. Fix: on Start() failure, set w.closed = true and close doneCh to signal the waiting goroutine, or restore releaseOnce to its pre-clear state so the subsequent cleanup is a no-op.
P2 — Should Fix
[P2] pgid not reset to 0 in cleanup paths — manager.go:90
monitorProcess and Shutdown set m.proc = nil but never reset m.pgid = 0. Currently safe because KillIfIdle guards on m.state == stateRunning, but defense-in-depth requires resetting cached state. If the state check logic changes, stale PGID could cause cross-process kill.
[P2] ResetContext doesn't clear w.conn/w.commands — worker.go:627-631
State cleanup clears threadID/recvCh/closed/doneCh/releaseOnce but leaves w.conn pointing to the old closed appConn and w.commands holding the old threadID. Start() overwrites these, but the window between clear and Start is defense-in-depth gap. Add w.conn = nil; w.commands = nil.
[P2] Duplicate godoc paragraph in KillIfIdle — manager.go:712
Line 712 "KillIfIdle immediately kills the singleton process when refs == 0." is a redundant subset of line 708-710. Delete.
[P2] Test pgid=99999 calls real ForceKill — worker_test.go:~1705
TestKillCallsKillIfIdle sets mgr.pgid = 99999 then calls Kill() which triggers ForceKill(99999). In CI, PGID 99999 could match an unrelated process. Use a mock or set mgr.state = stateIdle to avoid the actual ForceKill call while still testing the logic.
[P2] Zombie GC calls Terminate() not Kill() — 30m idle drain may persist for primary zombie path — session/manager.go→bridge_forward.go
The zombie GC path (gcTerminatedSessions) calls worker.Terminate(), not Kill(). Terminate() calls release() which starts the 30m idle drain timer. forwardEvents then exits via doneCh close + Wait() returning immediately, so Kill() (and thus KillIfIdle()) is never invoked. The fix for Bug B may only apply to the bridge_forward.go Wait-timeout path, not the primary zombie GC scenario. Consider calling KillIfIdle() from Terminate()/release() when refs drop to 0, or from the session manager's zombie transition.
P3 — Nice to Have
[P3] Integration tests use t.Logf+return instead of t.Skip — worker_test.go
TestIntegrationStartSavesSessionAndResetRestarts and TestIntegrationKillImmediatelyTerminatesIdleProcess silently pass when codex binary is unavailable. Use t.Skip("codex binary not available") for honest CI reporting.
| w.mu.Unlock() | ||
|
|
||
| // Re-establish a fresh thread so the new forwardEvents goroutine | ||
| // (spawned by bridge.ResetSession) reads from a live recvCh instead |
There was a problem hiding this comment.
[P1] Start() failure leaves inconsistent state: After Terminate+clear, if Start(saved) fails, worker has closed=false, recvCh=nil, conn=stale. A subsequent Kill() re-runs release() via fresh releaseOnce, calling manager.Release() on refs==0 (already released). Fix: on Start failure, set w.closed=true and close doneCh to signal waiting goroutines.
| // KillIfIdle immediately kills the singleton process when no sessions hold | ||
| // references (refs == 0). This is used by AppServerWorker.Kill() to | ||
| // distinguish a forceful termination (zombie GC) from a graceful release | ||
| // (Terminate), avoiding the 30-minute idle drain wait. |
There was a problem hiding this comment.
[P2] Duplicate godoc: This line is a redundant subset of the first paragraph (lines 708-710). Please delete it.
| saved := w.origSession | ||
| w.threadID = "" | ||
| w.recvCh = nil | ||
| w.closed = false |
There was a problem hiding this comment.
[P2] conn/commands not cleared: State cleanup should also include w.conn = nil and w.commands = nil alongside the other field resets, to avoid stale references during the clear-to-Start window.
P1: ResetContext Start() failure leaves worker unrecoverable - Set closed=true, close(doneCh), nil doneCh on Start failure - Prevents double-release via subsequent Kill() P2: pgid not reset in cleanup paths - Add m.pgid = 0 in monitorProcess, Shutdown, handshake failure P2: ResetContext doesn't clear w.conn/w.commands - Add w.conn = nil; w.commands = nil in state cleanup P2: Zombie GC calls Terminate() not Kill() — 30min drain persists - Terminate() now also calls manager.KillIfIdle() so zombie GC path immediately kills idle singleton process P2: TestKillCallsKillIfIdle calls real ForceKill(99999) - Set stateIdle instead of stateRunning to avoid actual ForceKill P2: Duplicate godoc paragraph in KillIfIdle - Merge into single coherent doc comment P3: savedSession naming inconsistent with origSession - Rename saved → origSess in ResetContext P3: KillIfIdle comment improvements (graceful rationale, deadlock wording) P3: Integration tests use t.Logf+return instead of t.Skip - Replace with t.Skipf for honest CI reporting
hotplex-ai
left a comment
There was a problem hiding this comment.
Code Review — hotplex-ai
Verdict: ✅ APPROVE | P0:0 P1:0 P2:1 P3:5
P2 (1)
- [P2]
Kill()注释过时 (worker.go:580) — 注释声称 Kill 和 Terminate 行为不同,但两者现在都调用KillIfIdle(),行为一致。注释应更正以反映实际行为。
P3 (5)
- [P3]
KillIfIdle存在良性 TOCTOU (manager.go:733) —shouldKill检查与ForceKill(pgid)之间有窗口,但对已退出 PGID 发 SIGKILL 返回 ESRCH 无害。 - [P3]
TestResetContextRestartsFromSavedSession导致 refs=-1 (worker_test.go:1671) — Acquire 未调用但 release() 调用了 manager.Release(),refs 从 0 降到 -1。测试应预先 Acquire 或 mock release。 - [P3]
testConfig575命名耦合 issue 编号 (worker_test.go:1648) — 建议改为更通用的名称如testConfigWithDefaults。 - [P3]
KillIfIdledoc comment 死锁描述不够精确 (manager.go:723-727) — 措辞暗示两个路径同时持有 proc.mu,实际是 cmd.Wait() 持有阻塞 Kill 获取。 - [P3]
TestKillCallsKillIfIdle未实际测试 kill 路径 (worker_test.go:1743) — state 为 stateIdle(默认值),shouldKill 条件refs==0 && state==stateRunning永假,测试仅验证了 timer stop。
整体评价:两个缺陷的修复逻辑正确。KillIfIdle 绕过 proc.mu 的 ForceKill 设计合理,ResetContext 的 Terminate→clear→Start 重连链路完整。仅注释需更新。
| // After release(), refs may reach 0. Immediately kill the singleton | ||
| // process instead of waiting for the idle drain timer (30 minutes). | ||
| // This distinguishes Kill() (forceful, from zombie GC) from | ||
| // Terminate() (graceful, keeps process alive for reuse). |
There was a problem hiding this comment.
[P2] 注释过时:"This distinguishes Kill() (forceful, from zombie GC) from Terminate() (graceful, keeps process alive for reuse)" — 但 Terminate() (line 564-569) 也调用了 KillIfIdle(),两者行为已完全一致。建议更新注释反映当前实际行为。
P2: Kill() comment was stale — both Kill and Terminate now call KillIfIdle, update comment to reflect identical behavior. P3: KillIfIdle TOCTOU — add NOTE about benign ESRCH on exited PGID. P3: KillIfIdle deadlock comment — clarify that cmd.Wait() inside proc.mu is the actual blocker, not concurrent mu acquisition. P3: TestResetContextRestartsFromSavedSession refs=-1 — pre-set mgr.refs=1 to avoid negative ref count on release(). P3: testConfig575 → testConfigWithDefaults — decouple from issue #. P3: TestKillCallsKillIfIdle now uses stateRunning+pgid=0 so shouldKill is false via pgid gate (not state gate), actually exercising the running code path. P3: Add TestKillIfIdleKillsWhenIdle to verify the kill path with refs=0+stateRunning+pgid=-1 (ESRCH, harmless). P3: TestKillIfIdleNoOpWhenRefsPositive — replace pgid=99999 with pgid=0 to avoid ForceKill on real PGID.
hotplex-ai
left a comment
There was a problem hiding this comment.
Code Review — hotplex-ai
Verdict: APPROVE | P0:0 P1:0 P2:0 P3:0
No issues found. All changes are well-tested (256 lines of new tests) and the design is sound.
Confirmed correct
KillIfIdle: TOCTOU betweenshouldKillcheck andForceKillis documented as benign — ESRCH on dead PGID is harmless ✓Kill()/Terminate()identity: Both callKillIfIdle()afterrelease(). Comment explains the distinction exists for interface conformance ✓ResetContext: Terminate → clear state → Start(savedSession) flow is correct. Failure path marks worker as safely inert (closed=true, doneCh closed, releaseOnce fresh for subsequent Kill) ✓pgidcaching: Set instartProcessLocked, cleared on all cleanup paths (Shutdown, startProcess error, monitorProcess exit) ✓origSession: Preserved from Start(), used in ResetContext to re-establish thread ✓- Tests: Comprehensive coverage — KillIfIdle (idle/refs>0/with-pgid), ResetContext (state cleanup, restart from saved session), integration (Start→Reset→new threadID) ✓
🤖 Generated with Claude Code
Fixes #575
Defect A — Reset kills session
ResetContext()clears worker state but never callsStart(). When bridgeInPlaceResetspawns a newforwardEventsgoroutine, it reads from the stale closedrecvCh→ immediate error on resumed sessions.Fix: Save
sessionInfoinStart(). RewriteResetContexttoTerminate → clear state → Start(ctx, savedSession)to re-establish a fresh thread.Defect B — Zombie process lingers 30 minutes
Kill()is identical toTerminate()— both just callrelease()which decrements refs and starts the 30-minute idle drain timer. Killed sessions leave the singleton process alive for half an hour.Fix:
Kill()now callsrelease()followed bymanager.KillIfIdle(), which force-kills the process by PGID whenrefs == 0. Usesproc.ForceKill(pgid)instead ofproc.Kill()to avoid the knownproc.mudeadlock between Kill/Wait.Design decisions
KillIfIdleusesForceKill(pgid)notproc.Kill()proc.Kill()holdsproc.muand callscmd.Wait(), deadlocking withmonitorProcessalready inWait()pgidinstartProcessLockedproc.PGID()also needsproc.mu, same deadlock riskKillIfIdlecopies pgid underm.mu, kills after unlockproc.ManagerTesting
testConfig575()helper with save/restore pattern to avoid globalInitConfigstate leakaget.Parallel()on tests that modify global configFiles changed