Skip to content

fix(codexcli): reset kills session + zombie process lingers 30min#576

Merged
hrygo merged 6 commits into
mainfrom
fix/codexcli-reset-zombie-575
May 30, 2026
Merged

fix(codexcli): reset kills session + zombie process lingers 30min#576
hrygo merged 6 commits into
mainfrom
fix/codexcli-reset-zombie-575

Conversation

@hrygo
Copy link
Copy Markdown
Owner

@hrygo hrygo commented May 30, 2026

Fixes #575

Defect A — Reset kills session

ResetContext() clears worker state but never calls Start(). When bridge InPlaceReset spawns a new forwardEvents goroutine, it reads from the stale closed recvCh → immediate error on resumed sessions.

Fix: Save sessionInfo in Start(). Rewrite ResetContext to Terminate → clear state → Start(ctx, savedSession) to re-establish a fresh thread.

Defect B — Zombie process lingers 30 minutes

Kill() is identical to Terminate() — both just call release() which decrements refs and starts the 30-minute idle drain timer. Killed sessions leave the singleton process alive for half an hour.

Fix: Kill() now calls release() followed by manager.KillIfIdle(), which force-kills the process by PGID when refs == 0. Uses proc.ForceKill(pgid) instead of proc.Kill() to avoid the known proc.mu deadlock between Kill/Wait.

Design decisions

Decision Rationale
KillIfIdle uses ForceKill(pgid) not proc.Kill() proc.Kill() holds proc.mu and calls cmd.Wait(), deadlocking with monitorProcess already in Wait()
Cache pgid in startProcessLocked proc.PGID() also needs proc.mu, same deadlock risk
KillIfIdle copies pgid under m.mu, kills after unlock Avoids blocking ops under lock
Do not modify proc.Manager Kill/Wait lock contention is pre-existing, out of scope for #575

Testing

  • 5 new integration-style tests with real codex app-server processes
  • Tests use testConfig575() helper with save/restore pattern to avoid global InitConfig state leakage
  • No t.Parallel() on tests that modify global config
  • Full suite passes ×5 runs (15 total) with zero flakes

Files changed

internal/worker/codexcli/manager.go      |  29 +++++
internal/worker/codexcli/worker.go       |  22 ++++
internal/worker/codexcli/worker_test.go  | 172 +++++++++++++++++++++
3 files changed, 223 insertions(+)

Two P1 defects in codex app-server worker:

1. ResetContext clears worker state but never calls Start(), so the new
   forwardEvents goroutine (spawned by bridge InPlaceReset) reads from
   a closed recvCh → immediate error on resumed sessions.

   Fix: save sessionInfo in Start(), rewrite ResetContext to Terminate
   → clear state → call Start(ctx, savedSession) to re-establish a
   fresh thread.

2. Kill() is identical to Terminate() — both just call release() which
   decrements refs and starts the 30-minute idle drain timer. When a
   user kills a session, the singleton process lingers for half an hour.

   Fix: Kill() now calls release() followed by manager.KillIfIdle(),
   which force-kills the process by PGID when refs == 0, bypassing
   proc.Kill() to avoid the known proc.mu deadlock between Kill/Wait.

   New manager field pgid (cached in startProcessLocked) avoids needing
   proc.PGID() which also contends on proc.mu.

Both fixes include integration-style tests with real codex app-server
processes. Tests use save/restore pattern for global InitConfig to
avoid state leakage to other parallel tests.
hotplex-ai
hotplex-ai previously approved these changes May 30, 2026
Copy link
Copy Markdown
Owner

@hotplex-ai hotplex-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — hotplex-ai

Verdict: ✅ APPROVE | P0:0 P1:0 P2:0 P3:3

整体评价:两个 P1 缺陷的修复方案正确。ResetContext 的 Terminate→clear→Start 模式与 claudecode worker 一致;Kill/Terminate 语义区分合理;KillIfIdle 通过 ForceKill(pgid) 规避 proc.mu deadlock 是正当的架构决策。测试覆盖充分(5 个集成测试含 save/restore 防全局状态泄漏)。


P3-1: savedSession 命名与跨 worker 约定不一致 (worker.go:405)

claudecode/worker.gocodexcli/worker.go(ExecWorker) 均使用 origSession 作为 "保留首次 Start() 的 SessionInfo" 的字段名。新增的 AppServerWorker.savedSession 语义相同但命名不同。建议统一为 origSession,或在注释中明确说明 savedSession 总是保留 最近一次(非首次)Start 的值。

P3-2: KillIfIdle 跳过 graceful shutdown 缺少设计说明 (manager.go:726)

Shutdown() 使用 proc.Kill() 遵循 SIGTERM→5s→SIGKILL 三层协议,而 KillIfIdle 直接 ForceKill(pgid) 跳到 SIGKILL。opencodeserverKill() 也不做 force-kill。建议在 KillIfIdle 注释中补充说明跳过 graceful 的设计理由(如:idle 进程无活跃 session,graceful 无意义)。

P3-3: KillIfIdle deadlock 注释措辞可改进 (manager.go:714)

当前注释说 "monitorProcess may already hold proc.mu in its own Wait() call",但实际 monitorProcess 先释放 m.mu 再调 pm.Wait()Wait() 内部会获取 proc.mu。更准确的说法是:Manager.Kill() 获取 proc.mu 后调 cmd.Wait(),而 monitorProcessWait() 也在 proc.mu 下调 cmd.Wait()(通过 waitOnce),两者构成 deadlock。ForceKill 不触碰 proc.mumonitorProcess 自然观察到进程退出后完成 cleanup。


…ards

Split #575 tests into:
- Unit tests (no codex binary): TestResetContextClearsStateAndResetsOnce,
  TestKillCallsKillIfIdle, TestKillIfIdleNoOpWhenRefsPositive
- Integration tests (require codex binary): guarded by testing.Short()
  and error-tolerant (t.Logf + return when codex not found)

CI runners lack the codex binary, so integration tests gracefully skip.
Unit tests validate the core logic without process dependencies.
- P3-1: Rename savedSession → origSession to match claudecode/exec worker convention
- P3-2: Expand KillIfIdle doc comment explaining why graceful shutdown is skipped
  (idle process has no active sessions, zombie GC needs immediate cleanup)
- P3-3: Fix deadlock explanation: both Kill() and monitorProcess's Wait()
  acquire proc.mu and call cmd.Wait() (serialised by waitOnce), not
  'monitorProcess may already hold proc.mu'
Copy link
Copy Markdown
Owner

@hotplex-ai hotplex-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — hotplex-ai

Verdict: ⚠️ REQUEST_CHANGES | P0:0 P1:1 P2:5 P3:1

Two well-targeted fixes for real bugs. The ResetContext re-establish pattern (Terminate→clear→Start) correctly mirrors the claudecode precedent. KillIfIdle's ForceKill-via-PGID approach correctly avoids the proc.mu deadlock documented in the spec. One P1 bug in the error path, plus defense-in-depth items below.

P1 — Must Fix

[P1] ResetContext Start() failure leaves worker in unrecoverable stateworker.go:630-636
After Terminate() + state clear, if Start(saved) fails, the worker has: closed=false, doneCh=new, recvCh=nil, conn=stale-closed. A subsequent Kill() call will re-execute release() via the fresh releaseOnce, calling manager.Release() on already-released refs (refs already 0), producing a spurious warn log and leaving the session in zombie state. Fix: on Start() failure, set w.closed = true and close doneCh to signal the waiting goroutine, or restore releaseOnce to its pre-clear state so the subsequent cleanup is a no-op.

P2 — Should Fix

[P2] pgid not reset to 0 in cleanup pathsmanager.go:90
monitorProcess and Shutdown set m.proc = nil but never reset m.pgid = 0. Currently safe because KillIfIdle guards on m.state == stateRunning, but defense-in-depth requires resetting cached state. If the state check logic changes, stale PGID could cause cross-process kill.

[P2] ResetContext doesn't clear w.conn/w.commandsworker.go:627-631
State cleanup clears threadID/recvCh/closed/doneCh/releaseOnce but leaves w.conn pointing to the old closed appConn and w.commands holding the old threadID. Start() overwrites these, but the window between clear and Start is defense-in-depth gap. Add w.conn = nil; w.commands = nil.

[P2] Duplicate godoc paragraph in KillIfIdlemanager.go:712
Line 712 "KillIfIdle immediately kills the singleton process when refs == 0." is a redundant subset of line 708-710. Delete.

[P2] Test pgid=99999 calls real ForceKillworker_test.go:~1705
TestKillCallsKillIfIdle sets mgr.pgid = 99999 then calls Kill() which triggers ForceKill(99999). In CI, PGID 99999 could match an unrelated process. Use a mock or set mgr.state = stateIdle to avoid the actual ForceKill call while still testing the logic.

[P2] Zombie GC calls Terminate() not Kill() — 30m idle drain may persist for primary zombie pathsession/manager.gobridge_forward.go
The zombie GC path (gcTerminatedSessions) calls worker.Terminate(), not Kill(). Terminate() calls release() which starts the 30m idle drain timer. forwardEvents then exits via doneCh close + Wait() returning immediately, so Kill() (and thus KillIfIdle()) is never invoked. The fix for Bug B may only apply to the bridge_forward.go Wait-timeout path, not the primary zombie GC scenario. Consider calling KillIfIdle() from Terminate()/release() when refs drop to 0, or from the session manager's zombie transition.

P3 — Nice to Have

[P3] Integration tests use t.Logf+return instead of t.Skipworker_test.go
TestIntegrationStartSavesSessionAndResetRestarts and TestIntegrationKillImmediatelyTerminatesIdleProcess silently pass when codex binary is unavailable. Use t.Skip("codex binary not available") for honest CI reporting.

w.mu.Unlock()

// Re-establish a fresh thread so the new forwardEvents goroutine
// (spawned by bridge.ResetSession) reads from a live recvCh instead
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Start() failure leaves inconsistent state: After Terminate+clear, if Start(saved) fails, worker has closed=false, recvCh=nil, conn=stale. A subsequent Kill() re-runs release() via fresh releaseOnce, calling manager.Release() on refs==0 (already released). Fix: on Start failure, set w.closed=true and close doneCh to signal waiting goroutines.

Comment thread internal/worker/codexcli/manager.go Outdated
// KillIfIdle immediately kills the singleton process when no sessions hold
// references (refs == 0). This is used by AppServerWorker.Kill() to
// distinguish a forceful termination (zombie GC) from a graceful release
// (Terminate), avoiding the 30-minute idle drain wait.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Duplicate godoc: This line is a redundant subset of the first paragraph (lines 708-710). Please delete it.

saved := w.origSession
w.threadID = ""
w.recvCh = nil
w.closed = false
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] conn/commands not cleared: State cleanup should also include w.conn = nil and w.commands = nil alongside the other field resets, to avoid stale references during the clear-to-Start window.

P1: ResetContext Start() failure leaves worker unrecoverable
- Set closed=true, close(doneCh), nil doneCh on Start failure
- Prevents double-release via subsequent Kill()

P2: pgid not reset in cleanup paths
- Add m.pgid = 0 in monitorProcess, Shutdown, handshake failure

P2: ResetContext doesn't clear w.conn/w.commands
- Add w.conn = nil; w.commands = nil in state cleanup

P2: Zombie GC calls Terminate() not Kill() — 30min drain persists
- Terminate() now also calls manager.KillIfIdle() so zombie GC
  path immediately kills idle singleton process

P2: TestKillCallsKillIfIdle calls real ForceKill(99999)
- Set stateIdle instead of stateRunning to avoid actual ForceKill

P2: Duplicate godoc paragraph in KillIfIdle
- Merge into single coherent doc comment

P3: savedSession naming inconsistent with origSession
- Rename saved → origSess in ResetContext

P3: KillIfIdle comment improvements (graceful rationale, deadlock wording)

P3: Integration tests use t.Logf+return instead of t.Skip
- Replace with t.Skipf for honest CI reporting
hotplex-ai
hotplex-ai previously approved these changes May 30, 2026
Copy link
Copy Markdown
Owner

@hotplex-ai hotplex-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — hotplex-ai

Verdict: ✅ APPROVE | P0:0 P1:0 P2:1 P3:5

P2 (1)

  • [P2] Kill() 注释过时 (worker.go:580) — 注释声称 Kill 和 Terminate 行为不同,但两者现在都调用 KillIfIdle(),行为一致。注释应更正以反映实际行为。

P3 (5)

  • [P3] KillIfIdle 存在良性 TOCTOU (manager.go:733) — shouldKill 检查与 ForceKill(pgid) 之间有窗口,但对已退出 PGID 发 SIGKILL 返回 ESRCH 无害。
  • [P3] TestResetContextRestartsFromSavedSession 导致 refs=-1 (worker_test.go:1671) — Acquire 未调用但 release() 调用了 manager.Release(),refs 从 0 降到 -1。测试应预先 Acquire 或 mock release。
  • [P3] testConfig575 命名耦合 issue 编号 (worker_test.go:1648) — 建议改为更通用的名称如 testConfigWithDefaults
  • [P3] KillIfIdle doc comment 死锁描述不够精确 (manager.go:723-727) — 措辞暗示两个路径同时持有 proc.mu,实际是 cmd.Wait() 持有阻塞 Kill 获取。
  • [P3] TestKillCallsKillIfIdle 未实际测试 kill 路径 (worker_test.go:1743) — state 为 stateIdle(默认值),shouldKill 条件 refs==0 && state==stateRunning 永假,测试仅验证了 timer stop。

整体评价:两个缺陷的修复逻辑正确。KillIfIdle 绕过 proc.mu 的 ForceKill 设计合理,ResetContext 的 Terminate→clear→Start 重连链路完整。仅注释需更新。

Comment thread internal/worker/codexcli/worker.go Outdated
// After release(), refs may reach 0. Immediately kill the singleton
// process instead of waiting for the idle drain timer (30 minutes).
// This distinguishes Kill() (forceful, from zombie GC) from
// Terminate() (graceful, keeps process alive for reuse).
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] 注释过时:"This distinguishes Kill() (forceful, from zombie GC) from Terminate() (graceful, keeps process alive for reuse)" — 但 Terminate() (line 564-569) 也调用了 KillIfIdle(),两者行为已完全一致。建议更新注释反映当前实际行为。

P2: Kill() comment was stale — both Kill and Terminate now call
KillIfIdle, update comment to reflect identical behavior.

P3: KillIfIdle TOCTOU — add NOTE about benign ESRCH on exited PGID.
P3: KillIfIdle deadlock comment — clarify that cmd.Wait() inside
proc.mu is the actual blocker, not concurrent mu acquisition.
P3: TestResetContextRestartsFromSavedSession refs=-1 — pre-set
mgr.refs=1 to avoid negative ref count on release().
P3: testConfig575 → testConfigWithDefaults — decouple from issue #.
P3: TestKillCallsKillIfIdle now uses stateRunning+pgid=0 so shouldKill
is false via pgid gate (not state gate), actually exercising the
running code path.
P3: Add TestKillIfIdleKillsWhenIdle to verify the kill path with
refs=0+stateRunning+pgid=-1 (ESRCH, harmless).
P3: TestKillIfIdleNoOpWhenRefsPositive — replace pgid=99999 with
pgid=0 to avoid ForceKill on real PGID.
Copy link
Copy Markdown
Owner

@hotplex-ai hotplex-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — hotplex-ai

Verdict: APPROVE | P0:0 P1:0 P2:0 P3:0

No issues found. All changes are well-tested (256 lines of new tests) and the design is sound.

Confirmed correct

  • KillIfIdle: TOCTOU between shouldKill check and ForceKill is documented as benign — ESRCH on dead PGID is harmless ✓
  • Kill() / Terminate() identity: Both call KillIfIdle() after release(). Comment explains the distinction exists for interface conformance ✓
  • ResetContext: Terminate → clear state → Start(savedSession) flow is correct. Failure path marks worker as safely inert (closed=true, doneCh closed, releaseOnce fresh for subsequent Kill) ✓
  • pgid caching: Set in startProcessLocked, cleared on all cleanup paths (Shutdown, startProcess error, monitorProcess exit) ✓
  • origSession: Preserved from Start(), used in ResetContext to re-establish thread ✓
  • Tests: Comprehensive coverage — KillIfIdle (idle/refs>0/with-pgid), ResetContext (state cleanup, restart from saved session), integration (Start→Reset→new threadID) ✓

🤖 Generated with Claude Code

@hrygo hrygo merged commit 6f5bf39 into main May 30, 2026
6 checks passed
@hrygo hrygo deleted the fix/codexcli-reset-zombie-575 branch May 30, 2026 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(worker/codexcli): reset kills session + zombie leaves process running 30m

2 participants