Summary
When a Codex worker process dies unexpectedly (SIGKILL, OOM, broker socket disconnect, app-server crash), the companion job state file retains status: "running" indefinitely. No code path in the companion checks whether the stored PID is actually alive. This is the shared root cause behind #164 (review stuck in "running") and #202 (zombie job blocks subsequent tasks), plus a third unreported symptom: result <job-id> returning "No job found" while status <job-id> reports the full (stale) job object.
Three Symptoms, One Root Cause
| Symptom |
Affected command |
Existing issue |
--wait poll loop spins until timeout against a dead process |
status --wait |
#164 |
| "Task X is still running" blocks all subsequent task calls |
task (via resolveLatestTrackedTaskThread) |
#202 |
result <job-id> says "No job found" while status <job-id> returns the job |
result vs status |
(new) |
Root Cause Analysis
1. Detached worker spawned with zero crash monitoring
spawnDetachedTaskWorker() in codex-companion.mjs (lines 613–624):
function spawnDetachedTaskWorker(cwd, jobId) {
const child = spawn(process.execPath, [...], {
detached: true, // own process group
stdio: "ignore", // parent never sees output
windowsHide: true
});
child.unref(); // parent drops all references
return child;
}
- No
child.on('exit', ...) handler
- No
child.on('error', ...) handler
- Parent has no visibility into worker lifecycle after spawn
2. runTrackedJob() catch block only handles JS errors
tracked-jobs.mjs (lines 142–204): The try/catch around await runner() handles JavaScript exceptions but cannot fire when the detached worker process is killed at the OS level (SIGKILL, OOM killer, etc.). The promise simply never settles, and the state file remains at status: "running".
3. No PID liveness probe in any status/query path
The following functions read the state file and return status as-is, without checking if job.pid is alive:
| Function |
File |
Lines |
enrichJob() |
job-control.mjs |
161–181 |
buildSingleJobSnapshot() |
job-control.mjs |
242–254 |
waitForSingleJobSnapshot() |
codex-companion.mjs |
293–309 |
resolveLatestTrackedTaskThread() |
codex-companion.mjs |
312–320 |
resolveResultJob() |
job-control.mjs |
256–279 |
Zero occurrences of kill(pid, 0) or ps -p exist outside terminateProcessTree(), which is only called by handleCancel().
4. result vs status inconsistency
resolveResultJob() (line 258) filters for completed | failed | cancelled. A zombie "running" job is caught at line 269 and throws "Job X is still running." But without a reference argument, it filters to current-session jobs only (line 258: filterJobsForCurrentSession). If the stale job belongs to a previous session, result says "No finished job found" while status (which reads all jobs) returns it as "running". This creates a confusing split where both commands appear to contradict each other.
Real-World Evidence
Incident 1: adversarial-review on macOS (2026-04-11)
Job: review-mnugvp3h-uy1ly9
Kind: adversarial-review
Created: 2026-04-11T15:05:41.837Z
Updated: 2026-04-11T15:07:23.215Z ← last state write
Cancelled: 2026-04-11T15:27:20.275Z ← manual cancel 20 min later
- Backend PID died silently at ~15:07.
status reported "running" for 20 minutes until manual /codex:cancel.
result <job-id> returned "No job found" during the stale window.
ps -p <pid> confirmed the process was dead (no row returned).
Incident 2: task broker disconnect (2026-04-07)
[2026-04-07T05:40:45.631Z] Codex turn interrupt failed: connect ECONNREFUSED .../broker.sock
[2026-04-07T05:40:45.632Z] Cancelled by user.
Broker socket became unreachable — same pattern: process dead, state file stale, only resolved by manual cancel.
Suggested Fix
Add a isProcessAlive(pid) utility and call it in the status/query path:
// lib/process.mjs
export function isProcessAlive(pid) {
if (!Number.isFinite(pid)) return false;
try {
process.kill(pid, 0);
return true;
} catch (e) {
return e.code === 'EPERM'; // exists but no permission
}
}
Then in enrichJob() or buildSingleJobSnapshot():
import { isProcessAlive } from "./process.mjs";
export function enrichJob(job, options = {}) {
// Auto-reap zombie jobs
if (
(job.status === "running" || job.status === "queued") &&
job.pid != null &&
!isProcessAlive(job.pid)
) {
const reaped = {
...job,
status: "failed",
phase: "failed",
pid: null,
errorMessage: "Worker process exited without updating job status (zombie reap).",
completedAt: new Date().toISOString()
};
// Persist the reap so it doesn't recur
upsertJob(job.workspaceRoot ?? resolveWorkspaceRoot(process.cwd()), reaped);
writeJobFile(job.workspaceRoot ?? resolveWorkspaceRoot(process.cwd()), job.id, reaped);
return enrichJob(reaped, options); // re-enrich with corrected status
}
// ... existing enrichment logic ...
}
This fixes all three symptoms at the source:
status --wait will see the reap and stop polling
resolveLatestTrackedTaskThread() will no longer block on a dead task
result will find the job in failed state instead of "still running"
Environment
- codex-plugin-cc: v1.0.2 (local cache), v1.0.3 available
- Codex CLI: v0.118.0
- Claude Code: latest
- Platform: macOS Darwin 24.6.0 (Apple Silicon)
- Node.js: v22.21.1
Related Issues
Summary
When a Codex worker process dies unexpectedly (SIGKILL, OOM, broker socket disconnect, app-server crash), the companion job state file retains
status: "running"indefinitely. No code path in the companion checks whether the stored PID is actually alive. This is the shared root cause behind #164 (review stuck in "running") and #202 (zombie job blocks subsequent tasks), plus a third unreported symptom:result <job-id>returning "No job found" whilestatus <job-id>reports the full (stale) job object.Three Symptoms, One Root Cause
--waitpoll loop spins until timeout against a dead processstatus --waittask(viaresolveLatestTrackedTaskThread)result <job-id>says "No job found" whilestatus <job-id>returns the jobresultvsstatusRoot Cause Analysis
1. Detached worker spawned with zero crash monitoring
spawnDetachedTaskWorker()incodex-companion.mjs(lines 613–624):child.on('exit', ...)handlerchild.on('error', ...)handler2.
runTrackedJob()catch block only handles JS errorstracked-jobs.mjs(lines 142–204): Thetry/catcharoundawait runner()handles JavaScript exceptions but cannot fire when the detached worker process is killed at the OS level (SIGKILL, OOM killer, etc.). The promise simply never settles, and the state file remains atstatus: "running".3. No PID liveness probe in any status/query path
The following functions read the state file and return
statusas-is, without checking ifjob.pidis alive:enrichJob()job-control.mjsbuildSingleJobSnapshot()job-control.mjswaitForSingleJobSnapshot()codex-companion.mjsresolveLatestTrackedTaskThread()codex-companion.mjsresolveResultJob()job-control.mjsZero occurrences of
kill(pid, 0)orps -pexist outsideterminateProcessTree(), which is only called byhandleCancel().4.
resultvsstatusinconsistencyresolveResultJob()(line 258) filters forcompleted | failed | cancelled. A zombie "running" job is caught at line 269 and throws"Job X is still running."But without a reference argument, it filters to current-session jobs only (line 258:filterJobsForCurrentSession). If the stale job belongs to a previous session,resultsays "No finished job found" whilestatus(which reads all jobs) returns it as "running". This creates a confusing split where both commands appear to contradict each other.Real-World Evidence
Incident 1: adversarial-review on macOS (2026-04-11)
statusreported "running" for 20 minutes until manual/codex:cancel.result <job-id>returned "No job found" during the stale window.ps -p <pid>confirmed the process was dead (no row returned).Incident 2: task broker disconnect (2026-04-07)
Broker socket became unreachable — same pattern: process dead, state file stale, only resolved by manual cancel.
Suggested Fix
Add a
isProcessAlive(pid)utility and call it in the status/query path:Then in
enrichJob()orbuildSingleJobSnapshot():This fixes all three symptoms at the source:
status --waitwill see the reap and stop pollingresolveLatestTrackedTaskThread()will no longer block on a dead taskresultwill find the job infailedstate instead of "still running"Environment
Related Issues