Skip to content

No OS-level process liveness check — stale 'running' state after worker death (root cause for #164, #202) #222

@bowenQT

Description

@bowenQT

Summary

When a Codex worker process dies unexpectedly (SIGKILL, OOM, broker socket disconnect, app-server crash), the companion job state file retains status: "running" indefinitely. No code path in the companion checks whether the stored PID is actually alive. This is the shared root cause behind #164 (review stuck in "running") and #202 (zombie job blocks subsequent tasks), plus a third unreported symptom: result <job-id> returning "No job found" while status <job-id> reports the full (stale) job object.

Three Symptoms, One Root Cause

Symptom Affected command Existing issue
--wait poll loop spins until timeout against a dead process status --wait #164
"Task X is still running" blocks all subsequent task calls task (via resolveLatestTrackedTaskThread) #202
result <job-id> says "No job found" while status <job-id> returns the job result vs status (new)

Root Cause Analysis

1. Detached worker spawned with zero crash monitoring

spawnDetachedTaskWorker() in codex-companion.mjs (lines 613–624):

function spawnDetachedTaskWorker(cwd, jobId) {
  const child = spawn(process.execPath, [...], {
    detached: true,      // own process group
    stdio: "ignore",     // parent never sees output
    windowsHide: true
  });
  child.unref();         // parent drops all references
  return child;
}
  • No child.on('exit', ...) handler
  • No child.on('error', ...) handler
  • Parent has no visibility into worker lifecycle after spawn

2. runTrackedJob() catch block only handles JS errors

tracked-jobs.mjs (lines 142–204): The try/catch around await runner() handles JavaScript exceptions but cannot fire when the detached worker process is killed at the OS level (SIGKILL, OOM killer, etc.). The promise simply never settles, and the state file remains at status: "running".

3. No PID liveness probe in any status/query path

The following functions read the state file and return status as-is, without checking if job.pid is alive:

Function File Lines
enrichJob() job-control.mjs 161–181
buildSingleJobSnapshot() job-control.mjs 242–254
waitForSingleJobSnapshot() codex-companion.mjs 293–309
resolveLatestTrackedTaskThread() codex-companion.mjs 312–320
resolveResultJob() job-control.mjs 256–279

Zero occurrences of kill(pid, 0) or ps -p exist outside terminateProcessTree(), which is only called by handleCancel().

4. result vs status inconsistency

resolveResultJob() (line 258) filters for completed | failed | cancelled. A zombie "running" job is caught at line 269 and throws "Job X is still running." But without a reference argument, it filters to current-session jobs only (line 258: filterJobsForCurrentSession). If the stale job belongs to a previous session, result says "No finished job found" while status (which reads all jobs) returns it as "running". This creates a confusing split where both commands appear to contradict each other.

Real-World Evidence

Incident 1: adversarial-review on macOS (2026-04-11)

Job:       review-mnugvp3h-uy1ly9
Kind:      adversarial-review
Created:   2026-04-11T15:05:41.837Z
Updated:   2026-04-11T15:07:23.215Z  ← last state write
Cancelled: 2026-04-11T15:27:20.275Z  ← manual cancel 20 min later
  • Backend PID died silently at ~15:07.
  • status reported "running" for 20 minutes until manual /codex:cancel.
  • result <job-id> returned "No job found" during the stale window.
  • ps -p <pid> confirmed the process was dead (no row returned).

Incident 2: task broker disconnect (2026-04-07)

[2026-04-07T05:40:45.631Z] Codex turn interrupt failed: connect ECONNREFUSED .../broker.sock
[2026-04-07T05:40:45.632Z] Cancelled by user.

Broker socket became unreachable — same pattern: process dead, state file stale, only resolved by manual cancel.

Suggested Fix

Add a isProcessAlive(pid) utility and call it in the status/query path:

// lib/process.mjs
export function isProcessAlive(pid) {
  if (!Number.isFinite(pid)) return false;
  try {
    process.kill(pid, 0);
    return true;
  } catch (e) {
    return e.code === 'EPERM'; // exists but no permission
  }
}

Then in enrichJob() or buildSingleJobSnapshot():

import { isProcessAlive } from "./process.mjs";

export function enrichJob(job, options = {}) {
  // Auto-reap zombie jobs
  if (
    (job.status === "running" || job.status === "queued") &&
    job.pid != null &&
    !isProcessAlive(job.pid)
  ) {
    const reaped = {
      ...job,
      status: "failed",
      phase: "failed",
      pid: null,
      errorMessage: "Worker process exited without updating job status (zombie reap).",
      completedAt: new Date().toISOString()
    };
    // Persist the reap so it doesn't recur
    upsertJob(job.workspaceRoot ?? resolveWorkspaceRoot(process.cwd()), reaped);
    writeJobFile(job.workspaceRoot ?? resolveWorkspaceRoot(process.cwd()), job.id, reaped);
    return enrichJob(reaped, options); // re-enrich with corrected status
  }
  // ... existing enrichment logic ...
}

This fixes all three symptoms at the source:

  • status --wait will see the reap and stop polling
  • resolveLatestTrackedTaskThread() will no longer block on a dead task
  • result will find the job in failed state instead of "still running"

Environment

  • codex-plugin-cc: v1.0.2 (local cache), v1.0.3 available
  • Codex CLI: v0.118.0
  • Claude Code: latest
  • Platform: macOS Darwin 24.6.0 (Apple Silicon)
  • Node.js: v22.21.1

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions