No OS-level process liveness check — stale 'running' state after worker death (root cause for #164, #202)

## Summary

When a Codex worker process dies unexpectedly (SIGKILL, OOM, broker socket disconnect, app-server crash), the companion job state file retains `status: "running"` indefinitely. **No code path in the companion checks whether the stored PID is actually alive.** This is the shared root cause behind #164 (review stuck in "running") and #202 (zombie job blocks subsequent tasks), plus a third unreported symptom: `result <job-id>` returning "No job found" while `status <job-id>` reports the full (stale) job object.

## Three Symptoms, One Root Cause

| Symptom | Affected command | Existing issue |
|---------|-----------------|----------------|
| `--wait` poll loop spins until timeout against a dead process | `status --wait` | #164 |
| "Task X is still running" blocks all subsequent task calls | `task` (via `resolveLatestTrackedTaskThread`) | #202 |
| `result <job-id>` says "No job found" while `status <job-id>` returns the job | `result` vs `status` | *(new)* |

## Root Cause Analysis

### 1. Detached worker spawned with zero crash monitoring

`spawnDetachedTaskWorker()` in `codex-companion.mjs` (lines 613–624):

```javascript
function spawnDetachedTaskWorker(cwd, jobId) {
  const child = spawn(process.execPath, [...], {
    detached: true,      // own process group
    stdio: "ignore",     // parent never sees output
    windowsHide: true
  });
  child.unref();         // parent drops all references
  return child;
}
```

- No `child.on('exit', ...)` handler
- No `child.on('error', ...)` handler
- Parent has **no visibility** into worker lifecycle after spawn

### 2. `runTrackedJob()` catch block only handles JS errors

`tracked-jobs.mjs` (lines 142–204): The `try/catch` around `await runner()` handles JavaScript exceptions but **cannot fire** when the detached worker process is killed at the OS level (SIGKILL, OOM killer, etc.). The promise simply never settles, and the state file remains at `status: "running"`.

### 3. No PID liveness probe in any status/query path

The following functions read the state file and return `status` as-is, **without checking if `job.pid` is alive**:

| Function | File | Lines |
|----------|------|-------|
| `enrichJob()` | `job-control.mjs` | 161–181 |
| `buildSingleJobSnapshot()` | `job-control.mjs` | 242–254 |
| `waitForSingleJobSnapshot()` | `codex-companion.mjs` | 293–309 |
| `resolveLatestTrackedTaskThread()` | `codex-companion.mjs` | 312–320 |
| `resolveResultJob()` | `job-control.mjs` | 256–279 |

**Zero occurrences** of `kill(pid, 0)` or `ps -p` exist outside `terminateProcessTree()`, which is only called by `handleCancel()`.

### 4. `result` vs `status` inconsistency

`resolveResultJob()` (line 258) filters for `completed | failed | cancelled`. A zombie "running" job is caught at line 269 and throws `"Job X is still running."` But without a reference argument, it filters to current-session jobs only (line 258: `filterJobsForCurrentSession`). If the stale job belongs to a previous session, `result` says "No finished job found" while `status` (which reads all jobs) returns it as "running". This creates a confusing split where both commands appear to contradict each other.

## Real-World Evidence

### Incident 1: adversarial-review on macOS (2026-04-11)

```
Job:       review-mnugvp3h-uy1ly9
Kind:      adversarial-review
Created:   2026-04-11T15:05:41.837Z
Updated:   2026-04-11T15:07:23.215Z  ← last state write
Cancelled: 2026-04-11T15:27:20.275Z  ← manual cancel 20 min later
```

- Backend PID died silently at ~15:07.
- `status` reported "running" for 20 minutes until manual `/codex:cancel`.
- `result <job-id>` returned "No job found" during the stale window.
- `ps -p <pid>` confirmed the process was dead (no row returned).

### Incident 2: task broker disconnect (2026-04-07)

```
[2026-04-07T05:40:45.631Z] Codex turn interrupt failed: connect ECONNREFUSED .../broker.sock
[2026-04-07T05:40:45.632Z] Cancelled by user.
```

Broker socket became unreachable — same pattern: process dead, state file stale, only resolved by manual cancel.

## Suggested Fix

Add a `isProcessAlive(pid)` utility and call it in the status/query path:

```javascript
// lib/process.mjs
export function isProcessAlive(pid) {
  if (!Number.isFinite(pid)) return false;
  try {
    process.kill(pid, 0);
    return true;
  } catch (e) {
    return e.code === 'EPERM'; // exists but no permission
  }
}
```

Then in `enrichJob()` or `buildSingleJobSnapshot()`:

```javascript
import { isProcessAlive } from "./process.mjs";

export function enrichJob(job, options = {}) {
  // Auto-reap zombie jobs
  if (
    (job.status === "running" || job.status === "queued") &&
    job.pid != null &&
    !isProcessAlive(job.pid)
  ) {
    const reaped = {
      ...job,
      status: "failed",
      phase: "failed",
      pid: null,
      errorMessage: "Worker process exited without updating job status (zombie reap).",
      completedAt: new Date().toISOString()
    };
    // Persist the reap so it doesn't recur
    upsertJob(job.workspaceRoot ?? resolveWorkspaceRoot(process.cwd()), reaped);
    writeJobFile(job.workspaceRoot ?? resolveWorkspaceRoot(process.cwd()), job.id, reaped);
    return enrichJob(reaped, options); // re-enrich with corrected status
  }
  // ... existing enrichment logic ...
}
```

This fixes all three symptoms at the source:
- `status --wait` will see the reap and stop polling
- `resolveLatestTrackedTaskThread()` will no longer block on a dead task
- `result` will find the job in `failed` state instead of "still running"

## Environment

- codex-plugin-cc: v1.0.2 (local cache), v1.0.3 available
- Codex CLI: v0.118.0
- Claude Code: latest
- Platform: macOS Darwin 24.6.0 (Apple Silicon)
- Node.js: v22.21.1

## Related Issues

- #164 — Review jobs stuck in 'running' after process death (symptom 1)
- #202 — Zombie running job blocks all subsequent task calls (symptom 2)
- #82 — Cancel scoping across sessions (related state management issue)
- #141 — App-server crash on macOS (one of the triggers for unexpected process death)

Function	File	Lines
`enrichJob()`	`job-control.mjs`	161–181
`buildSingleJobSnapshot()`	`job-control.mjs`	242–254
`waitForSingleJobSnapshot()`	`codex-companion.mjs`	293–309
`resolveLatestTrackedTaskThread()`	`codex-companion.mjs`	312–320
`resolveResultJob()`	`job-control.mjs`	256–279

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No OS-level process liveness check — stale 'running' state after worker death (root cause for #164, #202) #222

Summary

Three Symptoms, One Root Cause

Root Cause Analysis

1. Detached worker spawned with zero crash monitoring

2. `runTrackedJob()` catch block only handles JS errors

3. No PID liveness probe in any status/query path

4. `result` vs `status` inconsistency

Real-World Evidence

Incident 1: adversarial-review on macOS (2026-04-11)

Incident 2: task broker disconnect (2026-04-07)

Suggested Fix

Environment

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Symptom	Affected command	Existing issue
`--wait` poll loop spins until timeout against a dead process	`status --wait`	#164
"Task X is still running" blocks all subsequent task calls	`task` (via `resolveLatestTrackedTaskThread`)	#202
`result <job-id>` says "No job found" while `status <job-id>` returns the job	`result` vs `status`	(new)

No OS-level process liveness check — stale 'running' state after worker death (root cause for #164, #202) #222

Description

Summary

Three Symptoms, One Root Cause

Root Cause Analysis

1. Detached worker spawned with zero crash monitoring

2. runTrackedJob() catch block only handles JS errors

3. No PID liveness probe in any status/query path

4. result vs status inconsistency

Real-World Evidence

Incident 1: adversarial-review on macOS (2026-04-11)

Incident 2: task broker disconnect (2026-04-07)

Suggested Fix

Environment

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `runTrackedJob()` catch block only handles JS errors

4. `result` vs `status` inconsistency