companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler)

# companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler)

## Summary

The `codex-companion.mjs` foreground entrypoint registers no signal handlers. When the parent (Claude Code, a shell, a CI runner, etc.) sends `SIGTERM`/`SIGINT`/`SIGHUP` mid-job, Node exits immediately and the job's `state.json` entry is frozen at `status: "running"` with a stale `pid`. The job is never reconciled to `failed`/`terminated`, so the workspace accumulates orphan "running" jobs that block status UIs and resume logic. The companion's broker (`app-server-broker.mjs`) already handles these signals — only the foreground runner is missing equivalent coverage.

## Reproduction

1. From a Claude Code session (or any shell), invoke a foreground job, e.g. `/codex:rescue` which executes `node scripts/codex-companion.mjs task ...` in the foreground.
2. While the job is mid-turn (status `running`, phase `running`), cancel the bash tool (ESC in Claude Code) or send `kill -TERM <pid>` to the companion process.
3. Inspect `~/.claude/plugins/data/codex-openai-codex/state/<workspace-hash>/state.json`.
4. Observe the job entry still shows `"status": "running"`, `"phase": "running"`, and a `"pid"` whose process no longer exists (`process.kill(pid, 0)` throws `ESRCH`).
5. Re-run `node scripts/codex-companion.mjs status` — the orphan continues to be reported as in-flight.
6. Repeat the cancel cycle a few times and the same workspace accumulates multiple orphan jobs, all from the same `sessionId`.

### Real evidence

`~/.claude/plugins/data/codex-openai-codex/state/superbem-a2d3d65760f87d26/state.json` accumulated 3 orphan jobs (`task-mnyhf4jl-1mnkv1`, `task-mnye71qk-5tib84`, `task-mnydlu24-x1oney`), all sharing `sessionId: ef83482e-cb59-4e3c-b16a-b8613be08da8`, all stuck at `status: "running"` / `phase: "running"` with dead PIDs (14225, 12947, ...). Each was launched via `/codex:rescue` and cancelled mid-turn from the parent Claude Code session.

## Expected vs Actual

**Expected:** When the foreground companion receives `SIGTERM`/`SIGINT`/`SIGHUP`, the in-flight job should transition to `status: "failed"` (or `"cancelled"`) with `phase: "terminated"`, `pid: null`, and a `completedAt` timestamp before the process exits, mirroring the cleanup performed by the `try/catch` in `runTrackedJob` and the broker's `shutdown`.

**Actual:** Node exits immediately on the default signal disposition. The job record written at `runTrackedJob` start (`status: "running"`, `phase: "starting"` → progressed to `"running"`, with the live `pid`) is never updated. The job remains permanently `running` from the state's perspective.

## Root cause

`scripts/lib/tracked-jobs.mjs` line 142 — `runTrackedJob` writes the `running` record up-front and only reconciles it via `try/catch` around `await runner()`:

```js
// scripts/lib/tracked-jobs.mjs:142
export async function runTrackedJob(job, runner, options = {}) {
  const runningRecord = {
    ...job,
    status: "running",
    startedAt: nowIso(),
    phase: "starting",
    pid: process.pid,
    logFile: options.logFile ?? job.logFile ?? null
  };
  writeJobFile(job.workspaceRoot, job.id, runningRecord);
  upsertJob(job.workspaceRoot, runningRecord);

  try {
    const execution = await runner();
    // ... writes status: completed | failed
  } catch (error) {
    // ... writes status: failed
  }
}
```

A `try/catch` cannot observe an external signal — the Node default handler for `SIGTERM`/`SIGINT` terminates the process before the `catch` block runs, so the stored record is never reconciled.

`scripts/codex-companion.mjs` has zero signal handlers (verified: only `process.exitCode` at line 608 and `process.exitCode = 1` at line 1006). Compare with `scripts/app-server-broker.mjs` which correctly installs handlers:

```js
// scripts/app-server-broker.mjs:236
process.on("SIGTERM", async () => {
  await shutdown(server);
  process.exit(0);
});

// scripts/app-server-broker.mjs:241
process.on("SIGINT", async () => {
  await shutdown(server);
  process.exit(0);
});
```

The foreground runner needs the analogous treatment, scoped to the active tracked job.

## Suggested fix

Register signal handlers inside `runTrackedJob` that flush the job record to a terminal state before exiting. Example diff against `scripts/lib/tracked-jobs.mjs`:

```diff
 export async function runTrackedJob(job, runner, options = {}) {
   const runningRecord = {
     ...job,
     status: "running",
     startedAt: nowIso(),
     phase: "starting",
     pid: process.pid,
     logFile: options.logFile ?? job.logFile ?? null
   };
   writeJobFile(job.workspaceRoot, job.id, runningRecord);
   upsertJob(job.workspaceRoot, runningRecord);

+  const terminate = (signal) => {
+    try {
+      const existing = readStoredJobOrNull(job.workspaceRoot, job.id) ?? runningRecord;
+      const completedAt = nowIso();
+      writeJobFile(job.workspaceRoot, job.id, {
+        ...existing,
+        status: "failed",
+        phase: "terminated",
+        errorMessage: `Terminated by ${signal}`,
+        pid: null,
+        completedAt,
+        logFile: options.logFile ?? job.logFile ?? existing.logFile ?? null
+      });
+      upsertJob(job.workspaceRoot, {
+        id: job.id,
+        status: "failed",
+        phase: "terminated",
+        pid: null,
+        errorMessage: `Terminated by ${signal}`,
+        completedAt
+      });
+      appendLogLine(options.logFile ?? job.logFile ?? null, `Terminated by ${signal}.`);
+    } catch {
+      // best-effort flush; never block exit
+    } finally {
+      process.exit(signal === "SIGINT" ? 130 : 143);
+    }
+  };
+
+  const signals = ["SIGTERM", "SIGINT", "SIGHUP"];
+  const handlers = signals.map((sig) => {
+    const handler = () => terminate(sig);
+    process.on(sig, handler);
+    return [sig, handler];
+  });
+
   try {
     const execution = await runner();
     // ...
     return execution;
   } catch (error) {
     // ...
     throw error;
+  } finally {
+    for (const [sig, handler] of handlers) {
+      process.removeListener(sig, handler);
+    }
   }
 }
```

Notes:
- Handlers must be registered after the initial `writeJobFile`/`upsertJob` so the terminal flush always sees a known record.
- The handler is intentionally synchronous (`writeFileSync` is already used throughout `state.mjs` and `tracked-jobs.mjs`) so the flush completes before `process.exit`.
- Removing handlers in `finally` avoids leaks if multiple `runTrackedJob` invocations share a process (e.g. the worker subcommand).

## Also suggested: orphan sweeper at startup

Even with the handler in place, jobs from prior crashes (`SIGKILL`, OOM, power loss, pre-fix versions) will remain. Add a one-shot sweeper to `codex-companion.mjs::main` (before the `switch (subcommand)` dispatch) that walks `state.jobs`, filters `status === "running"`, and probes liveness with `process.kill(pid, 0)`:

```js
function sweepOrphanJobs(workspaceRoot) {
  const jobs = listJobs(workspaceRoot);
  for (const job of jobs ?? []) {
    if (job.status !== "running" || !job.pid) continue;
    let alive = false;
    try {
      process.kill(job.pid, 0);
      alive = true;
    } catch (err) {
      alive = err && err.code === "EPERM"; // exists but not ours
    }
    if (alive) continue;

    const completedAt = nowIso();
    upsertJob(workspaceRoot, {
      id: job.id,
      status: "failed",
      phase: "orphaned",
      pid: null,
      errorMessage: `Orphaned: pid ${job.pid} no longer alive`,
      completedAt
    });
    const jobFile = resolveJobFile(workspaceRoot, job.id);
    if (fs.existsSync(jobFile)) {
      const stored = readJobFile(jobFile);
      writeJobFile(workspaceRoot, job.id, {
        ...stored,
        status: "failed",
        phase: "orphaned",
        pid: null,
        errorMessage: `Orphaned: pid ${job.pid} no longer alive`,
        completedAt
      });
    }
  }
}

async function main() {
  const [subcommand, ...argv] = process.argv.slice(2);
  // ... usage handling ...
  const workspaceRoot = resolveWorkspaceRoot(process.cwd()); // existing helper from lib/workspace.mjs
  if (workspaceRoot) sweepOrphanJobs(workspaceRoot);
  switch (subcommand) { /* ... */ }
}
```

This guarantees the workspace self-heals on the next companion invocation regardless of how the previous run died, and it's cheap (one `process.kill(pid, 0)` per orphan).

## Environment

- Plugin: `openai-codex/codex` v1.0.0 (installed via Claude Code plugin cache at `~/.claude/plugins/cache/openai-codex/codex/1.0.0/`)
- OS: macOS, Darwin 24.6.0
- Node: v24.8.0 (host shell `node --version`; companion inherits `process.execPath` from the Claude Code CLI, typically system Node / nvm)
- Host: Claude Code CLI invoking the companion via the `/codex:rescue` skill (foreground bash tool)
- Reproducible workspace: `~/.claude/plugins/data/codex-openai-codex/state/superbem-a2d3d65760f87d26/`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler) #228

companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler)

Summary

Reproduction

Real evidence

Expected vs Actual

Root cause

Suggested fix

Also suggested: orphan sweeper at startup

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler) #228

Description

companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler)

Summary

Reproduction

Real evidence

Expected vs Actual

Root cause

Suggested fix

Also suggested: orphan sweeper at startup

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions