Skip to content

companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler) #228

@estevaoantuness

Description

@estevaoantuness

companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler)

Summary

The codex-companion.mjs foreground entrypoint registers no signal handlers. When the parent (Claude Code, a shell, a CI runner, etc.) sends SIGTERM/SIGINT/SIGHUP mid-job, Node exits immediately and the job's state.json entry is frozen at status: "running" with a stale pid. The job is never reconciled to failed/terminated, so the workspace accumulates orphan "running" jobs that block status UIs and resume logic. The companion's broker (app-server-broker.mjs) already handles these signals — only the foreground runner is missing equivalent coverage.

Reproduction

  1. From a Claude Code session (or any shell), invoke a foreground job, e.g. /codex:rescue which executes node scripts/codex-companion.mjs task ... in the foreground.
  2. While the job is mid-turn (status running, phase running), cancel the bash tool (ESC in Claude Code) or send kill -TERM <pid> to the companion process.
  3. Inspect ~/.claude/plugins/data/codex-openai-codex/state/<workspace-hash>/state.json.
  4. Observe the job entry still shows "status": "running", "phase": "running", and a "pid" whose process no longer exists (process.kill(pid, 0) throws ESRCH).
  5. Re-run node scripts/codex-companion.mjs status — the orphan continues to be reported as in-flight.
  6. Repeat the cancel cycle a few times and the same workspace accumulates multiple orphan jobs, all from the same sessionId.

Real evidence

~/.claude/plugins/data/codex-openai-codex/state/superbem-a2d3d65760f87d26/state.json accumulated 3 orphan jobs (task-mnyhf4jl-1mnkv1, task-mnye71qk-5tib84, task-mnydlu24-x1oney), all sharing sessionId: ef83482e-cb59-4e3c-b16a-b8613be08da8, all stuck at status: "running" / phase: "running" with dead PIDs (14225, 12947, ...). Each was launched via /codex:rescue and cancelled mid-turn from the parent Claude Code session.

Expected vs Actual

Expected: When the foreground companion receives SIGTERM/SIGINT/SIGHUP, the in-flight job should transition to status: "failed" (or "cancelled") with phase: "terminated", pid: null, and a completedAt timestamp before the process exits, mirroring the cleanup performed by the try/catch in runTrackedJob and the broker's shutdown.

Actual: Node exits immediately on the default signal disposition. The job record written at runTrackedJob start (status: "running", phase: "starting" → progressed to "running", with the live pid) is never updated. The job remains permanently running from the state's perspective.

Root cause

scripts/lib/tracked-jobs.mjs line 142 — runTrackedJob writes the running record up-front and only reconciles it via try/catch around await runner():

// scripts/lib/tracked-jobs.mjs:142
export async function runTrackedJob(job, runner, options = {}) {
  const runningRecord = {
    ...job,
    status: "running",
    startedAt: nowIso(),
    phase: "starting",
    pid: process.pid,
    logFile: options.logFile ?? job.logFile ?? null
  };
  writeJobFile(job.workspaceRoot, job.id, runningRecord);
  upsertJob(job.workspaceRoot, runningRecord);

  try {
    const execution = await runner();
    // ... writes status: completed | failed
  } catch (error) {
    // ... writes status: failed
  }
}

A try/catch cannot observe an external signal — the Node default handler for SIGTERM/SIGINT terminates the process before the catch block runs, so the stored record is never reconciled.

scripts/codex-companion.mjs has zero signal handlers (verified: only process.exitCode at line 608 and process.exitCode = 1 at line 1006). Compare with scripts/app-server-broker.mjs which correctly installs handlers:

// scripts/app-server-broker.mjs:236
process.on("SIGTERM", async () => {
  await shutdown(server);
  process.exit(0);
});

// scripts/app-server-broker.mjs:241
process.on("SIGINT", async () => {
  await shutdown(server);
  process.exit(0);
});

The foreground runner needs the analogous treatment, scoped to the active tracked job.

Suggested fix

Register signal handlers inside runTrackedJob that flush the job record to a terminal state before exiting. Example diff against scripts/lib/tracked-jobs.mjs:

 export async function runTrackedJob(job, runner, options = {}) {
   const runningRecord = {
     ...job,
     status: "running",
     startedAt: nowIso(),
     phase: "starting",
     pid: process.pid,
     logFile: options.logFile ?? job.logFile ?? null
   };
   writeJobFile(job.workspaceRoot, job.id, runningRecord);
   upsertJob(job.workspaceRoot, runningRecord);

+  const terminate = (signal) => {
+    try {
+      const existing = readStoredJobOrNull(job.workspaceRoot, job.id) ?? runningRecord;
+      const completedAt = nowIso();
+      writeJobFile(job.workspaceRoot, job.id, {
+        ...existing,
+        status: "failed",
+        phase: "terminated",
+        errorMessage: `Terminated by ${signal}`,
+        pid: null,
+        completedAt,
+        logFile: options.logFile ?? job.logFile ?? existing.logFile ?? null
+      });
+      upsertJob(job.workspaceRoot, {
+        id: job.id,
+        status: "failed",
+        phase: "terminated",
+        pid: null,
+        errorMessage: `Terminated by ${signal}`,
+        completedAt
+      });
+      appendLogLine(options.logFile ?? job.logFile ?? null, `Terminated by ${signal}.`);
+    } catch {
+      // best-effort flush; never block exit
+    } finally {
+      process.exit(signal === "SIGINT" ? 130 : 143);
+    }
+  };
+
+  const signals = ["SIGTERM", "SIGINT", "SIGHUP"];
+  const handlers = signals.map((sig) => {
+    const handler = () => terminate(sig);
+    process.on(sig, handler);
+    return [sig, handler];
+  });
+
   try {
     const execution = await runner();
     // ...
     return execution;
   } catch (error) {
     // ...
     throw error;
+  } finally {
+    for (const [sig, handler] of handlers) {
+      process.removeListener(sig, handler);
+    }
   }
 }

Notes:

  • Handlers must be registered after the initial writeJobFile/upsertJob so the terminal flush always sees a known record.
  • The handler is intentionally synchronous (writeFileSync is already used throughout state.mjs and tracked-jobs.mjs) so the flush completes before process.exit.
  • Removing handlers in finally avoids leaks if multiple runTrackedJob invocations share a process (e.g. the worker subcommand).

Also suggested: orphan sweeper at startup

Even with the handler in place, jobs from prior crashes (SIGKILL, OOM, power loss, pre-fix versions) will remain. Add a one-shot sweeper to codex-companion.mjs::main (before the switch (subcommand) dispatch) that walks state.jobs, filters status === "running", and probes liveness with process.kill(pid, 0):

function sweepOrphanJobs(workspaceRoot) {
  const jobs = listJobs(workspaceRoot);
  for (const job of jobs ?? []) {
    if (job.status !== "running" || !job.pid) continue;
    let alive = false;
    try {
      process.kill(job.pid, 0);
      alive = true;
    } catch (err) {
      alive = err && err.code === "EPERM"; // exists but not ours
    }
    if (alive) continue;

    const completedAt = nowIso();
    upsertJob(workspaceRoot, {
      id: job.id,
      status: "failed",
      phase: "orphaned",
      pid: null,
      errorMessage: `Orphaned: pid ${job.pid} no longer alive`,
      completedAt
    });
    const jobFile = resolveJobFile(workspaceRoot, job.id);
    if (fs.existsSync(jobFile)) {
      const stored = readJobFile(jobFile);
      writeJobFile(workspaceRoot, job.id, {
        ...stored,
        status: "failed",
        phase: "orphaned",
        pid: null,
        errorMessage: `Orphaned: pid ${job.pid} no longer alive`,
        completedAt
      });
    }
  }
}

async function main() {
  const [subcommand, ...argv] = process.argv.slice(2);
  // ... usage handling ...
  const workspaceRoot = resolveWorkspaceRoot(process.cwd()); // existing helper from lib/workspace.mjs
  if (workspaceRoot) sweepOrphanJobs(workspaceRoot);
  switch (subcommand) { /* ... */ }
}

This guarantees the workspace self-heals on the next companion invocation regardless of how the previous run died, and it's cheap (one process.kill(pid, 0) per orphan).

Environment

  • Plugin: openai-codex/codex v1.0.0 (installed via Claude Code plugin cache at ~/.claude/plugins/cache/openai-codex/codex/1.0.0/)
  • OS: macOS, Darwin 24.6.0
  • Node: v24.8.0 (host shell node --version; companion inherits process.execPath from the Claude Code CLI, typically system Node / nvm)
  • Host: Claude Code CLI invoking the companion via the /codex:rescue skill (foreground bash tool)
  • Reproducible workspace: ~/.claude/plugins/data/codex-openai-codex/state/superbem-a2d3d65760f87d26/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions