companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler)
Summary
The codex-companion.mjs foreground entrypoint registers no signal handlers. When the parent (Claude Code, a shell, a CI runner, etc.) sends SIGTERM/SIGINT/SIGHUP mid-job, Node exits immediately and the job's state.json entry is frozen at status: "running" with a stale pid. The job is never reconciled to failed/terminated, so the workspace accumulates orphan "running" jobs that block status UIs and resume logic. The companion's broker (app-server-broker.mjs) already handles these signals — only the foreground runner is missing equivalent coverage.
Reproduction
- From a Claude Code session (or any shell), invoke a foreground job, e.g.
/codex:rescue which executes node scripts/codex-companion.mjs task ... in the foreground.
- While the job is mid-turn (status
running, phase running), cancel the bash tool (ESC in Claude Code) or send kill -TERM <pid> to the companion process.
- Inspect
~/.claude/plugins/data/codex-openai-codex/state/<workspace-hash>/state.json.
- Observe the job entry still shows
"status": "running", "phase": "running", and a "pid" whose process no longer exists (process.kill(pid, 0) throws ESRCH).
- Re-run
node scripts/codex-companion.mjs status — the orphan continues to be reported as in-flight.
- Repeat the cancel cycle a few times and the same workspace accumulates multiple orphan jobs, all from the same
sessionId.
Real evidence
~/.claude/plugins/data/codex-openai-codex/state/superbem-a2d3d65760f87d26/state.json accumulated 3 orphan jobs (task-mnyhf4jl-1mnkv1, task-mnye71qk-5tib84, task-mnydlu24-x1oney), all sharing sessionId: ef83482e-cb59-4e3c-b16a-b8613be08da8, all stuck at status: "running" / phase: "running" with dead PIDs (14225, 12947, ...). Each was launched via /codex:rescue and cancelled mid-turn from the parent Claude Code session.
Expected vs Actual
Expected: When the foreground companion receives SIGTERM/SIGINT/SIGHUP, the in-flight job should transition to status: "failed" (or "cancelled") with phase: "terminated", pid: null, and a completedAt timestamp before the process exits, mirroring the cleanup performed by the try/catch in runTrackedJob and the broker's shutdown.
Actual: Node exits immediately on the default signal disposition. The job record written at runTrackedJob start (status: "running", phase: "starting" → progressed to "running", with the live pid) is never updated. The job remains permanently running from the state's perspective.
Root cause
scripts/lib/tracked-jobs.mjs line 142 — runTrackedJob writes the running record up-front and only reconciles it via try/catch around await runner():
// scripts/lib/tracked-jobs.mjs:142
export async function runTrackedJob(job, runner, options = {}) {
const runningRecord = {
...job,
status: "running",
startedAt: nowIso(),
phase: "starting",
pid: process.pid,
logFile: options.logFile ?? job.logFile ?? null
};
writeJobFile(job.workspaceRoot, job.id, runningRecord);
upsertJob(job.workspaceRoot, runningRecord);
try {
const execution = await runner();
// ... writes status: completed | failed
} catch (error) {
// ... writes status: failed
}
}
A try/catch cannot observe an external signal — the Node default handler for SIGTERM/SIGINT terminates the process before the catch block runs, so the stored record is never reconciled.
scripts/codex-companion.mjs has zero signal handlers (verified: only process.exitCode at line 608 and process.exitCode = 1 at line 1006). Compare with scripts/app-server-broker.mjs which correctly installs handlers:
// scripts/app-server-broker.mjs:236
process.on("SIGTERM", async () => {
await shutdown(server);
process.exit(0);
});
// scripts/app-server-broker.mjs:241
process.on("SIGINT", async () => {
await shutdown(server);
process.exit(0);
});
The foreground runner needs the analogous treatment, scoped to the active tracked job.
Suggested fix
Register signal handlers inside runTrackedJob that flush the job record to a terminal state before exiting. Example diff against scripts/lib/tracked-jobs.mjs:
export async function runTrackedJob(job, runner, options = {}) {
const runningRecord = {
...job,
status: "running",
startedAt: nowIso(),
phase: "starting",
pid: process.pid,
logFile: options.logFile ?? job.logFile ?? null
};
writeJobFile(job.workspaceRoot, job.id, runningRecord);
upsertJob(job.workspaceRoot, runningRecord);
+ const terminate = (signal) => {
+ try {
+ const existing = readStoredJobOrNull(job.workspaceRoot, job.id) ?? runningRecord;
+ const completedAt = nowIso();
+ writeJobFile(job.workspaceRoot, job.id, {
+ ...existing,
+ status: "failed",
+ phase: "terminated",
+ errorMessage: `Terminated by ${signal}`,
+ pid: null,
+ completedAt,
+ logFile: options.logFile ?? job.logFile ?? existing.logFile ?? null
+ });
+ upsertJob(job.workspaceRoot, {
+ id: job.id,
+ status: "failed",
+ phase: "terminated",
+ pid: null,
+ errorMessage: `Terminated by ${signal}`,
+ completedAt
+ });
+ appendLogLine(options.logFile ?? job.logFile ?? null, `Terminated by ${signal}.`);
+ } catch {
+ // best-effort flush; never block exit
+ } finally {
+ process.exit(signal === "SIGINT" ? 130 : 143);
+ }
+ };
+
+ const signals = ["SIGTERM", "SIGINT", "SIGHUP"];
+ const handlers = signals.map((sig) => {
+ const handler = () => terminate(sig);
+ process.on(sig, handler);
+ return [sig, handler];
+ });
+
try {
const execution = await runner();
// ...
return execution;
} catch (error) {
// ...
throw error;
+ } finally {
+ for (const [sig, handler] of handlers) {
+ process.removeListener(sig, handler);
+ }
}
}
Notes:
- Handlers must be registered after the initial
writeJobFile/upsertJob so the terminal flush always sees a known record.
- The handler is intentionally synchronous (
writeFileSync is already used throughout state.mjs and tracked-jobs.mjs) so the flush completes before process.exit.
- Removing handlers in
finally avoids leaks if multiple runTrackedJob invocations share a process (e.g. the worker subcommand).
Also suggested: orphan sweeper at startup
Even with the handler in place, jobs from prior crashes (SIGKILL, OOM, power loss, pre-fix versions) will remain. Add a one-shot sweeper to codex-companion.mjs::main (before the switch (subcommand) dispatch) that walks state.jobs, filters status === "running", and probes liveness with process.kill(pid, 0):
function sweepOrphanJobs(workspaceRoot) {
const jobs = listJobs(workspaceRoot);
for (const job of jobs ?? []) {
if (job.status !== "running" || !job.pid) continue;
let alive = false;
try {
process.kill(job.pid, 0);
alive = true;
} catch (err) {
alive = err && err.code === "EPERM"; // exists but not ours
}
if (alive) continue;
const completedAt = nowIso();
upsertJob(workspaceRoot, {
id: job.id,
status: "failed",
phase: "orphaned",
pid: null,
errorMessage: `Orphaned: pid ${job.pid} no longer alive`,
completedAt
});
const jobFile = resolveJobFile(workspaceRoot, job.id);
if (fs.existsSync(jobFile)) {
const stored = readJobFile(jobFile);
writeJobFile(workspaceRoot, job.id, {
...stored,
status: "failed",
phase: "orphaned",
pid: null,
errorMessage: `Orphaned: pid ${job.pid} no longer alive`,
completedAt
});
}
}
}
async function main() {
const [subcommand, ...argv] = process.argv.slice(2);
// ... usage handling ...
const workspaceRoot = resolveWorkspaceRoot(process.cwd()); // existing helper from lib/workspace.mjs
if (workspaceRoot) sweepOrphanJobs(workspaceRoot);
switch (subcommand) { /* ... */ }
}
This guarantees the workspace self-heals on the next companion invocation regardless of how the previous run died, and it's cheap (one process.kill(pid, 0) per orphan).
Environment
- Plugin:
openai-codex/codex v1.0.0 (installed via Claude Code plugin cache at ~/.claude/plugins/cache/openai-codex/codex/1.0.0/)
- OS: macOS, Darwin 24.6.0
- Node: v24.8.0 (host shell
node --version; companion inherits process.execPath from the Claude Code CLI, typically system Node / nvm)
- Host: Claude Code CLI invoking the companion via the
/codex:rescue skill (foreground bash tool)
- Reproducible workspace:
~/.claude/plugins/data/codex-openai-codex/state/superbem-a2d3d65760f87d26/
companion jobs stuck as status:running when foreground task receives SIGTERM (no signal handler)
Summary
The
codex-companion.mjsforeground entrypoint registers no signal handlers. When the parent (Claude Code, a shell, a CI runner, etc.) sendsSIGTERM/SIGINT/SIGHUPmid-job, Node exits immediately and the job'sstate.jsonentry is frozen atstatus: "running"with a stalepid. The job is never reconciled tofailed/terminated, so the workspace accumulates orphan "running" jobs that block status UIs and resume logic. The companion's broker (app-server-broker.mjs) already handles these signals — only the foreground runner is missing equivalent coverage.Reproduction
/codex:rescuewhich executesnode scripts/codex-companion.mjs task ...in the foreground.running, phaserunning), cancel the bash tool (ESC in Claude Code) or sendkill -TERM <pid>to the companion process.~/.claude/plugins/data/codex-openai-codex/state/<workspace-hash>/state.json."status": "running","phase": "running", and a"pid"whose process no longer exists (process.kill(pid, 0)throwsESRCH).node scripts/codex-companion.mjs status— the orphan continues to be reported as in-flight.sessionId.Real evidence
~/.claude/plugins/data/codex-openai-codex/state/superbem-a2d3d65760f87d26/state.jsonaccumulated 3 orphan jobs (task-mnyhf4jl-1mnkv1,task-mnye71qk-5tib84,task-mnydlu24-x1oney), all sharingsessionId: ef83482e-cb59-4e3c-b16a-b8613be08da8, all stuck atstatus: "running"/phase: "running"with dead PIDs (14225, 12947, ...). Each was launched via/codex:rescueand cancelled mid-turn from the parent Claude Code session.Expected vs Actual
Expected: When the foreground companion receives
SIGTERM/SIGINT/SIGHUP, the in-flight job should transition tostatus: "failed"(or"cancelled") withphase: "terminated",pid: null, and acompletedAttimestamp before the process exits, mirroring the cleanup performed by thetry/catchinrunTrackedJoband the broker'sshutdown.Actual: Node exits immediately on the default signal disposition. The job record written at
runTrackedJobstart (status: "running",phase: "starting"→ progressed to"running", with the livepid) is never updated. The job remains permanentlyrunningfrom the state's perspective.Root cause
scripts/lib/tracked-jobs.mjsline 142 —runTrackedJobwrites therunningrecord up-front and only reconciles it viatry/catcharoundawait runner():A
try/catchcannot observe an external signal — the Node default handler forSIGTERM/SIGINTterminates the process before thecatchblock runs, so the stored record is never reconciled.scripts/codex-companion.mjshas zero signal handlers (verified: onlyprocess.exitCodeat line 608 andprocess.exitCode = 1at line 1006). Compare withscripts/app-server-broker.mjswhich correctly installs handlers:The foreground runner needs the analogous treatment, scoped to the active tracked job.
Suggested fix
Register signal handlers inside
runTrackedJobthat flush the job record to a terminal state before exiting. Example diff againstscripts/lib/tracked-jobs.mjs:export async function runTrackedJob(job, runner, options = {}) { const runningRecord = { ...job, status: "running", startedAt: nowIso(), phase: "starting", pid: process.pid, logFile: options.logFile ?? job.logFile ?? null }; writeJobFile(job.workspaceRoot, job.id, runningRecord); upsertJob(job.workspaceRoot, runningRecord); + const terminate = (signal) => { + try { + const existing = readStoredJobOrNull(job.workspaceRoot, job.id) ?? runningRecord; + const completedAt = nowIso(); + writeJobFile(job.workspaceRoot, job.id, { + ...existing, + status: "failed", + phase: "terminated", + errorMessage: `Terminated by ${signal}`, + pid: null, + completedAt, + logFile: options.logFile ?? job.logFile ?? existing.logFile ?? null + }); + upsertJob(job.workspaceRoot, { + id: job.id, + status: "failed", + phase: "terminated", + pid: null, + errorMessage: `Terminated by ${signal}`, + completedAt + }); + appendLogLine(options.logFile ?? job.logFile ?? null, `Terminated by ${signal}.`); + } catch { + // best-effort flush; never block exit + } finally { + process.exit(signal === "SIGINT" ? 130 : 143); + } + }; + + const signals = ["SIGTERM", "SIGINT", "SIGHUP"]; + const handlers = signals.map((sig) => { + const handler = () => terminate(sig); + process.on(sig, handler); + return [sig, handler]; + }); + try { const execution = await runner(); // ... return execution; } catch (error) { // ... throw error; + } finally { + for (const [sig, handler] of handlers) { + process.removeListener(sig, handler); + } } }Notes:
writeJobFile/upsertJobso the terminal flush always sees a known record.writeFileSyncis already used throughoutstate.mjsandtracked-jobs.mjs) so the flush completes beforeprocess.exit.finallyavoids leaks if multiplerunTrackedJobinvocations share a process (e.g. the worker subcommand).Also suggested: orphan sweeper at startup
Even with the handler in place, jobs from prior crashes (
SIGKILL, OOM, power loss, pre-fix versions) will remain. Add a one-shot sweeper tocodex-companion.mjs::main(before theswitch (subcommand)dispatch) that walksstate.jobs, filtersstatus === "running", and probes liveness withprocess.kill(pid, 0):This guarantees the workspace self-heals on the next companion invocation regardless of how the previous run died, and it's cheap (one
process.kill(pid, 0)per orphan).Environment
openai-codex/codexv1.0.0 (installed via Claude Code plugin cache at~/.claude/plugins/cache/openai-codex/codex/1.0.0/)node --version; companion inheritsprocess.execPathfrom the Claude Code CLI, typically system Node / nvm)/codex:rescueskill (foreground bash tool)~/.claude/plugins/data/codex-openai-codex/state/superbem-a2d3d65760f87d26/