Skip to content

fix(codex): surface non-zero exits so wrappers stop reading as silent stalls#1467

Open
genisis0x wants to merge 1 commit into
garrytan:mainfrom
genisis0x:fix/codex-exit-surface-1327
Open

fix(codex): surface non-zero exits so wrappers stop reading as silent stalls#1467
genisis0x wants to merge 1 commit into
garrytan:mainfrom
genisis0x:fix/codex-exit-surface-1327

Conversation

@genisis0x
Copy link
Copy Markdown

@genisis0x genisis0x commented May 13, 2026

Summary

Fixes #1327.

The `/codex` Review, Challenge, Consult, and Consult-resume wrappers all inspect `_CODEX_EXIT` for the 124 timeout sentinel only. Any other non-zero exit — parse error from a Codex CLI breaking arg change (exit 2), auth failure that didn't match the existing auth-text regex, network drop returning a partial frame — gets swallowed. The calling agent sees empty stdout, indistinguishable from a model API stall, and proceeds to misdiagnose for 30-60min before someone checks `$TMPERR` by hand.

The reporter's CLAUDE.md #54f from claude-teams-bot makes the same observation across ~5 hits in one week: "Don't trust silent-stall framing... it's exit non-zero with empty stdout, swallowed by a wrapper that only handles the timeout exit code."

Approach

Add an `elif != "0"` branch to each of the four wrappers (Review, Challenge, Consult new-session, Consult-resume). On a non-zero non-timeout exit, the wrapper now:

  • Prints `[codex exit ] ` so the calling agent has an immediate, structured diagnostic line.
  • Prints the first 20 lines of `$TMPERR` indented two spaces so the full parse error / arg-shape rejection is right there in the conversation, not hidden in a temp file.
  • Logs `codex_nonzero_exit::<exit_code>` via the existing telemetry helper so this becomes queryable, distinct from `codex_timeout`.

Behavior on success (exit 0) and timeout (exit 124) is unchanged — the existing branches still fire first. Auth-error detection in the Challenge wrapper (`grep -qiE "auth|login|unauthorized" "$TMPERR"`) also still fires unchanged; the new branch sits between the timeout branch and the auth branch and doesn't shadow either.

Defensive, not curative

This is the layer the reporter explicitly framed as separate from #1209's curative fix for the specific parse-error trigger: even after #1209 lands, future Codex CLI arg-shape breaks will keep masquerading as silent stalls until/unless someone manually inspects the captured stderr. Both fixes are complementary; this one is the cheaper of the two and self-corrects on the next regression.

Files

  • `codex/SKILL.md.tmpl` — canonical source. Four `elif != "0"` branches added next to the existing `if = "124"` checks (Review @ 178, Challenge @ 336, Consult @ 489, Consult-resume @ 517).
  • `codex/SKILL.md` — generated mirror, same four edits applied so the PR is self-consistent without requiring the reviewer to run `bun run gen:skill-docs` to inspect the output.

Both files are kept in sync per the `SKILL.md workflow` section of CLAUDE.md.

Validation

  • `bash -n` on the embedded blocks — clean (the wrappers are markdown code fences but each block is a self-contained bash snippet that runs fine when extracted).
  • No `bun test` runner available locally; the change is additive (new elif branch only; no existing branch touched) so the existing skill-validation suite should remain green.

Fixes #1327


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

… stalls

Fixes garrytan#1327. The /codex Review, Challenge, Consult, and Consult-resume
wrappers all inspect _CODEX_EXIT for the 124 timeout sentinel only.
Any other non-zero exit — parse error from a Codex CLI breaking arg
change (exit 2), auth failure that didn't match the existing auth-text
regex, network drop returning a partial frame — gets swallowed. The
calling agent sees empty stdout, indistinguishable from a model API
stall, and proceeds to misdiagnose for 30-60min before someone checks
$TMPERR by hand.

The reporter's CLAUDE.md #54f from claude-teams-bot makes the same
observation across ~5 hits in one week: "Don't trust silent-stall
framing... it's exit non-zero with empty stdout, swallowed by a wrapper
that only handles the timeout exit code."

Add an `elif != "0"` branch to each of the four wrappers (Review,
Challenge, Consult new-session, Consult-resume). On a non-zero
non-timeout exit, the wrapper now:

  - Prints `[codex exit <code>] <first stderr line>` so the calling
    agent has an immediate, structured diagnostic line.
  - Prints the first 20 lines of $TMPERR indented two spaces so the
    full parse error / arg-shape rejection is right there in the
    conversation, not hidden in a temp file.
  - Logs `codex_nonzero_exit:<mode>:<exit_code>` via the existing
    telemetry helper so this becomes queryable, distinct from
    `codex_timeout`.

Behavior on success (exit 0) and timeout (exit 124) is unchanged — the
existing branches still fire first. Auth-error detection in the
Challenge wrapper (`grep -qiE "auth|login|unauthorized" "$TMPERR"`)
also still fires unchanged; the new branch sits between the timeout
branch and the auth branch and doesn't shadow either.

Defensive — the canonical fix for the original parse-error trigger
(garrytan#1196 / PR garrytan#1209) addresses the specific arg-shape break. This change
addresses the wrapper's blindness to all non-zero exits, including the
next time Codex changes its arg contract. Both fixes are complementary
as the reporter notes; this one is the cheaper of the two.

Edit applied to SKILL.md.tmpl (canonical source) and codex/SKILL.md
(generated) so the PR is self-consistent without requiring the
reviewer to run `bun run gen:skill-docs`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[codex skill] Surface non-zero exit codes from codex wrappers (currently silent on parse errors)

1 participant