fix(codex): surface non-zero exits so wrappers stop reading as silent stalls#1467
Open
genisis0x wants to merge 1 commit into
Open
fix(codex): surface non-zero exits so wrappers stop reading as silent stalls#1467genisis0x wants to merge 1 commit into
genisis0x wants to merge 1 commit into
Conversation
… stalls Fixes garrytan#1327. The /codex Review, Challenge, Consult, and Consult-resume wrappers all inspect _CODEX_EXIT for the 124 timeout sentinel only. Any other non-zero exit — parse error from a Codex CLI breaking arg change (exit 2), auth failure that didn't match the existing auth-text regex, network drop returning a partial frame — gets swallowed. The calling agent sees empty stdout, indistinguishable from a model API stall, and proceeds to misdiagnose for 30-60min before someone checks $TMPERR by hand. The reporter's CLAUDE.md #54f from claude-teams-bot makes the same observation across ~5 hits in one week: "Don't trust silent-stall framing... it's exit non-zero with empty stdout, swallowed by a wrapper that only handles the timeout exit code." Add an `elif != "0"` branch to each of the four wrappers (Review, Challenge, Consult new-session, Consult-resume). On a non-zero non-timeout exit, the wrapper now: - Prints `[codex exit <code>] <first stderr line>` so the calling agent has an immediate, structured diagnostic line. - Prints the first 20 lines of $TMPERR indented two spaces so the full parse error / arg-shape rejection is right there in the conversation, not hidden in a temp file. - Logs `codex_nonzero_exit:<mode>:<exit_code>` via the existing telemetry helper so this becomes queryable, distinct from `codex_timeout`. Behavior on success (exit 0) and timeout (exit 124) is unchanged — the existing branches still fire first. Auth-error detection in the Challenge wrapper (`grep -qiE "auth|login|unauthorized" "$TMPERR"`) also still fires unchanged; the new branch sits between the timeout branch and the auth branch and doesn't shadow either. Defensive — the canonical fix for the original parse-error trigger (garrytan#1196 / PR garrytan#1209) addresses the specific arg-shape break. This change addresses the wrapper's blindness to all non-zero exits, including the next time Codex changes its arg contract. Both fixes are complementary as the reporter notes; this one is the cheaper of the two. Edit applied to SKILL.md.tmpl (canonical source) and codex/SKILL.md (generated) so the PR is self-consistent without requiring the reviewer to run `bun run gen:skill-docs`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1327.
The `/codex` Review, Challenge, Consult, and Consult-resume wrappers all inspect `_CODEX_EXIT` for the 124 timeout sentinel only. Any other non-zero exit — parse error from a Codex CLI breaking arg change (exit 2), auth failure that didn't match the existing auth-text regex, network drop returning a partial frame — gets swallowed. The calling agent sees empty stdout, indistinguishable from a model API stall, and proceeds to misdiagnose for 30-60min before someone checks `$TMPERR` by hand.
The reporter's CLAUDE.md #54f from claude-teams-bot makes the same observation across ~5 hits in one week: "Don't trust silent-stall framing... it's exit non-zero with empty stdout, swallowed by a wrapper that only handles the timeout exit code."
Approach
Add an `elif != "0"` branch to each of the four wrappers (Review, Challenge, Consult new-session, Consult-resume). On a non-zero non-timeout exit, the wrapper now:
] ` so the calling agent has an immediate, structured diagnostic line.Behavior on success (exit 0) and timeout (exit 124) is unchanged — the existing branches still fire first. Auth-error detection in the Challenge wrapper (`grep -qiE "auth|login|unauthorized" "$TMPERR"`) also still fires unchanged; the new branch sits between the timeout branch and the auth branch and doesn't shadow either.
Defensive, not curative
This is the layer the reporter explicitly framed as separate from #1209's curative fix for the specific parse-error trigger: even after #1209 lands, future Codex CLI arg-shape breaks will keep masquerading as silent stalls until/unless someone manually inspects the captured stderr. Both fixes are complementary; this one is the cheaper of the two and self-corrects on the next regression.
Files
Both files are kept in sync per the `SKILL.md workflow` section of CLAUDE.md.
Validation
Fixes #1327
Need help on this PR? Tag
@codesmithwith what you need.