From 873cd27a51d7945ad292d3219b45ab9ceacd4e54 Mon Sep 17 00:00:00 2001 From: genisis0x Date: Wed, 13 May 2026 14:37:47 +0530 Subject: [PATCH] fix(codex): surface non-zero exits so wrappers stop reading as silent stalls MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fixes #1327. The /codex Review, Challenge, Consult, and Consult-resume wrappers all inspect _CODEX_EXIT for the 124 timeout sentinel only. Any other non-zero exit — parse error from a Codex CLI breaking arg change (exit 2), auth failure that didn't match the existing auth-text regex, network drop returning a partial frame — gets swallowed. The calling agent sees empty stdout, indistinguishable from a model API stall, and proceeds to misdiagnose for 30-60min before someone checks $TMPERR by hand. The reporter's CLAUDE.md #54f from claude-teams-bot makes the same observation across ~5 hits in one week: "Don't trust silent-stall framing... it's exit non-zero with empty stdout, swallowed by a wrapper that only handles the timeout exit code." Add an `elif != "0"` branch to each of the four wrappers (Review, Challenge, Consult new-session, Consult-resume). On a non-zero non-timeout exit, the wrapper now: - Prints `[codex exit ] ` so the calling agent has an immediate, structured diagnostic line. - Prints the first 20 lines of $TMPERR indented two spaces so the full parse error / arg-shape rejection is right there in the conversation, not hidden in a temp file. - Logs `codex_nonzero_exit::` via the existing telemetry helper so this becomes queryable, distinct from `codex_timeout`. Behavior on success (exit 0) and timeout (exit 124) is unchanged — the existing branches still fire first. Auth-error detection in the Challenge wrapper (`grep -qiE "auth|login|unauthorized" "$TMPERR"`) also still fires unchanged; the new branch sits between the timeout branch and the auth branch and doesn't shadow either. Defensive — the canonical fix for the original parse-error trigger (#1196 / PR #1209) addresses the specific arg-shape break. This change addresses the wrapper's blindness to all non-zero exits, including the next time Codex changes its arg contract. Both fixes are complementary as the reporter notes; this one is the cheaper of the two. Edit applied to SKILL.md.tmpl (canonical source) and codex/SKILL.md (generated) so the PR is self-consistent without requiring the reviewer to run `bun run gen:skill-docs`. --- codex/SKILL.md | 25 +++++++++++++++++++++++++ codex/SKILL.md.tmpl | 25 +++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/codex/SKILL.md b/codex/SKILL.md index 0ff1e3e553..ad3f966747 100644 --- a/codex/SKILL.md +++ b/codex/SKILL.md @@ -949,6 +949,13 @@ if [ "$_CODEX_EXIT" = "124" ]; then _gstack_codex_log_event "codex_timeout" "330" _gstack_codex_log_hang "review" "$(wc -c < "$TMPERR" 2>/dev/null || echo 0)" echo "Codex stalled past 5.5 minutes. Common causes: model API stall, long prompt, network issue. Try re-running. If persistent, split the prompt or check ~/.codex/logs/." +elif [ "$_CODEX_EXIT" != "0" ]; then + # Surface non-zero exits (parse errors, arg-shape breaks, etc.) so the + # calling agent doesn't read "no output" as a silent model/API stall and + # burn 30-60min misdiagnosing it. See #1327. + echo "[codex exit $_CODEX_EXIT] $(head -1 "$TMPERR" 2>/dev/null || echo "no stderr captured")" + head -20 "$TMPERR" 2>/dev/null | sed 's/^/ /' || true + _gstack_codex_log_event "codex_nonzero_exit" "review:$_CODEX_EXIT" fi ``` @@ -1187,6 +1194,12 @@ if [ "$_CODEX_EXIT" = "124" ]; then _gstack_codex_log_event "codex_timeout" "600" _gstack_codex_log_hang "challenge" "$(wc -c < "$TMPERR" 2>/dev/null || echo 0)" echo "Codex stalled past 10 minutes. Common causes: model API stall, long prompt, network issue. Try re-running. If persistent, split the prompt or check ~/.codex/logs/." +elif [ "$_CODEX_EXIT" != "0" ]; then + # Surface non-zero exits so the calling agent doesn't read "no output" as + # a silent model/API stall. See #1327. + echo "[codex exit $_CODEX_EXIT] $(head -1 "$TMPERR" 2>/dev/null || echo "no stderr captured")" + head -20 "$TMPERR" 2>/dev/null | sed 's/^/ /' || true + _gstack_codex_log_event "codex_nonzero_exit" "challenge:$_CODEX_EXIT" fi # Fix 2: surface auth errors from captured stderr instead of dropping them if grep -qiE "auth|login|unauthorized" "$TMPERR" 2>/dev/null; then @@ -1334,6 +1347,12 @@ if [ "$_CODEX_EXIT" = "124" ]; then _gstack_codex_log_event "codex_timeout" "600" _gstack_codex_log_hang "consult" "$(wc -c < "$TMPERR" 2>/dev/null || echo 0)" echo "Codex stalled past 10 minutes. Common causes: model API stall, long prompt, network issue. Try re-running. If persistent, split the prompt or check ~/.codex/logs/." +elif [ "$_CODEX_EXIT" != "0" ]; then + # Surface non-zero exits so the calling agent doesn't read "no output" as + # a silent model/API stall. See #1327. + echo "[codex exit $_CODEX_EXIT] $(head -1 "$TMPERR" 2>/dev/null || echo "no stderr captured")" + head -20 "$TMPERR" 2>/dev/null | sed 's/^/ /' || true + _gstack_codex_log_event "codex_nonzero_exit" "consult:$_CODEX_EXIT" fi ``` @@ -1356,6 +1375,12 @@ if [ "$_CODEX_EXIT" = "124" ]; then _gstack_codex_log_event "codex_timeout" "600" _gstack_codex_log_hang "consult-resume" "$(wc -c < "$TMPERR" 2>/dev/null || echo 0)" echo "Codex stalled past 10 minutes. Common causes: model API stall, long prompt, network issue. Try re-running. If persistent, split the prompt or check ~/.codex/logs/." +elif [ "$_CODEX_EXIT" != "0" ]; then + # Surface non-zero exits so the calling agent doesn't read "no output" as + # a silent model/API stall. See #1327. + echo "[codex exit $_CODEX_EXIT] $(head -1 "$TMPERR" 2>/dev/null || echo "no stderr captured")" + head -20 "$TMPERR" 2>/dev/null | sed 's/^/ /' || true + _gstack_codex_log_event "codex_nonzero_exit" "consult-resume:$_CODEX_EXIT" fi 5. Capture session ID from the streamed output. The parser prints `SESSION_ID:` diff --git a/codex/SKILL.md.tmpl b/codex/SKILL.md.tmpl index ed118a1193..69abdfc620 100644 --- a/codex/SKILL.md.tmpl +++ b/codex/SKILL.md.tmpl @@ -175,6 +175,13 @@ if [ "$_CODEX_EXIT" = "124" ]; then _gstack_codex_log_event "codex_timeout" "330" _gstack_codex_log_hang "review" "$(wc -c < "$TMPERR" 2>/dev/null || echo 0)" echo "Codex stalled past 5.5 minutes. Common causes: model API stall, long prompt, network issue. Try re-running. If persistent, split the prompt or check ~/.codex/logs/." +elif [ "$_CODEX_EXIT" != "0" ]; then + # Surface non-zero exits (parse errors, arg-shape breaks, etc.) so the + # calling agent doesn't read "no output" as a silent model/API stall and + # burn 30-60min misdiagnosing it. See #1327. + echo "[codex exit $_CODEX_EXIT] $(head -1 "$TMPERR" 2>/dev/null || echo "no stderr captured")" + head -20 "$TMPERR" 2>/dev/null | sed 's/^/ /' || true + _gstack_codex_log_event "codex_nonzero_exit" "review:$_CODEX_EXIT" fi ``` @@ -329,6 +336,12 @@ if [ "$_CODEX_EXIT" = "124" ]; then _gstack_codex_log_event "codex_timeout" "600" _gstack_codex_log_hang "challenge" "$(wc -c < "$TMPERR" 2>/dev/null || echo 0)" echo "Codex stalled past 10 minutes. Common causes: model API stall, long prompt, network issue. Try re-running. If persistent, split the prompt or check ~/.codex/logs/." +elif [ "$_CODEX_EXIT" != "0" ]; then + # Surface non-zero exits so the calling agent doesn't read "no output" as + # a silent model/API stall. See #1327. + echo "[codex exit $_CODEX_EXIT] $(head -1 "$TMPERR" 2>/dev/null || echo "no stderr captured")" + head -20 "$TMPERR" 2>/dev/null | sed 's/^/ /' || true + _gstack_codex_log_event "codex_nonzero_exit" "challenge:$_CODEX_EXIT" fi # Fix 2: surface auth errors from captured stderr instead of dropping them if grep -qiE "auth|login|unauthorized" "$TMPERR" 2>/dev/null; then @@ -476,6 +489,12 @@ if [ "$_CODEX_EXIT" = "124" ]; then _gstack_codex_log_event "codex_timeout" "600" _gstack_codex_log_hang "consult" "$(wc -c < "$TMPERR" 2>/dev/null || echo 0)" echo "Codex stalled past 10 minutes. Common causes: model API stall, long prompt, network issue. Try re-running. If persistent, split the prompt or check ~/.codex/logs/." +elif [ "$_CODEX_EXIT" != "0" ]; then + # Surface non-zero exits so the calling agent doesn't read "no output" as + # a silent model/API stall. See #1327. + echo "[codex exit $_CODEX_EXIT] $(head -1 "$TMPERR" 2>/dev/null || echo "no stderr captured")" + head -20 "$TMPERR" 2>/dev/null | sed 's/^/ /' || true + _gstack_codex_log_event "codex_nonzero_exit" "consult:$_CODEX_EXIT" fi ``` @@ -498,6 +517,12 @@ if [ "$_CODEX_EXIT" = "124" ]; then _gstack_codex_log_event "codex_timeout" "600" _gstack_codex_log_hang "consult-resume" "$(wc -c < "$TMPERR" 2>/dev/null || echo 0)" echo "Codex stalled past 10 minutes. Common causes: model API stall, long prompt, network issue. Try re-running. If persistent, split the prompt or check ~/.codex/logs/." +elif [ "$_CODEX_EXIT" != "0" ]; then + # Surface non-zero exits so the calling agent doesn't read "no output" as + # a silent model/API stall. See #1327. + echo "[codex exit $_CODEX_EXIT] $(head -1 "$TMPERR" 2>/dev/null || echo "no stderr captured")" + head -20 "$TMPERR" 2>/dev/null | sed 's/^/ /' || true + _gstack_codex_log_event "codex_nonzero_exit" "consult-resume:$_CODEX_EXIT" fi 5. Capture session ID from the streamed output. The parser prints `SESSION_ID:`