Skip to content

Enhance warmup inference handling and error logging#1475

Closed
jayy-77 wants to merge 1 commit intoexo-explore:mainfrom
jayy-77:runner_crashed_fix
Closed

Enhance warmup inference handling and error logging#1475
jayy-77 wants to merge 1 commit intoexo-explore:mainfrom
jayy-77:runner_crashed_fix

Conversation

@jayy-77
Copy link
Copy Markdown

@jayy-77 jayy-77 commented Feb 15, 2026

Motivation

Changes

#1431

Why It Works

Test Plan

Manual Testing

Automated Testing

Co-authored-by: Cursor <cursoragent@cursor.com>
@AlexCheema
Copy link
Copy Markdown
Contributor

PR #1475 Review: "Enhance warmup inference handling and error logging"

References: #1431

Summary of Changes

This PR wraps warmup inference in try/except blocks so that if warmup fails (e.g. MLX CPU JIT compile failure on Linux ARM), the runner continues instead of crashing. It also adds error handling in the runner supervisor for unexpected channel closures.

Note: The PR template is mostly empty — only the Changes section references issue #1431. No motivation, explanation, or test plan provided.


Files Changed (3)

  1. src/exo/worker/engines/mlx/generator/generate.py — Wrap stream_generate() warmup loop in try/except RuntimeError; log warning and continue without warmup on failure
  2. src/exo/worker/runner/runner.py — Wrap entire warmup block (warmup call + check_for_cancel_every calculation + distributed all_gather) in try/except RuntimeError
  3. src/exo/worker/runner/runner_supervisor.py — Add logging for existing ClosedResourceError/BrokenResourceError catch; add new broad except Exception block with string-based exception name matching

Detailed Analysis

1. generate.py — warmup_inference() error handling

  • Wraps the existing stream_generate() warmup loop in try/except RuntimeError
  • On failure, logs a warning and continues without warmup
  • Reports how many warmup tokens were generated before the failure

Assessment: Sound approach. Warmup is a performance optimization, not a correctness requirement — failing gracefully is the right behavior.

Minor concern: Catching only RuntimeError may be too narrow. If MLX raises a different exception type (e.g. OSError, ValueError), it won't be caught.

2. runner.py — warmup caller error handling

  • Wraps the entire warmup block in try/except RuntimeError
  • On failure, logs a warning and continues to runner initialization

Assessment: Good — this is the caller-side complement to the generate.py change. Without this, even if warmup_inference itself caught the error, the subsequent code (time calculation, all_gather) could still fail on partial results.

Concerns:

  • check_for_cancel_every defaults to its initial value when warmup fails. Need to verify this default is sensible — if it's 0 or uninitialized, every token generation would check for cancellation, which could be a performance hit.
  • Same RuntimeError-only catching concern as above.

3. runner_supervisor.py — _forward_events() error handling

This is the most concerning change.

A new except Exception block checks type(e).__name__ in ("ClosedResourceError", "BrokenResourceError") as strings. This suggests exceptions are coming from a different module/import path and aren't caught by the existing except (ClosedResourceError, BrokenResourceError) clause.

Concerns:

  • String-based exception matching is fragile. A better fix would be to identify why the exceptions aren't matching the existing except clause and fix the imports, rather than string-matching exception names. If the issue is that the runner process sends a pickled exception from a different module, the string-matching approach will break if exception names change.
  • The duplicated logic between the two except blocks violates DRY — consider extracting to a helper or restructuring.
  • The else: raise is correct — unknown exceptions are re-raised.

Overall Assessment

Positives:

Concerns:

  1. No CI checks — The branch has no reported CI runs
  2. Empty PR description — No motivation, no test plan
  3. String-based exception matching in runner_supervisor.py is fragile and a code smell — should investigate root cause of import mismatch
  4. RuntimeError-only catching may miss other exception types from MLX
  5. Duplicated error handling logic in runner_supervisor.py
  6. No automated tests for the failure paths
  7. check_for_cancel_every default when warmup fails is unverified

Recommendation: The generate.py and runner.py changes are reasonable. The runner_supervisor.py change needs rethinking — the string-based exception name matching should be replaced with a proper fix for the import mismatch.

@AlexCheema
Copy link
Copy Markdown
Contributor

Code Review

Thanks for working on this! Graceful handling of warmup failures is the right approach — warmup is a performance optimization, not a correctness requirement, so the runner should continue even if it fails (#1431).

A few issues to address:

1. Remove the duplicate except Exception block in runner_supervisor.py

The new except Exception as e: with type(e).__name__ string matching duplicates the existing except (ClosedResourceError, BrokenResourceError) block above it. The preceding except already catches those types by class — the string-name check would only match if a different class happened to share the same name, which would be a bug:

# This already catches the exceptions:
except (ClosedResourceError, BrokenResourceError) as e:
    ...

# This is redundant and fragile:
except Exception as e:
    if type(e).__name__ in ("ClosedResourceError", "BrokenResourceError"):
        ...  # same handling duplicated

Recommendation: Remove the entire except Exception block.

2. Double try/except — redundant catch in generate.py

RuntimeError is caught in both warmup_inference() (generate.py) AND in the caller (runner.py). If generate.py catches it, runner.py's catch never triggers for warmup failures. Both log different messages ("Continuing without warmup" vs "Warmup failed"), but only one can fire.

Recommendation: Keep the catch in runner.py only (it also covers check_for_cancel_every and all_gather). Let warmup_inference() be a simple function that succeeds or raises.

3. Minor: logging style inconsistency

The new code uses loguru {} style (logger.warning("...: {}", e)) while all surrounding code uses f-strings (logger.info(f"...")). Not a bug, but inconsistent.

4. CI & rebase needed

No CI checks reported — the branch is 12 commits behind main and needs a rebase.

5. PR description

The template sections (motivation, changes, test plan) are blank. Please fill them in.


Overall the core idea is sound — just needs the duplicate/redundant exception handling cleaned up. Looking forward to the next iteration!

@AlexCheema
Copy link
Copy Markdown
Contributor

Code Review — PR #1475: Enhance warmup inference handling and error logging

CI: No checks (fork PR — CI doesn't run automatically)

Overview

+68/-35 across 3 files. Attempts to handle MLX CPU JIT compile failures during warmup (e.g., _Float128 redeclaration on Linux ARM / Raspberry Pi, ref #1431) so the runner can still serve requests. Three changes:

  1. generate.py — try/except around stream_generate warmup loop
  2. runner.py — try/except around warmup + check_for_cancel_every computation
  3. runner_supervisor.py — duplicate exception handler in _forward_events

Critical bugs

1. Runner crashes on first inference after failed warmup

When warmup fails, check_for_cancel_every is either 0 or None, and the later assert check_for_cancel_every (line 298) fails:

  • Path A: generate.py catches RuntimeError, returns 0 tokens → runner.py computes check_for_cancel_every = min(math.ceil(0 / ...), 100) = 0 → exits try normally → later assert 0AssertionError
  • Path B: Exception propagates to runner.py's except → check_for_cancel_every stays None (initial value, line 144) → later assert NoneAssertionError

The PR's stated goal ("runner will still be ready") is not achieved. The first real inference request crashes the runner.

Fix: set a default in the except block: check_for_cancel_every = 10 (or similar).

2. Dead code in runner_supervisor.py

The second except Exception block name-checks type(e).__name__ in ("ClosedResourceError", "BrokenResourceError"). This can never be True because those exact types are already caught by the preceding except (ClosedResourceError, BrokenResourceError) block. The else: raise path is the only one that ever executes — making the entire block equivalent to just except Exception: raise, which is a no-op.

Remove this entire block.

Design issues

3. Double catch is redundant

RuntimeError is caught in both warmup_inference() (generate.py) and its caller (runner.py). If generate.py catches it and returns 0, runner.py's except never fires for warmup. If the intent is to also catch failures in mx_barrier or all_gather, only the runner.py catch is needed. Pick one location — keep runner.py's, remove generate.py's. warmup_inference() should succeed or raise, not silently swallow errors.

4. Broad RuntimeError catch may mask bugs

Catching all RuntimeError during warmup could hide OOM, tensor shape mismatches, or Metal errors. Consider checking the error message for the specific JIT compile pattern, or at minimum logging with full traceback (logger.opt(exception=e).warning(...)).

Minor

  • Logging style inconsistency — PR uses loguru {} placeholders while surrounding code uses f-strings. Both work, but mixing styles in the same file is inconsistent.
  • Multi-node warmup failure — If warmup fails on one node but succeeds on others, mx_barrier(group) after generate.py's try/except may deadlock.
  • PR description is blank — No description of changes or test plan.

Verdict

Do not merge. The core idea (graceful warmup failure) is correct, but the implementation has a critical bug: the runner crashes on first inference after failed warmup due to assert check_for_cancel_every failing with 0 or None. The runner_supervisor.py change is dead code. Needs rework.

@AlexCheema
Copy link
Copy Markdown
Contributor

Code Review: PR #1475 — Enhance warmup inference handling and error logging

Summary

Makes warmup inference failures non-fatal — catches RuntimeError in both the generate function and the runner, logging warnings instead of crashing. Also improves the runner supervisor's event forwarding error handling.

Review

Warmup error handling (generate.py):

try:
    for _r in stream_generate(...):
        tokens_generated += 1
except RuntimeError as e:
    logger.warning("Warmup inference failed: {}", e)

Correctly catches warmup failures (e.g., MLX CPU JIT compile errors on Linux ARM) and continues. The runner can still serve requests — first request may just be slower. ✅

Runner-side error handling (runner.py):
Wraps the entire warmup block in try/except, including the cancel-check-interval calculation. Falls through to RunnerReady status even on warmup failure. ✅

Runner supervisor error handling (runner_supervisor.py):

except Exception as e:
    if type(e).__name__ in ("ClosedResourceError", "BrokenResourceError"):

This is string-matching on exception class names instead of using isinstance(). This suggests the exceptions might come from different modules with the same name. If so, this is a pragmatic workaround, but a comment explaining why isinstance doesn't work would help.

Issues

1. Logger format strings use {} not %s or f-strings

logger.warning("Warmup inference failed (e.g. MLX CPU compile on this platform): {}", e)

Python's logging module uses %s formatting, not {} (that's loguru). If using stdlib logging, this will print the literal {} in the message. Verify which logger is being used.

2. check_for_cancel_every may be None after failed warmup
If warmup fails, check_for_cancel_every stays at its initial value. Looking at the original code, it's initialized to None. During generation, the code checks tokens_since_last_cancel_check >= check_for_cancel_every — if check_for_cancel_every is None, this comparison will always be False, meaning cancellation checks never happen. This could leave the runner unresponsive to cancel requests.

Verdict

Good defensive improvement — warmup failures shouldn't crash the runner. The main concern is the check_for_cancel_every defaulting to None after warmup failure, which could disable cancel checks entirely.

LGTM with the cancel-check concern.

@jayy-77 jayy-77 closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants