fix: add health check and heartbeat to RunnerSupervisor by AlexCheema · Pull Request #1464 · exo-explore/exo

AlexCheema · 2026-02-12T21:05:32Z

Motivation

The RunnerSupervisor had no proactive health monitoring. It only detected runner process death reactively — when _forward_events() got a ClosedResourceError/BrokenResourceError from the multiprocessing event channel. When the runner was OOM-killed or otherwise abruptly terminated, the multiprocessing.Queue.get() call inside _forward_events would block forever because the child process never sent the sentinel _MpEndOfStream. This left the system completely stuck with no recovery.

Manually killing the runner process also produced no reaction from the supervisor.

Changes

1. Process liveness check (`is_alive()` polling)

Added a _health_check coroutine that runs concurrently with _forward_events in a task group
Polls runner_process.is_alive() every 1 second
When the process is found dead, emits RunnerFailed and releases all pending task waiters
Any unexpected exit is treated as failure — including exitcode=0 when the runner wasn't in a shutdown state (the runner process should never die unless explicitly told to via a Shutdown task)

2. Heartbeat mechanism for detecting unresponsive processes

The subprocess runs a daemon thread that writes time.time() to a shared multiprocessing.Value every 0.5 seconds
The supervisor checks this timestamp during health checks — if it goes stale for >10 seconds while the process is still alive, the process is killed and RunnerFailed is emitted
Catches frozen/deadlocked processes (e.g. OS memory pressure suspending the process)

3. Tests

test_health_check_detects_dead_process — non-zero exit code
test_health_check_detects_signal_death — SIGKILL (simulates OOM)
test_health_check_releases_pending_tasks — pending start_task() waiters unblocked
test_clean_exit_no_failure_when_shutdown_status — no false alarm on expected shutdown
test_unexpected_exit_code_zero_emits_failure — rc=0 without shutdown state
test_heartbeat_timeout_detects_unresponsive_process — stale heartbeat detection

Why It Works

The root cause was that _forward_events blocked on queue.get() forever when the child process died without cleanly closing the channel. The health check runs in a separate task and can cancel the stuck _forward_events task via the task group's cancel scope. The abandon_on_cancel=True flag on receive_async ensures the blocked thread is properly abandoned.

The heartbeat mechanism extends this to catch processes that are alive but frozen — a scenario where is_alive() returns True but the process isn't making progress.

Test Plan

Automated Testing

6 new tests covering all failure modes (process death, signal death, pending task release, expected shutdown, unexpected clean exit, heartbeat timeout)
All 193 existing tests continue to pass
basedpyright: 0 errors
ruff: all checks passed

🤖 Generated with Claude Code

Add a Python/asyncio E2E test framework that spins up 2-node exo clusters in Docker Compose and verifies cluster formation, discovery, election, and API health. Includes a no-internet chaos test using DNS blocking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The runner was running out of disk space during the Docker image build (Rust compilation + Python deps). Remove unused toolchains first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clean up Rust target/ and cargo registry after uv sync in the same RUN command so build artifacts aren't committed to the layer (~1-2 GB saved). Also remove more unused toolchains from the CI runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use iptables to block all outbound traffic except private subnets and multicast (for mDNS discovery). Verify internet is blocked by curling huggingface.co from inside each container and checking exo logs for "Internet connectivity: False". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Launch mlx-community/Qwen3-0.6B-4bit on the cluster, send a chat completion with seed=42 and temperature=0, and verify the output matches a committed snapshot. Tests inference determinism end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MLX CPU inference on x86_64 is too slow for CI runners (~10min+ for a single request). Mark the inference snapshot test as slow so it's skipped by default. Run with --slow or E2E_SLOW=1 on Apple Silicon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add proactive monitoring to detect runner process death and unresponsiveness: - Health check loop polls is_alive() every 1s, detects unexpected exits - Counter-based heartbeat detects frozen/unresponsive processes - Emits RunnerFailed event and releases pending task waiters on failure - Add EXO_RUNNER_MUST_DIE debug trigger for testing abrupt process death - Add chaos E2E test that kills runner mid-inference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…lection Add root conftest.py to exclude tests/start_distributed_test.py from pytest collection (it calls sys.exit at module level). Fix ruff lint issues (import sorting, f-string without placeholders, lambda loop variable capture) and apply nix fmt formatting to e2e files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AlexCheema · 2026-02-17T18:18:30Z

Code Review: PR #1464 — Health check and heartbeat for RunnerSupervisor

Overall

Good feature addressing a real reliability gap. The _health_check / _forward_events task group design is clean, the heartbeat mechanism is well thought out, and the 6 new tests cover the key scenarios thoroughly.

Critical: `_forward_events` thread hangs on `cancel_scope.cancel()`

When _health_check detects death and calls cancel_scope.cancel(), the task group cancels _forward_events. But _forward_events is blocked in:

MpReceiver.receive_async() → to_thread.run_sync(queue.get(), abandon_on_cancel=True)

The async side is cancelled, but the thread doing queue.get() is abandoned and blocks forever. Neither _handle_process_exit() nor _handle_unresponsive() unblocks it. shutdown() calls _ev_recv.close() which does buffer.close(), but multiprocessing.Queue.close() does not wake up a pending get().

This is the cause of the 6-hour CI hang on aarch64-darwin: all 193 tests pass in ~28s, then pytest can't exit because the abandoned thread holds the process alive.

Fix: MpReceiver.close() needs to put an _MpEndOfStream sentinel before closing the buffer (mirroring what MpSender.close() already does). See PR #1511 which addresses this.

Medium: `shutdown()` is not idempotent

_handle_process_exit() and _handle_unresponsive() both call self.shutdown(). The _check_runner() path (from _forward_events exception handler) also calls self.shutdown(). While _death_handled prevents double-entry into the handlers, if both code paths somehow race, shutdown() could be called twice.

On the second call:

_cancel_sender.send(TaskId("CANCEL_CURRENT_TASK")) raises ClosedResourceError (sender already closed)
MpSender.close() calls buffer.put(_MpEndOfStream()) on an already-closed queue → ValueError

Suggestion: Add a _shutdown guard flag: if self._shutdown: return at the top of shutdown().

Minor: Pre-heartbeat freeze gap

The if current > 0 guard in _health_check means the heartbeat is ignored until the subprocess starts incrementing. If the process freezes before the heartbeat thread starts (e.g., during import/initialization), the heartbeat check never fires. The is_alive() check won't help either because the process IS alive — just frozen. This may be acceptable as a known limitation, but worth documenting.

Minor: PR scope

The E2E framework (Dockerfile, docker-compose, test runner, 4 E2E tests, snapshot testing, chaos testing) is substantial and well-engineered, but it's a lot of new code mixed into a health-check PR. Consider splitting the E2E framework into its own PR for easier review.

Tests — Well done

The 6 tests cover all key scenarios (dead process, signal death, pending task release, expected shutdown, unexpected clean exit, heartbeat timeout). _build_supervisor correctly clones the event sender to isolate the test's receiver.

Review only — not a merge approval.

AlexCheema · 2026-02-17T23:18:59Z

Code Review — PR #1464: fix: add health check and heartbeat to RunnerSupervisor

CI: aarch64-darwin HANGS (6+ hours, cancelled) | x86_64-linux PASS | aarch64-linux PASS | e2e PASS

Overview

+978/-5 across 17 files. Two major components:

Health check + heartbeat (~160 lines): shared-memory heartbeat counter in the runner subprocess, supervisor-side health check that detects process death and unresponsive freezes
E2E test framework (~500 lines): Docker-based multi-node cluster tests with chaos testing

Plus 6 new unit tests for the supervisor and a conftest.py root-level test ignore fix.

Critical: CI hang caused by `MpReceiver.close()` not unblocking `queue.get()`

Confirmed the author's self-review diagnosis. When _health_check detects death and calls cancel_scope.cancel():

_forward_events is cancelled while blocked in MpReceiver.receive_async() → to_thread.run_sync(queue.get())
to_thread.run_sync default abandon_on_cancel=False means it waits for the thread to finish
queue.get() blocks forever — MpReceiver.close() only sets a flag, does NOT put _MpEndOfStream on the queue (unlike MpSender.close() which does)
The abandoned thread holds the process alive → pytest hangs

Dependency: PR #1511 fixes MpReceiver.close(). Must merge #1511 first or include the fix here.

Should fix

1. shutdown() not idempotent — _handle_process_exit() and _handle_unresponsive() both call self.shutdown(). While _death_handled prevents duplicate handler execution, _check_runner() (called from _forward_events exception handler, line 408 on main) also calls self.shutdown(). If both _forward_events and _health_check detect death in the same event loop turn, _death_handled prevents double handling, but there's still a path through _check_runner() that calls shutdown(). Add a _shutdown guard: if self._shutdown: return at the top.

2. _forward_events exception handler calls _check_runner which calls shutdown(), but _handle_process_exit also calls shutdown() — With the _death_handled guard, only one path runs. But _check_runner is async (uses await self._event_sender.send()), while _handle_process_exit uses send_nowait. If _forward_events's exception handler runs first and _check_runner calls shutdown(), the _event_sender gets closed. Then _handle_process_exit is guarded. But if _handle_process_exit runs first, it calls shutdown() which closes _event_sender, and _forward_events is guarded. The _death_handled flag works correctly here due to cooperative scheduling. No actual bug, but the dual-path design is fragile.

Minor

e2e.yml triggers on ALL pushes — on: push: without branch filter runs the expensive Docker build on every push to every branch. Restrict to push: { branches: [main] }.
Pre-heartbeat freeze gap — if current > 0 guard means freezes during subprocess initialization (before heartbeat thread starts) aren't detected. is_alive() catches death but not live-but-frozen. Acceptable known limitation.
Debug trigger not gated — EXO_RUNNER_MUST_DIE calls os._exit(1) on any prompt containing the magic string. Follows the existing pattern of MUST_FAIL/MUST_OOM/MUST_TIMEOUT which are also ungated. Consistent but worth noting these are accessible to any user who can send prompts.
Root conftest.py — collect_ignore = ["tests/start_distributed_test.py"] is cleaner than the pyproject.toml --ignore approach in PRs fix: enable psutil fallback for memory monitoring when macmon is missing #1478/fix: forcefully terminate runner on Shutdown timeout #1466.

What's good

Heartbeat design is excellent — multiprocessing.Value("Q", 0) with a daemon thread is the right primitive. Monotonic counter avoids clock issues. 0.5s increment with 10s stale threshold (20 missed beats) avoids false positives during heavy computation.
_death_handled flag prevents race between _forward_events and _health_check detecting the same death. Clean cooperative scheduling guard.
Signal identification — _handle_process_exit distinguishes signal deaths (negative exit codes → signal.strsignal) from normal exits, producing actionable error messages like "Terminated (signal=9 (Killed))".
Unit tests are thorough — 6 tests covering dead process, signal death, pending task release, expected shutdown, unexpected clean exit, and heartbeat timeout. _build_supervisor correctly uses event_sender.clone() to isolate test receiver.
E2E framework is well-engineered — Docker Compose cluster, iptables network isolation, snapshot-based deterministic testing. Overlaps with PR feat: Docker-based E2E test framework with chaos testing #1462.

Verdict

The health check and heartbeat mechanism is well-designed and well-tested. Blocked by the MpReceiver.close() thread hang (PR #1511 dependency). Once #1511 is merged/included, the core feature is ready. Make shutdown() idempotent and restrict the e2e.yml trigger scope before merge.

Review only — not a merge approval.

AlexCheema force-pushed the alexcheema/runner-health-check branch from 51ff252 to b428858 Compare February 12, 2026 23:36

AlexCheema mentioned this pull request Feb 12, 2026

fix: forcefully terminate runner on Shutdown timeout #1466

Closed

AlexCheema and others added 8 commits February 16, 2026 10:04

fix: free disk space in CI before Docker build

d477dc9

The runner was running out of disk space during the Docker image build (Rust compilation + Python deps). Remove unused toolchains first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AlexCheema force-pushed the alexcheema/runner-health-check branch from 4842edf to 4fa6d05 Compare February 16, 2026 19:11

fix: add missing cancel_sender param to test supervisor builder

e5a8f39

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This was referenced Feb 17, 2026

feat: MLX distributed model transfer between nodes #1463

Draft

fix: unblock MpReceiver.close() to prevent shutdown hang #1511

Merged

Evanev7 closed this Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add health check and heartbeat to RunnerSupervisor#1464

fix: add health check and heartbeat to RunnerSupervisor#1464
AlexCheema wants to merge 9 commits intomainfrom
alexcheema/runner-health-check

AlexCheema commented Feb 12, 2026

Uh oh!

AlexCheema commented Feb 17, 2026

Uh oh!

AlexCheema commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlexCheema commented Feb 12, 2026

Motivation

Changes

1. Process liveness check (is_alive() polling)

2. Heartbeat mechanism for detecting unresponsive processes

3. Tests

Why It Works

Test Plan

Automated Testing

Uh oh!

AlexCheema commented Feb 17, 2026

Code Review: PR #1464 — Health check and heartbeat for RunnerSupervisor

Overall

Critical: _forward_events thread hangs on cancel_scope.cancel()

Medium: shutdown() is not idempotent

Minor: Pre-heartbeat freeze gap

Minor: PR scope

Tests — Well done

Uh oh!

AlexCheema commented Feb 17, 2026

Code Review — PR #1464: fix: add health check and heartbeat to RunnerSupervisor

Overview

Critical: CI hang caused by MpReceiver.close() not unblocking queue.get()

Should fix

Minor

What's good

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Process liveness check (`is_alive()` polling)

Critical: `_forward_events` thread hangs on `cancel_scope.cancel()`

Medium: `shutdown()` is not idempotent

Critical: CI hang caused by `MpReceiver.close()` not unblocking `queue.get()`