Skip to content

fix: add health check and heartbeat to RunnerSupervisor#1464

Closed
AlexCheema wants to merge 9 commits intomainfrom
alexcheema/runner-health-check
Closed

fix: add health check and heartbeat to RunnerSupervisor#1464
AlexCheema wants to merge 9 commits intomainfrom
alexcheema/runner-health-check

Conversation

@AlexCheema
Copy link
Copy Markdown
Contributor

Motivation

The RunnerSupervisor had no proactive health monitoring. It only detected runner process death reactively — when _forward_events() got a ClosedResourceError/BrokenResourceError from the multiprocessing event channel. When the runner was OOM-killed or otherwise abruptly terminated, the multiprocessing.Queue.get() call inside _forward_events would block forever because the child process never sent the sentinel _MpEndOfStream. This left the system completely stuck with no recovery.

Manually killing the runner process also produced no reaction from the supervisor.

Changes

1. Process liveness check (is_alive() polling)

  • Added a _health_check coroutine that runs concurrently with _forward_events in a task group
  • Polls runner_process.is_alive() every 1 second
  • When the process is found dead, emits RunnerFailed and releases all pending task waiters
  • Any unexpected exit is treated as failure — including exitcode=0 when the runner wasn't in a shutdown state (the runner process should never die unless explicitly told to via a Shutdown task)

2. Heartbeat mechanism for detecting unresponsive processes

  • The subprocess runs a daemon thread that writes time.time() to a shared multiprocessing.Value every 0.5 seconds
  • The supervisor checks this timestamp during health checks — if it goes stale for >10 seconds while the process is still alive, the process is killed and RunnerFailed is emitted
  • Catches frozen/deadlocked processes (e.g. OS memory pressure suspending the process)

3. Tests

  • test_health_check_detects_dead_process — non-zero exit code
  • test_health_check_detects_signal_death — SIGKILL (simulates OOM)
  • test_health_check_releases_pending_tasks — pending start_task() waiters unblocked
  • test_clean_exit_no_failure_when_shutdown_status — no false alarm on expected shutdown
  • test_unexpected_exit_code_zero_emits_failure — rc=0 without shutdown state
  • test_heartbeat_timeout_detects_unresponsive_process — stale heartbeat detection

Why It Works

The root cause was that _forward_events blocked on queue.get() forever when the child process died without cleanly closing the channel. The health check runs in a separate task and can cancel the stuck _forward_events task via the task group's cancel scope. The abandon_on_cancel=True flag on receive_async ensures the blocked thread is properly abandoned.

The heartbeat mechanism extends this to catch processes that are alive but frozen — a scenario where is_alive() returns True but the process isn't making progress.

Test Plan

Automated Testing

  • 6 new tests covering all failure modes (process death, signal death, pending task release, expected shutdown, unexpected clean exit, heartbeat timeout)
  • All 193 existing tests continue to pass
  • basedpyright: 0 errors
  • ruff: all checks passed

🤖 Generated with Claude Code

AlexCheema and others added 8 commits February 16, 2026 10:04
Add a Python/asyncio E2E test framework that spins up 2-node exo clusters
in Docker Compose and verifies cluster formation, discovery, election, and
API health. Includes a no-internet chaos test using DNS blocking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The runner was running out of disk space during the Docker image build
(Rust compilation + Python deps). Remove unused toolchains first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean up Rust target/ and cargo registry after uv sync in the same RUN
command so build artifacts aren't committed to the layer (~1-2 GB saved).
Also remove more unused toolchains from the CI runner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use iptables to block all outbound traffic except private subnets and
multicast (for mDNS discovery). Verify internet is blocked by curling
huggingface.co from inside each container and checking exo logs for
"Internet connectivity: False".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Launch mlx-community/Qwen3-0.6B-4bit on the cluster, send a chat
completion with seed=42 and temperature=0, and verify the output
matches a committed snapshot. Tests inference determinism end-to-end.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MLX CPU inference on x86_64 is too slow for CI runners (~10min+ for
a single request). Mark the inference snapshot test as slow so it's
skipped by default. Run with --slow or E2E_SLOW=1 on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add proactive monitoring to detect runner process death and unresponsiveness:

- Health check loop polls is_alive() every 1s, detects unexpected exits
- Counter-based heartbeat detects frozen/unresponsive processes
- Emits RunnerFailed event and releases pending task waiters on failure
- Add EXO_RUNNER_MUST_DIE debug trigger for testing abrupt process death
- Add chaos E2E test that kills runner mid-inference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lection

Add root conftest.py to exclude tests/start_distributed_test.py from
pytest collection (it calls sys.exit at module level). Fix ruff lint
issues (import sorting, f-string without placeholders, lambda loop
variable capture) and apply nix fmt formatting to e2e files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema AlexCheema force-pushed the alexcheema/runner-health-check branch from 4842edf to 4fa6d05 Compare February 16, 2026 19:11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema
Copy link
Copy Markdown
Contributor Author

Code Review: PR #1464 — Health check and heartbeat for RunnerSupervisor

Overall

Good feature addressing a real reliability gap. The _health_check / _forward_events task group design is clean, the heartbeat mechanism is well thought out, and the 6 new tests cover the key scenarios thoroughly.

Critical: _forward_events thread hangs on cancel_scope.cancel()

When _health_check detects death and calls cancel_scope.cancel(), the task group cancels _forward_events. But _forward_events is blocked in:

MpReceiver.receive_async() → to_thread.run_sync(queue.get(), abandon_on_cancel=True)

The async side is cancelled, but the thread doing queue.get() is abandoned and blocks forever. Neither _handle_process_exit() nor _handle_unresponsive() unblocks it. shutdown() calls _ev_recv.close() which does buffer.close(), but multiprocessing.Queue.close() does not wake up a pending get().

This is the cause of the 6-hour CI hang on aarch64-darwin: all 193 tests pass in ~28s, then pytest can't exit because the abandoned thread holds the process alive.

Fix: MpReceiver.close() needs to put an _MpEndOfStream sentinel before closing the buffer (mirroring what MpSender.close() already does). See PR #1511 which addresses this.

Medium: shutdown() is not idempotent

_handle_process_exit() and _handle_unresponsive() both call self.shutdown(). The _check_runner() path (from _forward_events exception handler) also calls self.shutdown(). While _death_handled prevents double-entry into the handlers, if both code paths somehow race, shutdown() could be called twice.

On the second call:

  • _cancel_sender.send(TaskId("CANCEL_CURRENT_TASK")) raises ClosedResourceError (sender already closed)
  • MpSender.close() calls buffer.put(_MpEndOfStream()) on an already-closed queue → ValueError

Suggestion: Add a _shutdown guard flag: if self._shutdown: return at the top of shutdown().

Minor: Pre-heartbeat freeze gap

The if current > 0 guard in _health_check means the heartbeat is ignored until the subprocess starts incrementing. If the process freezes before the heartbeat thread starts (e.g., during import/initialization), the heartbeat check never fires. The is_alive() check won't help either because the process IS alive — just frozen. This may be acceptable as a known limitation, but worth documenting.

Minor: PR scope

The E2E framework (Dockerfile, docker-compose, test runner, 4 E2E tests, snapshot testing, chaos testing) is substantial and well-engineered, but it's a lot of new code mixed into a health-check PR. Consider splitting the E2E framework into its own PR for easier review.

Tests — Well done

The 6 tests cover all key scenarios (dead process, signal death, pending task release, expected shutdown, unexpected clean exit, heartbeat timeout). _build_supervisor correctly clones the event sender to isolate the test's receiver.


Review only — not a merge approval.

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Code Review — PR #1464: fix: add health check and heartbeat to RunnerSupervisor

CI: aarch64-darwin HANGS (6+ hours, cancelled) | x86_64-linux PASS | aarch64-linux PASS | e2e PASS

Overview

+978/-5 across 17 files. Two major components:

  1. Health check + heartbeat (~160 lines): shared-memory heartbeat counter in the runner subprocess, supervisor-side health check that detects process death and unresponsive freezes
  2. E2E test framework (~500 lines): Docker-based multi-node cluster tests with chaos testing

Plus 6 new unit tests for the supervisor and a conftest.py root-level test ignore fix.

Critical: CI hang caused by MpReceiver.close() not unblocking queue.get()

Confirmed the author's self-review diagnosis. When _health_check detects death and calls cancel_scope.cancel():

  1. _forward_events is cancelled while blocked in MpReceiver.receive_async()to_thread.run_sync(queue.get())
  2. to_thread.run_sync default abandon_on_cancel=False means it waits for the thread to finish
  3. queue.get() blocks forever — MpReceiver.close() only sets a flag, does NOT put _MpEndOfStream on the queue (unlike MpSender.close() which does)
  4. The abandoned thread holds the process alive → pytest hangs

Dependency: PR #1511 fixes MpReceiver.close(). Must merge #1511 first or include the fix here.

Should fix

1. shutdown() not idempotent_handle_process_exit() and _handle_unresponsive() both call self.shutdown(). While _death_handled prevents duplicate handler execution, _check_runner() (called from _forward_events exception handler, line 408 on main) also calls self.shutdown(). If both _forward_events and _health_check detect death in the same event loop turn, _death_handled prevents double handling, but there's still a path through _check_runner() that calls shutdown(). Add a _shutdown guard: if self._shutdown: return at the top.

2. _forward_events exception handler calls _check_runner which calls shutdown(), but _handle_process_exit also calls shutdown() — With the _death_handled guard, only one path runs. But _check_runner is async (uses await self._event_sender.send()), while _handle_process_exit uses send_nowait. If _forward_events's exception handler runs first and _check_runner calls shutdown(), the _event_sender gets closed. Then _handle_process_exit is guarded. But if _handle_process_exit runs first, it calls shutdown() which closes _event_sender, and _forward_events is guarded. The _death_handled flag works correctly here due to cooperative scheduling. No actual bug, but the dual-path design is fragile.

Minor

  • e2e.yml triggers on ALL pusheson: push: without branch filter runs the expensive Docker build on every push to every branch. Restrict to push: { branches: [main] }.
  • Pre-heartbeat freeze gapif current > 0 guard means freezes during subprocess initialization (before heartbeat thread starts) aren't detected. is_alive() catches death but not live-but-frozen. Acceptable known limitation.
  • Debug trigger not gatedEXO_RUNNER_MUST_DIE calls os._exit(1) on any prompt containing the magic string. Follows the existing pattern of MUST_FAIL/MUST_OOM/MUST_TIMEOUT which are also ungated. Consistent but worth noting these are accessible to any user who can send prompts.
  • Root conftest.pycollect_ignore = ["tests/start_distributed_test.py"] is cleaner than the pyproject.toml --ignore approach in PRs fix: enable psutil fallback for memory monitoring when macmon is missing #1478/fix: forcefully terminate runner on Shutdown timeout #1466.

What's good

  • Heartbeat design is excellentmultiprocessing.Value("Q", 0) with a daemon thread is the right primitive. Monotonic counter avoids clock issues. 0.5s increment with 10s stale threshold (20 missed beats) avoids false positives during heavy computation.
  • _death_handled flag prevents race between _forward_events and _health_check detecting the same death. Clean cooperative scheduling guard.
  • Signal identification_handle_process_exit distinguishes signal deaths (negative exit codes → signal.strsignal) from normal exits, producing actionable error messages like "Terminated (signal=9 (Killed))".
  • Unit tests are thorough — 6 tests covering dead process, signal death, pending task release, expected shutdown, unexpected clean exit, and heartbeat timeout. _build_supervisor correctly uses event_sender.clone() to isolate test receiver.
  • E2E framework is well-engineered — Docker Compose cluster, iptables network isolation, snapshot-based deterministic testing. Overlaps with PR feat: Docker-based E2E test framework with chaos testing #1462.

Verdict

The health check and heartbeat mechanism is well-designed and well-tested. Blocked by the MpReceiver.close() thread hang (PR #1511 dependency). Once #1511 is merged/included, the core feature is ready. Make shutdown() idempotent and restrict the e2e.yml trigger scope before merge.

Review only — not a merge approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants