fix: add health check and heartbeat to RunnerSupervisor#1464
fix: add health check and heartbeat to RunnerSupervisor#1464AlexCheema wants to merge 9 commits intomainfrom
Conversation
51ff252 to
b428858
Compare
Add a Python/asyncio E2E test framework that spins up 2-node exo clusters in Docker Compose and verifies cluster formation, discovery, election, and API health. Includes a no-internet chaos test using DNS blocking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The runner was running out of disk space during the Docker image build (Rust compilation + Python deps). Remove unused toolchains first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean up Rust target/ and cargo registry after uv sync in the same RUN command so build artifacts aren't committed to the layer (~1-2 GB saved). Also remove more unused toolchains from the CI runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use iptables to block all outbound traffic except private subnets and multicast (for mDNS discovery). Verify internet is blocked by curling huggingface.co from inside each container and checking exo logs for "Internet connectivity: False". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Launch mlx-community/Qwen3-0.6B-4bit on the cluster, send a chat completion with seed=42 and temperature=0, and verify the output matches a committed snapshot. Tests inference determinism end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MLX CPU inference on x86_64 is too slow for CI runners (~10min+ for a single request). Mark the inference snapshot test as slow so it's skipped by default. Run with --slow or E2E_SLOW=1 on Apple Silicon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add proactive monitoring to detect runner process death and unresponsiveness: - Health check loop polls is_alive() every 1s, detects unexpected exits - Counter-based heartbeat detects frozen/unresponsive processes - Emits RunnerFailed event and releases pending task waiters on failure - Add EXO_RUNNER_MUST_DIE debug trigger for testing abrupt process death - Add chaos E2E test that kills runner mid-inference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lection Add root conftest.py to exclude tests/start_distributed_test.py from pytest collection (it calls sys.exit at module level). Fix ruff lint issues (import sorting, f-string without placeholders, lambda loop variable capture) and apply nix fmt formatting to e2e files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4842edf to
4fa6d05
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review: PR #1464 — Health check and heartbeat for RunnerSupervisorOverallGood feature addressing a real reliability gap. The Critical:
|
Code Review — PR #1464: fix: add health check and heartbeat to RunnerSupervisorCI: aarch64-darwin HANGS (6+ hours, cancelled) | x86_64-linux PASS | aarch64-linux PASS | e2e PASS Overview+978/-5 across 17 files. Two major components:
Plus 6 new unit tests for the supervisor and a Critical: CI hang caused by
|
Motivation
The
RunnerSupervisorhad no proactive health monitoring. It only detected runner process death reactively — when_forward_events()got aClosedResourceError/BrokenResourceErrorfrom the multiprocessing event channel. When the runner was OOM-killed or otherwise abruptly terminated, themultiprocessing.Queue.get()call inside_forward_eventswould block forever because the child process never sent the sentinel_MpEndOfStream. This left the system completely stuck with no recovery.Manually killing the runner process also produced no reaction from the supervisor.
Changes
1. Process liveness check (
is_alive()polling)_health_checkcoroutine that runs concurrently with_forward_eventsin a task grouprunner_process.is_alive()every 1 secondRunnerFailedand releases all pending task waitersexitcode=0when the runner wasn't in a shutdown state (the runner process should never die unless explicitly told to via aShutdowntask)2. Heartbeat mechanism for detecting unresponsive processes
time.time()to a sharedmultiprocessing.Valueevery 0.5 secondsRunnerFailedis emitted3. Tests
test_health_check_detects_dead_process— non-zero exit codetest_health_check_detects_signal_death— SIGKILL (simulates OOM)test_health_check_releases_pending_tasks— pendingstart_task()waiters unblockedtest_clean_exit_no_failure_when_shutdown_status— no false alarm on expected shutdowntest_unexpected_exit_code_zero_emits_failure— rc=0 without shutdown statetest_heartbeat_timeout_detects_unresponsive_process— stale heartbeat detectionWhy It Works
The root cause was that
_forward_eventsblocked onqueue.get()forever when the child process died without cleanly closing the channel. The health check runs in a separate task and can cancel the stuck_forward_eventstask via the task group's cancel scope. Theabandon_on_cancel=Trueflag onreceive_asyncensures the blocked thread is properly abandoned.The heartbeat mechanism extends this to catch processes that are alive but frozen — a scenario where
is_alive()returnsTruebut the process isn't making progress.Test Plan
Automated Testing
🤖 Generated with Claude Code