feat: Docker-based E2E test framework with chaos testing#1462
feat: Docker-based E2E test framework with chaos testing#1462AlexCheema wants to merge 23 commits intomainfrom
Conversation
Add a Python/asyncio E2E test framework that spins up 2-node exo clusters in Docker Compose and verifies cluster formation, discovery, election, and API health. Includes a no-internet chaos test using DNS blocking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The runner was running out of disk space during the Docker image build (Rust compilation + Python deps). Remove unused toolchains first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean up Rust target/ and cargo registry after uv sync in the same RUN command so build artifacts aren't committed to the layer (~1-2 GB saved). Also remove more unused toolchains from the CI runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use iptables to block all outbound traffic except private subnets and multicast (for mDNS discovery). Verify internet is blocked by curling huggingface.co from inside each container and checking exo logs for "Internet connectivity: False". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Launch mlx-community/Qwen3-0.6B-4bit on the cluster, send a chat completion with seed=42 and temperature=0, and verify the output matches a committed snapshot. Tests inference determinism end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MLX CPU inference on x86_64 is too slow for CI runners (~10min+ for a single request). Mark the inference snapshot test as slow so it's skipped by default. Run with --slow or E2E_SLOW=1 on Apple Silicon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st collection The tests/start_distributed_test.py script calls sys.exit() at module level, which crashes pytest collection. Exclude it via collect_ignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add e2e/snapshot.py with assert_snapshot() for deterministic regression testing. On first run, saves inference output as the expected snapshot. On subsequent runs, compares against it with unified diff on mismatch. Set UPDATE_SNAPSHOTS=1 or pass --update-snapshots to regenerate. Refactor test_inference_snapshot.py to use the shared infrastructure and drop temperature=0 in favor of seed-only determinism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd edge cases Expand e2e snapshot coverage beyond the single 'What is 2+2?' test: - test_snapshot_code_gen.py: code generation prompt (max_tokens=64) - test_snapshot_reasoning.py: step-by-step math reasoning (max_tokens=64) - test_snapshot_long_output.py: longer response with max_tokens=128 - test_snapshot_edge.py: single word, special chars, and unicode prompts All use seed=42 and the shared assert_snapshot() infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MLX already supports x86 CPU via mlx[cpu] and the Dockerfile has the
GCC workaround for CPU JIT. The only barriers were the 'slow' markers
causing tests to be skipped in CI.
Changes:
- Remove 'slow' marker from all snapshot tests so they run by default
- Make snapshots architecture-aware (snapshots/{arch}/{name}.json) since
floating-point results differ between x86_64 and arm64
- Store architecture in snapshot metadata
- Increase CI timeout from 30 to 45 minutes for model download + CPU inference
- Update docstrings to remove Apple Silicon requirement
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-build the Docker image using docker/build-push-action with GitHub Actions cache (type=gha). On cache hit, the image loads from cache instead of rebuilding (~12min → seconds). Changes: - CI: set up buildx, build image with --cache-from/--cache-to type=gha - docker-compose.yml: add image tag (exo-e2e:latest) so compose uses the pre-built image instead of rebuilding - conftest.py: Cluster.build() skips if exo-e2e:latest already exists (pre-built in CI), falls back to docker compose build for local dev Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add e2e snapshot test that exercises 3 different model architectures
to catch model-specific regressions:
- SmolLM2-135M-Instruct (tiny llama, bf16, ~269MB)
- Llama-3.2-1B-Instruct-4bit (small llama, 4bit, ~730MB)
- gemma-2-2b-it-4bit (gemma2 architecture, 4bit, ~1.5GB)
Each model gets its own snapshot file. All use the same prompt
("What is the capital of France?"), seed=42, max_tokens=32.
Also adds model cards for SmolLM2-135M-Instruct and gemma-2-2b-it-4bit
(Llama-3.2-1B-Instruct-4bit already had one).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two issues prevented MLX CPU from working on x86_64 in Docker: 1. Missing BLAS/LAPACK libraries: MLX CPU backend requires libblas-dev, liblapack-dev, and liblapacke-dev on Linux. Added to apt-get install. 2. g++ wrapper ordering: The -fpermissive wrapper for GCC 14 was installed AFTER uv sync, but MLX may compile extensions during install. Moved the wrapper BEFORE uv sync so both build-time and runtime JIT compilation benefit from the fix. MLX publishes manylinux_2_35_x86_64 wheels, so this uses the native CPU backend — no alternative inference framework needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add proactive monitoring to detect runner process death and unresponsiveness: - Health check loop polls is_alive() every 1s, detects unexpected exits - Counter-based heartbeat detects frozen/unresponsive processes - Emits RunnerFailed event and releases pending task waiters on failure - Add EXO_RUNNER_MUST_DIE debug trigger for testing abrupt process death - Add chaos E2E test that kills runner mid-inference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lection Add root conftest.py to exclude tests/start_distributed_test.py from pytest collection (it calls sys.exit at module level). Fix ruff lint issues (import sorting, f-string without placeholders, lambda loop variable capture) and apply nix fmt formatting to e2e files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Snapshot tests do MLX inference on x86 CPU in Docker which takes >600s per test, causing the 45-minute CI job to timeout. Only cluster_formation and no_internet (non-inference tests) should run in CI. Inference snapshot tests can be run locally with --slow or E2E_SLOW=1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Scope e2e workflow to only trigger on pushes to e2e-tests branch (not every branch push) - Add temperature=0 to remaining snapshot test chat calls for deterministic output - Make assert_snapshot fail when no baseline exists instead of silently creating one — baselines must be explicitly generated with UPDATE_SNAPSHOTS=1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docker mDNS discovery can be slow on first boot in CI, causing cluster_formation to timeout on "Nodes discovered each other" while subsequent tests pass fine. Retry failed tests once before counting them as real failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0e61947 to
b36721e
Compare
After merging main (api cancellation #1276), the RunnerSupervisor dataclass requires a _cancel_sender field. Update the test helper to create and pass this channel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review: Docker-based E2E test framework with chaos testingLarge, well-engineered PR (24 files, +1399/-5, 23 commits) adding E2E test infrastructure plus RunnerSupervisor health check. CI Failurese2e jobs (FAILING): aarch64-darwin (HANGING ~6 hours): Root cause is the E2E Framework — Well-Designed
E2E Tests — Good Coverage
RunnerSupervisor Health Check — Well-Designed, Minor ConcernsGood:
Concerns:
Security NoteDebug triggers ( Minor
VerdictSubstantial, well-engineered addition to the project. The E2E framework is clean and the RunnerSupervisor health check is sound. Two blockers:
Once those are resolved, recommend squashing the 23 commits and merging. Review only — not a merge approval. |
## Summary - `MpReceiver.close()` did not unblock threads stuck on `queue.get()` in `receive_async()`, causing abandoned threads (via `abandon_on_cancel=True`) to keep the Python process alive indefinitely after tests pass - This caused the `aarch64-darwin` CI jobs in PR #1462 to hang for ~6 hours until the GitHub Actions timeout killed them - Sends an `_MpEndOfStream` sentinel before closing the buffer, mirroring what `MpSender.close()` already does ## Test plan - [x] `uv run basedpyright` — 0 errors - [x] `uv run ruff check` — clean - [x] `nix fmt` — 0 changed - [x] `uv run pytest` — 188 passed, 1 skipped in 12s (no hang) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: rltakashige <rl.takashige@gmail.com> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
**Enabling peers to be discovered in environments where mDNS is unavailable (SSH sessions, headless servers, Docker).** ## Motivation Exo discovers peers exclusively via mDNS, which works great on a local network but breaks once you move beyond a single L2 broadcast domain: - SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI sessions (#1488) - Headless servers/rack machines — #1682 ("DGX Spark does not find other nodes") - Docker Compose — mDNS is often unavailable across container networks; e.g. #1462 (E2E test framework) needs an alternative Related works: #1488 (working implementation made by @AlexCheema and closed because SSH had a GUI workaround), #1023 (Headscale WAN then closed due to merge conflicts), #1656 (discovery cleanup, open). This PR introduces an optional bootstrap mechanism for peer discovery while leaving the existing mDNS behavior unchanged. ## Changes Adds two new CLI flags: - `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated libp2p multiaddrs to dial on startup and retry periodically - `--libp2p-port` — fixed TCP port for libp2p to listen on (default: OS-assigned). Required when bootstrap peers, so other nodes know which port to dial. 8 files: - `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in existing retry loop - `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to `Behaviour` - `rust/networking/examples/chatroom.rs`: Updated call site for new create_swarm signature - `rust/networking/tests/bootstrap_peers.rs`: Integration tests - `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional `bootstrap_peers` in PyO3 constructor - `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub - `src/exo/routing/router.py`: Pass peers to `NetworkingHandle` - `src/exo/main.py` : `--bootstrap-peers` CLI arg + `EXO_BOOTSTRAP_PEERS` env var ## Why It Works Bootstrap peers are dialed in the existing retry loop — the same path taken by peers when mDNS-discovered. The swarm handles connection, Noise handshake, and gossipsub mesh joining from there. PeerId is intentionally not required in the multiaddr, the Noise handshake discovers it. Docker Compose example: ```yaml services: exo-1: environment: EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000" exo-2: environment: EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000" ``` ## Test Plan ### Manual Testing <details> <summary>Docker Compose config</summary> ``` services: exo-node1: build: context: . dockerfile: Dockerfile.bootstrap-test container_name: exo-bootstrap-node1 hostname: exo-node1 command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"] environment: - EXO_LIBP2P_NAMESPACE=bootstrap-test ports: - "52415:52415" networks: bootstrap-net: ipv4_address: 172.30.20.2 deploy: resources: limits: memory: 4g exo-node2: build: context: . dockerfile: Dockerfile.bootstrap-test container_name: exo-bootstrap-node2 hostname: exo-node2 command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"] environment: - EXO_LIBP2P_NAMESPACE=bootstrap-test ports: - "52416:52415" networks: bootstrap-net: ipv4_address: 172.30.20.3 deploy: resources: limits: memory: 4g networks: bootstrap-net: driver: bridge ipam: config: - subnet: 172.30.20.0/24 ``` </details> Two containers on a bridge network (`172.30.20.0/24`), fixed IPs, `--libp2p-port 30000`, cross-referencing `--bootstrap-peers`. Both nodes found each other and established a connection then ran the election protocol. ### Automated Testing 4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs` (`cargo test -p networking`): | Test | What it verifies | Result | |------|-----------------|--------| | `two_nodes_connect_via_bootstrap_peers` | Node B discovers Node A via bootstrap addr (real TCP connection) | PASS | | `create_swarm_with_empty_bootstrap_peers` | Backward compatibility — no bootstrap peers works | PASS | | `create_swarm_ignores_invalid_bootstrap_addrs` | Invalid multiaddrs silently filtered | PASS | | `create_swarm_with_fixed_port` | `listen_port` parameter works | PASS | All 4 pass. The connection test takes ~6s --------- Signed-off-by: DeepZima <deepzima@outlook.com> Co-authored-by: Evan <evanev7@gmail.com>
**Enabling peers to be discovered in environments where mDNS is unavailable (SSH sessions, headless servers, Docker).** ## Motivation Exo discovers peers exclusively via mDNS, which works great on a local network but breaks once you move beyond a single L2 broadcast domain: - SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI sessions (exo-explore#1488) - Headless servers/rack machines — exo-explore#1682 ("DGX Spark does not find other nodes") - Docker Compose — mDNS is often unavailable across container networks; e.g. exo-explore#1462 (E2E test framework) needs an alternative Related works: exo-explore#1488 (working implementation made by @AlexCheema and closed because SSH had a GUI workaround), exo-explore#1023 (Headscale WAN then closed due to merge conflicts), exo-explore#1656 (discovery cleanup, open). This PR introduces an optional bootstrap mechanism for peer discovery while leaving the existing mDNS behavior unchanged. ## Changes Adds two new CLI flags: - `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated libp2p multiaddrs to dial on startup and retry periodically - `--libp2p-port` — fixed TCP port for libp2p to listen on (default: OS-assigned). Required when bootstrap peers, so other nodes know which port to dial. 8 files: - `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in existing retry loop - `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to `Behaviour` - `rust/networking/examples/chatroom.rs`: Updated call site for new create_swarm signature - `rust/networking/tests/bootstrap_peers.rs`: Integration tests - `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional `bootstrap_peers` in PyO3 constructor - `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub - `src/exo/routing/router.py`: Pass peers to `NetworkingHandle` - `src/exo/main.py` : `--bootstrap-peers` CLI arg + `EXO_BOOTSTRAP_PEERS` env var ## Why It Works Bootstrap peers are dialed in the existing retry loop — the same path taken by peers when mDNS-discovered. The swarm handles connection, Noise handshake, and gossipsub mesh joining from there. PeerId is intentionally not required in the multiaddr, the Noise handshake discovers it. Docker Compose example: ```yaml services: exo-1: environment: EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000" exo-2: environment: EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000" ``` ## Test Plan ### Manual Testing <details> <summary>Docker Compose config</summary> ``` services: exo-node1: build: context: . dockerfile: Dockerfile.bootstrap-test container_name: exo-bootstrap-node1 hostname: exo-node1 command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"] environment: - EXO_LIBP2P_NAMESPACE=bootstrap-test ports: - "52415:52415" networks: bootstrap-net: ipv4_address: 172.30.20.2 deploy: resources: limits: memory: 4g exo-node2: build: context: . dockerfile: Dockerfile.bootstrap-test container_name: exo-bootstrap-node2 hostname: exo-node2 command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"] environment: - EXO_LIBP2P_NAMESPACE=bootstrap-test ports: - "52416:52415" networks: bootstrap-net: ipv4_address: 172.30.20.3 deploy: resources: limits: memory: 4g networks: bootstrap-net: driver: bridge ipam: config: - subnet: 172.30.20.0/24 ``` </details> Two containers on a bridge network (`172.30.20.0/24`), fixed IPs, `--libp2p-port 30000`, cross-referencing `--bootstrap-peers`. Both nodes found each other and established a connection then ran the election protocol. ### Automated Testing 4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs` (`cargo test -p networking`): | Test | What it verifies | Result | |------|-----------------|--------| | `two_nodes_connect_via_bootstrap_peers` | Node B discovers Node A via bootstrap addr (real TCP connection) | PASS | | `create_swarm_with_empty_bootstrap_peers` | Backward compatibility — no bootstrap peers works | PASS | | `create_swarm_ignores_invalid_bootstrap_addrs` | Invalid multiaddrs silently filtered | PASS | | `create_swarm_with_fixed_port` | `listen_port` parameter works | PASS | All 4 pass. The connection test takes ~6s --------- Signed-off-by: DeepZima <deepzima@outlook.com> Co-authored-by: Evan <evanev7@gmail.com>
Motivation
We had no end-to-end testing for exo clusters. Unit tests can't catch issues in node discovery, master election, or multi-node coordination. We also need a framework for chaos testing (network failures, partitions, etc.) to build confidence in cluster resilience.
Changes
Adds a Python/asyncio E2E test framework that spins up 2-node exo clusters in Docker Compose:
e2e/Dockerfile— Multi-stage build: Node.js dashboard → Rust nightly + Python 3.13. Cleans up Rust build artifacts to keep the image small. Includes a g++ wrapper for MLX CPU JIT compatibility with GCC 14.e2e/conftest.py—Clusterclass wrapping docker compose: build, start, stop, logs, exec,place_model,chat. Async context manager with automatic cleanup.e2e/run_all.py— Test runner discoveringtest_*.pyfiles. Supports--slow/E2E_SLOW=1to include inference tests.test_cluster_formation— Nodes discover each other via mDNS, elect a master, API responds.test_no_internet— iptables blocks all outbound traffic except private subnets and multicast. Verifies cluster forms without internet, confirms connectivity is actually blocked (curl + exo's own "Internet connectivity: False" log).test_inference_snapshot(slow) — Launchesmlx-community/Qwen3-0.6B-4bit, sends a chat completion withseed=42, temperature=0, verifies output matches a committed snapshot. Skipped in CI (x86 MLX CPU too slow), runs on Apple Silicon with--slow..github/workflows/e2e.yml— CI workflow on push/PR. Frees disk space before Docker build (Rust compilation is heavy).Why It Works
internal: true, which blocks multicast).NET_ADMINcapability lets containers set up firewall rules before starting exo..venv/bin/exois called directly instead ofuv run, avoiding PyPI resolution at container startup.mx.random.seed()withtemperature=0for reproducible output.Test Plan
Manual Testing
All 3 tests pass locally on macOS (Apple Silicon, Docker Desktop):
Automated Testing
CI runs
cluster_formationandno_internet(2/3 passed, 1 skipped):https://github.com/exo-explore/exo/actions/runs/21961324819