Skip to content

feat: Docker-based E2E test framework with chaos testing#1462

Open
AlexCheema wants to merge 23 commits intomainfrom
e2e-tests
Open

feat: Docker-based E2E test framework with chaos testing#1462
AlexCheema wants to merge 23 commits intomainfrom
e2e-tests

Conversation

@AlexCheema
Copy link
Copy Markdown
Contributor

@AlexCheema AlexCheema commented Feb 12, 2026

Motivation

We had no end-to-end testing for exo clusters. Unit tests can't catch issues in node discovery, master election, or multi-node coordination. We also need a framework for chaos testing (network failures, partitions, etc.) to build confidence in cluster resilience.

Changes

Adds a Python/asyncio E2E test framework that spins up 2-node exo clusters in Docker Compose:

  • e2e/Dockerfile — Multi-stage build: Node.js dashboard → Rust nightly + Python 3.13. Cleans up Rust build artifacts to keep the image small. Includes a g++ wrapper for MLX CPU JIT compatibility with GCC 14.
  • e2e/conftest.pyCluster class wrapping docker compose: build, start, stop, logs, exec, place_model, chat. Async context manager with automatic cleanup.
  • e2e/run_all.py — Test runner discovering test_*.py files. Supports --slow / E2E_SLOW=1 to include inference tests.
  • 3 tests:
    • test_cluster_formation — Nodes discover each other via mDNS, elect a master, API responds.
    • test_no_internet — iptables blocks all outbound traffic except private subnets and multicast. Verifies cluster forms without internet, confirms connectivity is actually blocked (curl + exo's own "Internet connectivity: False" log).
    • test_inference_snapshot (slow) — Launches mlx-community/Qwen3-0.6B-4bit, sends a chat completion with seed=42, temperature=0, verifies output matches a committed snapshot. Skipped in CI (x86 MLX CPU too slow), runs on Apple Silicon with --slow.
  • .github/workflows/e2e.yml — CI workflow on push/PR. Frees disk space before Docker build (Rust compilation is heavy).

Why It Works

  • mDNS discovery works on Docker bridge networks — multicast 224.0.0.251 stays within the bridge.
  • No-internet isolation uses iptables (not Docker internal: true, which blocks multicast). NET_ADMIN capability lets containers set up firewall rules before starting exo.
  • .venv/bin/exo is called directly instead of uv run, avoiding PyPI resolution at container startup.
  • Deterministic inference — MLX respects mx.random.seed() with temperature=0 for reproducible output.

Test Plan

Manual Testing

All 3 tests pass locally on macOS (Apple Silicon, Docker Desktop):

$ python3 e2e/run_all.py --slow
PASSED: cluster_formation
PASSED: inference_snapshot
PASSED: no_internet
3/3 tests passed

Automated Testing

CI runs cluster_formation and no_internet (2/3 passed, 1 skipped):
https://github.com/exo-explore/exo/actions/runs/21961324819

@AlexCheema AlexCheema marked this pull request as draft February 12, 2026 18:48
@AlexCheema AlexCheema changed the title E2E Tests feat: Docker-based E2E test framework with chaos testing Feb 12, 2026
@AlexCheema AlexCheema marked this pull request as ready for review February 16, 2026 13:41
@AlexCheema AlexCheema enabled auto-merge (squash) February 16, 2026 17:49
AlexCheema and others added 22 commits February 16, 2026 10:00
Add a Python/asyncio E2E test framework that spins up 2-node exo clusters
in Docker Compose and verifies cluster formation, discovery, election, and
API health. Includes a no-internet chaos test using DNS blocking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The runner was running out of disk space during the Docker image build
(Rust compilation + Python deps). Remove unused toolchains first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean up Rust target/ and cargo registry after uv sync in the same RUN
command so build artifacts aren't committed to the layer (~1-2 GB saved).
Also remove more unused toolchains from the CI runner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use iptables to block all outbound traffic except private subnets and
multicast (for mDNS discovery). Verify internet is blocked by curling
huggingface.co from inside each container and checking exo logs for
"Internet connectivity: False".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Launch mlx-community/Qwen3-0.6B-4bit on the cluster, send a chat
completion with seed=42 and temperature=0, and verify the output
matches a committed snapshot. Tests inference determinism end-to-end.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MLX CPU inference on x86_64 is too slow for CI runners (~10min+ for
a single request). Mark the inference snapshot test as slow so it's
skipped by default. Run with --slow or E2E_SLOW=1 on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st collection

The tests/start_distributed_test.py script calls sys.exit() at module
level, which crashes pytest collection. Exclude it via collect_ignore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add e2e/snapshot.py with assert_snapshot() for deterministic regression
testing. On first run, saves inference output as the expected snapshot.
On subsequent runs, compares against it with unified diff on mismatch.
Set UPDATE_SNAPSHOTS=1 or pass --update-snapshots to regenerate.

Refactor test_inference_snapshot.py to use the shared infrastructure
and drop temperature=0 in favor of seed-only determinism.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd edge cases

Expand e2e snapshot coverage beyond the single 'What is 2+2?' test:
- test_snapshot_code_gen.py: code generation prompt (max_tokens=64)
- test_snapshot_reasoning.py: step-by-step math reasoning (max_tokens=64)
- test_snapshot_long_output.py: longer response with max_tokens=128
- test_snapshot_edge.py: single word, special chars, and unicode prompts

All use seed=42 and the shared assert_snapshot() infrastructure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MLX already supports x86 CPU via mlx[cpu] and the Dockerfile has the
GCC workaround for CPU JIT. The only barriers were the 'slow' markers
causing tests to be skipped in CI.

Changes:
- Remove 'slow' marker from all snapshot tests so they run by default
- Make snapshots architecture-aware (snapshots/{arch}/{name}.json) since
  floating-point results differ between x86_64 and arm64
- Store architecture in snapshot metadata
- Increase CI timeout from 30 to 45 minutes for model download + CPU inference
- Update docstrings to remove Apple Silicon requirement

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-build the Docker image using docker/build-push-action with GitHub
Actions cache (type=gha). On cache hit, the image loads from cache
instead of rebuilding (~12min → seconds).

Changes:
- CI: set up buildx, build image with --cache-from/--cache-to type=gha
- docker-compose.yml: add image tag (exo-e2e:latest) so compose uses
  the pre-built image instead of rebuilding
- conftest.py: Cluster.build() skips if exo-e2e:latest already exists
  (pre-built in CI), falls back to docker compose build for local dev

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add e2e snapshot test that exercises 3 different model architectures
to catch model-specific regressions:
- SmolLM2-135M-Instruct (tiny llama, bf16, ~269MB)
- Llama-3.2-1B-Instruct-4bit (small llama, 4bit, ~730MB)
- gemma-2-2b-it-4bit (gemma2 architecture, 4bit, ~1.5GB)

Each model gets its own snapshot file. All use the same prompt
("What is the capital of France?"), seed=42, max_tokens=32.

Also adds model cards for SmolLM2-135M-Instruct and gemma-2-2b-it-4bit
(Llama-3.2-1B-Instruct-4bit already had one).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two issues prevented MLX CPU from working on x86_64 in Docker:

1. Missing BLAS/LAPACK libraries: MLX CPU backend requires libblas-dev,
   liblapack-dev, and liblapacke-dev on Linux. Added to apt-get install.

2. g++ wrapper ordering: The -fpermissive wrapper for GCC 14 was installed
   AFTER uv sync, but MLX may compile extensions during install. Moved
   the wrapper BEFORE uv sync so both build-time and runtime JIT
   compilation benefit from the fix.

MLX publishes manylinux_2_35_x86_64 wheels, so this uses the native
CPU backend — no alternative inference framework needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add proactive monitoring to detect runner process death and unresponsiveness:

- Health check loop polls is_alive() every 1s, detects unexpected exits
- Counter-based heartbeat detects frozen/unresponsive processes
- Emits RunnerFailed event and releases pending task waiters on failure
- Add EXO_RUNNER_MUST_DIE debug trigger for testing abrupt process death
- Add chaos E2E test that kills runner mid-inference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lection

Add root conftest.py to exclude tests/start_distributed_test.py from
pytest collection (it calls sys.exit at module level). Fix ruff lint
issues (import sorting, f-string without placeholders, lambda loop
variable capture) and apply nix fmt formatting to e2e files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Snapshot tests do MLX inference on x86 CPU in Docker which takes >600s
per test, causing the 45-minute CI job to timeout. Only cluster_formation
and no_internet (non-inference tests) should run in CI. Inference
snapshot tests can be run locally with --slow or E2E_SLOW=1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Scope e2e workflow to only trigger on pushes to e2e-tests branch
  (not every branch push)
- Add temperature=0 to remaining snapshot test chat calls for
  deterministic output
- Make assert_snapshot fail when no baseline exists instead of silently
  creating one — baselines must be explicitly generated with
  UPDATE_SNAPSHOTS=1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docker mDNS discovery can be slow on first boot in CI, causing
cluster_formation to timeout on "Nodes discovered each other" while
subsequent tests pass fine. Retry failed tests once before counting
them as real failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After merging main (api cancellation #1276), the RunnerSupervisor
dataclass requires a _cancel_sender field. Update the test helper
to create and pass this channel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema
Copy link
Copy Markdown
Contributor Author

Code Review: Docker-based E2E test framework with chaos testing

Large, well-engineered PR (24 files, +1399/-5, 23 commits) adding E2E test infrastructure plus RunnerSupervisor health check.

CI Failures

e2e jobs (FAILING): test_inference_snapshot fails — no x86_64 snapshot baseline is committed. The test is NOT marked slow, so it runs in CI. Fix: either generate and commit x86_64 baselines (UPDATE_SNAPSHOTS=1) or mark the test as slow.

aarch64-darwin (HANGING ~6 hours): Root cause is the MpReceiver.close() thread hang bug — _forward_events()receive_async() → abandoned thread blocked on queue.get() that never receives _MpEndOfStream. Fixed in PR #1511. Once #1511 is merged and this PR is rebased, the hang should be resolved.

E2E Framework — Well-Designed

  • Cluster class (conftest.py): Clean async wrapper around docker compose with automatic cleanup, smart CI optimization (skips build if image exists), good failure output
  • Dockerfile: Multi-stage build, aggressive Rust cleanup, BLAS/LAPACK for MLX CPU on Linux. The g++ -fpermissive wrapper is acceptable for a test image but should note which MLX version requires it
  • Snapshot infrastructure: Architecture-aware (x86_64/arm64), explicit baseline requirement, clear diff output on mismatch
  • docker-compose: Two services, same image, isolated namespace. No-internet override uses iptables REJECT (not DROP) for fast test feedback

E2E Tests — Good Coverage

Test Verdict
test_cluster_formation Good — minimal smoke test
test_no_internet Good — validates iptables blocks + exo detects no connectivity
test_inference_snapshot Needs x86_64 baseline committed
test_runner_chaos Good — validates health check E2E via EXO_RUNNER_MUST_DIE
Snapshot variants (5 files) Good coverage, somewhat repetitive boilerplate

RunnerSupervisor Health Check — Well-Designed, Minor Concerns

Good:

  • Counter-based heartbeat (avoids clock skew), daemon thread, 0.5s interval
  • _death_handled flag prevents race between health check and event forwarding
  • Comprehensive unit tests (6 test cases, real multiprocessing)

Concerns:

  1. Sync shutdown() in async context: _handle_process_exit() and _handle_unresponsive() are sync methods that call self.shutdown() (which does runner_process.join(1) — blocking for up to 1s). These are called from _health_check() which is an async coroutine, so they block the event loop. Consider wrapping in to_thread.run_sync().

  2. Two overlapping failure-handling paths: _check_runner() (from _forward_events) and _handle_process_exit() (from _health_check) both handle process death. The _death_handled flag coordinates them, but having two code paths for the same failure is fragile. Consider consolidating.

  3. Pre-heartbeat freeze gap: If the process freezes before the heartbeat thread starts (e.g., during import), heartbeat.value stays at 0, and the if current > 0 guard causes the health check to ignore it. The is_alive() check won't detect a live-but-frozen process. Known limitation, acceptable for now.

Security Note

Debug triggers (EXO_RUNNER_MUST_DIE, MUST_FAIL, MUST_OOM, MUST_TIMEOUT) are gated on prompt content, not environment variables. Any API user can kill a runner by sending a specific prompt. Consider gating on EXO_ENABLE_DEBUG_TRIGGERS=1 for production safety.

Minor

  • Hard-coded port 52415 in conftest.py — extract to a constant
  • run_all.py slow detection parses docstrings — fragile, consider filename convention or pytest markers
  • 23 commits, many incremental CI fixes — squash before merge

Verdict

Substantial, well-engineered addition to the project. The E2E framework is clean and the RunnerSupervisor health check is sound. Two blockers:

  1. Fix e2e CI: Generate x86_64 snapshot baselines or mark test_inference_snapshot as slow
  2. Rebase onto fix: unblock MpReceiver.close() to prevent shutdown hang #1511: The aarch64-darwin hang is caused by the MpReceiver.close() bug — fixed in PR fix: unblock MpReceiver.close() to prevent shutdown hang #1511

Once those are resolved, recommend squashing the 23 commits and merging.


Review only — not a merge approval.

rltakashige added a commit that referenced this pull request Feb 18, 2026
## Summary

- `MpReceiver.close()` did not unblock threads stuck on `queue.get()` in
`receive_async()`, causing abandoned threads (via
`abandon_on_cancel=True`) to keep the Python process alive indefinitely
after tests pass
- This caused the `aarch64-darwin` CI jobs in PR #1462 to hang for ~6
hours until the GitHub Actions timeout killed them
- Sends an `_MpEndOfStream` sentinel before closing the buffer,
mirroring what `MpSender.close()` already does

## Test plan

- [x] `uv run basedpyright` — 0 errors
- [x] `uv run ruff check` — clean
- [x] `nix fmt` — 0 changed
- [x] `uv run pytest` — 188 passed, 1 skipped in 12s (no hang)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Evanev7 added a commit that referenced this pull request Mar 25, 2026
**Enabling peers to be discovered in environments where mDNS is
unavailable (SSH sessions, headless servers, Docker).**

## Motivation
Exo discovers peers exclusively via mDNS, which works great on a local
network but breaks once you move beyond a single L2 broadcast domain:

- SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI
sessions (#1488)
- Headless servers/rack machines — #1682 ("DGX Spark does not find other
nodes")
- Docker Compose — mDNS is often unavailable across container networks;
e.g. #1462 (E2E test framework) needs an alternative

Related works: 
#1488 (working implementation made by @AlexCheema and closed because SSH
had a GUI workaround),
#1023 (Headscale WAN then closed due to merge conflicts), 
#1656 (discovery cleanup, open). 

This PR introduces an optional bootstrap mechanism for peer discovery
while leaving the existing mDNS behavior unchanged.

## Changes
Adds two new CLI flags:

- `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated
libp2p multiaddrs to dial on startup and retry periodically
- `--libp2p-port` — fixed TCP port for libp2p to listen on (default:
OS-assigned). Required when bootstrap peers, so other nodes know which
port to dial.

8 files: 
- `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in
existing retry loop
- `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to
`Behaviour`
- `rust/networking/examples/chatroom.rs`: Updated call site for new
create_swarm signature
- `rust/networking/tests/bootstrap_peers.rs`: Integration tests
- `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional
`bootstrap_peers` in PyO3 constructor
- `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub 
- `src/exo/routing/router.py`: Pass peers to `NetworkingHandle` 
- `src/exo/main.py` : `--bootstrap-peers` CLI arg +
`EXO_BOOTSTRAP_PEERS` env var

## Why It Works

Bootstrap peers are dialed in the existing retry loop — the same path
taken by peers when mDNS-discovered. The swarm handles connection, Noise
handshake, and gossipsub mesh joining from there.

PeerId is intentionally not required in the multiaddr, the Noise
handshake discovers it.

Docker Compose example:

```yaml
services:
  exo-1:
    environment:
      EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000"
  exo-2:
    environment:
      EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000"
```

## Test Plan

### Manual Testing
<details>
<summary>Docker Compose config</summary>

```
services:
  exo-node1:
    build:
      context: .
      dockerfile: Dockerfile.bootstrap-test
    container_name: exo-bootstrap-node1
    hostname: exo-node1
    command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"]
    environment:
      - EXO_LIBP2P_NAMESPACE=bootstrap-test
    ports:
      - "52415:52415"
    networks:
      bootstrap-net:
        ipv4_address: 172.30.20.2
    deploy:
      resources:
        limits:
          memory: 4g

  exo-node2:
    build:
      context: .
      dockerfile: Dockerfile.bootstrap-test
    container_name: exo-bootstrap-node2
    hostname: exo-node2
    command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"]
    environment:
      - EXO_LIBP2P_NAMESPACE=bootstrap-test
    ports:
      - "52416:52415"
    networks:
      bootstrap-net:
        ipv4_address: 172.30.20.3
    deploy:
      resources:
        limits:
          memory: 4g

networks:
  bootstrap-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.30.20.0/24
```
</details> 

Two containers on a bridge network (`172.30.20.0/24`), fixed IPs,
`--libp2p-port 30000`, cross-referencing `--bootstrap-peers`.

Both nodes found each other and established a connection then ran the
election protocol.

### Automated Testing

4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs`
(`cargo test -p networking`):

| Test | What it verifies | Result |
|------|-----------------|--------|
| `two_nodes_connect_via_bootstrap_peers` | Node B discovers Node A via
bootstrap addr (real TCP connection) | PASS |
| `create_swarm_with_empty_bootstrap_peers` | Backward compatibility —
no bootstrap peers works | PASS |
| `create_swarm_ignores_invalid_bootstrap_addrs` | Invalid multiaddrs
silently filtered | PASS |
| `create_swarm_with_fixed_port` | `listen_port` parameter works | PASS
|

All 4 pass. The connection test takes ~6s

---------

Signed-off-by: DeepZima <deepzima@outlook.com>
Co-authored-by: Evan <evanev7@gmail.com>
ttupper92618 pushed a commit to Foxlight-Foundation/Skulk that referenced this pull request Mar 30, 2026
**Enabling peers to be discovered in environments where mDNS is
unavailable (SSH sessions, headless servers, Docker).**

## Motivation
Exo discovers peers exclusively via mDNS, which works great on a local
network but breaks once you move beyond a single L2 broadcast domain:

- SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI
sessions (exo-explore#1488)
- Headless servers/rack machines — exo-explore#1682 ("DGX Spark does not find other
nodes")
- Docker Compose — mDNS is often unavailable across container networks;
e.g. exo-explore#1462 (E2E test framework) needs an alternative

Related works: 
exo-explore#1488 (working implementation made by @AlexCheema and closed because SSH
had a GUI workaround),
exo-explore#1023 (Headscale WAN then closed due to merge conflicts), 
exo-explore#1656 (discovery cleanup, open). 

This PR introduces an optional bootstrap mechanism for peer discovery
while leaving the existing mDNS behavior unchanged.

## Changes
Adds two new CLI flags:

- `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated
libp2p multiaddrs to dial on startup and retry periodically
- `--libp2p-port` — fixed TCP port for libp2p to listen on (default:
OS-assigned). Required when bootstrap peers, so other nodes know which
port to dial.

8 files: 
- `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in
existing retry loop
- `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to
`Behaviour`
- `rust/networking/examples/chatroom.rs`: Updated call site for new
create_swarm signature
- `rust/networking/tests/bootstrap_peers.rs`: Integration tests
- `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional
`bootstrap_peers` in PyO3 constructor
- `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub 
- `src/exo/routing/router.py`: Pass peers to `NetworkingHandle` 
- `src/exo/main.py` : `--bootstrap-peers` CLI arg +
`EXO_BOOTSTRAP_PEERS` env var

## Why It Works

Bootstrap peers are dialed in the existing retry loop — the same path
taken by peers when mDNS-discovered. The swarm handles connection, Noise
handshake, and gossipsub mesh joining from there.

PeerId is intentionally not required in the multiaddr, the Noise
handshake discovers it.

Docker Compose example:

```yaml
services:
  exo-1:
    environment:
      EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000"
  exo-2:
    environment:
      EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000"
```

## Test Plan

### Manual Testing
<details>
<summary>Docker Compose config</summary>

```
services:
  exo-node1:
    build:
      context: .
      dockerfile: Dockerfile.bootstrap-test
    container_name: exo-bootstrap-node1
    hostname: exo-node1
    command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"]
    environment:
      - EXO_LIBP2P_NAMESPACE=bootstrap-test
    ports:
      - "52415:52415"
    networks:
      bootstrap-net:
        ipv4_address: 172.30.20.2
    deploy:
      resources:
        limits:
          memory: 4g

  exo-node2:
    build:
      context: .
      dockerfile: Dockerfile.bootstrap-test
    container_name: exo-bootstrap-node2
    hostname: exo-node2
    command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"]
    environment:
      - EXO_LIBP2P_NAMESPACE=bootstrap-test
    ports:
      - "52416:52415"
    networks:
      bootstrap-net:
        ipv4_address: 172.30.20.3
    deploy:
      resources:
        limits:
          memory: 4g

networks:
  bootstrap-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.30.20.0/24
```
</details> 

Two containers on a bridge network (`172.30.20.0/24`), fixed IPs,
`--libp2p-port 30000`, cross-referencing `--bootstrap-peers`.

Both nodes found each other and established a connection then ran the
election protocol.

### Automated Testing

4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs`
(`cargo test -p networking`):

| Test | What it verifies | Result |
|------|-----------------|--------|
| `two_nodes_connect_via_bootstrap_peers` | Node B discovers Node A via
bootstrap addr (real TCP connection) | PASS |
| `create_swarm_with_empty_bootstrap_peers` | Backward compatibility —
no bootstrap peers works | PASS |
| `create_swarm_ignores_invalid_bootstrap_addrs` | Invalid multiaddrs
silently filtered | PASS |
| `create_swarm_with_fixed_port` | `listen_port` parameter works | PASS
|

All 4 pass. The connection test takes ~6s

---------

Signed-off-by: DeepZima <deepzima@outlook.com>
Co-authored-by: Evan <evanev7@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant