rollout: move engine base_port from 15000 to 17500 (above mooncake RPC range) by DavidBellamy · Pull Request #7 · LLM360/miles

DavidBellamy · 2026-04-18T22:57:53Z

Summary

Change the default base_port in miles/ray/rollout.py from 15000 to 17500. The previous default fully overlapped Mooncake's RPC handshake range (rpc_min_port=15000..rpc_max_port=17000), leading to intermittent EADDRINUSE on SGLang engine boot.

Impact

On PD-disaggregation runs, Mooncake's TransferEngine starts before SGLang's Uvicorn on each engine. It can grab a port inside the miles-allocated range first. When SGLang then tries to bind its HTTP server on the same port it gets EADDRINUSE and the engine dies. The containing RolloutManager then waits forever for the missing engine; no train_step fires and the run timeouts silently.

Observed on LLM360/RL360 jobs 1564764 (port 15082) and 1565161 (port 15079). The engine-death signature in the slurm log is:

ERROR: [Errno 98] error while attempting to bind on address ('10.24.1.22', 15079): address already in use
INFO: Application shutdown complete.

followed by 500+ log-spam lines from sibling engines whose _wait_server_healthy poll landed on a mooncake handshake socket:

readString: too large length from socket: 7018130145941931335
SocketHandShakePlugin: failed to receive handshake message, malformed json format:
  * Line 1, Column 1   Syntax error: value, object or array expected.
, json string length: 0, json string content:

7018130145941931335 decodes as LE bytes = "GET /hea" — the first 8 bytes of GET /health_generate.

Choice of 17500

Satisfies all three existing port-range constraints simultaneously:

Constraint	Source	Check
`< 32768`	avoid ephemeral range (32768-65535)	17500 ✓
`> 10002` (with margin)	Ray uses 10002-19999, comment says "avoid near-10002"	17500 ✓
`> 17000`	NEW: avoid Mooncake RPC range 15000-17000	17500 ✓

The existing allocation loop reserves roughly 16 engines * (3-8 ports each) ≈ 50-150 ports, which comfortably fits 17500..19999.

Back-compat

No behavior change for callers that pass base_port explicitly. Only the default moves.

LLM360/RL360#86 — downstream issue documenting the overlap.
LLM360/RL360#82 — related but distinct port-zombie bug (also caused EADDRINUSE, different root cause).
Prior upstream-relevant Miles PRs that followed this same deploy-rebuild path: LLM360/miles#6 (TITO role widening) landed on deploy within 30 min of merge.

Tested

python -c "import ast; ast.parse('miles/ray/rollout.py')" — parses clean.
Full miles/ray/rollout.py module imports without error (no new imports added).
Full integration test will run on LLM360/RL360's next pd-hicache-l3 + cuda-graph=true iter job once this lands on deploy.

Miles allocates SGLang engine ports (server, nccl, dist_init, dp_attention, engine_info_bootstrap) starting at base_port. The previous 15000 default fully overlapped Mooncake's RPC handshake range (rpc_min_port=15000..rpc_max_port=17000, from mooncake-transfer-engine/include/config.h). On PD-disaggregation runs, Mooncake's TransferEngine starts before SGLang's Uvicorn on each engine, so it can grab a port inside the miles-allocated range first. When sglang then tries to bind its HTTP server on the same port it gets EADDRINUSE and the engine dies. Downstream symptom: RolloutManager silently waits forever for the missing engine; no train_step fires. Secondary observable (noisy but benign): miles' _wait_server_healthy 2s poll (GET /health_generate) against the dead port lands on whichever process grabbed it (usually a sibling engine's mooncake handshake listener), producing log spam: readString: too large length from socket: 7018130145941931335 SocketHandShakePlugin: failed to receive handshake message, malformed json format ... json string length: 0, json string content: where 7018130145941931335 decodes as 8 LE ASCII bytes = 'GET /hea'. 17500 satisfies all three existing constraints: - < 32768 (below ephemeral range) - > 10002 with margin (clear of Ray 10002-19999 racing near 10002) - > 17000 (NEW: clear of mooncake RPC range) No behavior change for callers that pass base_port explicitly. Observed on LLM360/RL360 jobs 1564764 (port 15082) and 1565161 (port 15079).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rollout: move engine base_port from 15000 to 17500 (above mooncake RPC range)#7

rollout: move engine base_port from 15000 to 17500 (above mooncake RPC range)#7
DavidBellamy wants to merge 1 commit intomainfrom
fix/move-base-port-above-mooncake-rpc

DavidBellamy commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DavidBellamy commented Apr 18, 2026

Summary

Impact

Choice of 17500

Back-compat

Related

Tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant