Skip to content

rollout: move engine base_port from 15000 to 17500 (above mooncake RPC range)#7

Open
DavidBellamy wants to merge 1 commit intomainfrom
fix/move-base-port-above-mooncake-rpc
Open

rollout: move engine base_port from 15000 to 17500 (above mooncake RPC range)#7
DavidBellamy wants to merge 1 commit intomainfrom
fix/move-base-port-above-mooncake-rpc

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

Summary

Change the default base_port in miles/ray/rollout.py from 15000 to 17500. The previous default fully overlapped Mooncake's RPC handshake range (rpc_min_port=15000..rpc_max_port=17000), leading to intermittent EADDRINUSE on SGLang engine boot.

Impact

On PD-disaggregation runs, Mooncake's TransferEngine starts before SGLang's Uvicorn on each engine. It can grab a port inside the miles-allocated range first. When SGLang then tries to bind its HTTP server on the same port it gets EADDRINUSE and the engine dies. The containing RolloutManager then waits forever for the missing engine; no train_step fires and the run timeouts silently.

Observed on LLM360/RL360 jobs 1564764 (port 15082) and 1565161 (port 15079). The engine-death signature in the slurm log is:

ERROR: [Errno 98] error while attempting to bind on address ('10.24.1.22', 15079): address already in use
INFO: Application shutdown complete.

followed by 500+ log-spam lines from sibling engines whose _wait_server_healthy poll landed on a mooncake handshake socket:

readString: too large length from socket: 7018130145941931335
SocketHandShakePlugin: failed to receive handshake message, malformed json format:
  * Line 1, Column 1   Syntax error: value, object or array expected.
, json string length: 0, json string content:

7018130145941931335 decodes as LE bytes = "GET /hea" — the first 8 bytes of GET /health_generate.

Choice of 17500

Satisfies all three existing port-range constraints simultaneously:

Constraint Source Check
< 32768 avoid ephemeral range (32768-65535) 17500 ✓
> 10002 (with margin) Ray uses 10002-19999, comment says "avoid near-10002" 17500 ✓
> 17000 NEW: avoid Mooncake RPC range 15000-17000 17500 ✓

The existing allocation loop reserves roughly 16 engines * (3-8 ports each) ≈ 50-150 ports, which comfortably fits 17500..19999.

Back-compat

No behavior change for callers that pass base_port explicitly. Only the default moves.

Related

  • LLM360/RL360#86 — downstream issue documenting the overlap.
  • LLM360/RL360#82 — related but distinct port-zombie bug (also caused EADDRINUSE, different root cause).
  • Prior upstream-relevant Miles PRs that followed this same deploy-rebuild path: LLM360/miles#6 (TITO role widening) landed on deploy within 30 min of merge.

Tested

  • python -c "import ast; ast.parse('miles/ray/rollout.py')" — parses clean.
  • Full miles/ray/rollout.py module imports without error (no new imports added).
  • Full integration test will run on LLM360/RL360's next pd-hicache-l3 + cuda-graph=true iter job once this lands on deploy.

Miles allocates SGLang engine ports (server, nccl, dist_init, dp_attention,
engine_info_bootstrap) starting at base_port. The previous 15000 default
fully overlapped Mooncake's RPC handshake range
(rpc_min_port=15000..rpc_max_port=17000, from
mooncake-transfer-engine/include/config.h). On PD-disaggregation runs,
Mooncake's TransferEngine starts before SGLang's Uvicorn on each engine,
so it can grab a port inside the miles-allocated range first. When
sglang then tries to bind its HTTP server on the same port it gets
EADDRINUSE and the engine dies. Downstream symptom: RolloutManager
silently waits forever for the missing engine; no train_step fires.

Secondary observable (noisy but benign): miles' _wait_server_healthy
2s poll (GET /health_generate) against the dead port lands on whichever
process grabbed it (usually a sibling engine's mooncake handshake
listener), producing log spam:

  readString: too large length from socket: 7018130145941931335
  SocketHandShakePlugin: failed to receive handshake message,
  malformed json format ... json string length: 0, json string content:

where 7018130145941931335 decodes as 8 LE ASCII bytes = 'GET /hea'.

17500 satisfies all three existing constraints:
- < 32768 (below ephemeral range)
- > 10002 with margin (clear of Ray 10002-19999 racing near 10002)
- > 17000 (NEW: clear of mooncake RPC range)

No behavior change for callers that pass base_port explicitly.

Observed on LLM360/RL360 jobs 1564764 (port 15082) and 1565161
(port 15079).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant