rollout: move engine base_port from 15000 to 17500 (above mooncake RPC range)#7
Open
DavidBellamy wants to merge 1 commit intomainfrom
Open
rollout: move engine base_port from 15000 to 17500 (above mooncake RPC range)#7DavidBellamy wants to merge 1 commit intomainfrom
DavidBellamy wants to merge 1 commit intomainfrom
Conversation
Miles allocates SGLang engine ports (server, nccl, dist_init, dp_attention, engine_info_bootstrap) starting at base_port. The previous 15000 default fully overlapped Mooncake's RPC handshake range (rpc_min_port=15000..rpc_max_port=17000, from mooncake-transfer-engine/include/config.h). On PD-disaggregation runs, Mooncake's TransferEngine starts before SGLang's Uvicorn on each engine, so it can grab a port inside the miles-allocated range first. When sglang then tries to bind its HTTP server on the same port it gets EADDRINUSE and the engine dies. Downstream symptom: RolloutManager silently waits forever for the missing engine; no train_step fires. Secondary observable (noisy but benign): miles' _wait_server_healthy 2s poll (GET /health_generate) against the dead port lands on whichever process grabbed it (usually a sibling engine's mooncake handshake listener), producing log spam: readString: too large length from socket: 7018130145941931335 SocketHandShakePlugin: failed to receive handshake message, malformed json format ... json string length: 0, json string content: where 7018130145941931335 decodes as 8 LE ASCII bytes = 'GET /hea'. 17500 satisfies all three existing constraints: - < 32768 (below ephemeral range) - > 10002 with margin (clear of Ray 10002-19999 racing near 10002) - > 17000 (NEW: clear of mooncake RPC range) No behavior change for callers that pass base_port explicitly. Observed on LLM360/RL360 jobs 1564764 (port 15082) and 1565161 (port 15079).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Change the default
base_portinmiles/ray/rollout.pyfrom 15000 to 17500. The previous default fully overlapped Mooncake's RPC handshake range (rpc_min_port=15000..rpc_max_port=17000), leading to intermittentEADDRINUSEon SGLang engine boot.Impact
On PD-disaggregation runs, Mooncake's
TransferEnginestarts before SGLang's Uvicorn on each engine. It can grab a port inside the miles-allocated range first. When SGLang then tries to bind its HTTP server on the same port it getsEADDRINUSEand the engine dies. The containing RolloutManager then waits forever for the missing engine; notrain_stepfires and the run timeouts silently.Observed on
LLM360/RL360jobs 1564764 (port 15082) and 1565161 (port 15079). The engine-death signature in the slurm log is:followed by 500+ log-spam lines from sibling engines whose
_wait_server_healthypoll landed on a mooncake handshake socket:7018130145941931335decodes as LE bytes ="GET /hea"— the first 8 bytes ofGET /health_generate.Choice of 17500
Satisfies all three existing port-range constraints simultaneously:
< 32768> 10002(with margin)> 17000The existing allocation loop reserves roughly
16 engines * (3-8 ports each) ≈ 50-150 ports, which comfortably fits17500..19999.Back-compat
No behavior change for callers that pass
base_portexplicitly. Only the default moves.Related
LLM360/RL360#86— downstream issue documenting the overlap.LLM360/RL360#82— related but distinct port-zombie bug (also caused EADDRINUSE, different root cause).LLM360/miles#6(TITO role widening) landed ondeploywithin 30 min of merge.Tested
python -c "import ast; ast.parse('miles/ray/rollout.py')"— parses clean.miles/ray/rollout.pymodule imports without error (no new imports added).LLM360/RL360's nextpd-hicache-l3 + cuda-graph=trueiter job once this lands ondeploy.