Skip to content

feat(sglang_engine): allow PD worker_type on /add_worker registration path#3

Closed
DavidBellamy wants to merge 1 commit intomainfrom
fix/allow-pd-worker-type-on-miles-router
Closed

feat(sglang_engine): allow PD worker_type on /add_worker registration path#3
DavidBellamy wants to merge 1 commit intomainfrom
fix/allow-pd-worker-type-on-miles-router

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

Problem

SGLangEngine._init_normal asserts worker_type == "regular" before POSTing to the old-style /add_worker?url=... endpoint:

if parse(sglang_router.__version__) <= parse("0.2.1") or self.args.use_miles_router:
    assert (
        self.worker_type == "regular"
    ), "pd disaggregation is not supported in old router or miles router."
    response = requests.post(
        f"http://{self.router_ip}:{self.router_port}/add_worker?url=http://{self.server_host}:{self.server_port}"
    )

Any attempt to stand up prefill/decode workers via the miles-router path (or a gateway that mirrors it, e.g. sgl-model-gateway) fails fast at engine init, even when the receiving router could handle PD if given the information.

This blocks PD disaggregation throughput scaling in Miles-driven rollouts. Observed during LLM360/RL360#76 overnight runs when attempting FAST_ITER_CONFIG=pd-disagg through the gateway's Miles-API shim.

Fix

Relax the assertion. Forward worker_type (and bootstrap_port for prefill workers) as extra query params on the /add_worker URL. Routers that honor them get correct PD registration; routers that only accept the single-arg form ignore the extras and register as regular, with a logger.warning so the fallback behavior is visible to operators.

add_worker_url = (
    f"http://{self.router_ip}:{self.router_port}/add_worker"
    f"?url=http://{self.server_host}:{self.server_port}"
)
if self.worker_type != "regular":
    add_worker_url += f"&worker_type={self.worker_type}"
    if self.worker_type == "prefill":
        bootstrap_port = server_args_dict.get("disaggregation_bootstrap_port")
        if bootstrap_port is not None:
            add_worker_url += f"&bootstrap_port={bootstrap_port}"
    logger.warning("Registering a '%s' worker via /add_worker on the old-style router path...")
response = requests.post(add_worker_url)

Companion server-side change

This PR alone doesn't make PD routing functional through an old-style router that didn't previously understand worker_type; it just removes the fail-fast assert. Operators also need one of:

  • sgl-model-gateway: extend the /add_worker handler to accept worker_type= and bootstrap_port= query params (small serde change; the current shim ignores unknown params)
  • Newer sglang_router (>0.2.1): switch --no-use-miles-router so Miles takes the POST /workers JSON path that already supports PD

Test

Unit-checked URL construction for all four shapes:

build_url(..., 'regular')               == '.../add_worker?url=http://h:p'
build_url(..., 'decode')                == '.../add_worker?url=http://h:p&worker_type=decode'
build_url(..., 'prefill', 8998)         == '.../add_worker?url=http://h:p&worker_type=prefill&bootstrap_port=8998'
build_url(..., 'prefill', None)         == '.../add_worker?url=http://h:p&worker_type=prefill'

Context

LLM360/RL360#76 Track G (job 1559336) proved that full PD+Mooncake KV transfer works on M2's SR-IOV IB via SGLang's native sglang_router.launch_router --mini-lb. This PR unblocks the same flow through Miles-driven rollouts, so agentic RL can use PD scaling without hand-written sbatch test harnesses.

… path

The old sglang_router (<=0.2.1) and the miles-router both use the single-arg
/add_worker?url=... endpoint for engine registration. Previously, the Miles
engine asserted worker_type=='regular' before hitting that endpoint, so any
attempt to stand up prefill/decode workers via the miles-router path
(including the sgl-model-gateway that mirrors it) fail-fasts at engine
init:

  AssertionError: pd disaggregation is not supported in old router or miles router.

This blocks PD disagg throughput scaling in any deployment that uses the
miles-router path, even when the receiving router (e.g. sgl-model-gateway
with a PD-aware shim) can handle worker_type on /add_worker.

Relax the assertion: forward worker_type (and bootstrap_port for prefill)
as extra query params. Routers that honor them get PD registration;
routers that only accept the single-arg form ignore the extras and register
as regular, with a warning logged so the fallback is visible.

The companion server-side change is on the receiving router:
  - sgl-model-gateway must accept ?worker_type=&bootstrap_port= on /add_worker
  - Or deployments can use the newer /workers endpoint (non-miles path).

Context: LLM360/RL360 radixark#76. Track G (job 1559336) showed full PD KV
transfer via mooncake works with SGLang's own mini_lb; this unblocks the
same flow through Miles-driven rollouts.
@DavidBellamy
Copy link
Copy Markdown
Collaborator Author

Re-opening against radixark/miles:main so the LLM360/miles deploy branch auto-builder picks it up. Same branch, same diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant