Skip to content

fix(disagg): support per-GPU JSON mapping in --disaggregation-ib-device#9

Open
DavidBellamy wants to merge 1 commit intollm360-mainfrom
fix/json-ib-device-passthrough-llm360-main
Open

fix(disagg): support per-GPU JSON mapping in --disaggregation-ib-device#9
DavidBellamy wants to merge 1 commit intollm360-mainfrom
fix/json-ib-device-passthrough-llm360-main

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

Port of upstream PR sgl-project#23003 applied directly on llm360-main so tonight's octopus-merge into deploy picks it up for the agentic RL pilots. Upstream PR has been open since 2026-04-16 with no reviews (Gemini quota).

Blocking fix

Current llm360-main sglang rejects the launcher's rail-mapping JSON path:

```
ValueError: Invalid IB devices specified: ['/mnt/weka/shrd/k2pta/rl360/rail_mapping_JOBID.json'].
Available devices: ['mlx5_0', ..., 'mlx5_7']
```

Observed on pilot job 1565856; will recur on 1565857 and 1565858 with the same config.

Patch

Adds a 2-line passthrough in `_validate_ib_devices()` for JSON content (`{...}`) and `.json` paths. `get_ib_devices_for_gpu()` in `mooncake_transfer_engine.py` already handles these formats downstream — the validator was just rejecting them prematurely.

Port of upstream PR sgl-project#23003 applied directly on
llm360-main so the octopus-merge into deploy picks it up for tonight's
agentic RL pilots. Blocking fix: sglang PD disaggregation currently
rejects the launcher's rail-mapping JSON path.

Adds a 2-line early return in _validate_ib_devices() that passes JSON
content ({...}) and .json file paths through unchanged;
get_ib_devices_for_gpu() in mooncake_transfer_engine.py already handles
those formats downstream.
DavidBellamy added a commit that referenced this pull request Apr 19, 2026
…project#23003

Upstream PR sgl-project#23003 (fix/json-ib-device-passthrough) is
superseded by #9 (fix/json-ib-device-passthrough-llm360-main),
which mirrors the same 7-line patch on top of llm360-main. The upstream
branch was based on sgl-project/sglang:main and line-drifts when merged
into llm360-main — was triggering a merge conflict in every octopus run
after the PR #10 decouple landed.

Port harbor's SKIP_UPSTREAM_BRANCHES pattern: env var + jq `NOT_SKIPPED`
filter applied to both UPSTREAM_PR_STATE/BRANCHES (cross-fork PRs on
LLM360/sglang:llm360-main) and REAL_UPSTREAM_PR_STATE/BRANCHES (upstream
sgl-project/sglang PRs). Removes the skipped branch cleanly from the
state fingerprint too, so re-running doesn't see it as a pending change.
DavidBellamy added a commit that referenced this pull request Apr 19, 2026
…project#23003 (#11)

Upstream PR sgl-project#23003 (fix/json-ib-device-passthrough) is
superseded by #9 (fix/json-ib-device-passthrough-llm360-main),
which mirrors the same 7-line patch on top of llm360-main. The upstream
branch was based on sgl-project/sglang:main and line-drifts when merged
into llm360-main — was triggering a merge conflict in every octopus run
after the PR #10 decouple landed.

Port harbor's SKIP_UPSTREAM_BRANCHES pattern: env var + jq `NOT_SKIPPED`
filter applied to both UPSTREAM_PR_STATE/BRANCHES (cross-fork PRs on
LLM360/sglang:llm360-main) and REAL_UPSTREAM_PR_STATE/BRANCHES (upstream
sgl-project/sglang PRs). Removes the skipped branch cleanly from the
state fingerprint too, so re-running doesn't see it as a pending change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant