Skip to content

fix(container): pin SGLang nixl_ref to v1.0.0 to match upstream wheel ABI#9283

Open
keivenchang wants to merge 3 commits intomainfrom
keivenchang/DIS-9265__sglang-nixl-v1.0.0-match
Open

fix(container): pin SGLang nixl_ref to v1.0.0 to match upstream wheel ABI#9283
keivenchang wants to merge 3 commits intomainfrom
keivenchang/DIS-9265__sglang-nixl-v1.0.0-match

Conversation

@keivenchang
Copy link
Copy Markdown
Contributor

@keivenchang keivenchang commented May 7, 2026

Overview:

#9216 added the NIXL C++ SDK to the sglang dev stage so cargo build could link nixl-sys, pinning nixl_ref: 0.10.1. But wheel_builder also builds ai_dynamo*.whl (Rust bindings) against that SDK, and the runtime image installs the wheel on top of lmsysorg/sglang, which preinstalls nixl-cu1{2,3}==1.0.0. ABI mismatch → nixl symbol load failures at runtime.

This PR pins all three NIXL surfaces to the same version (1.0.0) so the SDK ai_dynamo is built against matches the NIXL the runtime already ships:

  • sglang.nixl_ref: v1.0.0 — wheel_builder builds the C++ SDK at v1.0.0
  • nixl-sys = "=1.0.0" (Cargo) — Rust bindings linked against v1.0.0 headers
  • nixl-cu1{2,3}==1.0.0 (already preinstalled by upstream sglang runtime) — runtime Python wheel

Sglang-only; vllm/trtllm stay at 0.10.1.

Details:

  • Why both the SDK and the crate need bumping together: Between NIXL 0.10.x and 1.0.0, nixlDescList<T>::operator[] flipped from unsigned int to size_t (and dropped virtual). The nixl-sys crate's vendored wrapper.cpp is generated against a specific NIXL commit's headers, so crate version and SDK version must match. nixl-sys 1.0.0 is published on crates.io, lockstep with NIXL v1.0.0.
  • Why this doesn't change runtime image contents: sglang_runtime.Dockerfile narrows the wheel COPY from wheel_builder to ai_dynamo*.whl and excludes nixl-cu*.whl, so the wheel_builder NIXL stack does not enter the runtime image. Runtime keeps using upstream lmsysorg/sglang's preinstalled NIXL Python stack — only the ABI ai_dynamo is built against changes.
  • Public Rust API of nixl-sys is identical between 0.10.1 and 1.0.x (only wrapper.cpp changed, plus internal nixlAgentConfig field-by-field construction). Zero dynamo Rust source changes needed.
  • Cross-reference comments tying runtime_image_tagnixl_ref so future SGLang base-image bumps re-verify the upstream NIXL version (pip show nixl-cu13 recipe included).

Verified locally on cu12.9 and cu13.0 (sglang local-dev images built from this branch):

  • compile.sh --dev (full workspace cargo + maturin) PASSED on both — no rust-lld: undefined symbol from the prior 0.10.1 nixl-sys × v1.0.0 SDK mismatch.
  • GPU tests on cu13.0: aggregated-2, video_agg_qwen-2, embedding_agg-2 — 3/3 PASSED in 3:22.
  • GPU tests on cu12.9: aggregated-2, disaggregated-2, video_agg_qwen-2, embedding_agg-2 — 4/4 PASSED in 3:47.

Where should the reviewer start?

lib/memory/Cargo.toml and lib/llm/Cargo.toml for the crate bump; container/context.yaml for the SDK pin and the cross-reference comments.

Related Issues:

Linear: DIS-9265

Relates to #6671

/coderabbit profile chill


Open in Devin Review

Summary by CodeRabbit

  • Chores
    • Upgraded SGLang dependency to v1.0.0
    • Updated runtime image configurations for CUDA 12.9 and CUDA 13.0 compatibility
    • Added documentation notes clarifying version compatibility requirements

… ABI

PR #9216 set sglang.nixl_ref=0.10.1 so wheel_builder builds the NIXL C++
SDK at 0.10.1, but the upstream lmsysorg/sglang base image already
ships nixl-cu13==1.0.0 in its runtime — the SDK and runtime wheel
disagree on ABI, surfacing as nixl symbol load failures at runtime.

Pin sglang.nixl_ref to v1.0.0 so the SDK build matches the preinstalled
wheel. Sglang-only change; vllm and trtllm stay at 0.10.1.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…e image bumps

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang keivenchang requested review from a team as code owners May 7, 2026 21:23
@keivenchang keivenchang self-assigned this May 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

SGLang nixl_ref version is updated from 0.10.1 to v1.0.0, with corresponding CUDA 12.9 and 13.0 runtime image tag updates, accompanied by documentation comments clarifying the dependency relationship between nixl_ref and preinstalled wheel versions.

Changes

SGLang Configuration Updates

Layer / File(s) Summary
Configuration Version Updates
container/context.yaml
SGLang nixl_ref updated to v1.0.0; CUDA 12.9 runtime_image_tag set to v0.5.10.post1-runtime; CUDA 13.0 runtime_image_tag set to v0.5.10.post1-cu130-runtime; inline NOTE comments added for nixl_ref wheel compatibility documentation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and specifically describes the main change: pinning SGLang's nixl_ref version to v1.0.0 to resolve an ABI mismatch between the SDK and runtime wheels.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering all required template sections with detailed technical rationale.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
container/context.yaml (1)

95-95: ⚡ Quick win

Make the verification recipe explicit for both CUDA wheels.

Line 95 only shows nixl-cu13; that can mislead when validating the cuda12.9 runtime path. Consider documenting both nixl-cu12 and nixl-cu13 checks (or a parametric command) so maintainers verify the correct wheel per image tag.

#!/bin/bash
# Read-only verification recipe for maintainers (run locally where Docker is available)

# CUDA 12.9 image should expose nixl-cu12 at the expected version
docker run --rm lmsysorg/sglang:v0.5.10.post1-runtime \
  python -m pip show nixl-cu12 | sed -n '1,20p'

# CUDA 13.0 image should expose nixl-cu13 at the expected version
docker run --rm lmsysorg/sglang:v0.5.10.post1-cu130-runtime \
  python -m pip show nixl-cu13 | sed -n '1,20p'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@container/context.yaml` at line 95, Update the verification comment that
currently only references "nixl-cu13" so it documents both CUDA wheel checks or
a parametric command; change the single-line note to show commands for verifying
nixl-cu12 (for CUDA 12.9 runtimes) and nixl-cu13 (for CUDA 13.0 runtimes) or
provide a template command that accepts the wheel name/tag, so maintainers can
run the appropriate docker run + "python -m pip show <wheel>" check for each
runtime image (refer to the existing comment about verifying upstream version
and the example docker run + pip show pattern).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@container/context.yaml`:
- Line 95: Update the verification comment that currently only references
"nixl-cu13" so it documents both CUDA wheel checks or a parametric command;
change the single-line note to show commands for verifying nixl-cu12 (for CUDA
12.9 runtimes) and nixl-cu13 (for CUDA 13.0 runtimes) or provide a template
command that accepts the wheel name/tag, so maintainers can run the appropriate
docker run + "python -m pip show <wheel>" check for each runtime image (refer to
the existing comment about verifying upstream version and the example docker run
+ pip show pattern).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d1265fbc-9f54-4039-bac4-ef7698af7c0a

📥 Commits

Reviewing files that changed from the base of the PR and between 91516b0 and aa2a0e1.

📒 Files selected for processing (1)
  • container/context.yaml

@keivenchang keivenchang changed the title fix(container): pin sglang nixl_ref to v1.0.0 to match upstream wheel ABI fix(container): pin SGLang nixl_ref to v1.0.0 to match upstream wheel ABI May 7, 2026
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang keivenchang marked this pull request as ready for review May 8, 2026 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants