-
Notifications
You must be signed in to change notification settings - Fork 219
chore: Cuda13 Support #1786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
chore: Cuda13 Support #1786
Conversation
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
📝 WalkthroughWalkthroughThese changes upgrade the CUDA infrastructure from version 12.9 to 13.0, updating the Docker base image, adding CUDA header configuration, and replacing cu12-versioned dependencies with cu13-versioned packages (vLLM, transformer-engine, NVShmem, PyTorch). Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docker/Dockerfile`:
- Line 107: The Dockerfile currently runs the command "uv lock" during image
build which regenerates uv.lock and breaks reproducibility; remove that command
from the Dockerfile or replace it with "uv lock --check" so the build fails if
uv.lock is out-of-date (instead of silently rewriting it); if the intent was to
keep lockfile synced, instead enforce the check in CI (or document the CUDA13
workaround) and ensure any reference to "uv lock" in the Dockerfile is updated
accordingly.
In `@pyproject.toml`:
- Line 60: The vLLM direct URL dependency currently pins only the aarch64 wheel
(e.g., the entry with "vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl");
update each vllm dependency occurrence so the package spec includes two
conditional markers selecting the correct wheel per architecture: one URL using
the x86_64 wheel filename
(vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl) with a marker ;
platform_machine == 'x86_64' and the existing aarch64 URL with a marker ;
platform_machine == 'aarch64', ensuring all vllm entries (the ones matching the
current aarch64-only URL) are replaced with these architecture-conditional specs
so x86_64 builds install the correct wheel.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
docker/Dockerfilepyproject.toml
🧰 Additional context used
📓 Path-based instructions (1)
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
pyproject.tomldocker/Dockerfile
🧠 Learnings (2)
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to Dockerfile.ngc_pytorch : Exception: Dockerfile.ngc_pytorch is exempt from the uv run rule
Applied to files:
docker/Dockerfile
📚 Learning: 2025-09-24T18:36:01.919Z
Learnt from: terrykong
Repo: NVIDIA-NeMo/RL PR: 1024
File: .pre-commit-config.yaml:80-81
Timestamp: 2025-09-24T18:36:01.919Z
Learning: In pre-commit hooks, avoid using `uv run` for standalone Python scripts like tools/config_cli.py because `uv run` triggers project synchronization, which adds unnecessary overhead. Use direct script execution (e.g., `./tools/config_cli.py`) instead when the script is designed to be standalone.
Applied to files:
docker/Dockerfile
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (8)
docker/Dockerfile (2)
20-22: LGTM! Appropriate header path configuration for CUDA 13.The
CPLUS_INCLUDE_PATHconfiguration is a reasonable workaround for the CUDA 13 header reorganization where standard library headers moved underinclude/cccl. The comment clearly explains the purpose.
8-8: Base image tag is valid and confirmed available.The base image
nvcr.io/nvidia/cuda-dl-base:25.11-cuda13.0-devel-ubuntu24.04follows NVIDIA's documented tag pattern for CUDA DL 25.11 with CUDA 13.0 on Ubuntu 24.04 and is publicly available on NGC.pyproject.toml (6)
49-49: LGTM! NVShmem updated to CUDA 13 variant.The dependency correctly switches from
nvidia-nvshmem-cu12tonvidia-nvshmem-cu13with appropriate platform markers.
71-72: Clarify status of commented-out deep_ep and deep_gemm dependencies.Multiple
deep_epanddeep_gemmgit references are commented out. The commit message mentions "still not working with flashinfer," suggesting these are temporarily disabled.Consider:
- Adding a comment explaining why these are disabled and linking to a tracking issue
- If this is expected to be resolved before merge, mark it as a TODO with the issue reference
This helps future maintainers understand the intent and track when these can be re-enabled.
Also applies to: 78-79, 82-83
96-96: LGTM! Transformer-engine updated to 2.9.0 with CUDA 13 support.The
transformer-engine[pytorch,core_cu13]==2.9.0specification correctly pins the version with CUDA 13 core support.
161-171: LGTM! PyTorch sources correctly updated to cu130 index.The torch, torchvision, and triton source mappings properly reference the new
pytorch-cu130index for Linux platforms while preserving PyPI for macOS.
196-197: LGTM! Index URL updated for CUDA 13.The PyTorch index definition correctly points to the cu130 wheel repository.
226-227: Clever approach to disable cu12 transformer-engine.Using
sys_platform == 'never'as an impossible condition is an effective way to exclude thetransformer-engine-cu12package from resolution without removing it entirely from overrides. This prevents accidental installation of the CUDA 12 variant.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
docker/Dockerfile
Outdated
| # Do everything in one layer to prevent large layers. | ||
|
|
||
| # The venv is symlinked to avoid bloating the layer size | ||
| uv lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconsider running uv lock during the Docker build.
Running uv lock regenerates uv.lock from pyproject.toml, which can produce different results across builds if upstream dependency versions change. This undermines reproducibility—the lockfile committed to the repo should be the source of truth.
If the intent is to ensure the lockfile stays in sync with pyproject.toml, consider either:
- Remove
uv lockand rely on CI to verify the lockfile is up-to-date - Use
uv lock --checkto fail the build if the lockfile is stale, rather than silently regenerating it
Was this added intentionally to work around a specific issue with the CUDA 13 dependencies?
🤖 Prompt for AI Agents
In `@docker/Dockerfile` at line 107, The Dockerfile currently runs the command "uv
lock" during image build which regenerates uv.lock and breaks reproducibility;
remove that command from the Dockerfile or replace it with "uv lock --check" so
the build fails if uv.lock is out-of-date (instead of silently rewriting it); if
the intent was to keep lockfile synced, instead enforce the check in CI (or
document the CUDA13 workaround) and ensure any reference to "uv lock" in the
Dockerfile is updated accordingly.
pyproject.toml
Outdated
| "causal-conv1d", | ||
| # Remove this once https://github.com/NVIDIA-NeMo/RL/issues/501 resolved | ||
| "vllm==0.11.2", | ||
| "vllm @ https://github.com/vllm-project/vllm/releases/download/v0.13.0/vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check available vLLM v0.13.0 wheels on GitHub releases
echo "=== Checking vLLM v0.13.0 release assets ==="
curl -s "https://api.github.com/repos/vllm-project/vllm/releases/tags/v0.13.0" | jq -r '.assets[].name' 2>/dev/null | grep -i "cu130.*\.whl" || echo "No cu130 wheels found or API failed"Repository: NVIDIA-NeMo/RL
Length of output: 211
🏁 Script executed:
#!/bin/bash
# Examine pyproject.toml to see the dependencies and affected lines
echo "=== Checking pyproject.toml content around line 60, 74, 83, 104 ==="
cat -n pyproject.toml | sed -n '55,110p'Repository: NVIDIA-NeMo/RL
Length of output: 3690
vLLM wheel URLs only specify aarch64 architecture—x86_64 builds will fail.
All vLLM direct URL dependencies (lines 60, 74, 83, 104) point to manylinux_2_35_aarch64.whl while x86_64 wheels are available for v0.13.0. This breaks x86_64 Linux builds, which are likely the primary target for most users.
Add conditional markers to select the appropriate wheel per architecture:
"vllm @ https://github.com/vllm-project/vllm/releases/download/v0.13.0/vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl ; platform_machine == 'x86_64'",
"vllm @ https://github.com/vllm-project/vllm/releases/download/v0.13.0/vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl ; platform_machine == 'aarch64'",🤖 Prompt for AI Agents
In `@pyproject.toml` at line 60, The vLLM direct URL dependency currently pins
only the aarch64 wheel (e.g., the entry with
"vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl"); update each vllm
dependency occurrence so the package spec includes two conditional markers
selecting the correct wheel per architecture: one URL using the x86_64 wheel
filename (vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl) with a marker ;
platform_machine == 'x86_64' and the existing aarch64 URL with a marker ;
platform_machine == 'aarch64', ensuring all vllm entries (the ones matching the
current aarch64-only URL) are replaced with these architecture-conditional specs
so x86_64 builds install the correct wheel.
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
|
|
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
|
@terrykong I think the github CI machine is too old for running unit test. We can try to run those tests locally for now, but eventually we will need some H100 CI with higher driver for testing this. |
What does this PR do ?
Cuda13 in the base image and torch 2.9
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.