Skip to content

Conversation

@guyueh1
Copy link
Contributor

@guyueh1 guyueh1 commented Jan 16, 2026

What does this PR do ?

Cuda13 in the base image and torch 2.9

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores
    • Updated container base image to CUDA 13.0
    • Upgraded vLLM to version 0.13.0 with CUDA 13 support
    • Upgraded transformer-engine to 2.9.0 with CUDA 13 core support
    • Enhanced build workflow with improved dependency locking

✏️ Tip: You can customize this high-level summary in your review settings.

guyueh1 and others added 2 commits January 13, 2026 15:53
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
@guyueh1 guyueh1 requested review from a team as code owners January 16, 2026 00:16
@guyueh1 guyueh1 changed the title feat: Cuda13 Support feat: [draft do not merge] Cuda13 Support Jan 16, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

These changes upgrade the CUDA infrastructure from version 12.9 to 13.0, updating the Docker base image, adding CUDA header configuration, and replacing cu12-versioned dependencies with cu13-versioned packages (vLLM, transformer-engine, NVShmem, PyTorch).

Changes

Cohort / File(s) Summary
CUDA 13.0 Upgrade
docker/Dockerfile
Updated base image from CUDA 12.9 to 13.0, added CPLUS_INCLUDE_PATH environment variable for CUDA headers, and introduced uv lock before uv sync commands for dependency locking
CUDA 13.0 Upgrade
pyproject.toml
Replaced cu12 package versions with cu13 equivalents: NVShmem wheel spec, vLLM (0.13.0), transformer-engine (2.9.0), and PyTorch sources; updated tool.uv.sources entries from cu129 to cu130 for torch, torchvision, and triton

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major changes (CUDA 13.0, PyTorch 2.9, multiple dependency upgrades) affecting numerics and performance, but lacks test results, regression testing, performance benchmarks, and documentation. Add test results confirming no numerical regressions, performance benchmarks, FlashInfer compatibility resolution confirmation, and complete the contributor checklist before merging.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title 'chore: Cuda13 Support' accurately reflects the main change: upgrading CUDA from 12.9 to 13.0 across the Docker configuration and all dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docker/Dockerfile`:
- Line 107: The Dockerfile currently runs the command "uv lock" during image
build which regenerates uv.lock and breaks reproducibility; remove that command
from the Dockerfile or replace it with "uv lock --check" so the build fails if
uv.lock is out-of-date (instead of silently rewriting it); if the intent was to
keep lockfile synced, instead enforce the check in CI (or document the CUDA13
workaround) and ensure any reference to "uv lock" in the Dockerfile is updated
accordingly.

In `@pyproject.toml`:
- Line 60: The vLLM direct URL dependency currently pins only the aarch64 wheel
(e.g., the entry with "vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl");
update each vllm dependency occurrence so the package spec includes two
conditional markers selecting the correct wheel per architecture: one URL using
the x86_64 wheel filename
(vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl) with a marker ;
platform_machine == 'x86_64' and the existing aarch64 URL with a marker ;
platform_machine == 'aarch64', ensuring all vllm entries (the ones matching the
current aarch64-only URL) are replaced with these architecture-conditional specs
so x86_64 builds install the correct wheel.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0b1a91 and 5d09ae4.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • docker/Dockerfile
  • pyproject.toml
🧰 Additional context used
📓 Path-based instructions (1)
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • pyproject.toml
  • docker/Dockerfile
🧠 Learnings (2)
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to Dockerfile.ngc_pytorch : Exception: Dockerfile.ngc_pytorch is exempt from the uv run rule

Applied to files:

  • docker/Dockerfile
📚 Learning: 2025-09-24T18:36:01.919Z
Learnt from: terrykong
Repo: NVIDIA-NeMo/RL PR: 1024
File: .pre-commit-config.yaml:80-81
Timestamp: 2025-09-24T18:36:01.919Z
Learning: In pre-commit hooks, avoid using `uv run` for standalone Python scripts like tools/config_cli.py because `uv run` triggers project synchronization, which adds unnecessary overhead. Use direct script execution (e.g., `./tools/config_cli.py`) instead when the script is designed to be standalone.

Applied to files:

  • docker/Dockerfile
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (8)
docker/Dockerfile (2)

20-22: LGTM! Appropriate header path configuration for CUDA 13.

The CPLUS_INCLUDE_PATH configuration is a reasonable workaround for the CUDA 13 header reorganization where standard library headers moved under include/cccl. The comment clearly explains the purpose.


8-8: Base image tag is valid and confirmed available.

The base image nvcr.io/nvidia/cuda-dl-base:25.11-cuda13.0-devel-ubuntu24.04 follows NVIDIA's documented tag pattern for CUDA DL 25.11 with CUDA 13.0 on Ubuntu 24.04 and is publicly available on NGC.

pyproject.toml (6)

49-49: LGTM! NVShmem updated to CUDA 13 variant.

The dependency correctly switches from nvidia-nvshmem-cu12 to nvidia-nvshmem-cu13 with appropriate platform markers.


71-72: Clarify status of commented-out deep_ep and deep_gemm dependencies.

Multiple deep_ep and deep_gemm git references are commented out. The commit message mentions "still not working with flashinfer," suggesting these are temporarily disabled.

Consider:

  1. Adding a comment explaining why these are disabled and linking to a tracking issue
  2. If this is expected to be resolved before merge, mark it as a TODO with the issue reference

This helps future maintainers understand the intent and track when these can be re-enabled.

Also applies to: 78-79, 82-83


96-96: LGTM! Transformer-engine updated to 2.9.0 with CUDA 13 support.

The transformer-engine[pytorch,core_cu13]==2.9.0 specification correctly pins the version with CUDA 13 core support.


161-171: LGTM! PyTorch sources correctly updated to cu130 index.

The torch, torchvision, and triton source mappings properly reference the new pytorch-cu130 index for Linux platforms while preserving PyPI for macOS.


196-197: LGTM! Index URL updated for CUDA 13.

The PyTorch index definition correctly points to the cu130 wheel repository.


226-227: Clever approach to disable cu12 transformer-engine.

Using sys_platform == 'never' as an impossible condition is an effective way to exclude the transformer-engine-cu12 package from resolution without removing it entirely from overrides. This prevents accidental installation of the CUDA 12 variant.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

# Do everything in one layer to prevent large layers.

# The venv is symlinked to avoid bloating the layer size
uv lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reconsider running uv lock during the Docker build.

Running uv lock regenerates uv.lock from pyproject.toml, which can produce different results across builds if upstream dependency versions change. This undermines reproducibility—the lockfile committed to the repo should be the source of truth.

If the intent is to ensure the lockfile stays in sync with pyproject.toml, consider either:

  1. Remove uv lock and rely on CI to verify the lockfile is up-to-date
  2. Use uv lock --check to fail the build if the lockfile is stale, rather than silently regenerating it

Was this added intentionally to work around a specific issue with the CUDA 13 dependencies?

🤖 Prompt for AI Agents
In `@docker/Dockerfile` at line 107, The Dockerfile currently runs the command "uv
lock" during image build which regenerates uv.lock and breaks reproducibility;
remove that command from the Dockerfile or replace it with "uv lock --check" so
the build fails if uv.lock is out-of-date (instead of silently rewriting it); if
the intent was to keep lockfile synced, instead enforce the check in CI (or
document the CUDA13 workaround) and ensure any reference to "uv lock" in the
Dockerfile is updated accordingly.

pyproject.toml Outdated
"causal-conv1d",
# Remove this once https://github.com/NVIDIA-NeMo/RL/issues/501 resolved
"vllm==0.11.2",
"vllm @ https://github.com/vllm-project/vllm/releases/download/v0.13.0/vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check available vLLM v0.13.0 wheels on GitHub releases
echo "=== Checking vLLM v0.13.0 release assets ==="
curl -s "https://api.github.com/repos/vllm-project/vllm/releases/tags/v0.13.0" | jq -r '.assets[].name' 2>/dev/null | grep -i "cu130.*\.whl" || echo "No cu130 wheels found or API failed"

Repository: NVIDIA-NeMo/RL

Length of output: 211


🏁 Script executed:

#!/bin/bash
# Examine pyproject.toml to see the dependencies and affected lines
echo "=== Checking pyproject.toml content around line 60, 74, 83, 104 ==="
cat -n pyproject.toml | sed -n '55,110p'

Repository: NVIDIA-NeMo/RL

Length of output: 3690


vLLM wheel URLs only specify aarch64 architecture—x86_64 builds will fail.

All vLLM direct URL dependencies (lines 60, 74, 83, 104) point to manylinux_2_35_aarch64.whl while x86_64 wheels are available for v0.13.0. This breaks x86_64 Linux builds, which are likely the primary target for most users.

Add conditional markers to select the appropriate wheel per architecture:

"vllm @ https://github.com/vllm-project/vllm/releases/download/v0.13.0/vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl ; platform_machine == 'x86_64'",
"vllm @ https://github.com/vllm-project/vllm/releases/download/v0.13.0/vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl ; platform_machine == 'aarch64'",
🤖 Prompt for AI Agents
In `@pyproject.toml` at line 60, The vLLM direct URL dependency currently pins
only the aarch64 wheel (e.g., the entry with
"vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl"); update each vllm
dependency occurrence so the package spec includes two conditional markers
selecting the correct wheel per architecture: one URL using the x86_64 wheel
filename (vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl) with a marker ;
platform_machine == 'x86_64' and the existing aarch64 URL with a marker ;
platform_machine == 'aarch64', ensuring all vllm entries (the ones matching the
current aarch64-only URL) are replaced with these architecture-conditional specs
so x86_64 builds install the correct wheel.

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
@guyueh1
Copy link
Contributor Author

guyueh1 commented Jan 16, 2026

uv run examples/run_grpo_math.py --config examples/configs/grpo_math_1B_megatron.yaml policy.model_name=meta-llama/Llama-3.1-8B-Instruct works now!

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
@guyueh1 guyueh1 requested a review from a team as a code owner January 17, 2026 22:04
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 2490141 (PR #1786 from cuda13)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Fix
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
@guyueh1 guyueh1 changed the title feat: [draft do not merge] Cuda13 Support chore: Cuda13 Support Jan 18, 2026
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 90efbeb (PR #1786 from cuda13)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@guyueh1 guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Jan 18, 2026
@guyueh1 guyueh1 self-assigned this Jan 18, 2026
@guyueh1
Copy link
Contributor Author

guyueh1 commented Jan 19, 2026

@terrykong I think the github CI machine is too old for running unit test. We can try to run those tests locally for now, but eventually we will need some H100 CI with higher driver for testing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant