Skip to content

Conversation

@sahgerlad
Copy link
Contributor

@sahgerlad sahgerlad commented Jan 7, 2026

What does this PR do ?

Add CUDA Graph configuration support to MegatronPolicyWorker

Issues

N/A - Feature addition

Usage

# In config YAML, under megatron_cfg:
megatron_cfg:
  enabled: true
  enable_cuda_graph: true           # Enable CUDA graph capture
  cuda_graph_scope: "attn"    # Optional: set the scope for CUDA graphs
  # ... other megatron config options

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

N/A

Summary by CodeRabbit

  • New Features

    • Added CUDA Graph configuration support for Megatron policy worker, allowing users to enable CUDA graphs and specify their scope (e.g., full_model).
    • Integrated RNG configuration that automatically enables RNG tracking when CUDA graphs are activated.
  • Tests

    • Added comprehensive test coverage for CUDA graph configuration parsing with various parameter combinations.

✏️ Tip: You can customize this high-level summary in your review settings.

@sahgerlad sahgerlad requested review from a team as code owners January 7, 2026 19:56
@sahgerlad sahgerlad changed the title Add CUDA Graph configuration support to MegatronPolicyWorker feat: Add CUDA Graph configuration support to MegatronPolicyWorker Jan 7, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Added CUDA Graph and RNG configuration support to MegatronPolicyWorker by importing RNGConfig and conditionally propagating enable_cuda_graph, cuda_graph_scope, and rng settings into the Megatron ConfigContainer. Extended test configuration builder and added test coverage for CUDA graph configuration parsing.

Changes

Cohort / File(s) Summary
CUDA Graph and RNG Configuration
nemo_rl/models/policy/workers/megatron_policy_worker.py
Imported RNGConfig from megatron.bridge.training.config. Added conditional logic to extract enable_cuda_graph and cuda_graph_scope from megatron_cfg and propagate them into model_cfg. When enable_cuda_graph is true, set use_te_rng_tracker to True. Created RNGConfig instance based on enable_cuda_graph and passed it to Megatron ConfigContainer as rng parameter.
Test Configuration and Validation
tests/unit/models/policy/test_megatron_worker.py
Extended create_megatron_test_config signature with optional enable_cuda_graph and cuda_graph_scope parameters. Conditionally merge these parameters into megatron_cfg when provided. Added new test_cuda_graph_config_parsing unit test to verify CUDA graph configuration propagation across default, enabled-only, and enabled-with-scope scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR introduces major CUDA Graph configuration feature but lacks documented test execution results; PR states test execution was not fully confirmed. Document test execution results for test_cuda_graph_config_parsing and functional tests; include performance benchmarks demonstrating CUDA graph benefits without regressions.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding CUDA Graph configuration support to MegatronPolicyWorker, which matches the core functionality introduced across both modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_rl/models/policy/workers/megatron_policy_worker.py (1)

632-639: Consider more explicit conditional logic for use_te_rng_tracker.

The current code only sets model_cfg.use_te_rng_tracker = True when CUDA graphs are enabled (line 638), but doesn't explicitly set it to False when disabled. While this may be correct if the default is False, the logic could be clearer:

     if model_cfg.enable_cuda_graph:
         model_cfg.use_te_rng_tracker = True
+    else:
+        model_cfg.use_te_rng_tracker = False

Or more concisely:

model_cfg.use_te_rng_tracker = model_cfg.enable_cuda_graph

Additionally, there's no validation that cuda_graph_scope (lines 635-636) is only set when enable_cuda_graph=True. While this may be acceptable, it could lead to confusion if someone configures a scope without enabling CUDA graphs.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a39bd6 and 13751df.

📒 Files selected for processing (2)
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
  • tests/unit/models/policy/test_megatron_worker.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • tests/unit/models/policy/test_megatron_worker.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • tests/unit/models/policy/test_megatron_worker.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/unit/models/policy/test_megatron_worker.py
  • nemo_rl/models/policy/workers/megatron_policy_worker.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

  • nemo_rl/models/policy/workers/megatron_policy_worker.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (5)
nemo_rl/models/policy/workers/megatron_policy_worker.py (2)

717-719: Verify consistency between model_cfg.use_te_rng_tracker and rng_config.te_rng_tracker.

The rng_config.te_rng_tracker is set based on enable_cuda_graph config (defaulting to False if not present), while model_cfg.use_te_rng_tracker at line 638 is only set to True conditionally. Ensure these two settings remain consistent:

  • When enable_cuda_graph=True: Both should be True
  • When enable_cuda_graph=False: rng_config.te_rng_tracker=False, but model_cfg.use_te_rng_tracker is not explicitly set
  • When key absent: rng_config.te_rng_tracker=False, but model_cfg.use_te_rng_tracker is not explicitly set

This should work correctly if model_cfg.use_te_rng_tracker defaults to False, but please verify this assumption holds in the Megatron Bridge configuration.


632-639: Verify documentation and example YAML updates per coding guidelines.

Based on the coding guidelines: "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/.yaml"*

Please confirm that:

  1. The new config keys (enable_cuda_graph and cuda_graph_scope) are documented with their purpose, valid values, and defaults
  2. Example YAMLs under examples/configs/ have been updated to reflect these new options
  3. If these keys are added to a TypedDict for PolicyConfig, they use typing.NotRequired to mark them as optional

As per coding guidelines.

tests/unit/models/policy/test_megatron_worker.py (3)

68-69: LGTM! Parameter declarations are well-typed.

The new optional parameters are properly declared with appropriate types and defaults.


137-138: Correct handling of optional config parameters.

The conditional dictionary unpacking correctly uses is not None checks, which ensures:

  • enable_cuda_graph=False is added to the config (correct, since False is not None)
  • enable_cuda_graph=None omits the key (correct)
  • cuda_graph_scope="value" adds the value (correct)
  • cuda_graph_scope=None omits the key (correct)

2593-2624: Good test coverage for CUDA graph config parsing.

The test validates the four main scenarios:

  1. Default config (keys absent)
  2. CUDA graph enabled (key present, scope absent)
  3. CUDA graph enabled with scope (both present)
  4. CUDA graph disabled (key present with False value)

This provides solid coverage for the config parsing logic. The test correctly verifies that optional config keys are only included when explicitly provided.

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
@sahgerlad sahgerlad force-pushed the feat/cuda-graph-support branch from 304fb1b to 240f6cf Compare January 7, 2026 20:05
@terrykong
Copy link
Contributor

@shanmugamr1992 to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants