-
Notifications
You must be signed in to change notification settings - Fork 219
feat: Add CUDA Graph configuration support to MegatronPolicyWorker #1736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdded CUDA Graph and RNG configuration support to MegatronPolicyWorker by importing RNGConfig and conditionally propagating enable_cuda_graph, cuda_graph_scope, and rng settings into the Megatron ConfigContainer. Extended test configuration builder and added test coverage for CUDA graph configuration parsing. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_rl/models/policy/workers/megatron_policy_worker.py (1)
632-639: Consider more explicit conditional logic foruse_te_rng_tracker.The current code only sets
model_cfg.use_te_rng_tracker = Truewhen CUDA graphs are enabled (line 638), but doesn't explicitly set it toFalsewhen disabled. While this may be correct if the default isFalse, the logic could be clearer:if model_cfg.enable_cuda_graph: model_cfg.use_te_rng_tracker = True + else: + model_cfg.use_te_rng_tracker = FalseOr more concisely:
model_cfg.use_te_rng_tracker = model_cfg.enable_cuda_graphAdditionally, there's no validation that
cuda_graph_scope(lines 635-636) is only set whenenable_cuda_graph=True. While this may be acceptable, it could lead to confusion if someone configures a scope without enabling CUDA graphs.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_rl/models/policy/workers/megatron_policy_worker.pytests/unit/models/policy/test_megatron_worker.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
tests/unit/models/policy/test_megatron_worker.pynemo_rl/models/policy/workers/megatron_policy_worker.py
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
tests/unit/models/policy/test_megatron_worker.pynemo_rl/models/policy/workers/megatron_policy_worker.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
tests/unit/models/policy/test_megatron_worker.pynemo_rl/models/policy/workers/megatron_policy_worker.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/models/policy/workers/megatron_policy_worker.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (5)
nemo_rl/models/policy/workers/megatron_policy_worker.py (2)
717-719: Verify consistency betweenmodel_cfg.use_te_rng_trackerandrng_config.te_rng_tracker.The
rng_config.te_rng_trackeris set based onenable_cuda_graphconfig (defaulting toFalseif not present), whilemodel_cfg.use_te_rng_trackerat line 638 is only set toTrueconditionally. Ensure these two settings remain consistent:
- When
enable_cuda_graph=True: Both should beTrue✓- When
enable_cuda_graph=False:rng_config.te_rng_tracker=False, butmodel_cfg.use_te_rng_trackeris not explicitly set- When key absent:
rng_config.te_rng_tracker=False, butmodel_cfg.use_te_rng_trackeris not explicitly setThis should work correctly if
model_cfg.use_te_rng_trackerdefaults toFalse, but please verify this assumption holds in the Megatron Bridge configuration.
632-639: Verify documentation and example YAML updates per coding guidelines.Based on the coding guidelines: "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/.yaml"*
Please confirm that:
- The new config keys (
enable_cuda_graphandcuda_graph_scope) are documented with their purpose, valid values, and defaults- Example YAMLs under
examples/configs/have been updated to reflect these new options- If these keys are added to a TypedDict for PolicyConfig, they use
typing.NotRequiredto mark them as optionalAs per coding guidelines.
tests/unit/models/policy/test_megatron_worker.py (3)
68-69: LGTM! Parameter declarations are well-typed.The new optional parameters are properly declared with appropriate types and defaults.
137-138: Correct handling of optional config parameters.The conditional dictionary unpacking correctly uses
is not Nonechecks, which ensures:
enable_cuda_graph=Falseis added to the config (correct, sinceFalse is not None)enable_cuda_graph=Noneomits the key (correct)cuda_graph_scope="value"adds the value (correct)cuda_graph_scope=Noneomits the key (correct)
2593-2624: Good test coverage for CUDA graph config parsing.The test validates the four main scenarios:
- Default config (keys absent)
- CUDA graph enabled (key present, scope absent)
- CUDA graph enabled with scope (both present)
- CUDA graph disabled (key present with False value)
This provides solid coverage for the config parsing logic. The test correctly verifies that optional config keys are only included when explicitly provided.
13751df to
304fb1b
Compare
Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
304fb1b to
240f6cf
Compare
|
@shanmugamr1992 to review |
What does this PR do ?
Add CUDA Graph configuration support to MegatronPolicyWorker
Issues
N/A - Feature addition
Usage
Before your PR is "Ready for review"
Pre checks:
Additional Information
N/A
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.