-
Notifications
You must be signed in to change notification settings - Fork 219
feat: NeMo Gym GRPO on-policy fix params; Per-agent group-level rewards #1779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
d510b92 to
a13e7c3
Compare
Signed-off-by: Brian Yu <bxyu@nvidia.com>
…20260114 Signed-off-by: Brian Yu <bxyu@nvidia.com>
📝 WalkthroughWalkthroughThis PR extends the GRPO rollout pipeline with generation-per-prompt control, validation mode support, and conditional on-policy fixes for the vLLM HTTP server. Changes propagate through rollout orchestration, policy generation configuration, and worker-level behavior to enable flexible policy server corrections during training and validation phases. Changes
Sequence Diagram(s)sequenceDiagram
participant GRPO as GRPO Algorithm
participant Rollout as Async Rollout
participant VLLMGen as VLLMGeneration
participant VLLMWorker as VLLMWorker
participant Env as NeMo-Gym Environment
GRPO->>Rollout: run_async_nemo_gym_rollout(is_validation, num_generations_per_prompt)
alt is_validation == True
Rollout->>VLLMGen: prepare_http_server_for_validation()
VLLMGen->>VLLMWorker: prepare_http_server_for_validation()
VLLMWorker->>VLLMWorker: set do_on_policy_fixes from validation config
else is_validation == False
Rollout->>VLLMGen: prepare_http_server_for_training()
VLLMGen->>VLLMWorker: prepare_http_server_for_training()
VLLMWorker->>VLLMWorker: set do_on_policy_fixes from training config
end
Rollout->>Env: run_rollouts(do_on_policy_fixes)
Env->>Env: _postprocess_nemo_gym_to_nemo_rl_result(do_on_policy_fixes)
alt do_on_policy_fixes == True
Env->>Env: enforce token-id contiguity assertion
else do_on_policy_fixes == False
Env->>Env: skip token-id contiguity assertion
end
Env-->>Rollout: processed results
alt num_generations_per_prompt is set
Rollout->>Rollout: compute group_level_rewards per agent
Rollout->>Rollout: generate histogram & percentage metrics
end
Rollout-->>GRPO: AsyncNemoGymRolloutResult with metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@nemo_rl/models/generation/vllm/config.py`:
- Around line 42-46: Add the two new config keys
http_server_performs_on_policy_fixes_during_training and
http_server_performs_on_policy_fixes_during_validation to the exemplar YAMLs in
examples/configs/*.yaml and document their defaults: set
http_server_performs_on_policy_fixes_during_training: true (opt-out during
training) and http_server_performs_on_policy_fixes_during_validation: false;
ensure the YAMLs include these keys with the stated default values and a brief
comment mirroring the in-code comments in
nemo_rl/models/generation/vllm/config.py so the YAMLs remain the single source
of truth.
🧹 Nitpick comments (4)
nemo_rl/models/generation/vllm/vllm_generation.py (1)
691-707: Add docstrings and make HTTP-server prep results robust.
prepare_http_server_for_validationlacks a docstring, and_prepare_http_server_for_helperreturns the first result without checking empty/mismatched responses. As per coding guidelines, please add docstrings and make the helper resilient.♻️ Proposed refactor
def prepare_http_server_for_validation(self) -> bool: + """Returns whether or not to do on-policy fixes during validation.""" return self._prepare_http_server_for_helper( "prepare_http_server_for_validation" ) def _prepare_http_server_for_helper(self, method_name: str) -> bool: + """Run the worker prep hook and verify consistent responses.""" # Use run_all_workers_single_data for methods that don't need data futures = self.worker_group.run_all_workers_single_data( method_name, run_rank_0_only_axes=["tensor_parallel", "pipeline_parallel"] ) # Wait for all futures to complete results = ray.get(futures) - return results[0] + if not results: + raise RuntimeError("No workers responded to HTTP-server prep.") + first = results[0] + if any(r != first for r in results if r is not None): + raise RuntimeError( + "Inconsistent HTTP-server prep responses across workers." + ) + return firstnemo_rl/models/generation/vllm/vllm_worker_async.py (1)
283-292: Avoid code-side defaults for the new HTTP-server flags.
YAML should remain the source of truth for defaults, and these new public methods should have docstrings. As per coding guidelines, please avoid non-None defaults in code and document the methods.♻️ Proposed refactor
def prepare_http_server_for_training(self) -> bool: - self.do_on_policy_fixes = self.cfg["vllm_cfg"].get( - "http_server_performs_on_policy_fixes_during_training", True - ) + """Configure on-policy fixes for training requests.""" + self.do_on_policy_fixes = self.cfg["vllm_cfg"][ + "http_server_performs_on_policy_fixes_during_training" + ] return self.do_on_policy_fixes def prepare_http_server_for_validation(self) -> bool: - self.do_on_policy_fixes = self.cfg["vllm_cfg"].get( - "http_server_performs_on_policy_fixes_during_validation", False - ) + """Configure on-policy fixes for validation requests.""" + self.do_on_policy_fixes = self.cfg["vllm_cfg"][ + "http_server_performs_on_policy_fixes_during_validation" + ] return self.do_on_policy_fixesnemo_rl/experience/rollouts.py (2)
1103-1111: Add explicitstrict=tozipfor Ruff B905.
Given Python 3.12+,strict=Trueboth satisfies the lint and asserts the ordering assumption.♻️ Proposed refactor
- for nemo_gym_row, result in zip(nemo_gym_rows, results): + for nemo_gym_row, result in zip(nemo_gym_rows, results, strict=True):Please confirm the runtime baseline remains Python 3.10+ (strict is available).
1128-1182: Guard pct_0/1/mixed against non-binary rewards.
These percentages are only meaningful for {0,1} rewards; skip them when values are non-binary to avoid misleading metrics.♻️ Proposed refactor
per_agent_metrics[f"{agent_name}/group_level_reward/histogram"] = ( Histogram(group_level_rewards) ) - per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"] = ( - 100 - * sum(r == 0 for r in group_level_rewards) - / len(group_level_rewards) - ) - per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"] = ( - 100 - * sum(r == 1 for r in group_level_rewards) - / len(group_level_rewards) - ) - per_agent_metrics[f"{agent_name}/group_level_reward/pct_mixed"] = ( - 100 - - per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"] - - per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"] - ) + if set(group_level_rewards) <= {0, 1}: + per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"] = ( + 100 + * sum(r == 0 for r in group_level_rewards) + / len(group_level_rewards) + ) + per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"] = ( + 100 + * sum(r == 1 for r in group_level_rewards) + / len(group_level_rewards) + ) + per_agent_metrics[f"{agent_name}/group_level_reward/pct_mixed"] = ( + 100 + - per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"] + - per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"] + )
| # Whether or not to do the on-policy fixes when using the HTTP server. See nemo_rl/models/generation/vllm/vllm_worker_async.py::_replace_prefix_tokens | ||
| # By default, this is opt-out during training since the default behavior is to ensure that RL training is on-policy. | ||
| http_server_performs_on_policy_fixes_during_training: NotRequired[bool] | ||
| # This defaults to False during validation so that we perform validation in the same setting as outside of NeMo RL. | ||
| http_server_performs_on_policy_fixes_during_validation: NotRequired[bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, check if examples/configs directory exists
fd -t f "\.yaml$" examples/configs
# Search for the specific keys in YAML files
echo "=== Searching for http_server_performs_on_policy_fixes keys ==="
rg -n "http_server_performs_on_policy_fixes_during_(training|validation)" examples/configs -g '*.yaml' || echo "No matches found"
# Also check the config file to understand the TypedDict definition
echo "=== Checking config.py file ==="
wc -l nemo_rl/models/generation/vllm/config.pyRepository: NVIDIA-NeMo/RL
Length of output: 10066
🏁 Script executed:
cat -n nemo_rl/models/generation/vllm/config.pyRepository: NVIDIA-NeMo/RL
Length of output: 3311
Update exemplar YAMLs with the new HTTP-server on-policy fix flags.
These new NotRequired[bool] keys must be documented with their defaults in exemplar YAMLs under examples/configs/*.yaml to comply with the coding guidelines (YAML is the single source of truth for configuration defaults). Based on the code comments, the defaults are:
http_server_performs_on_policy_fixes_during_training: True (opt-out during training)http_server_performs_on_policy_fixes_during_validation: False
🤖 Prompt for AI Agents
In `@nemo_rl/models/generation/vllm/config.py` around lines 42 - 46, Add the two
new config keys http_server_performs_on_policy_fixes_during_training and
http_server_performs_on_policy_fixes_during_validation to the exemplar YAMLs in
examples/configs/*.yaml and document their defaults: set
http_server_performs_on_policy_fixes_during_training: true (opt-out during
training) and http_server_performs_on_policy_fixes_during_validation: false;
ensure the YAMLs include these keys with the stated default values and a brief
comment mirroring the in-code comments in
nemo_rl/models/generation/vllm/config.py so the YAMLs remain the single source
of truth.
What does this PR do ?
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.