Skip to content

Conversation

@bxyu-nvidia
Copy link
Contributor

@bxyu-nvidia bxyu-nvidia commented Jan 15, 2026

What does this PR do ?

  1. Parameterize and pipe through whether or not to do NeMo Gym on-policy fixes during GRPO training and validation stages
    1. nemo_rl/environments/nemo_gym.py
    2. nemo_rl/experience/rollouts.py
    3. nemo_rl/algorithms/grpo.py
    4. nemo_rl/models/generation/vllm/config.py
    5. nemo_rl/models/generation/vllm/vllm_generation.py
    6. nemo_rl/models/generation/vllm/vllm_worker_async.py
  2. Log per-agent group-level mixed reward metrics
    1. nemo_rl/experience/rollouts.py

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added generation-per-prompt control for rollout configuration
    • Added validation mode support for NeMo-Gym rollouts
    • Added configurable on-policy fixes for training and validation modes
    • Added group-level reward metrics when using multiple generations per prompt
  • Bug Fixes

    • Improved handling of non-numeric metrics in aggregation

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: dd33205 (PR #1779 from bxyu/gym-infra-20260114)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of bxyu/gym-validation-skips-on-policy-correction branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Brian Yu <bxyu@nvidia.com>
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: c8c15f2 (PR #1779 from bxyu/gym-infra-20260114)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of bxyu/gym-validation-skips-on-policy-correction branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@bxyu-nvidia bxyu-nvidia changed the base branch from bxyu/gym-validation-skips-on-policy-correction to bxyu/nemo-gym-refresh-20260113 January 17, 2026 00:36
@bxyu-nvidia bxyu-nvidia marked this pull request as ready for review January 17, 2026 02:46
@bxyu-nvidia bxyu-nvidia requested review from a team as code owners January 17, 2026 02:46
@bxyu-nvidia bxyu-nvidia added the CI:L1 Run doctests, unit tests, and functional tests label Jan 17, 2026
@bxyu-nvidia bxyu-nvidia requested a review from yfw January 17, 2026 02:46
@bxyu-nvidia bxyu-nvidia changed the title feat: Bxyu/gym infra 20260114 feat: Parameterize NeMo Gym on-policy fixes for GRPO; log per-agent group-level mixed reward metrics Jan 17, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@bxyu-nvidia bxyu-nvidia force-pushed the bxyu/gym-infra-20260114 branch from d510b92 to a13e7c3 Compare January 17, 2026 20:11
@bxyu-nvidia bxyu-nvidia changed the title feat: Parameterize NeMo Gym on-policy fixes for GRPO; log per-agent group-level mixed reward metrics feat: Parameterize NeMo Gym on-policy fixes for GRPO; log per-agent group-level reward metrics Jan 17, 2026
@bxyu-nvidia bxyu-nvidia changed the title feat: Parameterize NeMo Gym on-policy fixes for GRPO; log per-agent group-level reward metrics feat: NeMo Gym GRPO on-policy fix params; Per-agent group-level rewards Jan 17, 2026
@bxyu-nvidia bxyu-nvidia removed the CI:L1 Run doctests, unit tests, and functional tests label Jan 17, 2026
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Base automatically changed from bxyu/nemo-gym-refresh-20260113 to main January 18, 2026 03:43
@yfw yfw requested review from a team as code owners January 18, 2026 03:43
…20260114

Signed-off-by: Brian Yu <bxyu@nvidia.com>
@bxyu-nvidia bxyu-nvidia added the CI:L1 Run doctests, unit tests, and functional tests label Jan 18, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 18, 2026

📝 Walkthrough

Walkthrough

This PR extends the GRPO rollout pipeline with generation-per-prompt control, validation mode support, and conditional on-policy fixes for the vLLM HTTP server. Changes propagate through rollout orchestration, policy generation configuration, and worker-level behavior to enable flexible policy server corrections during training and validation phases.

Changes

Cohort / File(s) Summary
GRPO Training Algorithm
nemo_rl/algorithms/grpo.py
Added num_generations_per_prompt parameter to NeMo-Gym rollout call enabling generation-per-prompt control. Broadened metrics aggregation to skip non-numeric values. Added is_validation=True flag to validation rollout invocation.
Environment Rollout Interface
nemo_rl/environments/nemo_gym.py
Added optional do_on_policy_fixes: bool = True parameter to run_rollouts(). Updated _postprocess_nemo_gym_to_nemo_rl_result() signature to accept do_on_policy_fixes and made token-id contiguity assertion conditional based on this flag.
Async Rollout Orchestration
nemo_rl/experience/rollouts.py
Added is_validation and num_generations_per_prompt parameters to run_async_nemo_gym_rollout(). Introduced conditional policy server setup (validation vs. training). Added per-agent row tracking and group-level reward computation when num_generations_per_prompt is set, including histogram and percentage metrics.
vLLM Configuration
nemo_rl/models/generation/vllm/config.py
Added two new optional boolean fields to VllmSpecificArgs: http_server_performs_on_policy_fixes_during_training and http_server_performs_on_policy_fixes_during_validation for granular control over on-policy fix behavior.
vLLM Generation Interface
nemo_rl/models/generation/vllm/vllm_generation.py
Added three new methods: prepare_http_server_for_training(), prepare_http_server_for_validation(), and _prepare_http_server_for_helper() to coordinate on-policy fix preparation across workers.
vLLM Worker Implementation
nemo_rl/models/generation/vllm/vllm_worker_async.py
Added prepare_http_server_for_training() and prepare_http_server_for_validation() methods to set do_on_policy_fixes based on mode-specific config defaults. Broadened early-return condition in _preprocess_chat() to respect do_on_policy_fixes flag.

Sequence Diagram(s)

sequenceDiagram
    participant GRPO as GRPO Algorithm
    participant Rollout as Async Rollout
    participant VLLMGen as VLLMGeneration
    participant VLLMWorker as VLLMWorker
    participant Env as NeMo-Gym Environment

    GRPO->>Rollout: run_async_nemo_gym_rollout(is_validation, num_generations_per_prompt)
    
    alt is_validation == True
        Rollout->>VLLMGen: prepare_http_server_for_validation()
        VLLMGen->>VLLMWorker: prepare_http_server_for_validation()
        VLLMWorker->>VLLMWorker: set do_on_policy_fixes from validation config
    else is_validation == False
        Rollout->>VLLMGen: prepare_http_server_for_training()
        VLLMGen->>VLLMWorker: prepare_http_server_for_training()
        VLLMWorker->>VLLMWorker: set do_on_policy_fixes from training config
    end
    
    Rollout->>Env: run_rollouts(do_on_policy_fixes)
    Env->>Env: _postprocess_nemo_gym_to_nemo_rl_result(do_on_policy_fixes)
    
    alt do_on_policy_fixes == True
        Env->>Env: enforce token-id contiguity assertion
    else do_on_policy_fixes == False
        Env->>Env: skip token-id contiguity assertion
    end
    
    Env-->>Rollout: processed results
    
    alt num_generations_per_prompt is set
        Rollout->>Rollout: compute group_level_rewards per agent
        Rollout->>Rollout: generate histogram & percentage metrics
    end
    
    Rollout-->>GRPO: AsyncNemoGymRolloutResult with metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

asyncRL

Suggested reviewers

  • yfw
  • terrykong
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR contains major breaking changes and new features but lacks comprehensive testing documentation in description. Add test execution results, convergence verification, metric comparisons, performance analysis, and conditional assertion validation to PR description.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main changes: adding on-policy fix parameters for NeMo Gym GRPO and implementing per-agent group-level rewards tracking.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@nemo_rl/models/generation/vllm/config.py`:
- Around line 42-46: Add the two new config keys
http_server_performs_on_policy_fixes_during_training and
http_server_performs_on_policy_fixes_during_validation to the exemplar YAMLs in
examples/configs/*.yaml and document their defaults: set
http_server_performs_on_policy_fixes_during_training: true (opt-out during
training) and http_server_performs_on_policy_fixes_during_validation: false;
ensure the YAMLs include these keys with the stated default values and a brief
comment mirroring the in-code comments in
nemo_rl/models/generation/vllm/config.py so the YAMLs remain the single source
of truth.
🧹 Nitpick comments (4)
nemo_rl/models/generation/vllm/vllm_generation.py (1)

691-707: Add docstrings and make HTTP-server prep results robust.
prepare_http_server_for_validation lacks a docstring, and _prepare_http_server_for_helper returns the first result without checking empty/mismatched responses. As per coding guidelines, please add docstrings and make the helper resilient.

♻️ Proposed refactor
 def prepare_http_server_for_validation(self) -> bool:
+    """Returns whether or not to do on-policy fixes during validation."""
     return self._prepare_http_server_for_helper(
         "prepare_http_server_for_validation"
     )

 def _prepare_http_server_for_helper(self, method_name: str) -> bool:
+    """Run the worker prep hook and verify consistent responses."""
     # Use run_all_workers_single_data for methods that don't need data
     futures = self.worker_group.run_all_workers_single_data(
         method_name, run_rank_0_only_axes=["tensor_parallel", "pipeline_parallel"]
     )
     # Wait for all futures to complete
     results = ray.get(futures)
-    return results[0]
+    if not results:
+        raise RuntimeError("No workers responded to HTTP-server prep.")
+    first = results[0]
+    if any(r != first for r in results if r is not None):
+        raise RuntimeError(
+            "Inconsistent HTTP-server prep responses across workers."
+        )
+    return first
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)

283-292: Avoid code-side defaults for the new HTTP-server flags.
YAML should remain the source of truth for defaults, and these new public methods should have docstrings. As per coding guidelines, please avoid non-None defaults in code and document the methods.

♻️ Proposed refactor
 def prepare_http_server_for_training(self) -> bool:
-    self.do_on_policy_fixes = self.cfg["vllm_cfg"].get(
-        "http_server_performs_on_policy_fixes_during_training", True
-    )
+    """Configure on-policy fixes for training requests."""
+    self.do_on_policy_fixes = self.cfg["vllm_cfg"][
+        "http_server_performs_on_policy_fixes_during_training"
+    ]
     return self.do_on_policy_fixes

 def prepare_http_server_for_validation(self) -> bool:
-    self.do_on_policy_fixes = self.cfg["vllm_cfg"].get(
-        "http_server_performs_on_policy_fixes_during_validation", False
-    )
+    """Configure on-policy fixes for validation requests."""
+    self.do_on_policy_fixes = self.cfg["vllm_cfg"][
+        "http_server_performs_on_policy_fixes_during_validation"
+    ]
     return self.do_on_policy_fixes
nemo_rl/experience/rollouts.py (2)

1103-1111: Add explicit strict= to zip for Ruff B905.
Given Python 3.12+, strict=True both satisfies the lint and asserts the ordering assumption.

♻️ Proposed refactor
-        for nemo_gym_row, result in zip(nemo_gym_rows, results):
+        for nemo_gym_row, result in zip(nemo_gym_rows, results, strict=True):

Please confirm the runtime baseline remains Python 3.10+ (strict is available).


1128-1182: Guard pct_0/1/mixed against non-binary rewards.
These percentages are only meaningful for {0,1} rewards; skip them when values are non-binary to avoid misleading metrics.

♻️ Proposed refactor
                 per_agent_metrics[f"{agent_name}/group_level_reward/histogram"] = (
                     Histogram(group_level_rewards)
                 )
-                per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"] = (
-                    100
-                    * sum(r == 0 for r in group_level_rewards)
-                    / len(group_level_rewards)
-                )
-                per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"] = (
-                    100
-                    * sum(r == 1 for r in group_level_rewards)
-                    / len(group_level_rewards)
-                )
-                per_agent_metrics[f"{agent_name}/group_level_reward/pct_mixed"] = (
-                    100
-                    - per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"]
-                    - per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"]
-                )
+                if set(group_level_rewards) <= {0, 1}:
+                    per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"] = (
+                        100
+                        * sum(r == 0 for r in group_level_rewards)
+                        / len(group_level_rewards)
+                    )
+                    per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"] = (
+                        100
+                        * sum(r == 1 for r in group_level_rewards)
+                        / len(group_level_rewards)
+                    )
+                    per_agent_metrics[f"{agent_name}/group_level_reward/pct_mixed"] = (
+                        100
+                        - per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"]
+                        - per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"]
+                    )

Comment on lines +42 to +46
# Whether or not to do the on-policy fixes when using the HTTP server. See nemo_rl/models/generation/vllm/vllm_worker_async.py::_replace_prefix_tokens
# By default, this is opt-out during training since the default behavior is to ensure that RL training is on-policy.
http_server_performs_on_policy_fixes_during_training: NotRequired[bool]
# This defaults to False during validation so that we perform validation in the same setting as outside of NeMo RL.
http_server_performs_on_policy_fixes_during_validation: NotRequired[bool]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, check if examples/configs directory exists
fd -t f "\.yaml$" examples/configs

# Search for the specific keys in YAML files
echo "=== Searching for http_server_performs_on_policy_fixes keys ==="
rg -n "http_server_performs_on_policy_fixes_during_(training|validation)" examples/configs -g '*.yaml' || echo "No matches found"

# Also check the config file to understand the TypedDict definition
echo "=== Checking config.py file ==="
wc -l nemo_rl/models/generation/vllm/config.py

Repository: NVIDIA-NeMo/RL

Length of output: 10066


🏁 Script executed:

cat -n nemo_rl/models/generation/vllm/config.py

Repository: NVIDIA-NeMo/RL

Length of output: 3311


Update exemplar YAMLs with the new HTTP-server on-policy fix flags.

These new NotRequired[bool] keys must be documented with their defaults in exemplar YAMLs under examples/configs/*.yaml to comply with the coding guidelines (YAML is the single source of truth for configuration defaults). Based on the code comments, the defaults are:

  • http_server_performs_on_policy_fixes_during_training: True (opt-out during training)
  • http_server_performs_on_policy_fixes_during_validation: False
🤖 Prompt for AI Agents
In `@nemo_rl/models/generation/vllm/config.py` around lines 42 - 46, Add the two
new config keys http_server_performs_on_policy_fixes_during_training and
http_server_performs_on_policy_fixes_during_validation to the exemplar YAMLs in
examples/configs/*.yaml and document their defaults: set
http_server_performs_on_policy_fixes_during_training: true (opt-out during
training) and http_server_performs_on_policy_fixes_during_validation: false;
ensure the YAMLs include these keys with the stated default values and a brief
comment mirroring the in-code comments in
nemo_rl/models/generation/vllm/config.py so the YAMLs remain the single source
of truth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants