feat: NeMo Gym GRPO on-policy fix params; Per-agent group-level rewards #1779

bxyu-nvidia · 2026-01-15T05:05:27Z

What does this PR do ?

Parameterize and pipe through whether or not to do NeMo Gym on-policy fixes during GRPO training and validation stages
1. nemo_rl/environments/nemo_gym.py
2. nemo_rl/experience/rollouts.py
3. nemo_rl/algorithms/grpo.py
4. nemo_rl/models/generation/vllm/config.py
5. nemo_rl/models/generation/vllm/vllm_generation.py
6. nemo_rl/models/generation/vllm/vllm_worker_async.py
Log per-agent group-level mixed reward metrics
1. nemo_rl/experience/rollouts.py

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added generation-per-prompt control for rollout configuration
- Added validation mode support for NeMo-Gym rollouts
- Added configurable on-policy fixes for training and validation modes
- Added group-level reward metrics when using multiple generations per prompt
Bug Fixes
- Improved handling of non-numeric metrics in aggregation

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Brian Yu <bxyu@nvidia.com>

github-actions · 2026-01-16T02:32:54Z

✅ Submodule Fast-Forward Check Results

Check based on commit: dd33205 (PR #1779 from bxyu/gym-infra-20260114)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of bxyu/gym-validation-skips-on-policy-correction branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Brian Yu <bxyu@nvidia.com>

github-actions · 2026-01-16T22:36:38Z

✅ Submodule Fast-Forward Check Results

Check based on commit: c8c15f2 (PR #1779 from bxyu/gym-infra-20260114)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of bxyu/gym-validation-skips-on-policy-correction branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Brian Yu <bxyu@nvidia.com>

…efresh-20260113

Signed-off-by: Brian Yu <bxyu@nvidia.com>

…20260114 Signed-off-by: Brian Yu <bxyu@nvidia.com>

coderabbitai · 2026-01-18T04:07:51Z

📝 Walkthrough

Walkthrough

This PR extends the GRPO rollout pipeline with generation-per-prompt control, validation mode support, and conditional on-policy fixes for the vLLM HTTP server. Changes propagate through rollout orchestration, policy generation configuration, and worker-level behavior to enable flexible policy server corrections during training and validation phases.

Changes

Cohort / File(s)	Summary
GRPO Training Algorithm `nemo_rl/algorithms/grpo.py`	Added `num_generations_per_prompt` parameter to NeMo-Gym rollout call enabling generation-per-prompt control. Broadened metrics aggregation to skip non-numeric values. Added `is_validation=True` flag to validation rollout invocation.
Environment Rollout Interface `nemo_rl/environments/nemo_gym.py`	Added optional `do_on_policy_fixes: bool = True` parameter to `run_rollouts()`. Updated `_postprocess_nemo_gym_to_nemo_rl_result()` signature to accept `do_on_policy_fixes` and made token-id contiguity assertion conditional based on this flag.
Async Rollout Orchestration `nemo_rl/experience/rollouts.py`	Added `is_validation` and `num_generations_per_prompt` parameters to `run_async_nemo_gym_rollout()`. Introduced conditional policy server setup (validation vs. training). Added per-agent row tracking and group-level reward computation when `num_generations_per_prompt` is set, including histogram and percentage metrics.
vLLM Configuration `nemo_rl/models/generation/vllm/config.py`	Added two new optional boolean fields to `VllmSpecificArgs`: `http_server_performs_on_policy_fixes_during_training` and `http_server_performs_on_policy_fixes_during_validation` for granular control over on-policy fix behavior.
vLLM Generation Interface `nemo_rl/models/generation/vllm/vllm_generation.py`	Added three new methods: `prepare_http_server_for_training()`, `prepare_http_server_for_validation()`, and `_prepare_http_server_for_helper()` to coordinate on-policy fix preparation across workers.
vLLM Worker Implementation `nemo_rl/models/generation/vllm/vllm_worker_async.py`	Added `prepare_http_server_for_training()` and `prepare_http_server_for_validation()` methods to set `do_on_policy_fixes` based on mode-specific config defaults. Broadened early-return condition in `_preprocess_chat()` to respect `do_on_policy_fixes` flag.

Sequence Diagram(s)

sequenceDiagram
    participant GRPO as GRPO Algorithm
    participant Rollout as Async Rollout
    participant VLLMGen as VLLMGeneration
    participant VLLMWorker as VLLMWorker
    participant Env as NeMo-Gym Environment

    GRPO->>Rollout: run_async_nemo_gym_rollout(is_validation, num_generations_per_prompt)
    
    alt is_validation == True
        Rollout->>VLLMGen: prepare_http_server_for_validation()
        VLLMGen->>VLLMWorker: prepare_http_server_for_validation()
        VLLMWorker->>VLLMWorker: set do_on_policy_fixes from validation config
    else is_validation == False
        Rollout->>VLLMGen: prepare_http_server_for_training()
        VLLMGen->>VLLMWorker: prepare_http_server_for_training()
        VLLMWorker->>VLLMWorker: set do_on_policy_fixes from training config
    end
    
    Rollout->>Env: run_rollouts(do_on_policy_fixes)
    Env->>Env: _postprocess_nemo_gym_to_nemo_rl_result(do_on_policy_fixes)
    
    alt do_on_policy_fixes == True
        Env->>Env: enforce token-id contiguity assertion
    else do_on_policy_fixes == False
        Env->>Env: skip token-id contiguity assertion
    end
    
    Env-->>Rollout: processed results
    
    alt num_generations_per_prompt is set
        Rollout->>Rollout: compute group_level_rewards per agent
        Rollout->>Rollout: generate histogram & percentage metrics
    end
    
    Rollout-->>GRPO: AsyncNemoGymRolloutResult with metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat: Support DAPO dynamic sampling and reward shaping #602 — Directly modifies grpo.py and rollout APIs to propagate num_generations_per_prompt, validation flags, and on-policy-fix controls through the rollout pipeline.
feat: force on-policy ratio to 1 #1529 — Modifies grpo.py metrics aggregation logic in grpo_train, sharing overlapping changes to handling of non-numeric metrics.
feat: Expose async vLLM engine as HTTP server #1110 — Modifies vLLM HTTP-server integration (config, generation, worker) with related server behavior and on-policy-fix flag controls.

Suggested labels

asyncRL

Suggested reviewers

yfw
terrykong

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR contains major breaking changes and new features but lacks comprehensive testing documentation in description.	Add test execution results, convergence verification, metric comparisons, performance analysis, and conditional assertion validation to PR description.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main changes: adding on-policy fix parameters for NeMo Gym GRPO and implementing per-agent group-level rewards tracking.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@nemo_rl/models/generation/vllm/config.py`:
- Around line 42-46: Add the two new config keys
http_server_performs_on_policy_fixes_during_training and
http_server_performs_on_policy_fixes_during_validation to the exemplar YAMLs in
examples/configs/*.yaml and document their defaults: set
http_server_performs_on_policy_fixes_during_training: true (opt-out during
training) and http_server_performs_on_policy_fixes_during_validation: false;
ensure the YAMLs include these keys with the stated default values and a brief
comment mirroring the in-code comments in
nemo_rl/models/generation/vllm/config.py so the YAMLs remain the single source
of truth.

🧹 Nitpick comments (4)

nemo_rl/models/generation/vllm/vllm_generation.py (1)

691-707: Add docstrings and make HTTP-server prep results robust.
prepare_http_server_for_validation lacks a docstring, and _prepare_http_server_for_helper returns the first result without checking empty/mismatched responses. As per coding guidelines, please add docstrings and make the helper resilient.

♻️ Proposed refactor

 def prepare_http_server_for_validation(self) -> bool:
+    """Returns whether or not to do on-policy fixes during validation."""
     return self._prepare_http_server_for_helper(
         "prepare_http_server_for_validation"
     )

 def _prepare_http_server_for_helper(self, method_name: str) -> bool:
+    """Run the worker prep hook and verify consistent responses."""
     # Use run_all_workers_single_data for methods that don't need data
     futures = self.worker_group.run_all_workers_single_data(
         method_name, run_rank_0_only_axes=["tensor_parallel", "pipeline_parallel"]
     )
     # Wait for all futures to complete
     results = ray.get(futures)
-    return results[0]
+    if not results:
+        raise RuntimeError("No workers responded to HTTP-server prep.")
+    first = results[0]
+    if any(r != first for r in results if r is not None):
+        raise RuntimeError(
+            "Inconsistent HTTP-server prep responses across workers."
+        )
+    return first

nemo_rl/models/generation/vllm/vllm_worker_async.py (1)

283-292: Avoid code-side defaults for the new HTTP-server flags.
YAML should remain the source of truth for defaults, and these new public methods should have docstrings. As per coding guidelines, please avoid non-None defaults in code and document the methods.

♻️ Proposed refactor

 def prepare_http_server_for_training(self) -> bool:
-    self.do_on_policy_fixes = self.cfg["vllm_cfg"].get(
-        "http_server_performs_on_policy_fixes_during_training", True
-    )
+    """Configure on-policy fixes for training requests."""
+    self.do_on_policy_fixes = self.cfg["vllm_cfg"][
+        "http_server_performs_on_policy_fixes_during_training"
+    ]
     return self.do_on_policy_fixes

 def prepare_http_server_for_validation(self) -> bool:
-    self.do_on_policy_fixes = self.cfg["vllm_cfg"].get(
-        "http_server_performs_on_policy_fixes_during_validation", False
-    )
+    """Configure on-policy fixes for validation requests."""
+    self.do_on_policy_fixes = self.cfg["vllm_cfg"][
+        "http_server_performs_on_policy_fixes_during_validation"
+    ]
     return self.do_on_policy_fixes

nemo_rl/experience/rollouts.py (2)

1103-1111: Add explicit strict= to zip for Ruff B905.
Given Python 3.12+, strict=True both satisfies the lint and asserts the ordering assumption.

♻️ Proposed refactor

-        for nemo_gym_row, result in zip(nemo_gym_rows, results):
+        for nemo_gym_row, result in zip(nemo_gym_rows, results, strict=True):

Please confirm the runtime baseline remains Python 3.10+ (strict is available).

1128-1182: Guard pct_0/1/mixed against non-binary rewards.
These percentages are only meaningful for {0,1} rewards; skip them when values are non-binary to avoid misleading metrics.

♻️ Proposed refactor

                 per_agent_metrics[f"{agent_name}/group_level_reward/histogram"] = (
                     Histogram(group_level_rewards)
                 )
-                per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"] = (
-                    100
-                    * sum(r == 0 for r in group_level_rewards)
-                    / len(group_level_rewards)
-                )
-                per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"] = (
-                    100
-                    * sum(r == 1 for r in group_level_rewards)
-                    / len(group_level_rewards)
-                )
-                per_agent_metrics[f"{agent_name}/group_level_reward/pct_mixed"] = (
-                    100
-                    - per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"]
-                    - per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"]
-                )
+                if set(group_level_rewards) <= {0, 1}:
+                    per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"] = (
+                        100
+                        * sum(r == 0 for r in group_level_rewards)
+                        / len(group_level_rewards)
+                    )
+                    per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"] = (
+                        100
+                        * sum(r == 1 for r in group_level_rewards)
+                        / len(group_level_rewards)
+                    )
+                    per_agent_metrics[f"{agent_name}/group_level_reward/pct_mixed"] = (
+                        100
+                        - per_agent_metrics[f"{agent_name}/group_level_reward/pct_0"]
+                        - per_agent_metrics[f"{agent_name}/group_level_reward/pct_1"]
+                    )

coderabbitai · 2026-01-18T04:07:54Z

nemo_rl/models/generation/vllm/config.py

+    # Whether or not to do the on-policy fixes when using the HTTP server. See nemo_rl/models/generation/vllm/vllm_worker_async.py::_replace_prefix_tokens
+    # By default, this is opt-out during training since the default behavior is to ensure that RL training is on-policy.
+    http_server_performs_on_policy_fixes_during_training: NotRequired[bool]
+    # This defaults to False during validation so that we perform validation in the same setting as outside of NeMo RL.
+    http_server_performs_on_policy_fixes_during_validation: NotRequired[bool]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, check if examples/configs directory exists fd -t f "\.yaml$" examples/configs # Search for the specific keys in YAML files echo "=== Searching for http_server_performs_on_policy_fixes keys ===" rg -n "http_server_performs_on_policy_fixes_during_(training|validation)" examples/configs -g '*.yaml' || echo "No matches found" # Also check the config file to understand the TypedDict definition echo "=== Checking config.py file ===" wc -l nemo_rl/models/generation/vllm/config.py

Repository: NVIDIA-NeMo/RL

Length of output: 10066

🏁 Script executed:

cat -n nemo_rl/models/generation/vllm/config.py

Repository: NVIDIA-NeMo/RL

Length of output: 3311

Update exemplar YAMLs with the new HTTP-server on-policy fix flags.

These new NotRequired[bool] keys must be documented with their defaults in exemplar YAMLs under examples/configs/*.yaml to comply with the coding guidelines (YAML is the single source of truth for configuration defaults). Based on the code comments, the defaults are:

http_server_performs_on_policy_fixes_during_training: True (opt-out during training)

http_server_performs_on_policy_fixes_during_validation: False

🤖 Prompt for AI Agents

In `@nemo_rl/models/generation/vllm/config.py` around lines 42 - 46, Add the two new config keys http_server_performs_on_policy_fixes_during_training and http_server_performs_on_policy_fixes_during_validation to the exemplar YAMLs in examples/configs/*.yaml and document their defaults: set http_server_performs_on_policy_fixes_during_training: true (opt-out during training) and http_server_performs_on_policy_fixes_during_validation: false; ensure the YAMLs include these keys with the stated default values and a brief comment mirroring the in-code comments in nemo_rl/models/generation/vllm/config.py so the YAMLs remain the single source of truth.

bxyu-nvidia added 2 commits January 15, 2026 15:40

copy over changes and clean history

1a3982f

Signed-off-by: Brian Yu <bxyu@nvidia.com>

bump gym

e9191aa

Signed-off-by: Brian Yu <bxyu@nvidia.com>

try fix empty bins

c460020

Signed-off-by: Brian Yu <bxyu@nvidia.com>

bxyu-nvidia added 3 commits January 16, 2026 15:33

try fix

26cc2af

Signed-off-by: Brian Yu <bxyu@nvidia.com>

try fix

53370bb

Signed-off-by: Brian Yu <bxyu@nvidia.com>

lint

c5e6b33

Signed-off-by: Brian Yu <bxyu@nvidia.com>

bxyu-nvidia changed the base branch from bxyu/gym-validation-skips-on-policy-correction to bxyu/nemo-gym-refresh-20260113 January 17, 2026 00:36

bxyu-nvidia marked this pull request as ready for review January 17, 2026 02:46

bxyu-nvidia requested review from a team as code owners January 17, 2026 02:46

bxyu-nvidia added the CI:L1 Run doctests, unit tests, and functional tests label Jan 17, 2026

bxyu-nvidia requested a review from yfw January 17, 2026 02:46

bxyu-nvidia changed the title ~~feat: Bxyu/gym infra 20260114~~ feat: Parameterize NeMo Gym on-policy fixes for GRPO; log per-agent group-level mixed reward metrics Jan 17, 2026

bxyu-nvidia added 5 commits January 17, 2026 10:44

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into bxyu/nemo-gym-r…

346dd62

…efresh-20260113

try fix log prob calc

e015daa

Signed-off-by: Brian Yu <bxyu@nvidia.com>

add comement

cba98a1

Signed-off-by: Brian Yu <bxyu@nvidia.com>

lint

51b5f98

Signed-off-by: Brian Yu <bxyu@nvidia.com>

copy over changes

a13e7c3

Signed-off-by: Brian Yu <bxyu@nvidia.com>

bxyu-nvidia force-pushed the bxyu/gym-infra-20260114 branch from d510b92 to a13e7c3 Compare January 17, 2026 20:11

bxyu-nvidia changed the title ~~feat: Parameterize NeMo Gym on-policy fixes for GRPO; log per-agent group-level mixed reward metrics~~ feat: Parameterize NeMo Gym on-policy fixes for GRPO; log per-agent group-level reward metrics Jan 17, 2026

bxyu-nvidia changed the title ~~feat: Parameterize NeMo Gym on-policy fixes for GRPO; log per-agent group-level reward metrics~~ feat: NeMo Gym GRPO on-policy fix params; Per-agent group-level rewards Jan 17, 2026

bxyu-nvidia removed the CI:L1 Run doctests, unit tests, and functional tests label Jan 17, 2026

clean print

2204339

Signed-off-by: Brian Yu <bxyu@nvidia.com>

Base automatically changed from bxyu/nemo-gym-refresh-20260113 to main January 18, 2026 03:43

yfw requested review from a team as code owners January 18, 2026 03:43

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into bxyu/gym-infra-…

ef0a3c8

…20260114 Signed-off-by: Brian Yu <bxyu@nvidia.com>

bxyu-nvidia added the CI:L1 Run doctests, unit tests, and functional tests label Jan 18, 2026

bxyu-nvidia temporarily deployed to nemo-ci January 18, 2026 03:56 — with GitHub Actions Inactive

bxyu-nvidia temporarily deployed to nemo-ci January 18, 2026 04:03 — with GitHub Actions Inactive

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

bxyu-nvidia temporarily deployed to nemo-ci January 18, 2026 05:58 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: NeMo Gym GRPO on-policy fix params; Per-agent group-level rewards #1779

feat: NeMo Gym GRPO on-policy fix params; Per-agent group-level rewards #1779

Uh oh!

bxyu-nvidia commented Jan 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

coderabbitai bot commented Jan 18, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: NeMo Gym GRPO on-policy fix params; Per-agent group-level rewards #1779

Are you sure you want to change the base?

feat: NeMo Gym GRPO on-policy fix params; Per-agent group-level rewards #1779

Uh oh!

Conversation

bxyu-nvidia commented Jan 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 16, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Jan 16, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

coderabbitai bot commented Jan 18, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bxyu-nvidia commented Jan 15, 2026 •

edited by coderabbitai bot

Loading