Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .claude/skills/nemo-gym-reward-profiling
88 changes: 88 additions & 0 deletions skills/nemo-gym-reward-profiling/BENCHMARK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Evaluation Report

Evaluation of the `nemo-gym-reward-profiling` skill before publication through NVSkills-Eval.

This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use.

## Evaluation Summary

- Skill: `nemo-gym-reward-profiling`
- Evaluation date: 2026-05-30
- NVSkills-Eval profile: `external`
- Environment: `local`
- Dataset: 5 evaluation tasks
- Attempts per task: 2
- Pass threshold: 50%
- Overall verdict: PASS

## Agents Used

- `claude-code`
- `codex`

## Metrics Used

Reported benchmark dimensions:

- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
- Correctness: checks whether the agent follows the expected workflow and produces the correct final output.
- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
- Effectiveness: checks whether the agent performs measurably better with the skill than without it.
- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work.

Underlying evaluation signals used in this run:

- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.

## Test Tasks

The benchmark dataset contained 5 evaluation tasks:

- Positive tasks: 3 tasks where the skill was expected to activate.
- Negative tasks: 2 tasks where no skill was expected.
- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.

Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases.

## Results

| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | 8 | 100% (+5%) | 100% (+20%) |
| Correctness | 8 | 72% (-13%) | 90% (+2%) |
| Discoverability | 8 | 64% (-13%) | 88% (+5%) |
| Effectiveness | 8 | 64% (-14%) | 80% (-5%) |
| Efficiency | 8 | 57% (-6%) | 75% (+5%) |

Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.

## Tier 1: Static Validation Summary

Tier 1 validation passed. NVSkills-Eval ran 9 checks and found 0 total findings.

Notable observations:

- SECURITY: No security vulnerabilities detected (secrets, API keys, credentials)
- SCHEMA: Found skill manifest: SKILL.md
- VERSION: No semantic version label present; resource will use commit-hash history (opting back out of an existing label is allowed)
- PII: Scanning 3 files for PII
- LICENSE: no findings reported.

## Tier 2: Deduplication Summary

Tier 2 validation passed. NVSkills-Eval ran 2 checks and found 0 total findings.

Notable observations:

- Context Deduplication: Collected 3 file(s)
- Inter-Skill Deduplication: Parsed skill 'nemo-gym-reward-profiling': 139 char description

## Publication Recommendation

The skill is suitable to proceed toward NVSkills-Eval publication based on this benchmark. Skill owners should keep this file with the skill and refresh it when the evaluation dataset, skill behavior, or target agents materially change.
Original file line number Diff line number Diff line change
@@ -1,15 +1,33 @@
---
name: nemo-gym-reward-profiling
license: Apache-2.0
description: >-
Use to help users get started with Nemo Gym reward profiling. Covers the basic
ng_run, ng_collect_rollouts, and ng_reward_profile workflow, repeated rollouts,
materialized inputs, rollout JSONL artifacts, task and rollout identity, output
inspection, partial profiling, and rollout_infos. For failed jobs, prefer
nemo-gym-debugging.
Get started with Nemo Gym reward profiling: ng_run, ng_collect_rollouts, and
ng_reward_profile. For failed jobs, prefer nemo-gym-debugging.
metadata:
author: NVIDIA <nemo-gym@nvidia.com>
tags:
- reward-profiling
- rollouts
- evaluation
- metrics
---

# Nemo Gym Reward Profiling

## Purpose

Run and understand Nemo Gym reward profiling: start servers with `ng_run`,
collect rollout artifacts with `ng_collect_rollouts`, and produce profiling
output with `ng_reward_profile`, then inspect the resulting rows and metrics.

## Prerequisites

- NeMo Gym installed with the `ng_run`, `ng_collect_rollouts`, and `ng_reward_profile` CLIs.
- An environment config bundle and an input JSONL dataset.
- A reachable model server (an OpenAI-compatible endpoint or a local vLLM model server).
- Enough disk for rollout and materialized-input artifacts.

## Invocation Check

Use this skill when the user wants to run, understand, or lightly modify Nemo Gym reward profiling. Keep the answer oriented around the normal workflow:
Expand All @@ -18,7 +36,12 @@ Use this skill when the user wants to run, understand, or lightly modify Nemo Gy

If the user is primarily debugging a failed job or stack trace, use the `nemo-gym-debugging` skill first.

## Basic Workflow
Do not activate this skill for these adjacent tasks:

- Debugging a failed or crashed run (Ray/vLLM stack traces, empty output). Use `nemo-gym-debugging`.
- Adding or scaffolding a new benchmark, evaluation, or training environment. Use `add-benchmark`.

## Instructions

1. Identify the environment config paths and input JSONL.
2. Start Gym servers with `ng_run`.
Expand Down Expand Up @@ -51,3 +74,28 @@ Load references only when the user needs that detail:
- Treat `ng_reward_profile` as the reward profiling step; rollout collection does not write reward profile files.
- Run strict profiling by default. If rollout collection stopped early, use `++allow_partial_rollouts=True` to profile completed rollouts and drop original input rows with no completed rollout.
- Trust the target checkout's CLI help and `nemo_gym/reward_profile.py` over memory if flags differ.

## Examples

Profiling a single config: run `ng_run` for the environment, collect rollouts
with `+num_repeats` greater than one so per-task averages and variance are
meaningful, then run `ng_reward_profile` on the materialized inputs and rollout
JSONL and compare line counts across the artifacts.

Recovering from an interrupted collection: rerun `ng_reward_profile` with
`++allow_partial_rollouts=True` to profile completed rollouts and drop original
input rows that have no completed rollout.

## Limitations

- Per-task averages and variance are only meaningful with multiple rollouts per task; single-repeat runs give point estimates.
- This step summarizes existing rollout artifacts; it does not collect rollouts or fix failed runs.
- Reward semantics are defined by the resource server, not by this workflow.

## Troubleshooting

| Symptom | Likely cause | Resolution |
|---|---|---|
| No reward profile file produced | Expected it from rollout collection | Reward profiling is a separate step; run it on the materialized inputs and rollout JSONL |
| Profile rows fewer than input tasks | Rollout collection stopped early | Rerun profiling with partial rollouts allowed (see Practical Defaults) |
| CLI flags differ from this guide | Target checkout version differs | Trust the checkout's CLI help and `nemo_gym/reward_profile.py` |
60 changes: 60 additions & 0 deletions skills/nemo-gym-reward-profiling/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
[
{
"id": "nemo-gym-reward-profiling-positive-001",
"question": "How do I run reward profiling on the math benchmark? I want pass@5 numbers.",
"expected_skill": "nemo-gym-reward-profiling",
"ground_truth": "The agent loads the nemo-gym-reward-profiling skill, walks through the ng_run / ng_collect_rollouts / ng_reward_profile sequence, sets num_repeats=5 for pass@5, and explains that ng_reward_profile is the step that writes the *_reward_profiling.jsonl.",
"expected_behavior": [
"The agent read nemo-gym-reward-profiling/SKILL.md before acting",
"The agent walked through the ng_run → ng_collect_rollouts → ng_reward_profile sequence",
"The agent set num_repeats=5 to match the pass@5 ask",
"The agent identified ng_reward_profile (not ng_collect_rollouts) as the step that produces the reward profiling JSONL"
]
},
{
"id": "nemo-gym-reward-profiling-positive-002",
"question": "Can you explain what the *_materialized_inputs.jsonl file is and how it relates to rollouts.jsonl and the reward profile output?",
"expected_skill": "nemo-gym-reward-profiling",
"ground_truth": "The agent loads the nemo-gym-reward-profiling skill and explains materialized_inputs as the expanded collection inputs after repeat expansion plus agent defaults and task/rollout id assignment, rollouts.jsonl as one completed rollout per materialized input row, and *_reward_profiling.jsonl as one summarized profile row per original task with rollout_infos.",
"expected_behavior": [
"The agent read nemo-gym-reward-profiling/SKILL.md before acting",
"The agent explained that *_materialized_inputs.jsonl holds the expanded collection inputs after repeat expansion",
"The agent explained the role of _ng_task_index and _ng_rollout_index for keying analysis",
"The agent mentioned rollout_infos as the compact per-rollout info inside each task profile row"
]
},
{
"id": "nemo-gym-reward-profiling-positive-003",
"question": "Rollout collection stopped early so I have partial output. How do I still run reward profiling on the rollouts that did complete?",
"expected_skill": "nemo-gym-reward-profiling",
"ground_truth": "The agent loads the nemo-gym-reward-profiling skill and uses ++allow_partial_rollouts=True on ng_reward_profile to profile only the completed rollouts and drop original input rows with no completed rollout.",
"expected_behavior": [
"The agent read nemo-gym-reward-profiling/SKILL.md before acting",
"The agent passed ++allow_partial_rollouts=True to ng_reward_profile",
"The agent confirmed that ng_reward_profile (not ng_collect_rollouts) is the step being modified",
"The agent noted the default is strict profiling"
]
},
{
"id": "nemo-gym-reward-profiling-negative-001",
"question": "Add a new GSM8K-style benchmark to NeMo-Gym, including data prep and a verify() implementation.",
"expected_skill": null,
"should_trigger": false,
"ground_truth": "The agent should not activate the nemo-gym-reward-profiling skill for a benchmark integration task. It should use the add-benchmark skill instead.",
"expected_behavior": [
"The agent did not read or activate nemo-gym-reward-profiling/SKILL.md",
"The agent recognized this as a benchmark integration task"
]
},
{
"id": "nemo-gym-reward-profiling-negative-002",
"question": "My ng_collect_rollouts crashed with a Ray actor stack trace and no rollouts.jsonl was written. Help me find the root cause.",
"expected_skill": null,
"should_trigger": false,
"ground_truth": "The agent should not activate the nemo-gym-reward-profiling skill for a failed-job debugging task. It should use the nemo-gym-debugging skill instead.",
"expected_behavior": [
"The agent did not read or activate nemo-gym-reward-profiling/SKILL.md",
"The agent recognized this as a debugging task and deferred to nemo-gym-debugging"
]
}
]
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,20 @@ Substitute environment-specific config paths, input data, model endpoint, and ou

## Minimal Flow

Provide the policy endpoint key through the environment rather than on the
command line. Export it from your shell session or a secrets manager before
running. It is read at runtime via the `${oc.env:...}` resolver, so the value
never appears in the process arguments (visible via `ps`) or in shell history.

```bash
CONFIG_PATHS="your_model_config_paths,your_env_config_paths"

POLICY_MODEL_NAME="your_policy_model_name"
POLICY_BASE_URL="your_policy_base_url"
POLICY_ENDPOINT_KEY="your_policy_endpoint_key"

# Require the key from the environment; this validates presence without echoing
# the value. Set it beforehand, e.g. with `export POLICY_ENDPOINT_KEY=...`.
: "${POLICY_ENDPOINT_KEY:?export POLICY_ENDPOINT_KEY before running}"

DATA_JSONL="/path/to/your_input.jsonl"
ROLLOUTS_JSONL="/path/to/your_rollouts.jsonl"
Expand All @@ -22,7 +30,7 @@ NUM_SAMPLES_IN_PARALLEL=8
ng_run "+config_paths=[$CONFIG_PATHS]" \
+policy_model_name="$POLICY_MODEL_NAME" \
+policy_base_url="$POLICY_BASE_URL" \
+policy_api_key="$POLICY_ENDPOINT_KEY" &
'++policy_api_key=${oc.env:POLICY_ENDPOINT_KEY}' &
NG_RUN_PID=$!
trap 'kill "$NG_RUN_PID" 2>/dev/null || true' EXIT
./scripts/wait_for_servers.sh "$NG_RUN_PID"
Expand Down
76 changes: 76 additions & 0 deletions skills/nemo-gym-reward-profiling/skill-card.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
## Description: <br>
Get started with Nemo Gym reward profiling: ng_run, ng_collect_rollouts, and ng_reward_profile. <br>

This skill is ready for commercial/non-commercial use. <br>

## Owner
NVIDIA <br>

### License/Terms of Use: <br>
Apache 2.0 <br>
## Use Case: <br>
Developers and engineers running Nemo Gym reward profiling workflows to evaluate and profile model performance across tasks using ng_run, ng_collect_rollouts, and ng_reward_profile. <br>

### Deployment Geography for Use: <br>
Global <br>

## Known Risks and Mitigations: <br>
Risk: Review before execution as proposals could introduce incorrect or misleading guidance into skills. <br>
Mitigation: Review and scan skill before deployment. <br>

## Reference(s): <br>
- [Quick Start](references/quick-start.md) <br>
- [Output Format](references/output-format.md) <br>


## Skill Output: <br>
**Output Type(s):** [Shell commands, Analysis] <br>
**Output Format:** [Markdown with inline bash code blocks] <br>
**Output Parameters:** [1D] <br>
**Other Properties Related to Output:** [None] <br>

## Evaluation Agents Used: <br>
- `claude-code` <br>
- `codex` <br>



## Evaluation Tasks: <br>
Evaluated against 5 evaluation tasks (3 positive skill-activation tasks, 2 negative tasks) with 2 attempts per task. <br>

## Evaluation Metrics Used: <br>
Reported benchmark dimensions: <br>
- Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. <br>
- Correctness: Checks whether the agent follows the expected workflow and produces the correct final output. <br>
- Discoverability: Checks whether the agent loads the skill when relevant and avoids using it when irrelevant. <br>
- Effectiveness: Checks whether the agent performs measurably better with the skill than without it. <br>
- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work. <br>

Underlying evaluation signals used in this run: <br>
- `security`: Checks for unsafe operations, secret leakage, and unauthorized access. <br>
- `skill_execution`: Verifies that the agent loaded the expected skill and workflow. <br>
- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage. <br>
- `accuracy`: Grades final-answer correctness against the reference answer. <br>
- `goal_accuracy`: Checks whether the overall user task completed successfully. <br>
- `behavior_check`: Verifies expected behavior steps, including safety expectations. <br>
- `token_efficiency`: Compares token usage with and without the skill. <br>



## Evaluation Results: <br>
| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | 8 | 100% (+5%) | 100% (+20%) |
| Correctness | 8 | 72% (-13%) | 90% (+2%) |
| Discoverability | 8 | 64% (-13%) | 88% (+5%) |
| Effectiveness | 8 | 64% (-14%) | 80% (-5%) |
| Efficiency | 8 | 57% (-6%) | 75% (+5%) |

## Skill Version(s): <br>
0583bf68 (source: git SHA, committed 2026-05-29) <br>

## Ethical Considerations: <br>
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>

(For Release on NVIDIA Platforms Only) <br>
Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail). <br>
Loading
Loading