NVIDIA-NeMo · ananthsub · May 28, 2026 · May 29, 2026 · May 29, 2026 · May 30, 2026
diff --git a/.claude/skills/nemo-gym-reward-profiling b/.claude/skills/nemo-gym-reward-profiling
@@ -0,0 +1 @@
+../../skills/nemo-gym-reward-profiling
diff --git a/skills/nemo-gym-reward-profiling/BENCHMARK.md b/skills/nemo-gym-reward-profiling/BENCHMARK.md
@@ -0,0 +1,88 @@
+# Evaluation Report
+
+Evaluation of the `nemo-gym-reward-profiling` skill before publication through NVSkills-Eval.
+
+This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use.
+
+## Evaluation Summary
+
+- Skill: `nemo-gym-reward-profiling`
+- Evaluation date: 2026-05-30
+- NVSkills-Eval profile: `external`
+- Environment: `local`
+- Dataset: 5 evaluation tasks
+- Attempts per task: 2
+- Pass threshold: 50%
+- Overall verdict: PASS
+
+## Agents Used
+
+- `claude-code`
+- `codex`
+
+## Metrics Used
+
+Reported benchmark dimensions:
+
+- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
+- Correctness: checks whether the agent follows the expected workflow and produces the correct final output.
+- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
+- Effectiveness: checks whether the agent performs measurably better with the skill than without it.
+- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work.
+
+Underlying evaluation signals used in this run:
+
+- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
+- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
+- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
+- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
+- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
+- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
+- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.
+
+## Test Tasks
+
+The benchmark dataset contained 5 evaluation tasks:
+
+- Positive tasks: 3 tasks where the skill was expected to activate.
+- Negative tasks: 2 tasks where no skill was expected.
+- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.
+
+Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases.
+
+## Results
+
+| Dimension | Num | `claude-code` | `codex` |
+|---|---:|---:|---:|
+| Security | 8 | 100% (+5%) | 100% (+20%) |
+| Correctness | 8 | 72% (-13%) | 90% (+2%) |
+| Discoverability | 8 | 64% (-13%) | 88% (+5%) |
+| Effectiveness | 8 | 64% (-14%) | 80% (-5%) |
+| Efficiency | 8 | 57% (-6%) | 75% (+5%) |
+
+Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.
+
+## Tier 1: Static Validation Summary
+
+Tier 1 validation passed. NVSkills-Eval ran 9 checks and found 0 total findings.
+
+Notable observations:
+
+- SECURITY: No security vulnerabilities detected (secrets, API keys, credentials)
+- SCHEMA: Found skill manifest: SKILL.md
+- VERSION: No semantic version label present; resource will use commit-hash history (opting back out of an existing label is allowed)
+- PII: Scanning 3 files for PII
+- LICENSE: no findings reported.
+
+## Tier 2: Deduplication Summary
+
+Tier 2 validation passed. NVSkills-Eval ran 2 checks and found 0 total findings.
+
+Notable observations:
+
+- Context Deduplication: Collected 3 file(s)
+- Inter-Skill Deduplication: Parsed skill 'nemo-gym-reward-profiling': 139 char description
+
+## Publication Recommendation
+
+The skill is suitable to proceed toward NVSkills-Eval publication based on this benchmark. Skill owners should keep this file with the skill and refresh it when the evaluation dataset, skill behavior, or target agents materially change.
diff --git a/...skills/nemo-gym-reward-profiling/SKILL.md → skills/nemo-gym-reward-profiling/SKILL.md b/...skills/nemo-gym-reward-profiling/SKILL.md → skills/nemo-gym-reward-profiling/SKILL.md
@@ -1,15 +1,33 @@
 ---
 name: nemo-gym-reward-profiling
+license: Apache-2.0
 description: >-
-  Use to help users get started with Nemo Gym reward profiling. Covers the basic
-  ng_run, ng_collect_rollouts, and ng_reward_profile workflow, repeated rollouts,
-  materialized inputs, rollout JSONL artifacts, task and rollout identity, output
-  inspection, partial profiling, and rollout_infos. For failed jobs, prefer
-  nemo-gym-debugging.
+  Get started with Nemo Gym reward profiling: ng_run, ng_collect_rollouts, and
+  ng_reward_profile. For failed jobs, prefer nemo-gym-debugging.
+metadata:
+  author: NVIDIA <nemo-gym@nvidia.com>
+  tags:
+    - reward-profiling
+    - rollouts
+    - evaluation
+    - metrics
 ---
 
 # Nemo Gym Reward Profiling
 
+## Purpose
+
+Run and understand Nemo Gym reward profiling: start servers with `ng_run`,
+collect rollout artifacts with `ng_collect_rollouts`, and produce profiling
+output with `ng_reward_profile`, then inspect the resulting rows and metrics.
+
+## Prerequisites
+
+- NeMo Gym installed with the `ng_run`, `ng_collect_rollouts`, and `ng_reward_profile` CLIs.
+- An environment config bundle and an input JSONL dataset.
+- A reachable model server (an OpenAI-compatible endpoint or a local vLLM model server).
+- Enough disk for rollout and materialized-input artifacts.
+
 ## Invocation Check
 
 Use this skill when the user wants to run, understand, or lightly modify Nemo Gym reward profiling. Keep the answer oriented around the normal workflow:
@@ -18,7 +36,12 @@ Use this skill when the user wants to run, understand, or lightly modify Nemo Gy
 
 If the user is primarily debugging a failed job or stack trace, use the `nemo-gym-debugging` skill first.
 
-## Basic Workflow
+Do not activate this skill for these adjacent tasks:
+
+- Debugging a failed or crashed run (Ray/vLLM stack traces, empty output). Use `nemo-gym-debugging`.
+- Adding or scaffolding a new benchmark, evaluation, or training environment. Use `add-benchmark`.
+
+## Instructions
 
 1. Identify the environment config paths and input JSONL.
 2. Start Gym servers with `ng_run`.
@@ -51,3 +74,28 @@ Load references only when the user needs that detail:
 - Treat `ng_reward_profile` as the reward profiling step; rollout collection does not write reward profile files.
 - Run strict profiling by default. If rollout collection stopped early, use `++allow_partial_rollouts=True` to profile completed rollouts and drop original input rows with no completed rollout.
 - Trust the target checkout's CLI help and `nemo_gym/reward_profile.py` over memory if flags differ.
+
+## Examples
+
+Profiling a single config: run `ng_run` for the environment, collect rollouts
+with `+num_repeats` greater than one so per-task averages and variance are
+meaningful, then run `ng_reward_profile` on the materialized inputs and rollout
+JSONL and compare line counts across the artifacts.
+
+Recovering from an interrupted collection: rerun `ng_reward_profile` with
+`++allow_partial_rollouts=True` to profile completed rollouts and drop original
+input rows that have no completed rollout.
+
+## Limitations
+
+- Per-task averages and variance are only meaningful with multiple rollouts per task; single-repeat runs give point estimates.
+- This step summarizes existing rollout artifacts; it does not collect rollouts or fix failed runs.
+- Reward semantics are defined by the resource server, not by this workflow.
+
+## Troubleshooting
+
+| Symptom | Likely cause | Resolution |
+|---|---|---|
+| No reward profile file produced | Expected it from rollout collection | Reward profiling is a separate step; run it on the materialized inputs and rollout JSONL |
+| Profile rows fewer than input tasks | Rollout collection stopped early | Rerun profiling with partial rollouts allowed (see Practical Defaults) |
+| CLI flags differ from this guide | Target checkout version differs | Trust the checkout's CLI help and `nemo_gym/reward_profile.py` |
diff --git a/skills/nemo-gym-reward-profiling/evals/evals.json b/skills/nemo-gym-reward-profiling/evals/evals.json
@@ -0,0 +1,60 @@
+[
+  {
+    "id": "nemo-gym-reward-profiling-positive-001",
+    "question": "How do I run reward profiling on the math benchmark? I want pass@5 numbers.",
+    "expected_skill": "nemo-gym-reward-profiling",
+    "ground_truth": "The agent loads the nemo-gym-reward-profiling skill, walks through the ng_run / ng_collect_rollouts / ng_reward_profile sequence, sets num_repeats=5 for pass@5, and explains that ng_reward_profile is the step that writes the *_reward_profiling.jsonl.",
+    "expected_behavior": [
+      "The agent read nemo-gym-reward-profiling/SKILL.md before acting",
+      "The agent walked through the ng_run → ng_collect_rollouts → ng_reward_profile sequence",
+      "The agent set num_repeats=5 to match the pass@5 ask",
+      "The agent identified ng_reward_profile (not ng_collect_rollouts) as the step that produces the reward profiling JSONL"
+    ]
+  },
+  {
+    "id": "nemo-gym-reward-profiling-positive-002",
+    "question": "Can you explain what the *_materialized_inputs.jsonl file is and how it relates to rollouts.jsonl and the reward profile output?",
+    "expected_skill": "nemo-gym-reward-profiling",
+    "ground_truth": "The agent loads the nemo-gym-reward-profiling skill and explains materialized_inputs as the expanded collection inputs after repeat expansion plus agent defaults and task/rollout id assignment, rollouts.jsonl as one completed rollout per materialized input row, and *_reward_profiling.jsonl as one summarized profile row per original task with rollout_infos.",
+    "expected_behavior": [
+      "The agent read nemo-gym-reward-profiling/SKILL.md before acting",
+      "The agent explained that *_materialized_inputs.jsonl holds the expanded collection inputs after repeat expansion",
+      "The agent explained the role of _ng_task_index and _ng_rollout_index for keying analysis",
+      "The agent mentioned rollout_infos as the compact per-rollout info inside each task profile row"
+    ]
+  },
+  {
+    "id": "nemo-gym-reward-profiling-positive-003",
+    "question": "Rollout collection stopped early so I have partial output. How do I still run reward profiling on the rollouts that did complete?",
+    "expected_skill": "nemo-gym-reward-profiling",
+    "ground_truth": "The agent loads the nemo-gym-reward-profiling skill and uses ++allow_partial_rollouts=True on ng_reward_profile to profile only the completed rollouts and drop original input rows with no completed rollout.",
+    "expected_behavior": [
+      "The agent read nemo-gym-reward-profiling/SKILL.md before acting",
+      "The agent passed ++allow_partial_rollouts=True to ng_reward_profile",
+      "The agent confirmed that ng_reward_profile (not ng_collect_rollouts) is the step being modified",
+      "The agent noted the default is strict profiling"
+    ]
+  },
+  {
+    "id": "nemo-gym-reward-profiling-negative-001",
+    "question": "Add a new GSM8K-style benchmark to NeMo-Gym, including data prep and a verify() implementation.",
+    "expected_skill": null,
+    "should_trigger": false,
+    "ground_truth": "The agent should not activate the nemo-gym-reward-profiling skill for a benchmark integration task. It should use the add-benchmark skill instead.",
+    "expected_behavior": [
+      "The agent did not read or activate nemo-gym-reward-profiling/SKILL.md",
+      "The agent recognized this as a benchmark integration task"
+    ]
+  },
+  {
+    "id": "nemo-gym-reward-profiling-negative-002",
+    "question": "My ng_collect_rollouts crashed with a Ray actor stack trace and no rollouts.jsonl was written. Help me find the root cause.",
+    "expected_skill": null,
+    "should_trigger": false,
+    "ground_truth": "The agent should not activate the nemo-gym-reward-profiling skill for a failed-job debugging task. It should use the nemo-gym-debugging skill instead.",
+    "expected_behavior": [
+      "The agent did not read or activate nemo-gym-reward-profiling/SKILL.md",
+      "The agent recognized this as a debugging task and deferred to nemo-gym-debugging"
+    ]
+  }
+]
diff --git a/...ard-profiling/references/output-format.md → ...ard-profiling/references/output-format.md b/...ard-profiling/references/output-format.md → ...ard-profiling/references/output-format.md
diff --git a/...eward-profiling/references/quick-start.md → ...eward-profiling/references/quick-start.md b/...eward-profiling/references/quick-start.md → ...eward-profiling/references/quick-start.md
@@ -4,12 +4,20 @@ Substitute environment-specific config paths, input data, model endpoint, and ou
 
 ## Minimal Flow
 
+Provide the policy endpoint key through the environment rather than on the
+command line. Export it from your shell session or a secrets manager before
+running. It is read at runtime via the `${oc.env:...}` resolver, so the value
+never appears in the process arguments (visible via `ps`) or in shell history.
+
 ```bash
 CONFIG_PATHS="your_model_config_paths,your_env_config_paths"
 
 POLICY_MODEL_NAME="your_policy_model_name"
 POLICY_BASE_URL="your_policy_base_url"
-POLICY_ENDPOINT_KEY="your_policy_endpoint_key"
+
+# Require the key from the environment; this validates presence without echoing
+# the value. Set it beforehand, e.g. with `export POLICY_ENDPOINT_KEY=...`.
+: "${POLICY_ENDPOINT_KEY:?export POLICY_ENDPOINT_KEY before running}"
 
 DATA_JSONL="/path/to/your_input.jsonl"
 ROLLOUTS_JSONL="/path/to/your_rollouts.jsonl"
@@ -22,7 +30,7 @@ NUM_SAMPLES_IN_PARALLEL=8
 ng_run "+config_paths=[$CONFIG_PATHS]" \
     +policy_model_name="$POLICY_MODEL_NAME" \
     +policy_base_url="$POLICY_BASE_URL" \
-    +policy_api_key="$POLICY_ENDPOINT_KEY" &
+    '++policy_api_key=${oc.env:POLICY_ENDPOINT_KEY}' &
 NG_RUN_PID=$!
 trap 'kill "$NG_RUN_PID" 2>/dev/null || true' EXIT
 ./scripts/wait_for_servers.sh "$NG_RUN_PID"

diff --git a/skills/nemo-gym-reward-profiling/skill-card.md b/skills/nemo-gym-reward-profiling/skill-card.md
@@ -0,0 +1,76 @@
+## Description: <br>
+Get started with Nemo Gym reward profiling: ng_run, ng_collect_rollouts, and ng_reward_profile. <br>
+
+This skill is ready for commercial/non-commercial use. <br>
+
+## Owner
+NVIDIA <br>
+
+### License/Terms of Use: <br>
+Apache 2.0 <br>
+## Use Case: <br>
+Developers and engineers running Nemo Gym reward profiling workflows to evaluate and profile model performance across tasks using ng_run, ng_collect_rollouts, and ng_reward_profile. <br>
+
+### Deployment Geography for Use: <br>
+Global <br>
+
+## Known Risks and Mitigations: <br>
+Risk: Review before execution as proposals could introduce incorrect or misleading guidance into skills. <br>
+Mitigation: Review and scan skill before deployment. <br>
+
+## Reference(s): <br>
+- [Quick Start](references/quick-start.md) <br>
+- [Output Format](references/output-format.md) <br>
+
+
+## Skill Output: <br>
+**Output Type(s):** [Shell commands, Analysis] <br>
+**Output Format:** [Markdown with inline bash code blocks] <br>
+**Output Parameters:** [1D] <br>
+**Other Properties Related to Output:** [None] <br>
+
+## Evaluation Agents Used: <br>
+- `claude-code` <br>
+- `codex` <br>
+
+
+
+## Evaluation Tasks: <br>
+Evaluated against 5 evaluation tasks (3 positive skill-activation tasks, 2 negative tasks) with 2 attempts per task. <br>
+
+## Evaluation Metrics Used: <br>
+Reported benchmark dimensions: <br>
+- Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. <br>
+- Correctness: Checks whether the agent follows the expected workflow and produces the correct final output. <br>
+- Discoverability: Checks whether the agent loads the skill when relevant and avoids using it when irrelevant. <br>
+- Effectiveness: Checks whether the agent performs measurably better with the skill than without it. <br>
+- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work. <br>
+
+Underlying evaluation signals used in this run: <br>
+- `security`: Checks for unsafe operations, secret leakage, and unauthorized access. <br>
+- `skill_execution`: Verifies that the agent loaded the expected skill and workflow. <br>
+- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage. <br>
+- `accuracy`: Grades final-answer correctness against the reference answer. <br>
+- `goal_accuracy`: Checks whether the overall user task completed successfully. <br>
+- `behavior_check`: Verifies expected behavior steps, including safety expectations. <br>
+- `token_efficiency`: Compares token usage with and without the skill. <br>
+
+
+
+## Evaluation Results: <br>
+| Dimension | Num | `claude-code` | `codex` |
+|---|---:|---:|---:|
+| Security | 8 | 100% (+5%) | 100% (+20%) |
+| Correctness | 8 | 72% (-13%) | 90% (+2%) |
+| Discoverability | 8 | 64% (-13%) | 88% (+5%) |
+| Effectiveness | 8 | 64% (-14%) | 80% (-5%) |
+| Efficiency | 8 | 57% (-6%) | 75% (+5%) |
+
+## Skill Version(s): <br>
+0583bf68 (source: git SHA, committed 2026-05-29) <br>
+
+## Ethical Considerations: <br>
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>
+
+(For Release on NVIDIA Platforms Only) <br>
+Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail). <br>