From 12de65707383ef67a18d87ba070047c86095d877 Mon Sep 17 00:00:00 2001 From: Ananth Subramaniam Date: Thu, 28 May 2026 16:58:32 -0700 Subject: [PATCH 1/5] ci(nemo-gym-reward-profiling): migrate to skills/ for NVSkills CI ## Summary - Move `.claude/skills/nemo-gym-reward-profiling/` to top-level `skills/nemo-gym-reward-profiling/` so this PR touches files under the central `team-request.yml` trigger allowlist (`skills/`, `team-skills/`, `rules/team-rules/`, `plugins/`). - Replace `.claude/skills/nemo-gym-reward-profiling/` with a symlink to `../../skills/nemo-gym-reward-profiling` so Claude Code and Cursor continue to discover the skill via the conventional `.claude/skills//SKILL.md` path with no tool-side change. - Add `license: Apache-2.0` to `skills/nemo-gym-reward-profiling/SKILL.md` frontmatter. - Add `skills/nemo-gym-reward-profiling/evals/evals.json` with positive trigger cases (ng_run/ng_collect_rollouts/ ng_reward_profile workflow, repeated rollouts, materialized inputs) and negative cases that delegate to sibling skills. ## Motivation Prepares the `nemo-gym-reward-profiling` skill for NVSkills CI signing. Per-skill scope keeps the diff small and lets NVSkills CI evaluate one skill at a time. Other skills remain at `.claude/skills//` until each has its own migration PR. ## Test plan - [ ] Comment `/nvskills-ci` on this PR. Expect the request workflow to dispatch (not skip) and `svc-nvskills-signing` to attach `skill-card.md` and `skill.oms.sig` under `skills/nemo-gym-reward-profiling/`. - [ ] Claude Code discovers `nemo-gym-reward-profiling` via `.claude/skills/nemo-gym-reward-profiling/SKILL.md` (follows symlink). Signed-off-by: Ananth Subramaniam --- .claude/skills/nemo-gym-reward-profiling | 1 + .../nemo-gym-reward-profiling/SKILL.md | 1 + .../evals/evals.json | 60 +++++++++++++++++++ .../references/output-format.md | 0 .../references/quick-start.md | 0 5 files changed, 62 insertions(+) create mode 120000 .claude/skills/nemo-gym-reward-profiling rename {.claude/skills => skills}/nemo-gym-reward-profiling/SKILL.md (99%) create mode 100644 skills/nemo-gym-reward-profiling/evals/evals.json rename {.claude/skills => skills}/nemo-gym-reward-profiling/references/output-format.md (100%) rename {.claude/skills => skills}/nemo-gym-reward-profiling/references/quick-start.md (100%) diff --git a/.claude/skills/nemo-gym-reward-profiling b/.claude/skills/nemo-gym-reward-profiling new file mode 120000 index 000000000..c405981ca --- /dev/null +++ b/.claude/skills/nemo-gym-reward-profiling @@ -0,0 +1 @@ +../../skills/nemo-gym-reward-profiling \ No newline at end of file diff --git a/.claude/skills/nemo-gym-reward-profiling/SKILL.md b/skills/nemo-gym-reward-profiling/SKILL.md similarity index 99% rename from .claude/skills/nemo-gym-reward-profiling/SKILL.md rename to skills/nemo-gym-reward-profiling/SKILL.md index 1d177d087..acf7df6bb 100644 --- a/.claude/skills/nemo-gym-reward-profiling/SKILL.md +++ b/skills/nemo-gym-reward-profiling/SKILL.md @@ -1,5 +1,6 @@ --- name: nemo-gym-reward-profiling +license: Apache-2.0 description: >- Use to help users get started with Nemo Gym reward profiling. Covers the basic ng_run, ng_collect_rollouts, and ng_reward_profile workflow, repeated rollouts, diff --git a/skills/nemo-gym-reward-profiling/evals/evals.json b/skills/nemo-gym-reward-profiling/evals/evals.json new file mode 100644 index 000000000..039b92fa6 --- /dev/null +++ b/skills/nemo-gym-reward-profiling/evals/evals.json @@ -0,0 +1,60 @@ +[ + { + "id": "nemo-gym-reward-profiling-positive-001", + "question": "How do I run reward profiling on the math benchmark? I want pass@5 numbers.", + "expected_skill": "nemo-gym-reward-profiling", + "ground_truth": "The agent loads the nemo-gym-reward-profiling skill, walks through the ng_run / ng_collect_rollouts / ng_reward_profile sequence, sets num_repeats=5 for pass@5, and explains that ng_reward_profile is the step that writes the *_reward_profiling.jsonl.", + "expected_behavior": [ + "The agent read nemo-gym-reward-profiling/SKILL.md before acting", + "The agent walked through the ng_run → ng_collect_rollouts → ng_reward_profile sequence", + "The agent set num_repeats=5 to match the pass@5 ask", + "The agent identified ng_reward_profile (not ng_collect_rollouts) as the step that produces the reward profiling JSONL" + ] + }, + { + "id": "nemo-gym-reward-profiling-positive-002", + "question": "Can you explain what the *_materialized_inputs.jsonl file is and how it relates to rollouts.jsonl and the reward profile output?", + "expected_skill": "nemo-gym-reward-profiling", + "ground_truth": "The agent loads the nemo-gym-reward-profiling skill and explains materialized_inputs as the expanded collection inputs after repeat expansion plus agent defaults and task/rollout id assignment, rollouts.jsonl as one completed rollout per materialized input row, and *_reward_profiling.jsonl as one summarized profile row per original task with rollout_infos.", + "expected_behavior": [ + "The agent read nemo-gym-reward-profiling/SKILL.md before acting", + "The agent explained that *_materialized_inputs.jsonl holds the expanded collection inputs after repeat expansion", + "The agent explained the role of _ng_task_index and _ng_rollout_index for keying analysis", + "The agent mentioned rollout_infos as the compact per-rollout info inside each task profile row" + ] + }, + { + "id": "nemo-gym-reward-profiling-positive-003", + "question": "Rollout collection stopped early so I have partial output. How do I still run reward profiling on the rollouts that did complete?", + "expected_skill": "nemo-gym-reward-profiling", + "ground_truth": "The agent loads the nemo-gym-reward-profiling skill and uses ++allow_partial_rollouts=True on ng_reward_profile to profile only the completed rollouts and drop original input rows with no completed rollout.", + "expected_behavior": [ + "The agent read nemo-gym-reward-profiling/SKILL.md before acting", + "The agent passed ++allow_partial_rollouts=True to ng_reward_profile", + "The agent confirmed that ng_reward_profile (not ng_collect_rollouts) is the step being modified", + "The agent noted the default is strict profiling" + ] + }, + { + "id": "nemo-gym-reward-profiling-negative-001", + "question": "Add a new GSM8K-style benchmark to NeMo-Gym, including data prep and a verify() implementation.", + "expected_skill": null, + "should_trigger": false, + "ground_truth": "The agent should not activate the nemo-gym-reward-profiling skill for a benchmark integration task. It should use the add-benchmark skill instead.", + "expected_behavior": [ + "The agent did not read or activate nemo-gym-reward-profiling/SKILL.md", + "The agent recognized this as a benchmark integration task" + ] + }, + { + "id": "nemo-gym-reward-profiling-negative-002", + "question": "My ng_collect_rollouts crashed with a Ray actor stack trace and no rollouts.jsonl was written. Help me find the root cause.", + "expected_skill": null, + "should_trigger": false, + "ground_truth": "The agent should not activate the nemo-gym-reward-profiling skill for a failed-job debugging task. It should use the nemo-gym-debugging skill instead.", + "expected_behavior": [ + "The agent did not read or activate nemo-gym-reward-profiling/SKILL.md", + "The agent recognized this as a debugging task and deferred to nemo-gym-debugging" + ] + } +] diff --git a/.claude/skills/nemo-gym-reward-profiling/references/output-format.md b/skills/nemo-gym-reward-profiling/references/output-format.md similarity index 100% rename from .claude/skills/nemo-gym-reward-profiling/references/output-format.md rename to skills/nemo-gym-reward-profiling/references/output-format.md diff --git a/.claude/skills/nemo-gym-reward-profiling/references/quick-start.md b/skills/nemo-gym-reward-profiling/references/quick-start.md similarity index 100% rename from .claude/skills/nemo-gym-reward-profiling/references/quick-start.md rename to skills/nemo-gym-reward-profiling/references/quick-start.md From 1a732becdcf37c8573b799cf0ee08ceee33a4c58 Mon Sep 17 00:00:00 2001 From: Ananth Subramaniam Date: Fri, 29 May 2026 09:03:29 -0700 Subject: [PATCH 2/5] ci(nemo-gym-reward-profiling): address NVSkills CI content feedback Apply the same content fixes validated on add-benchmark: - Add metadata.author and metadata.tags. - Tighten the description (the negative trigger to nemo-gym-debugging is retained). - Add a Purpose section and an Examples section, and rename Basic Workflow to Instructions. Signed-off-by: Ananth Subramaniam --- skills/nemo-gym-reward-profiling/SKILL.md | 35 +++++++++++++++++++---- 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/skills/nemo-gym-reward-profiling/SKILL.md b/skills/nemo-gym-reward-profiling/SKILL.md index acf7df6bb..3c933dc31 100644 --- a/skills/nemo-gym-reward-profiling/SKILL.md +++ b/skills/nemo-gym-reward-profiling/SKILL.md @@ -2,15 +2,27 @@ name: nemo-gym-reward-profiling license: Apache-2.0 description: >- - Use to help users get started with Nemo Gym reward profiling. Covers the basic - ng_run, ng_collect_rollouts, and ng_reward_profile workflow, repeated rollouts, - materialized inputs, rollout JSONL artifacts, task and rollout identity, output - inspection, partial profiling, and rollout_infos. For failed jobs, prefer - nemo-gym-debugging. + Get started with Nemo Gym reward profiling: the ng_run, ng_collect_rollouts, + and ng_reward_profile workflow, repeated rollouts, materialized inputs, rollout + JSONL artifacts, task and rollout identity, output inspection, partial + profiling, and rollout_infos. For failed jobs, prefer nemo-gym-debugging. +metadata: + author: NVIDIA + tags: + - reward-profiling + - rollouts + - evaluation + - metrics --- # Nemo Gym Reward Profiling +## Purpose + +Run and understand Nemo Gym reward profiling: start servers with `ng_run`, +collect rollout artifacts with `ng_collect_rollouts`, and produce profiling +output with `ng_reward_profile`, then inspect the resulting rows and metrics. + ## Invocation Check Use this skill when the user wants to run, understand, or lightly modify Nemo Gym reward profiling. Keep the answer oriented around the normal workflow: @@ -19,7 +31,7 @@ Use this skill when the user wants to run, understand, or lightly modify Nemo Gy If the user is primarily debugging a failed job or stack trace, use the `nemo-gym-debugging` skill first. -## Basic Workflow +## Instructions 1. Identify the environment config paths and input JSONL. 2. Start Gym servers with `ng_run`. @@ -52,3 +64,14 @@ Load references only when the user needs that detail: - Treat `ng_reward_profile` as the reward profiling step; rollout collection does not write reward profile files. - Run strict profiling by default. If rollout collection stopped early, use `++allow_partial_rollouts=True` to profile completed rollouts and drop original input rows with no completed rollout. - Trust the target checkout's CLI help and `nemo_gym/reward_profile.py` over memory if flags differ. + +## Examples + +Profiling a single config: run `ng_run` for the environment, collect rollouts +with `+num_repeats` greater than one so per-task averages and variance are +meaningful, then run `ng_reward_profile` on the materialized inputs and rollout +JSONL and compare line counts across the artifacts. + +Recovering from an interrupted collection: rerun `ng_reward_profile` with +`++allow_partial_rollouts=True` to profile completed rollouts and drop original +input rows that have no completed rollout. From 058216606a49fc4534cda2c3d4dc2302af381e5d Mon Sep 17 00:00:00 2001 From: Ananth Subramaniam Date: Fri, 29 May 2026 15:09:59 -0700 Subject: [PATCH 3/5] ci(nemo-gym-reward-profiling): clear low-severity quality advisories - Shorten the description to under 150 characters. - Add Prerequisites, Limitations, and Troubleshooting sections. Signed-off-by: Ananth Subramaniam --- skills/nemo-gym-reward-profiling/SKILL.md | 27 +++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/skills/nemo-gym-reward-profiling/SKILL.md b/skills/nemo-gym-reward-profiling/SKILL.md index 3c933dc31..174cc49e7 100644 --- a/skills/nemo-gym-reward-profiling/SKILL.md +++ b/skills/nemo-gym-reward-profiling/SKILL.md @@ -2,10 +2,8 @@ name: nemo-gym-reward-profiling license: Apache-2.0 description: >- - Get started with Nemo Gym reward profiling: the ng_run, ng_collect_rollouts, - and ng_reward_profile workflow, repeated rollouts, materialized inputs, rollout - JSONL artifacts, task and rollout identity, output inspection, partial - profiling, and rollout_infos. For failed jobs, prefer nemo-gym-debugging. + Get started with Nemo Gym reward profiling: ng_run, ng_collect_rollouts, and + ng_reward_profile. For failed jobs, prefer nemo-gym-debugging. metadata: author: NVIDIA tags: @@ -23,6 +21,13 @@ Run and understand Nemo Gym reward profiling: start servers with `ng_run`, collect rollout artifacts with `ng_collect_rollouts`, and produce profiling output with `ng_reward_profile`, then inspect the resulting rows and metrics. +## Prerequisites + +- NeMo Gym installed with the `ng_run`, `ng_collect_rollouts`, and `ng_reward_profile` CLIs. +- An environment config bundle and an input JSONL dataset. +- A reachable model server (an OpenAI-compatible endpoint or a local vLLM model server). +- Enough disk for rollout and materialized-input artifacts. + ## Invocation Check Use this skill when the user wants to run, understand, or lightly modify Nemo Gym reward profiling. Keep the answer oriented around the normal workflow: @@ -75,3 +80,17 @@ JSONL and compare line counts across the artifacts. Recovering from an interrupted collection: rerun `ng_reward_profile` with `++allow_partial_rollouts=True` to profile completed rollouts and drop original input rows that have no completed rollout. + +## Limitations + +- Per-task averages and variance are only meaningful with multiple rollouts per task; single-repeat runs give point estimates. +- This step summarizes existing rollout artifacts; it does not collect rollouts or fix failed runs. +- Reward semantics are defined by the resource server, not by this workflow. + +## Troubleshooting + +| Symptom | Likely cause | Resolution | +|---|---|---| +| No reward profile file produced | Expected it from rollout collection | Reward profiling is a separate step; run it on the materialized inputs and rollout JSONL | +| Profile rows fewer than input tasks | Rollout collection stopped early | Rerun profiling with partial rollouts allowed (see Practical Defaults) | +| CLI flags differ from this guide | Target checkout version differs | Trust the checkout's CLI help and `nemo_gym/reward_profile.py` | From 0583bf68f02cd69a7007b5322648e5de5a9c50c6 Mon Sep 17 00:00:00 2001 From: Ananth Subramaniam Date: Fri, 29 May 2026 21:33:44 -0700 Subject: [PATCH 4/5] ci(nemo-gym-reward-profiling): read policy key from env; strengthen routing - Stop assigning a placeholder API key and passing it on the command line in references/quick-start.md; require POLICY_ENDPOINT_KEY from the environment and resolve it at runtime via the oc.env resolver so the value never appears in process args or shell history (clears the PII/secret and CLI-exposure findings). - Add explicit do-not-activate routing to nemo-gym-debugging and add-benchmark. Signed-off-by: Ananth Subramaniam --- skills/nemo-gym-reward-profiling/SKILL.md | 5 +++++ .../references/quick-start.md | 12 ++++++++++-- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/skills/nemo-gym-reward-profiling/SKILL.md b/skills/nemo-gym-reward-profiling/SKILL.md index 174cc49e7..03333d732 100644 --- a/skills/nemo-gym-reward-profiling/SKILL.md +++ b/skills/nemo-gym-reward-profiling/SKILL.md @@ -36,6 +36,11 @@ Use this skill when the user wants to run, understand, or lightly modify Nemo Gy If the user is primarily debugging a failed job or stack trace, use the `nemo-gym-debugging` skill first. +Do not activate this skill for these adjacent tasks: + +- Debugging a failed or crashed run (Ray/vLLM stack traces, empty output). Use `nemo-gym-debugging`. +- Adding or scaffolding a new benchmark, evaluation, or training environment. Use `add-benchmark`. + ## Instructions 1. Identify the environment config paths and input JSONL. diff --git a/skills/nemo-gym-reward-profiling/references/quick-start.md b/skills/nemo-gym-reward-profiling/references/quick-start.md index 0e2c80838..66588a38e 100644 --- a/skills/nemo-gym-reward-profiling/references/quick-start.md +++ b/skills/nemo-gym-reward-profiling/references/quick-start.md @@ -4,12 +4,20 @@ Substitute environment-specific config paths, input data, model endpoint, and ou ## Minimal Flow +Provide the policy endpoint key through the environment rather than on the +command line. Export it from your shell session or a secrets manager before +running. It is read at runtime via the `${oc.env:...}` resolver, so the value +never appears in the process arguments (visible via `ps`) or in shell history. + ```bash CONFIG_PATHS="your_model_config_paths,your_env_config_paths" POLICY_MODEL_NAME="your_policy_model_name" POLICY_BASE_URL="your_policy_base_url" -POLICY_ENDPOINT_KEY="your_policy_endpoint_key" + +# Require the key from the environment; this validates presence without echoing +# the value. Set it beforehand, e.g. with `export POLICY_ENDPOINT_KEY=...`. +: "${POLICY_ENDPOINT_KEY:?export POLICY_ENDPOINT_KEY before running}" DATA_JSONL="/path/to/your_input.jsonl" ROLLOUTS_JSONL="/path/to/your_rollouts.jsonl" @@ -22,7 +30,7 @@ NUM_SAMPLES_IN_PARALLEL=8 ng_run "+config_paths=[$CONFIG_PATHS]" \ +policy_model_name="$POLICY_MODEL_NAME" \ +policy_base_url="$POLICY_BASE_URL" \ - +policy_api_key="$POLICY_ENDPOINT_KEY" & + '++policy_api_key=${oc.env:POLICY_ENDPOINT_KEY}' & NG_RUN_PID=$! trap 'kill "$NG_RUN_PID" 2>/dev/null || true' EXIT ./scripts/wait_for_servers.sh "$NG_RUN_PID" From cbf3412aab0e6e5ff9deda1030f9d25fb84e5ba4 Mon Sep 17 00:00:00 2001 From: nvskills-svc-account Date: Sat, 30 May 2026 04:52:51 +0000 Subject: [PATCH 5/5] Attach NVSkills validation signatures Signed-off-by: nvskills-svc-account --- skills/nemo-gym-reward-profiling/BENCHMARK.md | 88 +++++++++++++++++++ .../nemo-gym-reward-profiling/skill-card.md | 76 ++++++++++++++++ .../nemo-gym-reward-profiling/skill.oms.sig | 1 + 3 files changed, 165 insertions(+) create mode 100644 skills/nemo-gym-reward-profiling/BENCHMARK.md create mode 100644 skills/nemo-gym-reward-profiling/skill-card.md create mode 100644 skills/nemo-gym-reward-profiling/skill.oms.sig diff --git a/skills/nemo-gym-reward-profiling/BENCHMARK.md b/skills/nemo-gym-reward-profiling/BENCHMARK.md new file mode 100644 index 000000000..80919b8c0 --- /dev/null +++ b/skills/nemo-gym-reward-profiling/BENCHMARK.md @@ -0,0 +1,88 @@ +# Evaluation Report + +Evaluation of the `nemo-gym-reward-profiling` skill before publication through NVSkills-Eval. + +This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use. + +## Evaluation Summary + +- Skill: `nemo-gym-reward-profiling` +- Evaluation date: 2026-05-30 +- NVSkills-Eval profile: `external` +- Environment: `local` +- Dataset: 5 evaluation tasks +- Attempts per task: 2 +- Pass threshold: 50% +- Overall verdict: PASS + +## Agents Used + +- `claude-code` +- `codex` + +## Metrics Used + +Reported benchmark dimensions: + +- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. +- Correctness: checks whether the agent follows the expected workflow and produces the correct final output. +- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant. +- Effectiveness: checks whether the agent performs measurably better with the skill than without it. +- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work. + +Underlying evaluation signals used in this run: + +- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access. +- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow. +- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage. +- `accuracy` (Accuracy): grades final-answer correctness against the reference answer. +- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully. +- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations. +- `token_efficiency` (Token Efficiency): compares token usage with and without the skill. + +## Test Tasks + +The benchmark dataset contained 5 evaluation tasks: + +- Positive tasks: 3 tasks where the skill was expected to activate. +- Negative tasks: 2 tasks where no skill was expected. +- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred. + +Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases. + +## Results + +| Dimension | Num | `claude-code` | `codex` | +|---|---:|---:|---:| +| Security | 8 | 100% (+5%) | 100% (+20%) | +| Correctness | 8 | 72% (-13%) | 90% (+2%) | +| Discoverability | 8 | 64% (-13%) | 88% (+5%) | +| Effectiveness | 8 | 64% (-14%) | 80% (-5%) | +| Efficiency | 8 | 57% (-6%) | 75% (+5%) | + +Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available. + +## Tier 1: Static Validation Summary + +Tier 1 validation passed. NVSkills-Eval ran 9 checks and found 0 total findings. + +Notable observations: + +- SECURITY: No security vulnerabilities detected (secrets, API keys, credentials) +- SCHEMA: Found skill manifest: SKILL.md +- VERSION: No semantic version label present; resource will use commit-hash history (opting back out of an existing label is allowed) +- PII: Scanning 3 files for PII +- LICENSE: no findings reported. + +## Tier 2: Deduplication Summary + +Tier 2 validation passed. NVSkills-Eval ran 2 checks and found 0 total findings. + +Notable observations: + +- Context Deduplication: Collected 3 file(s) +- Inter-Skill Deduplication: Parsed skill 'nemo-gym-reward-profiling': 139 char description + +## Publication Recommendation + +The skill is suitable to proceed toward NVSkills-Eval publication based on this benchmark. Skill owners should keep this file with the skill and refresh it when the evaluation dataset, skill behavior, or target agents materially change. diff --git a/skills/nemo-gym-reward-profiling/skill-card.md b/skills/nemo-gym-reward-profiling/skill-card.md new file mode 100644 index 000000000..2965a4597 --- /dev/null +++ b/skills/nemo-gym-reward-profiling/skill-card.md @@ -0,0 +1,76 @@ +## Description:
+Get started with Nemo Gym reward profiling: ng_run, ng_collect_rollouts, and ng_reward_profile.
+ +This skill is ready for commercial/non-commercial use.
+ +## Owner +NVIDIA
+ +### License/Terms of Use:
+Apache 2.0
+## Use Case:
+Developers and engineers running Nemo Gym reward profiling workflows to evaluate and profile model performance across tasks using ng_run, ng_collect_rollouts, and ng_reward_profile.
+ +### Deployment Geography for Use:
+Global
+ +## Known Risks and Mitigations:
+Risk: Review before execution as proposals could introduce incorrect or misleading guidance into skills.
+Mitigation: Review and scan skill before deployment.
+ +## Reference(s):
+- [Quick Start](references/quick-start.md)
+- [Output Format](references/output-format.md)
+ + +## Skill Output:
+**Output Type(s):** [Shell commands, Analysis]
+**Output Format:** [Markdown with inline bash code blocks]
+**Output Parameters:** [1D]
+**Other Properties Related to Output:** [None]
+ +## Evaluation Agents Used:
+- `claude-code`
+- `codex`
+ + + +## Evaluation Tasks:
+Evaluated against 5 evaluation tasks (3 positive skill-activation tasks, 2 negative tasks) with 2 attempts per task.
+ +## Evaluation Metrics Used:
+Reported benchmark dimensions:
+- Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
+- Correctness: Checks whether the agent follows the expected workflow and produces the correct final output.
+- Discoverability: Checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
+- Effectiveness: Checks whether the agent performs measurably better with the skill than without it.
+- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work.
+ +Underlying evaluation signals used in this run:
+- `security`: Checks for unsafe operations, secret leakage, and unauthorized access.
+- `skill_execution`: Verifies that the agent loaded the expected skill and workflow.
+- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage.
+- `accuracy`: Grades final-answer correctness against the reference answer.
+- `goal_accuracy`: Checks whether the overall user task completed successfully.
+- `behavior_check`: Verifies expected behavior steps, including safety expectations.
+- `token_efficiency`: Compares token usage with and without the skill.
+ + + +## Evaluation Results:
+| Dimension | Num | `claude-code` | `codex` | +|---|---:|---:|---:| +| Security | 8 | 100% (+5%) | 100% (+20%) | +| Correctness | 8 | 72% (-13%) | 90% (+2%) | +| Discoverability | 8 | 64% (-13%) | 88% (+5%) | +| Effectiveness | 8 | 64% (-14%) | 80% (-5%) | +| Efficiency | 8 | 57% (-6%) | 75% (+5%) | + +## Skill Version(s):
+0583bf68 (source: git SHA, committed 2026-05-29)
+ +## Ethical Considerations:
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+ +(For Release on NVIDIA Platforms Only)
+Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
diff --git a/skills/nemo-gym-reward-profiling/skill.oms.sig b/skills/nemo-gym-reward-profiling/skill.oms.sig new file mode 100644 index 000000000..314bd5a5d --- /dev/null +++ b/skills/nemo-gym-reward-profiling/skill.oms.sig @@ -0,0 +1 @@ +{"mediaType":"application/vnd.dev.sigstore.bundle.v0.3+json","verificationMaterial":{"x509CertificateChain":{"certificates":[{"rawBytes":"MIICgzCCAgmgAwIBAgIUKIyS7SxNteQIiWzK1dWj85E6520wCgYIKoZIzj0EAwMwVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwHhcNMjYwNDAxMDAwMDAwWhcNMjgwNDIyMTUzMzA5WjBUMQswCQYDVQQGEwJVUzEbMBkGA1UECgwSTlZJRElBIENvcnBvcmF0aW9uMSgwJgYDVQQDDB9OVklESUEgQWdlbnQgU2tpbGxzIFNpZ25pbmcgMDAxMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEYoRM9bQl/dGlwSRNi6bTpIJUXH8Nv9GciP6LSflJYYMLCc296kpyuTSsk5ddbAWiDcFX3C/ydX3jwc+qCLYP6uHy9XphyLjOQ27Yb2J6rBLVtRBS1mgGco/Gr7fL6ODco4GaMIGXMB0GA1UdDgQWBBRQ/5ZW3nJ6lmo9SVk7I15o7UGmpTAfBgNVHSMEGDAWgBRPGpILxMBBleJSsBGjrMKsby1CgjAMBgNVHRMBAf8EAjAAMA4GA1UdDwEB/wQEAwIHgDA3BggrBgEFBQcBAQQrMCkwJwYIKwYBBQUHMAGGG2h0dHA6Ly9vY3NwLm5kaXMubnZpZGlhLmNvbTAKBggqhkjOPQQDAwNoADBlAjAUygu/GiOCIXrgGr4SmLgeEVDcEitfFUv7ALbvLVGVyMysB3mxmO/uInZfXzWcJZsCMQDxuoxj4ZmO30jhkPIcCxGFCOvnUsnfU3TfGcouYm4M6iRpbKvtVnHPiy4bi6pcKf0="},{"rawBytes":"MIICiDCCAg6gAwIBAgIUZsIuSv9NkpJCNqtYEfCouVv5BzowCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAASI72cR3ctKGg4VWnB3bNja6g1Z2PnOmFEopkPof+QeIcPk9rT+g9MjJnq51EQXL93a7C2GJ9J985G4o2V85VD7wJ1RaXhluHW2rf3y8bQGeAYaKMr5s/hUgn+M3/9WlWejgaAwgZ0wHQYDVR0OBBYEFE8akgvEwEGV4lKwEaOswqxvLUKCMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYDVR0PAQH/BAQDAgEGMDcGCCsGAQUFBwEBBCswKTAnBggrBgEFBQcwAYYbaHR0cDovL29jc3AubmRpcy5udmlkaWEuY29tMAoGCCqGSM49BAMDA2gAMGUCMQCeIMMfAbyzPDacw2MxG+Yt1cikrJX/DVxiGfXuHmkkXn6VgSzE79+lkqDErpVO2gYCMCNEColOyvUvkzZGUEI1hQ3PfMgi3FIo9tHoBKMw4/wGBLFpu/0ubtmbBXM6/UMOEw=="},{"rawBytes":"MIICRTCCAcygAwIBAgIUeJdY3rV86EdvFmG7L8LJBsyQFYkwCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTB2MBAGByqGSM49AgEGBSuBBAAiA2IABAYpiXCDjJ9NT2eSDhyHJVSw1Tbze18cGG2F/578oWvHxg23eQAhNRYdq88i1iOshZSO6C29doKui5Xpmo/7Ctw9Sx4PP2RzOmIuOLCuTdNtKcTRwi4GEsd5BAFvWj42M6NjMGEwHQYDVR0OBBYEFItnoAjjfuCEUvzyvWyI2vOGvwPjMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMA8GA1UdEwEB/wQFMAMBAf8wDgYDVR0PAQH/BAQDAgEGMAoGCCqGSM49BAMDA2cAMGQCMCwtAjWLaNwgGWNCgdyNoTyvNhqWRECRJV2r3+7w8g0PL6NHLOsbkgE09BH95h8XlgIwTaQmbbUh2ChAJ5TA1wRiVDnCcvbzHlZl2jM2FcwQQZlk19LOAbyGMRixbu2Ww/rj"}]},"tlogEntries":[]},"dsseEnvelope":{"payload":"ewogICJfdHlwZSI6ICJodHRwczovL2luLXRvdG8uaW8vU3RhdGVtZW50L3YxIiwKICAic3ViamVjdCI6IFsKICAgIHsKICAgICAgIm5hbWUiOiAibmVtby1neW0tcmV3YXJkLXByb2ZpbGluZyIsCiAgICAgICJkaWdlc3QiOiB7CiAgICAgICAgInNoYTI1NiI6ICJlNzk5ODhlYmUzYzgyYzZkMmNiNjYxMjVmYmQ2NzliNDYyMzYzZGE3ZDIwMDIwZWE3MGFmZGVjNjliZWU4YjNlIgogICAgICB9CiAgICB9CiAgXSwKICAicHJlZGljYXRlVHlwZSI6ICJodHRwczovL21vZGVsX3NpZ25pbmcvc2lnbmF0dXJlL3YxLjAiLAogICJwcmVkaWNhdGUiOiB7CiAgICAic2VyaWFsaXphdGlvbiI6IHsKICAgICAgImFsbG93X3N5bWxpbmtzIjogZmFsc2UsCiAgICAgICJtZXRob2QiOiAiZmlsZXMiLAogICAgICAiaGFzaF90eXBlIjogInNoYTI1NiIsCiAgICAgICJpZ25vcmVfcGF0aHMiOiBbCiAgICAgICAgIi5naXRodWIiLAogICAgICAgICIuZ2l0aWdub3JlIiwKICAgICAgICAiLmdpdGF0dHJpYnV0ZXMiLAogICAgICAgICIuZ2l0IgogICAgICBdCiAgICB9LAogICAgInJlc291cmNlcyI6IFsKICAgICAgewogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IiwKICAgICAgICAiZGlnZXN0IjogImIwYjVjYzVmZjFkZTg3NmJkYjM4YzhmMjBmNmZmOWM0MjAwNmY2YjBlOTNjNDAwM2QyZDcyMzE0NWQyYjFiODgiLAogICAgICAgICJuYW1lIjogIkJFTkNITUFSSy5tZCIKICAgICAgfSwKICAgICAgewogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IiwKICAgICAgICAiZGlnZXN0IjogIjUzM2MyY2NhYTZhMzJjMGVmNDdiMTc1YWVkMmUzN2EzMmYxMjQ0ZTk2MWUxOTA5ZmFkZDE5N2M0MGRiYzU2ZDEiLAogICAgICAgICJuYW1lIjogIlNLSUxMLm1kIgogICAgICB9LAogICAgICB7CiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiLAogICAgICAgICJkaWdlc3QiOiAiZTMyOWM5ZDMxNTliMDdlZTUwMTFiOWRiYmYzZGUxNTdmMTc0ZTQ2ZGIwOWE1ZGEyNzNjOWJjNjljN2I5MDc5ZSIsCiAgICAgICAgIm5hbWUiOiAiZXZhbHMvZXZhbHMuanNvbiIKICAgICAgfSwKICAgICAgewogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IiwKICAgICAgICAiZGlnZXN0IjogIjcwYjQzYTkzMzI4ZmYxNDM5YzA4MjhhNGM5NGM0OGRlNDE2NzQ3MmFjNjBmZGMwODUyNjkyNWI4NDZmNDg5MTQiLAogICAgICAgICJuYW1lIjogInJlZmVyZW5jZXMvb3V0cHV0LWZvcm1hdC5tZCIKICAgICAgfSwKICAgICAgewogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IiwKICAgICAgICAiZGlnZXN0IjogImNhMTg1MDYxODZlZDQ3OGVjNzlhOWJhMDg4NjE4MmMzYjM5OTgyMWQwZmEwYjA3ZjUwMjIzMzFkOTg3NmQ3MzMiLAogICAgICAgICJuYW1lIjogInJlZmVyZW5jZXMvcXVpY2stc3RhcnQubWQiCiAgICAgIH0sCiAgICAgIHsKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIsCiAgICAgICAgImRpZ2VzdCI6ICI5NTUzNzc3MGFjZGU1MmYxNGM1YjlmNWM3Njg5Njk4YTM0OTExZjY0MzMyOWRlY2M0ZTVlZjQzODNjMjY2ZDQ4IiwKICAgICAgICAibmFtZSI6ICJza2lsbC1jYXJkLm1kIgogICAgICB9CiAgICBdCiAgfQp9","payloadType":"application/vnd.in-toto+json","signatures":[{"sig":"MGUCMGdSsnRo4Nmr8fCND14Zk1PJhkiXOAAjyMeX0W2jaWnHg3La0HIp0oVGnBFHnhTogAIxAIE639lS72e7NYSngqAvJ0Wr+lK5s5TgNtLN1NbQ9699i+INt9EavyOSnPlCh7xNqQ==","keyid":""}]}} \ No newline at end of file