Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .claude/skills/nemo-gym-debugging
86 changes: 86 additions & 0 deletions skills/nemo-gym-debugging/BENCHMARK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Evaluation Report

Evaluation of the `nemo-gym-debugging` skill before publication through NVSkills-Eval.

This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use.

## Evaluation Summary

- Skill: `nemo-gym-debugging`
- Evaluation date: 2026-05-30
- NVSkills-Eval profile: `external`
- Environment: `local`
- Dataset: 6 evaluation tasks
- Attempts per task: 2
- Pass threshold: 50%
- Overall verdict: PASS

## Agents Used

- `claude-code`
- `codex`

## Metrics Used

Reported benchmark dimensions:

- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
- Correctness: checks whether the agent follows the expected workflow and produces the correct final output.
- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
- Effectiveness: checks whether the agent performs measurably better with the skill than without it.
- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work.

Underlying evaluation signals used in this run:

- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.

## Test Tasks

The benchmark dataset contained 6 evaluation tasks:

- Positive tasks: 4 tasks where the skill was expected to activate.
- Negative tasks: 2 tasks where no skill was expected.
- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.

Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases.

## Results

| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | 8 | 100% (+0%) | 92% (+0%) |
| Correctness | 8 | 87% (-1%) | 86% (+3%) |
| Discoverability | 8 | 98% (+1%) | 76% (-2%) |
| Effectiveness | 8 | 74% (+1%) | 85% (+11%) |
| Efficiency | 8 | 84% (+2%) | 61% (-2%) |

Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.

## Tier 1: Static Validation Summary

Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 3 total findings.

Top findings:

- MEDIUM QUALITY/quality_correctness: No documented scripts in table format (`skills/nemo-gym-debugging/SKILL.md`)
- MEDIUM QUALITY/quality_correctness: Instructions don't mention 'run_script' (`skills/nemo-gym-debugging/SKILL.md`)
- LOW SCRIPT_LINT/magic_numbers: check_tool_call_jsonl.py contains magic numbers (`skills/nemo-gym-debugging/scripts/check_tool_call_jsonl.py`)

## Tier 2: Deduplication Summary

Tier 2 validation passed. NVSkills-Eval ran 2 checks and found 0 total findings.

Notable observations:

- Context Deduplication: Collected 6 file(s)
- Inter-Skill Deduplication: Parsed skill 'nemo-gym-debugging': 132 char description

## Publication Recommendation

The skill is suitable to proceed toward NVSkills-Eval publication based on this benchmark. Skill owners should keep this file with the skill and refresh it when the evaluation dataset, skill behavior, or target agents materially change.
Original file line number Diff line number Diff line change
@@ -1,17 +1,37 @@
---
name: nemo-gym-debugging
license: Apache-2.0
description: >-
Use when debugging a Nemo Gym run or reward profiling job. Covers rollout collection failures,
empty or partial JSONL outputs, stale materialized inputs, verifier/schema errors, Ray or Slurm
issues, vLLM readiness, judge failures, tool/sandbox failures, cache problems, and throughput
bottlenecks.
Debug a Nemo Gym run or reward-profiling job by classifying the failing layer.
Not for adding benchmarks or routine profiling setup.
metadata:
author: NVIDIA <nemo-gym@nvidia.com>
tags:
- debugging
- rollouts
- reward-profiling
- troubleshooting
- observability
---

# Nemo Gym Debugging

## Purpose

Diagnose and resolve failures in a Nemo Gym run or reward-profiling job by
classifying the failing layer (infra, model serving, config, data/schema,
verifier/runtime, cache/resume, or throughput) before changing code or data.

## Prerequisites

- Access to the failing run's Slurm or Ray logs, config bundle, and output directory.
- The same Nemo Gym checkout and config used for the run.
- Read access to materialized inputs, rollout JSONL, and profiling output.
- Reachable vLLM or model-server endpoints for readiness checks.

## Invocation Check

Use this skill when something failed or looks suspicious in a Nemo Gym run. If the task is adding a new env, use the `nemo-gym-env-integration` skill; if it is changing profiling behavior, use the `nemo-gym-reward-profiling` skill.
Use this skill when something failed or looks suspicious in a Nemo Gym run. If the task is adding a new benchmark or environment, use the `add-benchmark` skill; if it is changing profiling behavior, use the `nemo-gym-reward-profiling` skill.

Debug by classification, not by guessing. The first goal is to decide whether the issue is:

Expand All @@ -23,7 +43,9 @@ Debug by classification, not by guessing. The first goal is to decide whether th
- cache/resume: stale materialized inputs or partial rollout output
- throughput/resources: concurrency too high, judge bottleneck, tool/sandbox latency

## Debug Order
## Instructions

Work through these checks in order:

1. Check Slurm/Ray job state and logs.
2. Check vLLM readiness and `/models` availability.
Expand All @@ -50,6 +72,33 @@ Debug by classification, not by guessing. The first goal is to decide whether th
- Read `references/vllm-tool-call-schema-checks.md` when a tool-call dataset may be rejected by vLLM/Outlines grammar compilation before any meaningful generation happens.
- Read `references/request-boundary-visibility.md` when `/run` 500s hide row identity or nested Gym 500s hide the inner model/verifier/provider error. It covers the existing Gym debug flag, shipped request-boundary markers, empty provider bodies, and vLLM provider-side escalation.

## Examples

Empty reward-profiling output with a populated rollouts file: confirm the
`rollouts.jsonl` row count, then inspect the first real verifier exception rather
than shutdown noise. If the data changed and `resume_from_cache` was enabled,
suspect stale materialized inputs and compare source-data and materialized-input
timestamps before rerunning.

Tool-call rows failing before generation: run the static tool-schema check in
`references/vllm-tool-call-schema-checks.md` before modifying Gym wrappers, since
vLLM and Outlines reject malformed tool schemas during grammar compilation, ahead
of any meaningful generation.

## Limitations

- Diagnostic only; it localizes failures but does not add or modify benchmarks, configs, or datasets.
- Assumes the run's logs and artifacts are still available; discarded state cannot be reconstructed.
- Tool and sandbox guidance applies only when the environment actually configures them.

## Troubleshooting

| Symptom | Likely cause | Resolution |
|---|---|---|
| Job exits immediately with no rollouts | Scheduler or container startup failure | Inspect Slurm/Ray job state and the earliest startup logs first |
| Model server never becomes ready | Model load or port binding failure | Verify the `/models` endpoint and the configured port before suspecting Gym |
| Profiling output has fewer rows than tasks | Partial rollouts or strict-mode dropping | Confirm completed-rollout counts; allow partial rollouts only when intended |

## Communication Pattern

When reporting back, state:
Expand Down
73 changes: 73 additions & 0 deletions skills/nemo-gym-debugging/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
[
{
"id": "nemo-gym-debugging-positive-001",
"question": "My ng_collect_rollouts job finished but rollouts.jsonl only has 2 rows even though my input has 100 tasks. The reward profile is empty. What's going on?",
"expected_skill": "nemo-gym-debugging",
"ground_truth": "The agent loads the nemo-gym-debugging skill, classifies the symptom as a partial-output + verifier issue, inspects the first real verifier exception (not shutdown noise), checks for stale materialized inputs if resume_from_cache was on, and compares the failing row schema against the resource-server request model.",
"expected_behavior": [
"The agent read nemo-gym-debugging/SKILL.md before acting",
"The agent classified the failing layer (data/schema, verifier, or cache) rather than guessing",
"The agent inspected the first real verifier exception rather than shutdown noise",
"The agent considered stale materialized inputs as a first-class suspect if resume_from_cache was enabled",
"The agent checked rollout output and profiling row counts"
]
},
{
"id": "nemo-gym-debugging-positive-002",
"question": "All my Gym servers are up and vLLM is ready, but the resources server keeps returning 422 errors from the verifier endpoint. Help me figure out why.",
"expected_skill": "nemo-gym-debugging",
"ground_truth": "The agent loads the nemo-gym-debugging skill, recognizes that ready servers + 422 means a data/schema mismatch, compares the request body against the resources server's Pydantic request model, and inspects the first real verifier exception before touching infra or model settings.",
"expected_behavior": [
"The agent read nemo-gym-debugging/SKILL.md before acting",
"The agent classified the failure as data/schema rather than infra or model serving",
"The agent compared the failing row schema against the resources-server request model",
"The agent did not chase infra or vLLM tuning before checking the request body"
]
},
{
"id": "nemo-gym-debugging-positive-003",
"question": "My tool-call rollouts hang almost immediately with vLLM grammar/schema errors before any generation happens. What should I check?",
"expected_skill": "nemo-gym-debugging",
"ground_truth": "The agent loads the nemo-gym-debugging skill, reads references/vllm-tool-call-schema-checks.md, and runs a static tool-schema check against the dataset before changing Gym wrappers or model settings.",
"expected_behavior": [
"The agent read nemo-gym-debugging/SKILL.md before acting",
"The agent loaded references/vllm-tool-call-schema-checks.md for the tool-call schema check",
"The agent ran a static tool-schema check before changing Gym wrappers",
"The agent did not change model settings as the first move"
]
},
{
"id": "nemo-gym-debugging-positive-004",
"question": "I'm only seeing nested 'inner server' 500s in my logs and I can't tell which row or provider call actually failed. How do I get the real error?",
"expected_skill": "nemo-gym-debugging",
"ground_truth": "The agent loads the nemo-gym-debugging skill, enables ++global_aiohttp_client_request_debug=True to surface request-boundary visibility, and reads references/request-boundary-visibility.md before changing code.",
"expected_behavior": [
"The agent read nemo-gym-debugging/SKILL.md before acting",
"The agent enabled ++global_aiohttp_client_request_debug=True instead of editing code first",
"The agent loaded references/request-boundary-visibility.md",
"The agent did not start changing wrappers or provider code as a first move"
]
},
{
"id": "nemo-gym-debugging-negative-001",
"question": "Add a new benchmark called wmt-comet to NeMo-Gym, including data prep and the resources server.",
"expected_skill": null,
"should_trigger": false,
"ground_truth": "The agent should not activate the nemo-gym-debugging skill for a benchmark integration task. It should use the add-benchmark skill instead.",
"expected_behavior": [
"The agent did not read or activate nemo-gym-debugging/SKILL.md",
"The agent recognized this as a benchmark integration task"
]
},
{
"id": "nemo-gym-debugging-negative-002",
"question": "Document the new ng_reward_profile --pass_threshold flag in the Fern docs under the reward profiling section.",
"expected_skill": null,
"should_trigger": false,
"ground_truth": "The agent should not activate the nemo-gym-debugging skill for a documentation task. It should use the nemo-gym-docs skill instead.",
"expected_behavior": [
"The agent did not read or activate nemo-gym-debugging/SKILL.md",
"The agent recognized this as a docs edit"
]
}
]
78 changes: 78 additions & 0 deletions skills/nemo-gym-debugging/skill-card.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
## Description: <br>
Debug a Nemo Gym run or reward-profiling job by classifying the failing layer. <br>

This skill is ready for commercial/non-commercial use. <br>

## Owner
NVIDIA <br>

### License/Terms of Use: <br>
Apache 2.0 <br>
## Use Case: <br>
Developers and engineers debugging failures in Nemo Gym evaluation or reward-profiling runs by classifying the failing layer (infra, model serving, config, data/schema, verifier/runtime, cache/resume, or throughput) before changing code or data. <br>

### Deployment Geography for Use: <br>
Global <br>

## Known Risks and Mitigations: <br>
Risk: Review before execution as proposals could introduce incorrect or misleading guidance into skills. <br>
Mitigation: Review and scan skill before deployment. <br>

## Reference(s): <br>
- [Error Profiles](references/error-profiles.md) <br>
- [Diagnostic Snippets](references/diagnostic-snippets.md) <br>
- [Request Boundary Visibility](references/request-boundary-visibility.md) <br>
- [vLLM Tool-Call Schema Checks](references/vllm-tool-call-schema-checks.md) <br>


## Skill Output: <br>
**Output Type(s):** [Analysis, Shell commands] <br>
**Output Format:** [Markdown with inline bash code blocks] <br>
**Output Parameters:** [1D] <br>
**Other Properties Related to Output:** [None] <br>

## Evaluation Agents Used: <br>
- Claude Code (`claude-code`) <br>
- Codex (`codex`) <br>



## Evaluation Tasks: <br>
Evaluated against 6 evaluation tasks (4 positive skill-activation cases, 2 negative cases) with 2 attempts per task. <br>

## Evaluation Metrics Used: <br>
Reported benchmark dimensions: <br>
- Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. <br>
- Correctness: Checks whether the agent follows the expected workflow and produces the correct final output. <br>
- Discoverability: Checks whether the agent loads the skill when relevant and avoids using it when irrelevant. <br>
- Effectiveness: Checks whether the agent performs measurably better with the skill than without it. <br>
- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work. <br>

Underlying evaluation signals used in this run: <br>
- `security`: Checks for unsafe operations, secret leakage, and unauthorized access. <br>
- `skill_execution`: Verifies that the agent loaded the expected skill and workflow. <br>
- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage. <br>
- `accuracy`: Grades final-answer correctness against the reference answer. <br>
- `goal_accuracy`: Checks whether the overall user task completed successfully. <br>
- `behavior_check`: Verifies expected behavior steps, including safety expectations. <br>
- `token_efficiency`: Compares token usage with and without the skill. <br>



## Evaluation Results: <br>
| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | 8 | 100% (+0%) | 92% (+0%) |
| Correctness | 8 | 87% (-1%) | 86% (+3%) |
| Discoverability | 8 | 98% (+1%) | 76% (-2%) |
| Effectiveness | 8 | 74% (+1%) | 85% (+11%) |
| Efficiency | 8 | 84% (+2%) | 61% (-2%) |

## Skill Version(s): <br>
6be42228 (source: git SHA, committed 2026-05-29) <br>

## Ethical Considerations: <br>
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>

(For Release on NVIDIA Platforms Only) <br>
Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail). <br>
Loading
Loading