Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .claude/skills/nemo-gym-pivot-datasets
Original file line number Diff line number Diff line change
@@ -1,14 +1,34 @@
---
name: nemo-gym-pivot-datasets
license: Apache-2.0
description: >-
Use when creating, validating, or documenting Nemo Gym pivot datasets from rollout,
trajectory, chat-completion, Responses API, or tool-call artifacts. Covers Gym
Responses-style row conversion, pivot selection, single-step tool-use configs,
agent_ref alignment, verifier knobs, expected-action row contracts, and train/eval usage.
Create and validate Nemo Gym single-step pivot datasets from trajectory or
rollout artifacts. Not for reward profiling or debugging runs.
metadata:
author: NVIDIA <nemo-gym@nvidia.com>
tags:
- pivot-dataset
- dataset-conversion
- reinforcement-learning
- single-step
- trajectory
---

# Nemo Gym Pivot Datasets

## Purpose

Convert agent trajectories and rollout artifacts into single-step Nemo Gym pivot
datasets for local RL or evaluation, and validate that a pivot JSONL and its Gym
config can be used together.

## Prerequisites

- Source artifacts to convert: rollout, trajectory, chat-completion, Responses API, or tool-call data.
- Python to run `scripts/validate_pivot_dataset.py` and the reference converters.
- The target Gym config (agent and resource-server names) the pivot rows must align with.
- Optionally a Gym checkout (`--gym-repo`) to validate against resource-server Pydantic models.

## Paper Reference

This skill operationalizes [PivotRL](https://arxiv.org/html/2603.21383v1): create local
Expand All @@ -21,11 +41,17 @@ Use this skill when the task is to turn existing agent trajectories or rollout a
Nemo Gym pivot dataset, or to validate whether a pivot JSONL/config pair can be used for
single-step local RL or evaluation.

Do not activate this skill for these adjacent tasks:

- Running or profiling rewards on an existing dataset. Use `nemo-gym-reward-profiling`.
- Debugging a failed or crashed run (Ray/vLLM stack traces, empty output). Use `nemo-gym-debugging`.
- Adding or scaffolding a new benchmark or training environment. Use `add-benchmark`.

Before writing a converter, inspect representative source rows and the target resource server.
Do not assume the source field names are the contract. Convert by reconstructing the semantic
pieces needed by Gym's Responses-style row format.

## Core Workflow
## Instructions

1. Inspect the source data shape and count the candidate assistant decision points.
2. Identify the semantic fields needed for each pivot:
Expand Down Expand Up @@ -117,3 +143,31 @@ resource-server request model.

The validator accepts both supported expected-action types by default (`function_call` and `message`)
and prints an end summary split between tool-call and message pivots.

## Examples

Converting chat-completion logs: inspect representative rows, identify each
assistant decision point, and reconstruct `responses_create_params`,
`expected_action` (a single `function_call` or `message`), and `agent_ref` for
each accepted pivot. Route turns with more than one tool call into a skipped-row
audit. Borrow from
`scripts/reference/chat_messages_to_pivot_dataset_reference.py` rather than
running it unchanged.

Validating a finished dataset: run `scripts/validate_pivot_dataset.py` with the
expected `--agent-ref`, and add `--gym-repo` when the Gym checkout is available
to also validate against the resource-server Pydantic models.

## Limitations

- `expected_action` is singular; source turns with more than one tool call are filtered out, not split.
- Reference converters under `scripts/reference/` are dataset-specific examples, not commands to run unchanged.
- A valid JSONL file can still be unusable if the agent and resource-server names do not line up.

## Troubleshooting

| Symptom | Likely cause | Resolution |
|---|---|---|
| Validator rejects rows | `agent_ref.name` does not match the config's agent block | Align `agent_ref.name` with the agent used by the generated config |
| Tool-argument matches fail | String-argument threshold too strict | Tune `word_count_similarity_threshold` for the single-step tool-use verifier |
| Structured-decoding path taken unexpectedly | `tool_choice: "required"` routes some engines there | Use `tool_choice: "auto"` for these rows |
62 changes: 62 additions & 0 deletions skills/nemo-gym-pivot-datasets/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
[
{
"id": "nemo-gym-pivot-datasets-positive-001",
"question": "Convert these tool-call trajectories in a JSONL file into a NeMo Gym pivot dataset I can use for single-step training.",
"expected_skill": "nemo-gym-pivot-datasets",
"ground_truth": "The agent loads the nemo-gym-pivot-datasets skill, inspects representative source rows and the target resource server before writing a converter, identifies the semantic fields needed per pivot, converts each accepted decision point to one row with responses_create_params, expected_action, and agent_ref, and runs the bundled validator against the output.",
"expected_behavior": [
"The agent read nemo-gym-pivot-datasets/SKILL.md before acting",
"The agent inspected representative source rows before writing a converter",
"The agent emitted one pivot row per accepted decision point with responses_create_params, expected_action, and agent_ref",
"The agent filtered out source turns with more than one tool call rather than emitting multi-action rows",
"The agent ran scripts/validate_pivot_dataset.py on the output"
]
},
{
"id": "nemo-gym-pivot-datasets-positive-002",
"question": "Validate this pivot.jsonl file against the resources-server request models and the agent_ref I expect — is it usable for training?",
"expected_skill": "nemo-gym-pivot-datasets",
"ground_truth": "The agent loads the nemo-gym-pivot-datasets skill and runs the bundled validator with both --agent-ref and --gym-repo, checks that agent_ref matches the config's agent block, and confirms the row contract for single_step_tool_use_with_argument_comparison.",
"expected_behavior": [
"The agent read nemo-gym-pivot-datasets/SKILL.md before acting",
"The agent ran scripts/validate_pivot_dataset.py with --agent-ref",
"The agent passed --gym-repo to validate against the resources-server Pydantic models when the Gym repo is available",
"The agent confirmed agent_ref.name matches the agent block used by the config"
]
},
{
"id": "nemo-gym-pivot-datasets-positive-003",
"question": "I have a batch of chat-completion rollouts from a different framework. Build me a NeMo Gym pivot dataset and the matching Gym YAML config.",
"expected_skill": "nemo-gym-pivot-datasets",
"ground_truth": "The agent loads the nemo-gym-pivot-datasets skill, normalizes the chat-completion rows into Gym Responses-style pivot rows (one expected_action per row), uses tool_choice: auto in the generated config, points the train dataset entry at the pivot JSONL, and aligns agent_ref with the agent block before validating.",
"expected_behavior": [
"The agent read nemo-gym-pivot-datasets/SKILL.md before acting",
"The agent normalized the chat-completion rows into Responses-style pivot rows",
"The agent set tool_choice: auto in the generated config rather than required",
"The agent pointed the config's train dataset entry directly at the pivot JSONL",
"The agent ensured row-level agent_ref matches the config's agent block"
]
},
{
"id": "nemo-gym-pivot-datasets-negative-001",
"question": "Add the cuOpt vehicle-routing benchmark to NeMo-Gym, including data prep and the resources server.",
"expected_skill": null,
"should_trigger": false,
"ground_truth": "The agent should not activate the nemo-gym-pivot-datasets skill for a new-benchmark integration task. It should use the add-benchmark skill instead.",
"expected_behavior": [
"The agent did not read or activate nemo-gym-pivot-datasets/SKILL.md",
"The agent recognized this as a benchmark integration task"
]
},
{
"id": "nemo-gym-pivot-datasets-negative-002",
"question": "My ng_reward_profile job is producing empty profile rows for half the tasks. Help me figure out what's wrong.",
"expected_skill": null,
"should_trigger": false,
"ground_truth": "The agent should not activate the nemo-gym-pivot-datasets skill for a debugging task on an existing reward profiling run. It should use the nemo-gym-debugging skill instead.",
"expected_behavior": [
"The agent did not read or activate nemo-gym-pivot-datasets/SKILL.md",
"The agent recognized this as a debugging task"
]
}
]
Loading