diff --git a/.claude/skills/add-benchmark/SKILL.md b/.claude/skills/add-benchmark/SKILL.md deleted file mode 100644 index 385666e38..000000000 --- a/.claude/skills/add-benchmark/SKILL.md +++ /dev/null @@ -1,252 +0,0 @@ ---- -name: add-benchmark -description: > - Guide for adding a new benchmark or training environment to NeMo-Gym. - Use when the user asks to add, create, or integrate a benchmark, evaluation, - training environment, or resources server into NeMo-Gym. Also use when wrapping - an existing 3rd-party benchmark library. Covers the full workflow: data preparation, - resources server implementation, agent wiring, YAML config, testing, and reward - profiling (baselining). Triggered by: "add benchmark", "new resources server", - "integrate benchmark", "wrap benchmark", "add training environment", "add eval". ---- - -# Add Benchmark to NeMo-Gym - -## Determine Integration Type - -Before starting, determine which type of benchmark you're adding: - -**Native benchmark** — verification logic implemented directly in a Gym resources server: -- Resources server implements `verify()` with reward logic -- Agent server orchestrates model calls (use `simple_agent` for single-turn, or custom agent for multi-turn) -- Example: `code_gen`, `instruction_following`, `math_with_judge` - -**External benchmark** — wrapping a 3rd-party library that has its own orchestration: -- Integrate at the agent server level (not resources server) -- Agent's `/run` endpoint wraps the external library -- Pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse` -- Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration -- Add the dependency in `requirements.txt` - -## Workflow - -### Step 1: Scaffold the server - -Run `ng_init_resources_server` to generate the directory structure: - -```bash -ng_init_resources_server +entrypoint=resources_servers/my_benchmark -``` - -This creates: -``` -resources_servers/my_benchmark/ -├── app.py # Server template -├── configs/my_benchmark.yaml -├── data/.gitignore -├── tests/test_app.py -├── requirements.txt -└── README.md -``` - -For external benchmarks, create the agent server manually under `responses_api_agents/my_agent/` with the same structure. - -### Step 2: Prepare data - -Convert your source dataset to Gym JSONL format. Each line must have `responses_create_params.input` (OpenAI message format). Task-specific verification data goes in `verifier_metadata`. - -```json -{ - "responses_create_params": { - "input": [ - {"role": "system", "content": "System prompt"}, - {"role": "user", "content": "Problem statement"} - ] - }, - "verifier_metadata": { - "test_cases": [{"input": "...", "expected_output": "..."}], - "task_id": "unique_id" - } -} -``` - -**Data conversion**: Write conversion scripts in the **source repo** (e.g. your dataset repository), not in NeMo-Gym. Prompt files also belong in the source repo. Exception: when there is no external source repo. See `references/patterns.md` § "Data Conversion Script Pattern". - -**`example.jsonl`**: Generate 5 entries for smoke testing. This file is committed directly to git in `data/example.jsonl`. - -**`train`/`validation` datasets**: Upload to the GitLab dataset registry — these must NOT be committed to git. - -```bash -ng_upload_dataset_to_gitlab \ - +dataset_name=my_benchmark \ - +version=0.0.1 \ - +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl -``` - -Requires MLflow credentials in `env.yaml` (or passed via CLI): -```yaml -mlflow_tracking_uri: -mlflow_tracking_token: -``` - -**`data/.gitignore`**: The scaffold generates default patterns (`*train.jsonl`, `*validation.jsonl`, etc.). If your filename doesn't match (e.g. `my_eval.jsonl`), add a custom pattern (e.g. `*eval.jsonl`). If data was previously tracked, run `git rm --cached `. - -**Validate** your data: -```bash -# Validate example data (for PR submission) -ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \ - +output_dirpath=/tmp/prepare +mode=example_validation - -# Download and prepare train/validation from GitLab -ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \ - +output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab -``` - -### Step 3: Implement verify() - -Edit `app.py`. The `verify()` method receives model output + `verifier_metadata`, returns reward. - -For code execution benchmarks, see `references/patterns.md` § "Subprocess Execution with Ray" and "Resources Server Pattern". - -Critical rules: -- Return `reward` as 0.0 or 1.0 (binary) -- Handle empty/missing model output gracefully — return 0.0, don't crash -- Must handle 4k-65k concurrent requests without crashing -- Use `asyncio.Semaphore` for subprocess concurrency control -- For Ray remote tasks: `result = await future` (Ray futures are directly awaitable). Never call `ray.get()` in async context. -- Decode subprocess output with `errors="replace"` -- Strip ``/`` blocks before parsing model output (thinking models emit these) -- Tests should `pytest.mark.skipif` when external tools aren't installed -- If the benchmark auto-installs its tool (see Step 3b), add a `pytest_configure` hook in `conftest.py` to run the install before test collection — `skipif` evaluates at import time, before fixtures run - -### Step 3b: Auto-install external tools (if applicable) - -If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See `references/patterns.md` § "External Tool Auto-Install Pattern". - -Key points: -- Create `setup_.py` with `ensure_()` — checks PATH, forks on `sys.platform` (brew on macOS, build from source on Linux) -- Call it in `model_post_init()` before semaphore init -- Build scripts should be idempotent and install into a local gitignored prefix -- Add a `pytest_configure` hook in `tests/conftest.py` that calls `ensure_()` before collection - -### Step 4: Wire YAML config - -Edit `configs/my_benchmark.yaml`. Define the resources server instance and agent pairing(s). See `references/patterns.md` § "YAML Config Pattern". - -Key points: -- `verified: false` is auto-added by pre-commit hook (set to `true` after baselining) -- `license` is required for `train` and `validation` datasets -- Agent references resources server and model server by instance name - -For multi-turn benchmarks, either use `proof_refinement_agent` or create a custom agent. See `references/patterns.md` § "Agent Patterns". - -For `train`/`validation` datasets, add `gitlab_identifier` alongside `jsonl_fpath`: -```yaml -datasets: -- name: my_dataset - type: train - jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl - gitlab_identifier: - dataset_name: my_benchmark - version: 0.0.1 - artifact_fpath: my_dataset.jsonl - license: MIT -- name: example - type: example - jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl -``` - -Both fields must coexist: `jsonl_fpath` is the local download destination, `gitlab_identifier` tells the system where to fetch from. `example` datasets don't need `gitlab_identifier` — they're committed to git directly. - -### Step 5: Test - -```bash -# Run server tests (creates isolated .venv, slow on first run) -ng_test +entrypoint=resources_servers/my_benchmark - -# Run core library tests to check nothing broke -pytest tests/unit_tests/ -x -``` - -Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout. - -### Step 6: Smoke test end-to-end - -```bash -# Start servers -ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]" - -# Quick test with example data -ng_collect_rollouts +agent_name=my_benchmark_simple_agent \ - +input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \ - +output_jsonl_fpath=results/example_rollouts.jsonl \ - +num_repeats=1 \ - "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" - -# Inspect results -``` - -### Step 7: Baseline (reward profiling) - -Run against multiple models to validate correctness. Recommended suite: -- Your policy model of interest -- At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct) -- At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking) -- At least one closed-source model (e.g. GPT-5 Nano or GPT-5) - -```bash -# Collect rollouts -ng_collect_rollouts +agent_name=my_benchmark_simple_agent \ - +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \ - +output_jsonl_fpath=results/rollouts.jsonl \ - +num_repeats=5 \ - "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" - -# Compute per-task pass rates -ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \ - +rollouts_jsonl_fpath=results/rollouts.jsonl \ - +output_jsonl_fpath=results/profiled.jsonl \ - +pass_threshold=1.0 - -# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward) -python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl -``` - -Increase `num_repeats` until variance < 1% across runs on the same model. - -Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers. - -For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match. - -### Step 8: Pre-commit and PR - -```bash -pre-commit run --all-files -``` - -First run may fail as hooks auto-modify files (`verified: false` flag, README table). Stage changes and run again. - -Set `verified: true` in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description. - -To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files: -```bash -pre-commit run --files resources_servers/my_benchmark/**/* -``` -If hooks modify files in other directories, discard those changes: -```bash -git checkout -- resources_servers/other_server/ -``` - -## Constraints - -- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other -- **Use aiohttp, not httpx, for async HTTP.** All async HTTP calls must go through `nemo_gym.server_utils.request()` (aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see `resources_servers/tavily_search/app.py` (`TavilySearchAIOHTTPClient`) for the pattern and `docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md` for the rationale. -- Pass configuration through Gym config (YAML), not environment variables -- Code must run on Linux -- `/run` endpoint must be async -- Errors from tool execution or bad model output must return error responses, not crash -- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`) - -## Reference - -For detailed code patterns, schemas, and examples: see [references/patterns.md](references/patterns.md). diff --git a/.claude/skills/create-benchmark/SKILL.md b/.claude/skills/create-benchmark/SKILL.md new file mode 100644 index 000000000..d472a9978 --- /dev/null +++ b/.claude/skills/create-benchmark/SKILL.md @@ -0,0 +1,9 @@ +--- +name: create-benchmark +description: > + TODO +--- +**Native benchmark** — verification logic implemented directly in a Gym resources server: +- Resources server implements `verify()` with reward logic +- Agent server orchestrates model calls (use `simple_agent` for single-turn, or custom agent for multi-turn) +- Example: `code_gen`, `instruction_following`, `math_with_judge` \ No newline at end of file diff --git a/.claude/skills/evaluate-environments/SKILL.md b/.claude/skills/evaluate-environments/SKILL.md new file mode 100644 index 000000000..6892161a2 --- /dev/null +++ b/.claude/skills/evaluate-environments/SKILL.md @@ -0,0 +1,111 @@ +--- +name: evaluate-environments +description: > + Guide for running evaluations for NeMo Gym environments/benchmarks. + This should not be used for creating a new environment or integration a new + evaluation/environment. + Use this skill when a model, agent, or benchmark needs to be run or compared. + It also should be used for collecting rollouts/rewards. + Triggered by: + "evaluate model", "evaluate agent", "run benchmark", "collect rollouts", + "reward profiling", "benchmark results", "compare models", "compare agents", + "analyze results", "pass@k", "why is reward 0" +--- + +# Evaluate Environments +This is for running reliable evaluations and generating rollouts/getting rewards. + +First always test and make sure that a single evaluation run works before scaling up. + +## Pre-requisites +1. Install NeMo Gym or repo set up: `uv venv && uv sync` from project root if working in Github repo +2. You need a policy model. This can be a model endpoint or a self hosted model. +env.yaml` at project root with model endpoint: + ```yaml + policy_base_url: https://api.openai.com/v1 + policy_api_key: + policy_model_name: gpt-4.1-2025-04-14 + ``` + For self-hosted / vLLM / Fireworks / OpenRouter, see [Configure Model docs](https://docs.nvidia.com/nemo/gym/latest/model-server). + +## Running Evals/Rollouts +**Step 1 — Start servers.** NeMo Gym runs three coordinated server types; the agent name in `ng_collect_rollouts` must match the top-level instance key declared in the +environment config you load here. + +```bash +ng_run "+config_paths=[,]" +``` + +Verify with `ng_status` in another terminal. You should see the resources server, the agent server, and the model server. + +**Step 2 — Smoke test on `example.jsonl` (5 tasks, committed to git).** + +```bash +ng_collect_rollouts \ ++agent_name=_simple_agent \ ++input_jsonl_fpath=resources_servers//data/example.jsonl \ ++output_jsonl_fpath=results/smoke_rollouts.jsonl \ ++limit=5 \ ++num_repeats=1 +``` + +If smoke fails, do **not** scale up. Inspect `results/smoke_rollouts.jsonl` directly — a completed-with-reward-0 task is very different from a server/runtime error. + +**Step 3 — Scale.** Use validation/train data (downloaded via `ng_prepare_data` if not local — see [Dataset +Management](https://docs.nvidia.com/nemo/gym/latest/about/concepts/datasets)). Bump `num_repeats` for variance reduction. + +```bash +ng_collect_rollouts \ ++agent_name=_simple_agent \ ++input_jsonl_fpath= \ ++output_jsonl_fpath=results/rollouts.jsonl \ ++num_repeats=5 \ ++num_samples_in_parallel=10 \ +"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" +``` + +`_aggregate_metrics.json` is written alongside `rollouts.jsonl` automatically. Headline numbers (`mean/reward`, `pass@1/accuracy`, etc.) print to stdout. + +## Per-task pass rates & pass@k + +```bash +ng_reward_profile \ ++input_jsonl_fpath= \ ++rollouts_jsonl_fpath=results/rollouts.jsonl \ ++output_jsonl_fpath=results/profiled.jsonl \ ++pass_threshold=1.0 + +python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl +``` + +- `pass@1 = avg_reward` across rollouts of a task. +- `pass@k` derived from `max_reward` across `k` repeats — only meaningful when reward is binary. +- For continuous reward, ignore pass@k and report distribution shifts + per-task means. + +## Common Evaluation Patterns +**Compare models on the same env+agent:** chain multiple model configs and run `ng_collect_rollouts` once per `agent_name` that points at each. The agent's resources server +stays identical so any score delta is attributable to the model. +**Compare agents on the same env+model:** swap the agent config in `config_paths` and re-run. Hold dataset, `num_repeats`, and `responses_create_params` constant. +**No matter what, always change only one knob at a time** Mixing model/agent changes makes deltas uninterpretable. + +## Inspect saved results +- `results/_rollouts.jsonl` — one line per (task, repeat) with `reward`, `response`, `task_index`, and any custom `VerifyResponse` fields. +- `results/_aggregate_metrics.json` — array, one object per agent: `agent_ref`, `agent_metrics`, `key_metrics`, `group_level_metrics`. +- `results/_materialized_inputs.jsonl` — the fully resolved inputs sent to the agent (useful for diffing prompts). +For benchmark-specific headline metrics, override `compute_metrics()` / `get_key_metrics()` on the resources server or agent — see [Aggregate Metrics + docs](https://docs.nvidia.com/nemo/gym/latest/environment-tutorials/aggregate-metrics). When debugging an unexpected score, read the rollout JSONL directly before re-running. + +## Metrics interpretation +1. **Binary vs continuous reward** — pass@k is only meaningful when reward is effectively {0, 1}. For continuous rewards, focus on distribution shifts and per-task means. +2. **Variance reduction** — keep increasing `num_repeats` until variance across runs of the same model is < 1%. Anything noisier and small score deltas are noise. +3. **Inspect samples before claiming regressions.** Aggregate numbers can hide a single broken task type swamping the average. +4. **Distinguish "completed rollout with low reward" from "runtime/server failure."** The latter shows up as exceptions in server logs and/or missing rollouts; the former is a +model/agent quality issue. + +## Output format + +When summarizing an evaluation run, return: + +1. **Run configuration table** — env, agent_name, model, dataset, num_repeats, exact command line. +2. **Aggregate metrics** — `mean/reward`, `pass@1`, `pass@k` (if binary), per-task variance. +3. **Sample-level failure themes** — group the 0-reward rollouts by failure mode (parsing error, wrong answer, tool failure, timeout, etc.). Cite specific `task_index` values. \ No newline at end of file diff --git a/.claude/skills/integrate-benchmark/SKILL.md b/.claude/skills/integrate-benchmark/SKILL.md new file mode 100644 index 000000000..6cc98f2b1 --- /dev/null +++ b/.claude/skills/integrate-benchmark/SKILL.md @@ -0,0 +1,268 @@ +--- +name: integrate-benchmark +description: > + Guide for adding a new benchmark or training environment to NeMo-Gym. This should + only be used when a benchmark or training environment ALREADY exists but is not in + NeMo Gym yet. You can also use this when wrapping an existing 3rd-party benchmark + library. + If the benchmark/training environment doesn't already exist, for example a brand + new benchmark or environment that they are defining for the first time, use the + `create-benchmark` skill instead. + Triggered by: "integrate benchmark", "wrap benchmark", + "port benchmark", "add existing benchmark", "integrate X into Gym", "wrap X library", + "add X benchmark to Gym" +--- + +# Add Benchmark to NeMo-Gym + +## Ask a few simple questions for documentation and code structure: + +- Environment name +- Source repo/location of benchmark or environment +- Paper/reference (if applicable) +- License (I don't know is fine if not known) +- Brief description: What does this environment evaluate? (e.g. web navigation, code generation, tool use) + +If any of the questions the user is not sure about then you can skip over it. +Try to figure out any information you're sure of from looking at the benchmark/environment they supply, then +you can fill in information yourself. + +## Information Gathering and Implementation Planning + +To help figure out what kind of environment/benchmark this is it can be helpful to ask the user questions +to learn how the agent interacts with the environment and the other dependencies for the environment. + +You can refer to "Background Information about Benchmarks" in this file for additional context. + +Use the information already supplied by the user like paper, reference, source repo, etc, to answer the +below as much as possible. After you have filled in all the information you can ask the user too and use +these two sources of information to find discrepancies to clarify the environment/benchmark. + +Ask the user to define how the agent interacts with the environment - here are +some common things to think about and challenge the user on. +- Does the agent receive a natural language prompt and return an answer? +- Does the model use tools (function calling, code execution, web browsing)? +- Is it single-turn or multi-turn (does the model get feedback and retry)? + +Then, ask the user about how verification works. +What's the reward signal? Is it binary pass/fail, a score, or multiple +metrics? How is correctness determined? (exact match, test cases, judge model, human eval)? + +Ask about external dependencies. +Does this environment require external tools, specific runtimes, or sandboxes (e.g. compilers, browsers, Docker, VMs)? +If so, list them and note whether they can be auto-installed on server startup. + +Ask about data. +- Dataset source (e.g. HuggingFace, custom): +- Approximate size (number of tasks): +- Splits available (train/validation/test): +If they didn't already provide paper/reference/source repo then ask for this. +We're looking for published or known results to use as a reference. +Link to leaderboards, papers, or repos with reported numbers. + +Lastly, note anything an engineer should know about running this environment: +- Does it need specific hardware (GPUs, large memory)? +- Does it require network access, Docker, or a VM? +- Are there known limitations on parallelism or throughput? +- Any OS or platform restrictions? + +## Build + +Use the information from information gathering from both the user +and the benchmark/environment source to properly design implementation +according to the guidelines below: + +No matter what kind of external benchmark/environment you are integrating, +you will integrate at the agent server level and not in resources server. +In short, you will wrap the benchmark in the agent server's `/run` endpoint. + +- Integrate at the agent server level (not resources server) +- Agent's `/run` endpoint wraps the external library +- Pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse` +- Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration +- Add the dependency in `requirements.txt` + +## Workflow +TODO: worried this is overfit to tau2. + +### Step 1: Scaffold the agent server +TODO: it'd be great if we had a cli command to scaffold - do we have this? +Follow the structure of `responses_api_agents/tau2` to start. + + Required structure: + + responses_api_agents// + ├── app.py + ├── configs/_agent.yaml + ├── data/example.jsonl # 5 entries, committed to git + ├── tests/__init__.py + ├── tests/test_app.py + ├── requirements.txt + └── README.md + +requirements.txt content: + + -e nemo-gym[dev] @ ../../ + == # or: git+https://... @ + +Per the docs, this is the only place upstream dependencies are declared — do not vendor them into nemo_gym/ + +### Step 2: Define request/response schemas +Subclass BaseRunRequest with any extra fields your library's task runner needs (task spec, seed, run config, etc.). Subclass BaseVerifyResponse for the agent's reply — include +both reward and any per-task metrics you want logged downstream (duration, step counts, token usage, finish reasons). + +Reference: `responses_api_agents/tau2/app.py` + +### Step 3: Implement `/run` - wrap the upstream library +Subclass SimpleResponsesAPIAgent. Per the docs, leave responses() as raise NotImplementedError — external integrations only need /run. + +In run(): + +1. Preprocess — translate BaseRunRequest + responses_create_params into whatever shape your library's entrypoint expects. (See responses_api_agents/tau2/app.py:126-152.) +2. Point the upstream LLM client at Gym model servers. For each model role the library needs, expose a ModelServerRef field in the agent config (model_server for policy, plus +extras like user_model_server for simulators). At runtime, set the library's api_base = f"{get_server_url(self.config..name)}/v1" and a dummy API key. Tau2 does this for +both policy and user-sim models at app.py:131-148. +3. Await the library's task entrypoint. Example: result = await run_single_task(**body_dict) (app.py:152). +4. Postprocess the trajectory for RL. Convert the library's message list → OpenAI responses items via VLLMConverter.chat_completions_messages_to_responses_items, then split with + split_responses_input_output_items. This is what makes the trajectory consumable by Gym's training loop. (app.py:154-169.) +5. Return your *VerifyResponse with reward set from the library's result object plus any metrics you computed. + +### Step 4: Auto-install external tools (if applicable) + +If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See `references/patterns.md` § "External Tool Auto-Install Pattern". + +Key points: +- Create `setup_.py` with `ensure_()` — checks PATH, forks on `sys.platform` (brew on macOS, build from source on Linux) +- Call it in `model_post_init()` before semaphore init +- Build scripts should be idempotent and install into a local gitignored prefix +- Add a `pytest_configure` hook in `tests/conftest.py` that calls `ensure_()` before collection + +### Step 5: Write YAML configs + +1. Agent config — `responses_api_agents//configs/_agent.yaml` + +Declares the agent server: entrypoint, every ModelServerRef the library needs, library-specific settings (max steps, concurrency knobs, debug flags), and an +example dataset. Reference: `responses_api_agents/tau2/configs/tau2_agent.yaml`. + +2. Benchmark config — `benchmarks//config.yaml` + +Chains to the agent config via config_paths and uses _inherit_from to override per-variant knobs (which model serves the agent, which model serves the simulator, num_repeats, +dataset path). This is what isolates one benchmark variant from another so the agent config stays generic. Reference: benchmarks/tau2/config.yaml. + +### Step 6: Test + +```bash +# Run server tests (creates isolated .venv, slow on first run) +ng_test +entrypoint=resources_servers/my_benchmark + +# Run core library tests to check nothing broke +pytest tests/unit_tests/ -x +``` + +Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout. + +### Step 6: Smoke test end-to-end + +```bash +# Start servers +ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]" + +# Quick test with example data +ng_collect_rollouts +agent_name=my_benchmark_simple_agent \ + +input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \ + +output_jsonl_fpath=results/example_rollouts.jsonl \ + +num_repeats=1 \ + "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" + +# Inspect results +``` + +### Step 7: Baseline (reward profiling) + +Run against multiple models to validate correctness. Recommended suite: +- Your policy model of interest +- At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct) +- At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking) +- At least one closed-source model (e.g. GPT-5 Nano or GPT-5) + +```bash +# Collect rollouts +ng_collect_rollouts +agent_name=my_benchmark_simple_agent \ + +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \ + +output_jsonl_fpath=results/rollouts.jsonl \ + +num_repeats=5 \ + "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" + +# Compute per-task pass rates +ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \ + +rollouts_jsonl_fpath=results/rollouts.jsonl \ + +output_jsonl_fpath=results/profiled.jsonl \ + +pass_threshold=1.0 + +# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward) +python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl +``` + +Increase `num_repeats` until variance < 1% across runs on the same model. + +Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers. + +For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match. + +### Step 8: Pre-commit and PR + +Use `.github/ISSUE_TEMPLATE/environment-integration.md` to make sure and issue is created for the integrated environment. + +```bash +pre-commit run --all-files +``` + +First run may fail as hooks auto-modify files (`verified: false` flag, README table). Stage changes and run again. + +Set `verified: true` in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description. + +To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files: +```bash +pre-commit run --files resources_servers/my_benchmark/**/* +``` +If hooks modify files in other directories, discard those changes: +```bash +git checkout -- resources_servers/other_server/ +``` + +## Constraints + +- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other +- **Use aiohttp, not httpx, for async HTTP.** All async HTTP calls must go through `nemo_gym.server_utils.request()` (aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see `resources_servers/tavily_search/app.py` (`TavilySearchAIOHTTPClient`) for the pattern and `docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md` for the rationale. +- Pass configuration through Gym config (YAML), not environment variables +- Code must run on Linux +- `/run` endpoint must be async +- Errors from tool execution or bad model output must return error responses, not crash +- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`) +- Issue for the integrated environment is created from `.github/ISSUE_TEMPLATE/environment-integration.md` + +## Reference + +For detailed code patterns, schemas, and examples: see [references/patterns.md](references/patterns.md). + +## Background Information about Benchmarks +TODO: we could tell the agent to go read fern/versions/latest/pages/about/architecture.mdx +to learn about architecture of environments. +Benchmarks are fundamentally synonymous with environments, so understanding how +environments work will help you understand how benchmarks also work. + +There are generally 3 kinds of external benchmarks/environment structure. +This is based off the information comes "with" the benchmark that's going to be integrated: + +1) Benchmarks/environments that define tasks and verifier. These notably don't have an agent harness. +A good example of this is MMLU. Users will define the model that they want to improve on this benchmark. +TODO: do these have action and state? I don't think so probably? + +2) Benchmarks/environments that define tasks, the agent/agent harness, and the state, action, and verifier. +A good example of this kind of environment is tau2. There is a specific tau2 agent harness, which is used +for doing tool calling for this benchmark. Users will define the model that they want to improve on this benchmark. + +3) Benchmarks/environments that define tasks, the verifier, the state, and actions. +A good example of this kind of environment is SWEBench. Users can define the model +and/or agent that they want to improve on this benchmark. +TODO: include state and actions? diff --git a/.claude/skills/add-benchmark/references/patterns.md b/.claude/skills/integrate-benchmark/references/patterns.md similarity index 100% rename from .claude/skills/add-benchmark/references/patterns.md rename to .claude/skills/integrate-benchmark/references/patterns.md diff --git a/Agents.md b/Agents.md new file mode 100644 index 000000000..a837940dd --- /dev/null +++ b/Agents.md @@ -0,0 +1,362 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## What This Is + +NeMo-Gym is NVIDIA's library for building RL training environments for LLMs (RLVR). It uses a microservice architecture with three composable FastAPI server types that communicate over async HTTP. + +## Common Commands + +```bash +# Setup +uv venv && uv sync --extra dev --group docs +pre-commit install + +# Run servers +ng_run "+config_paths=[resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" + +# Run tests for a specific server (creates .venv per server, installs deps, runs pytest) +# First run is slow. Use skip_venv_if_present config or place a .venv to skip venv creation. +ng_test +entrypoint=resources_servers/example_single_tool_call + +# Run all server tests +ng_test_all + +# Run core library unit tests +pytest tests/unit_tests/ -x + +# Run a single test file +pytest tests/unit_tests/test_openai_utils.py -x + +# Lint and format +ruff check --fix . +ruff format . + +# Pre-commit (runs ruff, formatting, custom hooks) +pre-commit run --all-files + +# Collect rollouts +ng_collect_rollouts +agent_name= +input_jsonl_fpath= +output_jsonl_fpath= +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" + +# Profile results (compute per-task pass rates) +ng_reward_profile +input_jsonl_fpath= +rollouts_jsonl_fpath= +output_jsonl_fpath= +pass_threshold=1.0 + +# Check server health +ng_status + +# Dev test (runs pytest directly in server dir, no venv isolation) +ng_dev_test +entrypoint=resources_servers/example_single_tool_call + +# Dump merged config +ng_dump_config "+config_paths=[...]" + +# Dataset management (HF) +ng_upload_dataset_to_hf +dataset_name= +version= +input_jsonl_fpath= +hf_repo_id= +ng_download_dataset_from_hf +dataset_name= +version= +output_jsonl_fpath= +hf_repo_id= +``` + +## Architecture + +Three server types, all FastAPI apps communicating via aiohttp: + +- **Resources Servers** (`resources_servers/`): Implement `verify()` — task verification and reward computation. Return reward 0.0 or 1.0. +- **Response API Models** (`responses_api_models/`): Implement `chat_completions()` and `responses()` — LLM inference. Four variants: openai, azure_openai, vllm, local_vllm. +- **Response API Agents** (`responses_api_agents/`): Implement `responses()` and `run()` — orchestrate model-tool call loops. `simple_agent` is the default single-turn agent; others include `proof_refinement_agent` (multi-turn correction), `verifiers_agent`, `swe_agents`, etc. + +A **HeadServer** coordinates all server lifecycles, config, and Ray cluster init. + +### Base Class Hierarchy + +``` +BaseServer (Pydantic model with config + server_client) +└── SimpleServer (FastAPI app setup, middleware stack) + ├── SimpleResourcesServer → implement verify() + ├── SimpleResponsesAPIModel → implement chat_completions(), responses() + └── SimpleResponsesAPIAgent → implement responses(), run() +``` + +### Data Flow + +JSONL input → agent `/run` → model `/v1/responses` → (tool calls if any) → resources server `/verify` → reward → JSONL output + +### Inter-Server Communication + +`ServerClient` wraps aiohttp with retry logic (3 tries, exponential backoff). Session cookies propagate through the call stack for stateful environments. The global aiohttp client is a singleton with connection pooling. + +## Configuration + +Hydra + OmegaConf for hierarchical YAML composition. CLI overrides use `+key=value` syntax. + +Each server instance is a top-level key in YAML that maps to a server type + config: +```yaml +my_server_instance: + resources_servers: # server type directory + my_server: # server subdirectory name + entrypoint: app.py + domain: coding + # ... server-specific config fields +``` + +Agent configs reference their resource and model servers: +```yaml +my_agent_instance: + responses_api_agents: + simple_agent: + entrypoint: app.py + resources_server: + type: resources_servers + name: my_server + model_server: + type: responses_api_models + name: policy_model + datasets: + - name: my_dataset + type: train + jsonl_fpath: path/to/data.jsonl +``` + +Model endpoint config goes in `env.yaml` at project root: +```yaml +policy_base_url: http://localhost:8000/v1 +policy_api_key: your-key +policy_model_name: your-model +``` + +## JSONL Data Schema + +Each line in input JSONL: +```json +{ + "responses_create_params": { + "input": [ + {"role": "system", "content": "..."}, + {"role": "user", "content": "..."} + ] + }, + "verifier_metadata": { ... } +} +``` + +`responses_create_params.input` follows OpenAI message format. `verifier_metadata` is passed through to the resources server's `verify()` for task-specific validation data (test cases, expected answers, etc.). + +Output JSONL (from `ng_collect_rollouts`) contains the full verify response per rollout, including at minimum: +```json +{ + "reward": 1.0, + "response": { "output_text": "..." }, + "task_index": 0 +} +``` +Additional fields depend on the resources server's `VerifyResponse` class. + +## Dataset Management + +### Dataset types and where they live + +- **`example`** datasets (5 entries for smoke testing) are committed directly to git in `data/example.jsonl`. +- **`train`** and **`validation`** datasets are hosted in the GitLab dataset registry. They must NOT be committed to git. + +### GitLab dataset registry + +Upload a JSONL dataset: +```bash +ng_upload_dataset_to_gitlab \ + +dataset_name=my_benchmark \ + +version=0.0.1 \ + +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl +``` + +Requires MLflow credentials in `env.yaml` (or passed via CLI): +```yaml +mlflow_tracking_uri: +mlflow_tracking_token: +``` + +The tracking URI format is `https:///api/v4/projects//ml/mlflow`. + +### YAML config: gitlab_identifier + jsonl_fpath + +Both fields coexist. `jsonl_fpath` is the local download destination; `gitlab_identifier` tells the system where to fetch from: +```yaml +- name: my_dataset + type: validation + jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl + gitlab_identifier: + dataset_name: my_benchmark + version: 0.0.1 + artifact_fpath: my_dataset.jsonl + license: MIT +``` + +### data/.gitignore + +Every resources server has `data/.gitignore` (generated by `ng_init_resources_server`): +``` +*train.jsonl +*validation.jsonl +*train_prepare.jsonl +*validation_prepare.jsonl +*example_prepare.jsonl +``` + +If your filename doesn't match these patterns (e.g. `my_eval.jsonl`), add a custom pattern (e.g. `*eval.jsonl`). If data was previously tracked, run `git rm --cached `. + +### ng_prepare_data + +Validate example data (for PR submission): +```bash +ng_prepare_data "+config_paths=[...]" +output_dirpath=/tmp/prepare +mode=example_validation +``` + +Download and prepare train/validation from GitLab: +```bash +ng_prepare_data "+config_paths=[...]" +output_dirpath=data/my_benchmark \ + +mode=train_preparation +should_download=true +data_source=gitlab +``` + +## Adding a New Benchmark (Resources Server + Agent) + +For wrapping an existing 3rd-party benchmark library, integrate at the agent server level: wrap the library in `/run`, pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse`. Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration. + +For native benchmarks, follow these steps: + +### 1. Create the resources server + +Copy an existing server as template: +- `example_single_tool_call` — simplest example +- `code_gen` — subprocess execution with Ray (good for compilation/execution benchmarks) + +Required structure: +``` +resources_servers/my_server/ +├── app.py # Server class extending SimpleResourcesServer +├── configs/my_server.yaml +├── data/example.jsonl # 5 examples for quick testing +├── tests/__init__.py +├── tests/test_app.py +├── requirements.txt # just: -e nemo-gym[dev] @ ../../ +└── README.md +``` + +The `verify()` method receives the model output and `verifier_metadata`, returns a response with `reward` field. The `verifier_metadata` dict is opaque to the framework — define whatever fields your benchmark needs (test cases, expected answers, task IDs, etc.) and pass them through the JSONL data. + +### 2. Create or reuse an agent + +- `simple_agent` — single-turn, works for most benchmarks. Just pair it with your resources server in the YAML config. +- `proof_refinement_agent` — multi-turn correction loop (model gets error feedback and retries). Copy this if your benchmark benefits from iterative refinement. + +Agent structure: +``` +responses_api_agents/my_agent/ +├── app.py # Server class extending SimpleResponsesAPIAgent +├── configs/my_agent.yaml +├── tests/__init__.py +├── tests/test_app.py +└── requirements.txt +``` + +For multi-turn agents, propagate cookies from the incoming request through all downstream calls: `cookies=request.cookies`. Also propagate token IDs (`prompt_token_ids`, `generation_token_ids`, `generation_log_probs`) from model responses when constructing the next turn's input — these are needed for RL training. + +### 3. Wire up the YAML config + +A single YAML file in `configs/` typically defines both the resources server and its agent pairings. The agent references the resources server and model server by name. + +### 4. Prepare data + +Input JSONL has one problem per line. System prompt goes in the `input` messages. Task-specific verification data goes in `verifier_metadata`. + +If converting from another format, write the conversion script in the source repo (e.g. your dataset source repo) — conversion scripts and prompt files do not belong in the NeMo-Gym repo. Upload only the converted JSONL to the GitLab registry. + +Generate `data/example.jsonl` with 5 entries (committed to git). Upload `train`/`validation` datasets with `ng_upload_dataset_to_gitlab`. Add `gitlab_identifier` to the YAML config. See "Dataset Management" above for the full workflow. + +Validate your data: +```bash +ng_prepare_data "+config_paths=[...]" +output_dirpath=/tmp/prepare +mode=example_validation +ng_prepare_data "+config_paths=[...]" +output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab +``` + +### 5. Baseline (reward profiling) + +Run against multiple models to validate correctness: + +```bash +# Start servers +ng_run "+config_paths=[resources_servers/my_server/configs/my_server.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" + +# Collect rollouts (start with example.jsonl for quick smoke test) +ng_collect_rollouts +agent_name=my_agent +input_jsonl_fpath= +output_jsonl_fpath=results/rollouts.jsonl +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" + +# Compute per-task pass rates +ng_reward_profile +input_jsonl_fpath= +rollouts_jsonl_fpath=results/rollouts.jsonl +output_jsonl_fpath=results/profiled.jsonl +pass_threshold=1.0 + +# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward) +python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl +``` + +Run on both instruct and thinking models. Thinking models emit ``/`` blocks in their output — your code extraction logic must strip these before parsing. + +Use `openai_model` for endpoints supporting `/v1/responses`, `vllm_model` for `/v1/chat/completions`. + +### 6. Important constraints + +- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other clients. It's `openai<=2.6.1` for schema compatibility. +- Pass all configuration through Gym config (YAML), not environment variables. This includes model URLs, API keys, etc. +- Environments must handle errors gracefully — tool failures and bad model outputs should return meaningful error responses, not crash the server. Must handle 4k-65k concurrent requests without crashing. +- The `/run` endpoint must be async. Use `asyncio.Semaphore` for concurrency control if shelling out to external processes. +- Tests should skip gracefully if external tools aren't installed (e.g. `pytest.mark.skipif(shutil.which("tool") is None, ...)`). +- If a benchmark auto-installs its tool dependency (see "External Tool Auto-Install" below), add a `pytest_configure` hook in `conftest.py` to run the install before test collection — `skipif` markers evaluate at import time, before fixtures run. +- Executables must run on Linux. +- Increase num_repeats until variance is < 1% across runs on the same model. + +## Code Style + +- Line length: 119 +- Python 3.12+, async-first +- Ruff for linting and formatting (double quotes, isort) +- Test coverage must be >= 95% +- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`) + +## Pre-commit Hooks + +Notable custom hooks that auto-modify files: +- `add-verified-flag`: Adds `verified: false` to new resources server YAML configs (`verified: true` means the benchmark has been baselined and reviewed; new servers start as `false`) +- `update-readme-table`: Updates the resources server table in root README.md +- `ruff-format`: Auto-formats code + +First run may fail as hooks modify files. Stage the changes and commit again. + +To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files: +```bash +pre-commit run --files resources_servers/my_benchmark/**/* +``` +If hooks modify files in other directories, discard those changes: +```bash +git checkout -- resources_servers/other_server/ +``` + +## External Tool Auto-Install + +When a benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup: + +1. Create a `setup_.py` module with an `ensure_()` function that: + - Checks `shutil.which("tool")` — returns early if already on PATH + - Forks on `sys.platform`: macOS (brew), Linux (build from source via bash script) + - Updates `os.environ["PATH"]` and `os.environ["LD_LIBRARY_PATH"]` for the current process + - Verifies the tool runs successfully after install +2. Call `ensure_()` in the server's `model_post_init()` (runs once at startup) +3. For tests: add a `pytest_configure` hook in `conftest.py` that calls `ensure_()` before collection, so `skipif(shutil.which("tool") is None)` markers see the installed tool +4. Build-from-source scripts should be idempotent (skip if artifacts exist) and install into a local prefix (e.g. `./` in the server dir, gitignored) + +## Cluster / HPC Gotchas + +- **Ray socket path length**: On systems with long working directory paths (e.g. Lustre mounts), Ray's AF_UNIX socket paths can exceed the 107-byte Linux limit. Fix: `RAY_TMPDIR=/tmp` before running tests or `ray.init()`. +- **`ng_test` venv isolation**: `ng_test` creates isolated venvs per resources server. `os.environ` changes in Python don't propagate — set env vars externally (e.g. `RAY_TMPDIR=/tmp ng_test ...`). + +## Async Patterns + +- **Use aiohttp, not httpx, for async HTTP.** All async HTTP calls must go through NeMo Gym's global aiohttp client (`nemo_gym.server_utils.request()`). Do not use `httpx.AsyncClient` — httpx/httpcore has O(n^2) connection pooling that causes hangs at high concurrency (16k+ requests). When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter. See `docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md` for the full writeup and `resources_servers/tavily_search/app.py` (`TavilySearchAIOHTTPClient`) for the adapter pattern. +- Use `asyncio.Semaphore` to bound concurrent subprocess/external calls +- For Ray remote tasks in async code: `result = await future` (Ray futures are directly awaitable). Never call `ray.get()` directly in async context. +- Decode all subprocess output with `errors="replace"` to handle non-UTF8 +- Guard optional nested fields: `(body.field or {}).get("key", default)` diff --git a/CLAUDE.md b/CLAUDE.md index a837940dd..79c98d8d8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,362 +1,2 @@ # CLAUDE.md - -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. - -## What This Is - -NeMo-Gym is NVIDIA's library for building RL training environments for LLMs (RLVR). It uses a microservice architecture with three composable FastAPI server types that communicate over async HTTP. - -## Common Commands - -```bash -# Setup -uv venv && uv sync --extra dev --group docs -pre-commit install - -# Run servers -ng_run "+config_paths=[resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" - -# Run tests for a specific server (creates .venv per server, installs deps, runs pytest) -# First run is slow. Use skip_venv_if_present config or place a .venv to skip venv creation. -ng_test +entrypoint=resources_servers/example_single_tool_call - -# Run all server tests -ng_test_all - -# Run core library unit tests -pytest tests/unit_tests/ -x - -# Run a single test file -pytest tests/unit_tests/test_openai_utils.py -x - -# Lint and format -ruff check --fix . -ruff format . - -# Pre-commit (runs ruff, formatting, custom hooks) -pre-commit run --all-files - -# Collect rollouts -ng_collect_rollouts +agent_name= +input_jsonl_fpath= +output_jsonl_fpath= +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" - -# Profile results (compute per-task pass rates) -ng_reward_profile +input_jsonl_fpath= +rollouts_jsonl_fpath= +output_jsonl_fpath= +pass_threshold=1.0 - -# Check server health -ng_status - -# Dev test (runs pytest directly in server dir, no venv isolation) -ng_dev_test +entrypoint=resources_servers/example_single_tool_call - -# Dump merged config -ng_dump_config "+config_paths=[...]" - -# Dataset management (HF) -ng_upload_dataset_to_hf +dataset_name= +version= +input_jsonl_fpath= +hf_repo_id= -ng_download_dataset_from_hf +dataset_name= +version= +output_jsonl_fpath= +hf_repo_id= -``` - -## Architecture - -Three server types, all FastAPI apps communicating via aiohttp: - -- **Resources Servers** (`resources_servers/`): Implement `verify()` — task verification and reward computation. Return reward 0.0 or 1.0. -- **Response API Models** (`responses_api_models/`): Implement `chat_completions()` and `responses()` — LLM inference. Four variants: openai, azure_openai, vllm, local_vllm. -- **Response API Agents** (`responses_api_agents/`): Implement `responses()` and `run()` — orchestrate model-tool call loops. `simple_agent` is the default single-turn agent; others include `proof_refinement_agent` (multi-turn correction), `verifiers_agent`, `swe_agents`, etc. - -A **HeadServer** coordinates all server lifecycles, config, and Ray cluster init. - -### Base Class Hierarchy - -``` -BaseServer (Pydantic model with config + server_client) -└── SimpleServer (FastAPI app setup, middleware stack) - ├── SimpleResourcesServer → implement verify() - ├── SimpleResponsesAPIModel → implement chat_completions(), responses() - └── SimpleResponsesAPIAgent → implement responses(), run() -``` - -### Data Flow - -JSONL input → agent `/run` → model `/v1/responses` → (tool calls if any) → resources server `/verify` → reward → JSONL output - -### Inter-Server Communication - -`ServerClient` wraps aiohttp with retry logic (3 tries, exponential backoff). Session cookies propagate through the call stack for stateful environments. The global aiohttp client is a singleton with connection pooling. - -## Configuration - -Hydra + OmegaConf for hierarchical YAML composition. CLI overrides use `+key=value` syntax. - -Each server instance is a top-level key in YAML that maps to a server type + config: -```yaml -my_server_instance: - resources_servers: # server type directory - my_server: # server subdirectory name - entrypoint: app.py - domain: coding - # ... server-specific config fields -``` - -Agent configs reference their resource and model servers: -```yaml -my_agent_instance: - responses_api_agents: - simple_agent: - entrypoint: app.py - resources_server: - type: resources_servers - name: my_server - model_server: - type: responses_api_models - name: policy_model - datasets: - - name: my_dataset - type: train - jsonl_fpath: path/to/data.jsonl -``` - -Model endpoint config goes in `env.yaml` at project root: -```yaml -policy_base_url: http://localhost:8000/v1 -policy_api_key: your-key -policy_model_name: your-model -``` - -## JSONL Data Schema - -Each line in input JSONL: -```json -{ - "responses_create_params": { - "input": [ - {"role": "system", "content": "..."}, - {"role": "user", "content": "..."} - ] - }, - "verifier_metadata": { ... } -} -``` - -`responses_create_params.input` follows OpenAI message format. `verifier_metadata` is passed through to the resources server's `verify()` for task-specific validation data (test cases, expected answers, etc.). - -Output JSONL (from `ng_collect_rollouts`) contains the full verify response per rollout, including at minimum: -```json -{ - "reward": 1.0, - "response": { "output_text": "..." }, - "task_index": 0 -} -``` -Additional fields depend on the resources server's `VerifyResponse` class. - -## Dataset Management - -### Dataset types and where they live - -- **`example`** datasets (5 entries for smoke testing) are committed directly to git in `data/example.jsonl`. -- **`train`** and **`validation`** datasets are hosted in the GitLab dataset registry. They must NOT be committed to git. - -### GitLab dataset registry - -Upload a JSONL dataset: -```bash -ng_upload_dataset_to_gitlab \ - +dataset_name=my_benchmark \ - +version=0.0.1 \ - +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl -``` - -Requires MLflow credentials in `env.yaml` (or passed via CLI): -```yaml -mlflow_tracking_uri: -mlflow_tracking_token: -``` - -The tracking URI format is `https:///api/v4/projects//ml/mlflow`. - -### YAML config: gitlab_identifier + jsonl_fpath - -Both fields coexist. `jsonl_fpath` is the local download destination; `gitlab_identifier` tells the system where to fetch from: -```yaml -- name: my_dataset - type: validation - jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl - gitlab_identifier: - dataset_name: my_benchmark - version: 0.0.1 - artifact_fpath: my_dataset.jsonl - license: MIT -``` - -### data/.gitignore - -Every resources server has `data/.gitignore` (generated by `ng_init_resources_server`): -``` -*train.jsonl -*validation.jsonl -*train_prepare.jsonl -*validation_prepare.jsonl -*example_prepare.jsonl -``` - -If your filename doesn't match these patterns (e.g. `my_eval.jsonl`), add a custom pattern (e.g. `*eval.jsonl`). If data was previously tracked, run `git rm --cached `. - -### ng_prepare_data - -Validate example data (for PR submission): -```bash -ng_prepare_data "+config_paths=[...]" +output_dirpath=/tmp/prepare +mode=example_validation -``` - -Download and prepare train/validation from GitLab: -```bash -ng_prepare_data "+config_paths=[...]" +output_dirpath=data/my_benchmark \ - +mode=train_preparation +should_download=true +data_source=gitlab -``` - -## Adding a New Benchmark (Resources Server + Agent) - -For wrapping an existing 3rd-party benchmark library, integrate at the agent server level: wrap the library in `/run`, pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse`. Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration. - -For native benchmarks, follow these steps: - -### 1. Create the resources server - -Copy an existing server as template: -- `example_single_tool_call` — simplest example -- `code_gen` — subprocess execution with Ray (good for compilation/execution benchmarks) - -Required structure: -``` -resources_servers/my_server/ -├── app.py # Server class extending SimpleResourcesServer -├── configs/my_server.yaml -├── data/example.jsonl # 5 examples for quick testing -├── tests/__init__.py -├── tests/test_app.py -├── requirements.txt # just: -e nemo-gym[dev] @ ../../ -└── README.md -``` - -The `verify()` method receives the model output and `verifier_metadata`, returns a response with `reward` field. The `verifier_metadata` dict is opaque to the framework — define whatever fields your benchmark needs (test cases, expected answers, task IDs, etc.) and pass them through the JSONL data. - -### 2. Create or reuse an agent - -- `simple_agent` — single-turn, works for most benchmarks. Just pair it with your resources server in the YAML config. -- `proof_refinement_agent` — multi-turn correction loop (model gets error feedback and retries). Copy this if your benchmark benefits from iterative refinement. - -Agent structure: -``` -responses_api_agents/my_agent/ -├── app.py # Server class extending SimpleResponsesAPIAgent -├── configs/my_agent.yaml -├── tests/__init__.py -├── tests/test_app.py -└── requirements.txt -``` - -For multi-turn agents, propagate cookies from the incoming request through all downstream calls: `cookies=request.cookies`. Also propagate token IDs (`prompt_token_ids`, `generation_token_ids`, `generation_log_probs`) from model responses when constructing the next turn's input — these are needed for RL training. - -### 3. Wire up the YAML config - -A single YAML file in `configs/` typically defines both the resources server and its agent pairings. The agent references the resources server and model server by name. - -### 4. Prepare data - -Input JSONL has one problem per line. System prompt goes in the `input` messages. Task-specific verification data goes in `verifier_metadata`. - -If converting from another format, write the conversion script in the source repo (e.g. your dataset source repo) — conversion scripts and prompt files do not belong in the NeMo-Gym repo. Upload only the converted JSONL to the GitLab registry. - -Generate `data/example.jsonl` with 5 entries (committed to git). Upload `train`/`validation` datasets with `ng_upload_dataset_to_gitlab`. Add `gitlab_identifier` to the YAML config. See "Dataset Management" above for the full workflow. - -Validate your data: -```bash -ng_prepare_data "+config_paths=[...]" +output_dirpath=/tmp/prepare +mode=example_validation -ng_prepare_data "+config_paths=[...]" +output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab -``` - -### 5. Baseline (reward profiling) - -Run against multiple models to validate correctness: - -```bash -# Start servers -ng_run "+config_paths=[resources_servers/my_server/configs/my_server.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" - -# Collect rollouts (start with example.jsonl for quick smoke test) -ng_collect_rollouts +agent_name=my_agent +input_jsonl_fpath= +output_jsonl_fpath=results/rollouts.jsonl +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}" - -# Compute per-task pass rates -ng_reward_profile +input_jsonl_fpath= +rollouts_jsonl_fpath=results/rollouts.jsonl +output_jsonl_fpath=results/profiled.jsonl +pass_threshold=1.0 - -# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward) -python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl -``` - -Run on both instruct and thinking models. Thinking models emit ``/`` blocks in their output — your code extraction logic must strip these before parsing. - -Use `openai_model` for endpoints supporting `/v1/responses`, `vllm_model` for `/v1/chat/completions`. - -### 6. Important constraints - -- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other clients. It's `openai<=2.6.1` for schema compatibility. -- Pass all configuration through Gym config (YAML), not environment variables. This includes model URLs, API keys, etc. -- Environments must handle errors gracefully — tool failures and bad model outputs should return meaningful error responses, not crash the server. Must handle 4k-65k concurrent requests without crashing. -- The `/run` endpoint must be async. Use `asyncio.Semaphore` for concurrency control if shelling out to external processes. -- Tests should skip gracefully if external tools aren't installed (e.g. `pytest.mark.skipif(shutil.which("tool") is None, ...)`). -- If a benchmark auto-installs its tool dependency (see "External Tool Auto-Install" below), add a `pytest_configure` hook in `conftest.py` to run the install before test collection — `skipif` markers evaluate at import time, before fixtures run. -- Executables must run on Linux. -- Increase num_repeats until variance is < 1% across runs on the same model. - -## Code Style - -- Line length: 119 -- Python 3.12+, async-first -- Ruff for linting and formatting (double quotes, isort) -- Test coverage must be >= 95% -- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`) - -## Pre-commit Hooks - -Notable custom hooks that auto-modify files: -- `add-verified-flag`: Adds `verified: false` to new resources server YAML configs (`verified: true` means the benchmark has been baselined and reviewed; new servers start as `false`) -- `update-readme-table`: Updates the resources server table in root README.md -- `ruff-format`: Auto-formats code - -First run may fail as hooks modify files. Stage the changes and commit again. - -To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files: -```bash -pre-commit run --files resources_servers/my_benchmark/**/* -``` -If hooks modify files in other directories, discard those changes: -```bash -git checkout -- resources_servers/other_server/ -``` - -## External Tool Auto-Install - -When a benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup: - -1. Create a `setup_.py` module with an `ensure_()` function that: - - Checks `shutil.which("tool")` — returns early if already on PATH - - Forks on `sys.platform`: macOS (brew), Linux (build from source via bash script) - - Updates `os.environ["PATH"]` and `os.environ["LD_LIBRARY_PATH"]` for the current process - - Verifies the tool runs successfully after install -2. Call `ensure_()` in the server's `model_post_init()` (runs once at startup) -3. For tests: add a `pytest_configure` hook in `conftest.py` that calls `ensure_()` before collection, so `skipif(shutil.which("tool") is None)` markers see the installed tool -4. Build-from-source scripts should be idempotent (skip if artifacts exist) and install into a local prefix (e.g. `./` in the server dir, gitignored) - -## Cluster / HPC Gotchas - -- **Ray socket path length**: On systems with long working directory paths (e.g. Lustre mounts), Ray's AF_UNIX socket paths can exceed the 107-byte Linux limit. Fix: `RAY_TMPDIR=/tmp` before running tests or `ray.init()`. -- **`ng_test` venv isolation**: `ng_test` creates isolated venvs per resources server. `os.environ` changes in Python don't propagate — set env vars externally (e.g. `RAY_TMPDIR=/tmp ng_test ...`). - -## Async Patterns - -- **Use aiohttp, not httpx, for async HTTP.** All async HTTP calls must go through NeMo Gym's global aiohttp client (`nemo_gym.server_utils.request()`). Do not use `httpx.AsyncClient` — httpx/httpcore has O(n^2) connection pooling that causes hangs at high concurrency (16k+ requests). When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter. See `docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md` for the full writeup and `resources_servers/tavily_search/app.py` (`TavilySearchAIOHTTPClient`) for the adapter pattern. -- Use `asyncio.Semaphore` to bound concurrent subprocess/external calls -- For Ray remote tasks in async code: `result = await future` (Ray futures are directly awaitable). Never call `ray.get()` directly in async context. -- Decode all subprocess output with `errors="replace"` to handle non-UTF8 -- Guard optional nested fields: `(body.field or {}).get("key", default)` +Before working read `Agents.md` and follow guidance. \ No newline at end of file