diff --git a/.claude/skills/add-benchmark/SKILL.md b/.claude/skills/add-benchmark/SKILL.md
deleted file mode 100644
index 385666e38..000000000
--- a/.claude/skills/add-benchmark/SKILL.md
+++ /dev/null
@@ -1,252 +0,0 @@
----
-name: add-benchmark
-description: >
-  Guide for adding a new benchmark or training environment to NeMo-Gym.
-  Use when the user asks to add, create, or integrate a benchmark, evaluation,
-  training environment, or resources server into NeMo-Gym. Also use when wrapping
-  an existing 3rd-party benchmark library. Covers the full workflow: data preparation,
-  resources server implementation, agent wiring, YAML config, testing, and reward
-  profiling (baselining). Triggered by: "add benchmark", "new resources server",
-  "integrate benchmark", "wrap benchmark", "add training environment", "add eval".
----
-
-# Add Benchmark to NeMo-Gym
-
-## Determine Integration Type
-
-Before starting, determine which type of benchmark you're adding:
-
-**Native benchmark** — verification logic implemented directly in a Gym resources server:
-- Resources server implements `verify()` with reward logic
-- Agent server orchestrates model calls (use `simple_agent` for single-turn, or custom agent for multi-turn)
-- Example: `code_gen`, `instruction_following`, `math_with_judge`
-
-**External benchmark** — wrapping a 3rd-party library that has its own orchestration:
-- Integrate at the agent server level (not resources server)
-- Agent's `/run` endpoint wraps the external library
-- Pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse`
-- Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration
-- Add the dependency in `requirements.txt`
-
-## Workflow
-
-### Step 1: Scaffold the server
-
-Run `ng_init_resources_server` to generate the directory structure:
-
-```bash
-ng_init_resources_server +entrypoint=resources_servers/my_benchmark
-```
-
-This creates:
-```
-resources_servers/my_benchmark/
-├── app.py              # Server template
-├── configs/my_benchmark.yaml
-├── data/.gitignore
-├── tests/test_app.py
-├── requirements.txt
-└── README.md
-```
-
-For external benchmarks, create the agent server manually under `responses_api_agents/my_agent/` with the same structure.
-
-### Step 2: Prepare data
-
-Convert your source dataset to Gym JSONL format. Each line must have `responses_create_params.input` (OpenAI message format). Task-specific verification data goes in `verifier_metadata`.
-
-```json
-{
-  "responses_create_params": {
-    "input": [
-      {"role": "system", "content": "System prompt"},
-      {"role": "user", "content": "Problem statement"}
-    ]
-  },
-  "verifier_metadata": {
-    "test_cases": [{"input": "...", "expected_output": "..."}],
-    "task_id": "unique_id"
-  }
-}
-```
-
-**Data conversion**: Write conversion scripts in the **source repo** (e.g. your dataset repository), not in NeMo-Gym. Prompt files also belong in the source repo. Exception: when there is no external source repo. See `references/patterns.md` § "Data Conversion Script Pattern".
-
-**`example.jsonl`**: Generate 5 entries for smoke testing. This file is committed directly to git in `data/example.jsonl`.
-
-**`train`/`validation` datasets**: Upload to the GitLab dataset registry — these must NOT be committed to git.
-
-```bash
-ng_upload_dataset_to_gitlab \
-    +dataset_name=my_benchmark \
-    +version=0.0.1 \
-    +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
-```
-
-Requires MLflow credentials in `env.yaml` (or passed via CLI):
-```yaml
-mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
-mlflow_tracking_token: <your-gitlab-api-token>
-```
-
-**`data/.gitignore`**: The scaffold generates default patterns (`*train.jsonl`, `*validation.jsonl`, etc.). If your filename doesn't match (e.g. `my_eval.jsonl`), add a custom pattern (e.g. `*eval.jsonl`). If data was previously tracked, run `git rm --cached <file>`.
-
-**Validate** your data:
-```bash
-# Validate example data (for PR submission)
-ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
-    +output_dirpath=/tmp/prepare +mode=example_validation
-
-# Download and prepare train/validation from GitLab
-ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
-    +output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
-```
-
-### Step 3: Implement verify()
-
-Edit `app.py`. The `verify()` method receives model output + `verifier_metadata`, returns reward.
-
-For code execution benchmarks, see `references/patterns.md` § "Subprocess Execution with Ray" and "Resources Server Pattern".
-
-Critical rules:
-- Return `reward` as 0.0 or 1.0 (binary)
-- Handle empty/missing model output gracefully — return 0.0, don't crash
-- Must handle 4k-65k concurrent requests without crashing
-- Use `asyncio.Semaphore` for subprocess concurrency control
-- For Ray remote tasks: `result = await future` (Ray futures are directly awaitable). Never call `ray.get()` in async context.
-- Decode subprocess output with `errors="replace"`
-- Strip `<think>`/`<thinking>` blocks before parsing model output (thinking models emit these)
-- Tests should `pytest.mark.skipif` when external tools aren't installed
-- If the benchmark auto-installs its tool (see Step 3b), add a `pytest_configure` hook in `conftest.py` to run the install before test collection — `skipif` evaluates at import time, before fixtures run
-
-### Step 3b: Auto-install external tools (if applicable)
-
-If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See `references/patterns.md` § "External Tool Auto-Install Pattern".
-
-Key points:
-- Create `setup_<tool>.py` with `ensure_<tool>()` — checks PATH, forks on `sys.platform` (brew on macOS, build from source on Linux)
-- Call it in `model_post_init()` before semaphore init
-- Build scripts should be idempotent and install into a local gitignored prefix
-- Add a `pytest_configure` hook in `tests/conftest.py` that calls `ensure_<tool>()` before collection
-
-### Step 4: Wire YAML config
-
-Edit `configs/my_benchmark.yaml`. Define the resources server instance and agent pairing(s). See `references/patterns.md` § "YAML Config Pattern".
-
-Key points:
-- `verified: false` is auto-added by pre-commit hook (set to `true` after baselining)
-- `license` is required for `train` and `validation` datasets
-- Agent references resources server and model server by instance name
-
-For multi-turn benchmarks, either use `proof_refinement_agent` or create a custom agent. See `references/patterns.md` § "Agent Patterns".
-
-For `train`/`validation` datasets, add `gitlab_identifier` alongside `jsonl_fpath`:
-```yaml
-datasets:
-- name: my_dataset
-  type: train
-  jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
-  gitlab_identifier:
-    dataset_name: my_benchmark
-    version: 0.0.1
-    artifact_fpath: my_dataset.jsonl
-  license: MIT
-- name: example
-  type: example
-  jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl
-```
-
-Both fields must coexist: `jsonl_fpath` is the local download destination, `gitlab_identifier` tells the system where to fetch from. `example` datasets don't need `gitlab_identifier` — they're committed to git directly.
-
-### Step 5: Test
-
-```bash
-# Run server tests (creates isolated .venv, slow on first run)
-ng_test +entrypoint=resources_servers/my_benchmark
-
-# Run core library tests to check nothing broke
-pytest tests/unit_tests/ -x
-```
-
-Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout.
-
-### Step 6: Smoke test end-to-end
-
-```bash
-# Start servers
-ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"
-
-# Quick test with example data
-ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
-  +input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \
-  +output_jsonl_fpath=results/example_rollouts.jsonl \
-  +num_repeats=1 \
-  "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
-
-# Inspect results
-```
-
-### Step 7: Baseline (reward profiling)
-
-Run against multiple models to validate correctness. Recommended suite:
-- Your policy model of interest
-- At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct)
-- At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking)
-- At least one closed-source model (e.g. GPT-5 Nano or GPT-5)
-
-```bash
-# Collect rollouts
-ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
-  +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
-  +output_jsonl_fpath=results/rollouts.jsonl \
-  +num_repeats=5 \
-  "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
-
-# Compute per-task pass rates
-ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
-  +rollouts_jsonl_fpath=results/rollouts.jsonl \
-  +output_jsonl_fpath=results/profiled.jsonl \
-  +pass_threshold=1.0
-
-# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
-python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl
-```
-
-Increase `num_repeats` until variance < 1% across runs on the same model.
-
-Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers.
-
-For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match.
-
-### Step 8: Pre-commit and PR
-
-```bash
-pre-commit run --all-files
-```
-
-First run may fail as hooks auto-modify files (`verified: false` flag, README table). Stage changes and run again.
-
-Set `verified: true` in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description.
-
-To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:
-```bash
-pre-commit run --files resources_servers/my_benchmark/**/*
-```
-If hooks modify files in other directories, discard those changes:
-```bash
-git checkout -- resources_servers/other_server/
-```
-
-## Constraints
-
-- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other
-- **Use aiohttp, not httpx, for async HTTP.** All async HTTP calls must go through `nemo_gym.server_utils.request()` (aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see `resources_servers/tavily_search/app.py` (`TavilySearchAIOHTTPClient`) for the pattern and `docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md` for the rationale.
-- Pass configuration through Gym config (YAML), not environment variables
-- Code must run on Linux
-- `/run` endpoint must be async
-- Errors from tool execution or bad model output must return error responses, not crash
-- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`)
-
-## Reference
-
-For detailed code patterns, schemas, and examples: see [references/patterns.md](references/patterns.md).
diff --git a/.claude/skills/create-benchmark/SKILL.md b/.claude/skills/create-benchmark/SKILL.md
new file mode 100644
index 000000000..d472a9978
--- /dev/null
+++ b/.claude/skills/create-benchmark/SKILL.md
@@ -0,0 +1,9 @@
+---
+name: create-benchmark
+description: >
+  TODO
+---
+**Native benchmark** — verification logic implemented directly in a Gym resources server:
+- Resources server implements `verify()` with reward logic
+- Agent server orchestrates model calls (use `simple_agent` for single-turn, or custom agent for multi-turn)
+- Example: `code_gen`, `instruction_following`, `math_with_judge` 
\ No newline at end of file
diff --git a/.claude/skills/evaluate-environments/SKILL.md b/.claude/skills/evaluate-environments/SKILL.md
new file mode 100644
index 000000000..6892161a2
--- /dev/null
+++ b/.claude/skills/evaluate-environments/SKILL.md
@@ -0,0 +1,111 @@
+---
+name: evaluate-environments
+description: >
+  Guide for running evaluations for NeMo Gym environments/benchmarks. 
+  This should not be used for creating a new environment or integration a new
+  evaluation/environment. 
+  Use this skill when a model, agent, or benchmark needs to be run or compared. 
+  It also should be used for collecting rollouts/rewards. 
+  Triggered by:
+  "evaluate model", "evaluate agent", "run benchmark", "collect rollouts", 
+  "reward profiling", "benchmark results", "compare models", "compare agents", 
+  "analyze results", "pass@k", "why is reward 0"
+---
+
+# Evaluate Environments
+This is for running reliable evaluations and generating rollouts/getting rewards.
+
+First always test and make sure that a single evaluation run works before scaling up.
+
+## Pre-requisites
+1. Install NeMo Gym or repo set up: `uv venv && uv sync` from project root if working in Github repo
+2. You need a policy model. This can be a model endpoint or a self hosted model. 
+env.yaml` at project root with model endpoint:
+     ```yaml
+     policy_base_url: https://api.openai.com/v1
+     policy_api_key: <key>
+     policy_model_name: gpt-4.1-2025-04-14
+     ```
+     For self-hosted / vLLM / Fireworks / OpenRouter, see [Configure Model docs](https://docs.nvidia.com/nemo/gym/latest/model-server).
+
+## Running Evals/Rollouts
+**Step 1 — Start servers.** NeMo Gym runs three coordinated server types; the agent name in `ng_collect_rollouts` must match the top-level instance key declared in the
+environment config you load here.
+
+```bash
+ng_run "+config_paths=[<env_config>,<model_config>]"
+```
+
+Verify with `ng_status` in another terminal. You should see the resources server, the agent server, and the model server.
+
+**Step 2 — Smoke test on `example.jsonl` (5 tasks, committed to git).**
+
+```bash
+ng_collect_rollouts \
++agent_name=<env>_simple_agent \
++input_jsonl_fpath=resources_servers/<env>/data/example.jsonl \
++output_jsonl_fpath=results/smoke_rollouts.jsonl \
++limit=5 \
++num_repeats=1
+```
+
+If smoke fails, do **not** scale up. Inspect `results/smoke_rollouts.jsonl` directly — a completed-with-reward-0 task is very different from a server/runtime error.
+
+**Step 3 — Scale.** Use validation/train data (downloaded via `ng_prepare_data` if not local — see [Dataset 
+Management](https://docs.nvidia.com/nemo/gym/latest/about/concepts/datasets)). Bump `num_repeats` for variance reduction.
+
+```bash
+ng_collect_rollouts \
++agent_name=<env>_simple_agent \
++input_jsonl_fpath=<full_dataset.jsonl> \
++output_jsonl_fpath=results/rollouts.jsonl \
++num_repeats=5 \
++num_samples_in_parallel=10 \
+"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
+```
+
+`_aggregate_metrics.json` is written alongside `rollouts.jsonl` automatically. Headline numbers (`mean/reward`, `pass@1/accuracy`, etc.) print to stdout.
+
+## Per-task pass rates & pass@k
+
+```bash
+ng_reward_profile \
++input_jsonl_fpath=<full_dataset.jsonl> \
++rollouts_jsonl_fpath=results/rollouts.jsonl \
++output_jsonl_fpath=results/profiled.jsonl \
++pass_threshold=1.0
+
+python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl
+```
+
+- `pass@1 = avg_reward` across rollouts of a task.
+- `pass@k` derived from `max_reward` across `k` repeats — only meaningful when reward is binary.
+- For continuous reward, ignore pass@k and report distribution shifts + per-task means.
+
+## Common Evaluation Patterns
+**Compare models on the same env+agent:** chain multiple model configs and run `ng_collect_rollouts` once per `agent_name` that points at each. The agent's resources server
+stays identical so any score delta is attributable to the model.
+**Compare agents on the same env+model:** swap the agent config in `config_paths` and re-run. Hold dataset, `num_repeats`, and `responses_create_params` constant.
+**No matter what, always change only one knob at a time** Mixing model/agent changes makes deltas uninterpretable. 
+
+## Inspect saved results
+- `results/<run>_rollouts.jsonl` — one line per (task, repeat) with `reward`, `response`, `task_index`, and any custom `VerifyResponse` fields.
+- `results/<run>_aggregate_metrics.json` — array, one object per agent: `agent_ref`, `agent_metrics`, `key_metrics`, `group_level_metrics`.
+- `results/<run>_materialized_inputs.jsonl` — the fully resolved inputs sent to the agent (useful for diffing prompts).
+For benchmark-specific headline metrics, override `compute_metrics()` / `get_key_metrics()` on the resources server or agent — see [Aggregate Metrics 
+  docs](https://docs.nvidia.com/nemo/gym/latest/environment-tutorials/aggregate-metrics). When debugging an unexpected score, read the rollout JSONL directly before re-running.
+
+## Metrics interpretation 
+1. **Binary vs continuous reward** — pass@k is only meaningful when reward is effectively {0, 1}. For continuous rewards, focus on distribution shifts and per-task means.
+2. **Variance reduction** — keep increasing `num_repeats` until variance across runs of the same model is < 1%. Anything noisier and small score deltas are noise.
+3. **Inspect samples before claiming regressions.** Aggregate numbers can hide a single broken task type swamping the average.
+4. **Distinguish "completed rollout with low reward" from "runtime/server failure."** The latter shows up as exceptions in server logs and/or missing rollouts; the former is a
+model/agent quality issue.
+
+## Output format
+
+When summarizing an evaluation run, return:
+
+1. **Run configuration table** — env, agent_name, model, dataset, num_repeats, exact command line.
+2. **Aggregate metrics** — `mean/reward`, `pass@1`, `pass@k` (if binary), per-task variance.
+3. **Sample-level failure themes** — group the 0-reward rollouts by failure mode (parsing error, wrong answer, tool failure, timeout, etc.). Cite specific `task_index` values.
\ No newline at end of file
diff --git a/.claude/skills/integrate-benchmark/SKILL.md b/.claude/skills/integrate-benchmark/SKILL.md
new file mode 100644
index 000000000..6cc98f2b1
--- /dev/null
+++ b/.claude/skills/integrate-benchmark/SKILL.md
@@ -0,0 +1,268 @@
+---
+name: integrate-benchmark
+description: >
+  Guide for adding a new benchmark or training environment to NeMo-Gym. This should
+  only be used when a benchmark or training environment ALREADY exists but is not in 
+  NeMo Gym yet. You can also use this when wrapping an existing 3rd-party benchmark
+  library. 
+  If the benchmark/training environment doesn't already exist, for example a brand
+  new benchmark or environment that they are defining for the first time, use the
+  `create-benchmark` skill instead. 
+  Triggered by: "integrate benchmark", "wrap benchmark", 
+  "port benchmark", "add existing benchmark", "integrate X into Gym", "wrap X library", 
+  "add X benchmark to Gym"
+---
+
+# Add Benchmark to NeMo-Gym
+
+## Ask a few simple questions for documentation and code structure:
+
+- Environment name
+- Source repo/location of benchmark or environment
+- Paper/reference (if applicable)
+- License (I don't know is fine if not known)
+- Brief description: What does this environment evaluate? (e.g. web navigation, code generation, tool use)
+
+If any of the questions the user is not sure about then you can skip over it. 
+Try to figure out any information you're sure of from looking at the benchmark/environment they supply, then
+you can fill in information yourself. 
+
+## Information Gathering and Implementation Planning
+
+To help figure out what kind of environment/benchmark this is it can be helpful to ask the user questions 
+to learn how the agent interacts with the environment and the other dependencies for the environment. 
+
+You can refer to "Background Information about Benchmarks" in this file for additional context.
+
+Use the information already supplied by the user like paper, reference, source repo, etc, to answer the
+below as much as possible. After you have filled in all the information you can ask the user too and use
+these two sources of information to find discrepancies to clarify the environment/benchmark. 
+
+Ask the user to define how the agent interacts with the environment - here are
+some common things to think about and challenge the user on. 
+- Does the agent receive a natural language prompt and return an answer?
+- Does the model use tools (function calling, code execution, web browsing)?
+- Is it single-turn or multi-turn (does the model get feedback and retry)?
+
+Then, ask the user about how verification works. 
+What's the reward signal? Is it binary pass/fail, a score, or multiple
+metrics? How is correctness determined? (exact match, test cases, judge model, human eval)?
+
+Ask about external dependencies.
+Does this environment require external tools, specific runtimes, or sandboxes (e.g. compilers, browsers, Docker, VMs)?
+If so, list them and note whether they can be auto-installed on server startup. 
+
+Ask about data.
+- Dataset source (e.g. HuggingFace, custom):
+- Approximate size (number of tasks):
+- Splits available (train/validation/test):
+If they didn't already provide paper/reference/source repo then ask for this.
+We're looking for published or known results to use as a reference.
+Link to leaderboards, papers, or repos with reported numbers.
+
+Lastly, note anything an engineer should know about running this environment:
+- Does it need specific hardware (GPUs, large memory)?
+- Does it require network access, Docker, or a VM?
+- Are there known limitations on parallelism or throughput?
+- Any OS or platform restrictions?
+
+## Build
+
+Use the information from information gathering from both the user
+and the benchmark/environment source to properly design implementation
+according to the guidelines below: 
+
+No matter what kind of external benchmark/environment you are integrating,
+you will integrate at the agent server level and not in resources server.
+In short, you will wrap the benchmark in the agent server's `/run` endpoint. 
+
+- Integrate at the agent server level (not resources server) 
+- Agent's `/run` endpoint wraps the external library 
+- Pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse`
+- Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration
+- Add the dependency in `requirements.txt`
+
+## Workflow
+TODO: worried this is overfit to tau2. 
+
+### Step 1: Scaffold the agent server
+TODO: it'd be great if we had a cli command to scaffold - do we have this? 
+Follow the structure of `responses_api_agents/tau2` to start.
+
+  Required structure:
+
+  responses_api_agents/<name>/
+  ├── app.py
+  ├── configs/<name>_agent.yaml
+  ├── data/example.jsonl          # 5 entries, committed to git
+  ├── tests/__init__.py
+  ├── tests/test_app.py
+  ├── requirements.txt
+  └── README.md
+
+requirements.txt content:
+
+  -e nemo-gym[dev] @ ../../
+  <upstream-library>==<pin>     # or: git+https://... @ <sha>
+
+Per the docs, this is the only place upstream dependencies are declared — do not vendor them into nemo_gym/
+
+### Step 2: Define request/response schemas
+Subclass BaseRunRequest with any extra fields your library's task runner needs (task spec, seed, run config, etc.). Subclass BaseVerifyResponse for the agent's reply — include
+both reward and any per-task metrics you want logged downstream (duration, step counts, token usage, finish reasons).
+
+Reference: `responses_api_agents/tau2/app.py`
+
+### Step 3: Implement `/run` - wrap the upstream library 
+Subclass SimpleResponsesAPIAgent. Per the docs, leave responses() as raise NotImplementedError — external integrations only need /run.
+
+In run():
+
+1. Preprocess — translate BaseRunRequest + responses_create_params into whatever shape your library's entrypoint expects. (See responses_api_agents/tau2/app.py:126-152.)
+2. Point the upstream LLM client at Gym model servers. For each model role the library needs, expose a ModelServerRef field in the agent config (model_server for policy, plus
+extras like user_model_server for simulators). At runtime, set the library's api_base = f"{get_server_url(self.config.<ref>.name)}/v1" and a dummy API key. Tau2 does this for
+both policy and user-sim models at app.py:131-148.
+3. Await the library's task entrypoint. Example: result = await run_single_task(**body_dict) (app.py:152).
+4. Postprocess the trajectory for RL. Convert the library's message list → OpenAI responses items via VLLMConverter.chat_completions_messages_to_responses_items, then split with
+  split_responses_input_output_items. This is what makes the trajectory consumable by Gym's training loop. (app.py:154-169.)
+5. Return your *VerifyResponse with reward set from the library's result object plus any metrics you computed.
+
+### Step 4: Auto-install external tools (if applicable)
+
+If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See `references/patterns.md` § "External Tool Auto-Install Pattern".
+
+Key points:
+- Create `setup_<tool>.py` with `ensure_<tool>()` — checks PATH, forks on `sys.platform` (brew on macOS, build from source on Linux)
+- Call it in `model_post_init()` before semaphore init
+- Build scripts should be idempotent and install into a local gitignored prefix
+- Add a `pytest_configure` hook in `tests/conftest.py` that calls `ensure_<tool>()` before collection
+
+### Step 5: Write YAML configs
+
+1. Agent config — `responses_api_agents/<name>/configs/<name>_agent.yaml`
+
+Declares the agent server: entrypoint, every ModelServerRef the library needs, library-specific settings (max steps, concurrency knobs, debug flags), and an
+example dataset. Reference: `responses_api_agents/tau2/configs/tau2_agent.yaml`.
+
+2. Benchmark config — `benchmarks/<name>/config.yaml`
+
+Chains to the agent config via config_paths and uses _inherit_from to override per-variant knobs (which model serves the agent, which model serves the simulator, num_repeats,
+dataset path). This is what isolates one benchmark variant from another so the agent config stays generic. Reference: benchmarks/tau2/config.yaml.
+
+### Step 6: Test
+
+```bash
+# Run server tests (creates isolated .venv, slow on first run)
+ng_test +entrypoint=resources_servers/my_benchmark
+
+# Run core library tests to check nothing broke
+pytest tests/unit_tests/ -x
+```
+
+Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout.
+
+### Step 6: Smoke test end-to-end
+
+```bash
+# Start servers
+ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"
+
+# Quick test with example data
+ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
+  +input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \
+  +output_jsonl_fpath=results/example_rollouts.jsonl \
+  +num_repeats=1 \
+  "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
+
+# Inspect results
+```
+
+### Step 7: Baseline (reward profiling)
+
+Run against multiple models to validate correctness. Recommended suite:
+- Your policy model of interest
+- At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct)
+- At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking)
+- At least one closed-source model (e.g. GPT-5 Nano or GPT-5)
+
+```bash
+# Collect rollouts
+ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
+  +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
+  +output_jsonl_fpath=results/rollouts.jsonl \
+  +num_repeats=5 \
+  "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
+
+# Compute per-task pass rates
+ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
+  +rollouts_jsonl_fpath=results/rollouts.jsonl \
+  +output_jsonl_fpath=results/profiled.jsonl \
+  +pass_threshold=1.0
+
+# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
+python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl
+```
+
+Increase `num_repeats` until variance < 1% across runs on the same model.
+
+Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers.
+
+For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match.
+
+### Step 8: Pre-commit and PR
+
+Use `.github/ISSUE_TEMPLATE/environment-integration.md` to make sure and issue is created for the integrated environment. 
+
+```bash
+pre-commit run --all-files
+```
+
+First run may fail as hooks auto-modify files (`verified: false` flag, README table). Stage changes and run again.
+
+Set `verified: true` in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description.
+
+To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:
+```bash
+pre-commit run --files resources_servers/my_benchmark/**/*
+```
+If hooks modify files in other directories, discard those changes:
+```bash
+git checkout -- resources_servers/other_server/
+```
+
+## Constraints
+
+- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other
+- **Use aiohttp, not httpx, for async HTTP.** All async HTTP calls must go through `nemo_gym.server_utils.request()` (aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see `resources_servers/tavily_search/app.py` (`TavilySearchAIOHTTPClient`) for the pattern and `docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md` for the rationale.
+- Pass configuration through Gym config (YAML), not environment variables
+- Code must run on Linux
+- `/run` endpoint must be async
+- Errors from tool execution or bad model output must return error responses, not crash
+- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`)
+- Issue for the integrated environment is created from `.github/ISSUE_TEMPLATE/environment-integration.md`
+
+## Reference
+
+For detailed code patterns, schemas, and examples: see [references/patterns.md](references/patterns.md).
+
+## Background Information about Benchmarks 
+TODO: we could tell the agent to go read fern/versions/latest/pages/about/architecture.mdx 
+to learn about architecture of environments. 
+Benchmarks are fundamentally synonymous with environments, so understanding how
+environments work will help you understand how benchmarks also work. 
+
+There are generally 3 kinds of external benchmarks/environment structure.
+This is based off the information comes "with" the benchmark that's going to be integrated:
+
+1) Benchmarks/environments that define tasks and verifier. These notably don't have an agent harness. 
+A good example of this is MMLU. Users will define the model that they want to improve on this benchmark. 
+TODO: do these have action and state? I don't think so probably? 
+
+2) Benchmarks/environments that define tasks, the agent/agent harness, and the state, action, and verifier. 
+A good example of this kind of environment is tau2. There is a specific tau2 agent harness, which is used
+for doing tool calling for this benchmark. Users will define the model that they want to improve on this benchmark. 
+
+3) Benchmarks/environments that define tasks, the verifier, the state, and actions. 
+A good example of this kind of environment is SWEBench. Users can define the model 
+and/or agent that they want to improve on this benchmark. 
+TODO: include state and actions? 
diff --git a/.claude/skills/add-benchmark/references/patterns.md b/.claude/skills/integrate-benchmark/references/patterns.md
similarity index 100%
rename from .claude/skills/add-benchmark/references/patterns.md
rename to .claude/skills/integrate-benchmark/references/patterns.md
diff --git a/Agents.md b/Agents.md
new file mode 100644
index 000000000..a837940dd
--- /dev/null
+++ b/Agents.md
@@ -0,0 +1,362 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## What This Is
+
+NeMo-Gym is NVIDIA's library for building RL training environments for LLMs (RLVR). It uses a microservice architecture with three composable FastAPI server types that communicate over async HTTP.
+
+## Common Commands
+
+```bash
+# Setup
+uv venv && uv sync --extra dev --group docs
+pre-commit install
+
+# Run servers
+ng_run "+config_paths=[resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]"
+
+# Run tests for a specific server (creates .venv per server, installs deps, runs pytest)
+# First run is slow. Use skip_venv_if_present config or place a .venv to skip venv creation.
+ng_test +entrypoint=resources_servers/example_single_tool_call
+
+# Run all server tests
+ng_test_all
+
+# Run core library unit tests
+pytest tests/unit_tests/ -x
+
+# Run a single test file
+pytest tests/unit_tests/test_openai_utils.py -x
+
+# Lint and format
+ruff check --fix .
+ruff format .
+
+# Pre-commit (runs ruff, formatting, custom hooks)
+pre-commit run --all-files
+
+# Collect rollouts
+ng_collect_rollouts +agent_name=<agent> +input_jsonl_fpath=<data.jsonl> +output_jsonl_fpath=<output.jsonl> +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
+
+# Profile results (compute per-task pass rates)
+ng_reward_profile +input_jsonl_fpath=<data.jsonl> +rollouts_jsonl_fpath=<rollouts.jsonl> +output_jsonl_fpath=<profiled.jsonl> +pass_threshold=1.0
+
+# Check server health
+ng_status
+
+# Dev test (runs pytest directly in server dir, no venv isolation)
+ng_dev_test +entrypoint=resources_servers/example_single_tool_call
+
+# Dump merged config
+ng_dump_config "+config_paths=[...]"
+
+# Dataset management (HF)
+ng_upload_dataset_to_hf +dataset_name=<name> +version=<ver> +input_jsonl_fpath=<path> +hf_repo_id=<repo>
+ng_download_dataset_from_hf +dataset_name=<name> +version=<ver> +output_jsonl_fpath=<path> +hf_repo_id=<repo>
+```
+
+## Architecture
+
+Three server types, all FastAPI apps communicating via aiohttp:
+
+- **Resources Servers** (`resources_servers/`): Implement `verify()` — task verification and reward computation. Return reward 0.0 or 1.0.
+- **Response API Models** (`responses_api_models/`): Implement `chat_completions()` and `responses()` — LLM inference. Four variants: openai, azure_openai, vllm, local_vllm.
+- **Response API Agents** (`responses_api_agents/`): Implement `responses()` and `run()` — orchestrate model-tool call loops. `simple_agent` is the default single-turn agent; others include `proof_refinement_agent` (multi-turn correction), `verifiers_agent`, `swe_agents`, etc.
+
+A **HeadServer** coordinates all server lifecycles, config, and Ray cluster init.
+
+### Base Class Hierarchy
+
+```
+BaseServer (Pydantic model with config + server_client)
+└── SimpleServer (FastAPI app setup, middleware stack)
+    ├── SimpleResourcesServer  →  implement verify()
+    ├── SimpleResponsesAPIModel  →  implement chat_completions(), responses()
+    └── SimpleResponsesAPIAgent  →  implement responses(), run()
+```
+
+### Data Flow
+
+JSONL input → agent `/run` → model `/v1/responses` → (tool calls if any) → resources server `/verify` → reward → JSONL output
+
+### Inter-Server Communication
+
+`ServerClient` wraps aiohttp with retry logic (3 tries, exponential backoff). Session cookies propagate through the call stack for stateful environments. The global aiohttp client is a singleton with connection pooling.
+
+## Configuration
+
+Hydra + OmegaConf for hierarchical YAML composition. CLI overrides use `+key=value` syntax.
+
+Each server instance is a top-level key in YAML that maps to a server type + config:
+```yaml
+my_server_instance:
+  resources_servers:        # server type directory
+    my_server:              # server subdirectory name
+      entrypoint: app.py
+      domain: coding
+      # ... server-specific config fields
+```
+
+Agent configs reference their resource and model servers:
+```yaml
+my_agent_instance:
+  responses_api_agents:
+    simple_agent:
+      entrypoint: app.py
+      resources_server:
+        type: resources_servers
+        name: my_server
+      model_server:
+        type: responses_api_models
+        name: policy_model
+      datasets:
+      - name: my_dataset
+        type: train
+        jsonl_fpath: path/to/data.jsonl
+```
+
+Model endpoint config goes in `env.yaml` at project root:
+```yaml
+policy_base_url: http://localhost:8000/v1
+policy_api_key: your-key
+policy_model_name: your-model
+```
+
+## JSONL Data Schema
+
+Each line in input JSONL:
+```json
+{
+  "responses_create_params": {
+    "input": [
+      {"role": "system", "content": "..."},
+      {"role": "user", "content": "..."}
+    ]
+  },
+  "verifier_metadata": { ... }
+}
+```
+
+`responses_create_params.input` follows OpenAI message format. `verifier_metadata` is passed through to the resources server's `verify()` for task-specific validation data (test cases, expected answers, etc.).
+
+Output JSONL (from `ng_collect_rollouts`) contains the full verify response per rollout, including at minimum:
+```json
+{
+  "reward": 1.0,
+  "response": { "output_text": "..." },
+  "task_index": 0
+}
+```
+Additional fields depend on the resources server's `VerifyResponse` class.
+
+## Dataset Management
+
+### Dataset types and where they live
+
+- **`example`** datasets (5 entries for smoke testing) are committed directly to git in `data/example.jsonl`.
+- **`train`** and **`validation`** datasets are hosted in the GitLab dataset registry. They must NOT be committed to git.
+
+### GitLab dataset registry
+
+Upload a JSONL dataset:
+```bash
+ng_upload_dataset_to_gitlab \
+    +dataset_name=my_benchmark \
+    +version=0.0.1 \
+    +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+```
+
+Requires MLflow credentials in `env.yaml` (or passed via CLI):
+```yaml
+mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
+mlflow_tracking_token: <your-gitlab-api-token>
+```
+
+The tracking URI format is `https://<gitlab-host>/api/v4/projects/<PROJECT_ID>/ml/mlflow`.
+
+### YAML config: gitlab_identifier + jsonl_fpath
+
+Both fields coexist. `jsonl_fpath` is the local download destination; `gitlab_identifier` tells the system where to fetch from:
+```yaml
+- name: my_dataset
+  type: validation
+  jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
+  gitlab_identifier:
+    dataset_name: my_benchmark
+    version: 0.0.1
+    artifact_fpath: my_dataset.jsonl
+  license: MIT
+```
+
+### data/.gitignore
+
+Every resources server has `data/.gitignore` (generated by `ng_init_resources_server`):
+```
+*train.jsonl
+*validation.jsonl
+*train_prepare.jsonl
+*validation_prepare.jsonl
+*example_prepare.jsonl
+```
+
+If your filename doesn't match these patterns (e.g. `my_eval.jsonl`), add a custom pattern (e.g. `*eval.jsonl`). If data was previously tracked, run `git rm --cached <file>`.
+
+### ng_prepare_data
+
+Validate example data (for PR submission):
+```bash
+ng_prepare_data "+config_paths=[...]" +output_dirpath=/tmp/prepare +mode=example_validation
+```
+
+Download and prepare train/validation from GitLab:
+```bash
+ng_prepare_data "+config_paths=[...]" +output_dirpath=data/my_benchmark \
+    +mode=train_preparation +should_download=true +data_source=gitlab
+```
+
+## Adding a New Benchmark (Resources Server + Agent)
+
+For wrapping an existing 3rd-party benchmark library, integrate at the agent server level: wrap the library in `/run`, pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse`. Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration.
+
+For native benchmarks, follow these steps:
+
+### 1. Create the resources server
+
+Copy an existing server as template:
+- `example_single_tool_call` — simplest example
+- `code_gen` — subprocess execution with Ray (good for compilation/execution benchmarks)
+
+Required structure:
+```
+resources_servers/my_server/
+├── app.py              # Server class extending SimpleResourcesServer
+├── configs/my_server.yaml
+├── data/example.jsonl  # 5 examples for quick testing
+├── tests/__init__.py
+├── tests/test_app.py
+├── requirements.txt    # just: -e nemo-gym[dev] @ ../../
+└── README.md
+```
+
+The `verify()` method receives the model output and `verifier_metadata`, returns a response with `reward` field. The `verifier_metadata` dict is opaque to the framework — define whatever fields your benchmark needs (test cases, expected answers, task IDs, etc.) and pass them through the JSONL data.
+
+### 2. Create or reuse an agent
+
+- `simple_agent` — single-turn, works for most benchmarks. Just pair it with your resources server in the YAML config.
+- `proof_refinement_agent` — multi-turn correction loop (model gets error feedback and retries). Copy this if your benchmark benefits from iterative refinement.
+
+Agent structure:
+```
+responses_api_agents/my_agent/
+├── app.py              # Server class extending SimpleResponsesAPIAgent
+├── configs/my_agent.yaml
+├── tests/__init__.py
+├── tests/test_app.py
+└── requirements.txt
+```
+
+For multi-turn agents, propagate cookies from the incoming request through all downstream calls: `cookies=request.cookies`. Also propagate token IDs (`prompt_token_ids`, `generation_token_ids`, `generation_log_probs`) from model responses when constructing the next turn's input — these are needed for RL training.
+
+### 3. Wire up the YAML config
+
+A single YAML file in `configs/` typically defines both the resources server and its agent pairings. The agent references the resources server and model server by name.
+
+### 4. Prepare data
+
+Input JSONL has one problem per line. System prompt goes in the `input` messages. Task-specific verification data goes in `verifier_metadata`.
+
+If converting from another format, write the conversion script in the source repo (e.g. your dataset source repo) — conversion scripts and prompt files do not belong in the NeMo-Gym repo. Upload only the converted JSONL to the GitLab registry.
+
+Generate `data/example.jsonl` with 5 entries (committed to git). Upload `train`/`validation` datasets with `ng_upload_dataset_to_gitlab`. Add `gitlab_identifier` to the YAML config. See "Dataset Management" above for the full workflow.
+
+Validate your data:
+```bash
+ng_prepare_data "+config_paths=[...]" +output_dirpath=/tmp/prepare +mode=example_validation
+ng_prepare_data "+config_paths=[...]" +output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
+```
+
+### 5. Baseline (reward profiling)
+
+Run against multiple models to validate correctness:
+
+```bash
+# Start servers
+ng_run "+config_paths=[resources_servers/my_server/configs/my_server.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]"
+
+# Collect rollouts (start with example.jsonl for quick smoke test)
+ng_collect_rollouts +agent_name=my_agent +input_jsonl_fpath=<data.jsonl> +output_jsonl_fpath=results/rollouts.jsonl +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
+
+# Compute per-task pass rates
+ng_reward_profile +input_jsonl_fpath=<data.jsonl> +rollouts_jsonl_fpath=results/rollouts.jsonl +output_jsonl_fpath=results/profiled.jsonl +pass_threshold=1.0
+
+# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
+python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl
+```
+
+Run on both instruct and thinking models. Thinking models emit `<think>`/`<thinking>` blocks in their output — your code extraction logic must strip these before parsing.
+
+Use `openai_model` for endpoints supporting `/v1/responses`, `vllm_model` for `/v1/chat/completions`.
+
+### 6. Important constraints
+
+- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other clients. It's `openai<=2.6.1` for schema compatibility.
+- Pass all configuration through Gym config (YAML), not environment variables. This includes model URLs, API keys, etc.
+- Environments must handle errors gracefully — tool failures and bad model outputs should return meaningful error responses, not crash the server. Must handle 4k-65k concurrent requests without crashing.
+- The `/run` endpoint must be async. Use `asyncio.Semaphore` for concurrency control if shelling out to external processes.
+- Tests should skip gracefully if external tools aren't installed (e.g. `pytest.mark.skipif(shutil.which("tool") is None, ...)`).
+- If a benchmark auto-installs its tool dependency (see "External Tool Auto-Install" below), add a `pytest_configure` hook in `conftest.py` to run the install before test collection — `skipif` markers evaluate at import time, before fixtures run.
+- Executables must run on Linux.
+- Increase num_repeats until variance is < 1% across runs on the same model.
+
+## Code Style
+
+- Line length: 119
+- Python 3.12+, async-first
+- Ruff for linting and formatting (double quotes, isort)
+- Test coverage must be >= 95%
+- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`)
+
+## Pre-commit Hooks
+
+Notable custom hooks that auto-modify files:
+- `add-verified-flag`: Adds `verified: false` to new resources server YAML configs (`verified: true` means the benchmark has been baselined and reviewed; new servers start as `false`)
+- `update-readme-table`: Updates the resources server table in root README.md
+- `ruff-format`: Auto-formats code
+
+First run may fail as hooks modify files. Stage the changes and commit again.
+
+To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:
+```bash
+pre-commit run --files resources_servers/my_benchmark/**/*
+```
+If hooks modify files in other directories, discard those changes:
+```bash
+git checkout -- resources_servers/other_server/
+```
+
+## External Tool Auto-Install
+
+When a benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup:
+
+1. Create a `setup_<tool>.py` module with an `ensure_<tool>()` function that:
+   - Checks `shutil.which("tool")` — returns early if already on PATH
+   - Forks on `sys.platform`: macOS (brew), Linux (build from source via bash script)
+   - Updates `os.environ["PATH"]` and `os.environ["LD_LIBRARY_PATH"]` for the current process
+   - Verifies the tool runs successfully after install
+2. Call `ensure_<tool>()` in the server's `model_post_init()` (runs once at startup)
+3. For tests: add a `pytest_configure` hook in `conftest.py` that calls `ensure_<tool>()` before collection, so `skipif(shutil.which("tool") is None)` markers see the installed tool
+4. Build-from-source scripts should be idempotent (skip if artifacts exist) and install into a local prefix (e.g. `.<tool_name>/` in the server dir, gitignored)
+
+## Cluster / HPC Gotchas
+
+- **Ray socket path length**: On systems with long working directory paths (e.g. Lustre mounts), Ray's AF_UNIX socket paths can exceed the 107-byte Linux limit. Fix: `RAY_TMPDIR=/tmp` before running tests or `ray.init()`.
+- **`ng_test` venv isolation**: `ng_test` creates isolated venvs per resources server. `os.environ` changes in Python don't propagate — set env vars externally (e.g. `RAY_TMPDIR=/tmp ng_test ...`).
+
+## Async Patterns
+
+- **Use aiohttp, not httpx, for async HTTP.** All async HTTP calls must go through NeMo Gym's global aiohttp client (`nemo_gym.server_utils.request()`). Do not use `httpx.AsyncClient` — httpx/httpcore has O(n^2) connection pooling that causes hangs at high concurrency (16k+ requests). When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter. See `docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md` for the full writeup and `resources_servers/tavily_search/app.py` (`TavilySearchAIOHTTPClient`) for the adapter pattern.
+- Use `asyncio.Semaphore` to bound concurrent subprocess/external calls
+- For Ray remote tasks in async code: `result = await future` (Ray futures are directly awaitable). Never call `ray.get()` directly in async context.
+- Decode all subprocess output with `errors="replace"` to handle non-UTF8
+- Guard optional nested fields: `(body.field or {}).get("key", default)`
diff --git a/CLAUDE.md b/CLAUDE.md
index a837940dd..79c98d8d8 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,362 +1,2 @@
 # CLAUDE.md
-
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-
-## What This Is
-
-NeMo-Gym is NVIDIA's library for building RL training environments for LLMs (RLVR). It uses a microservice architecture with three composable FastAPI server types that communicate over async HTTP.
-
-## Common Commands
-
-```bash
-# Setup
-uv venv && uv sync --extra dev --group docs
-pre-commit install
-
-# Run servers
-ng_run "+config_paths=[resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]"
-
-# Run tests for a specific server (creates .venv per server, installs deps, runs pytest)
-# First run is slow. Use skip_venv_if_present config or place a .venv to skip venv creation.
-ng_test +entrypoint=resources_servers/example_single_tool_call
-
-# Run all server tests
-ng_test_all
-
-# Run core library unit tests
-pytest tests/unit_tests/ -x
-
-# Run a single test file
-pytest tests/unit_tests/test_openai_utils.py -x
-
-# Lint and format
-ruff check --fix .
-ruff format .
-
-# Pre-commit (runs ruff, formatting, custom hooks)
-pre-commit run --all-files
-
-# Collect rollouts
-ng_collect_rollouts +agent_name=<agent> +input_jsonl_fpath=<data.jsonl> +output_jsonl_fpath=<output.jsonl> +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
-
-# Profile results (compute per-task pass rates)
-ng_reward_profile +input_jsonl_fpath=<data.jsonl> +rollouts_jsonl_fpath=<rollouts.jsonl> +output_jsonl_fpath=<profiled.jsonl> +pass_threshold=1.0
-
-# Check server health
-ng_status
-
-# Dev test (runs pytest directly in server dir, no venv isolation)
-ng_dev_test +entrypoint=resources_servers/example_single_tool_call
-
-# Dump merged config
-ng_dump_config "+config_paths=[...]"
-
-# Dataset management (HF)
-ng_upload_dataset_to_hf +dataset_name=<name> +version=<ver> +input_jsonl_fpath=<path> +hf_repo_id=<repo>
-ng_download_dataset_from_hf +dataset_name=<name> +version=<ver> +output_jsonl_fpath=<path> +hf_repo_id=<repo>
-```
-
-## Architecture
-
-Three server types, all FastAPI apps communicating via aiohttp:
-
-- **Resources Servers** (`resources_servers/`): Implement `verify()` — task verification and reward computation. Return reward 0.0 or 1.0.
-- **Response API Models** (`responses_api_models/`): Implement `chat_completions()` and `responses()` — LLM inference. Four variants: openai, azure_openai, vllm, local_vllm.
-- **Response API Agents** (`responses_api_agents/`): Implement `responses()` and `run()` — orchestrate model-tool call loops. `simple_agent` is the default single-turn agent; others include `proof_refinement_agent` (multi-turn correction), `verifiers_agent`, `swe_agents`, etc.
-
-A **HeadServer** coordinates all server lifecycles, config, and Ray cluster init.
-
-### Base Class Hierarchy
-
-```
-BaseServer (Pydantic model with config + server_client)
-└── SimpleServer (FastAPI app setup, middleware stack)
-    ├── SimpleResourcesServer  →  implement verify()
-    ├── SimpleResponsesAPIModel  →  implement chat_completions(), responses()
-    └── SimpleResponsesAPIAgent  →  implement responses(), run()
-```
-
-### Data Flow
-
-JSONL input → agent `/run` → model `/v1/responses` → (tool calls if any) → resources server `/verify` → reward → JSONL output
-
-### Inter-Server Communication
-
-`ServerClient` wraps aiohttp with retry logic (3 tries, exponential backoff). Session cookies propagate through the call stack for stateful environments. The global aiohttp client is a singleton with connection pooling.
-
-## Configuration
-
-Hydra + OmegaConf for hierarchical YAML composition. CLI overrides use `+key=value` syntax.
-
-Each server instance is a top-level key in YAML that maps to a server type + config:
-```yaml
-my_server_instance:
-  resources_servers:        # server type directory
-    my_server:              # server subdirectory name
-      entrypoint: app.py
-      domain: coding
-      # ... server-specific config fields
-```
-
-Agent configs reference their resource and model servers:
-```yaml
-my_agent_instance:
-  responses_api_agents:
-    simple_agent:
-      entrypoint: app.py
-      resources_server:
-        type: resources_servers
-        name: my_server
-      model_server:
-        type: responses_api_models
-        name: policy_model
-      datasets:
-      - name: my_dataset
-        type: train
-        jsonl_fpath: path/to/data.jsonl
-```
-
-Model endpoint config goes in `env.yaml` at project root:
-```yaml
-policy_base_url: http://localhost:8000/v1
-policy_api_key: your-key
-policy_model_name: your-model
-```
-
-## JSONL Data Schema
-
-Each line in input JSONL:
-```json
-{
-  "responses_create_params": {
-    "input": [
-      {"role": "system", "content": "..."},
-      {"role": "user", "content": "..."}
-    ]
-  },
-  "verifier_metadata": { ... }
-}
-```
-
-`responses_create_params.input` follows OpenAI message format. `verifier_metadata` is passed through to the resources server's `verify()` for task-specific validation data (test cases, expected answers, etc.).
-
-Output JSONL (from `ng_collect_rollouts`) contains the full verify response per rollout, including at minimum:
-```json
-{
-  "reward": 1.0,
-  "response": { "output_text": "..." },
-  "task_index": 0
-}
-```
-Additional fields depend on the resources server's `VerifyResponse` class.
-
-## Dataset Management
-
-### Dataset types and where they live
-
-- **`example`** datasets (5 entries for smoke testing) are committed directly to git in `data/example.jsonl`.
-- **`train`** and **`validation`** datasets are hosted in the GitLab dataset registry. They must NOT be committed to git.
-
-### GitLab dataset registry
-
-Upload a JSONL dataset:
-```bash
-ng_upload_dataset_to_gitlab \
-    +dataset_name=my_benchmark \
-    +version=0.0.1 \
-    +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
-```
-
-Requires MLflow credentials in `env.yaml` (or passed via CLI):
-```yaml
-mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
-mlflow_tracking_token: <your-gitlab-api-token>
-```
-
-The tracking URI format is `https://<gitlab-host>/api/v4/projects/<PROJECT_ID>/ml/mlflow`.
-
-### YAML config: gitlab_identifier + jsonl_fpath
-
-Both fields coexist. `jsonl_fpath` is the local download destination; `gitlab_identifier` tells the system where to fetch from:
-```yaml
-- name: my_dataset
-  type: validation
-  jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
-  gitlab_identifier:
-    dataset_name: my_benchmark
-    version: 0.0.1
-    artifact_fpath: my_dataset.jsonl
-  license: MIT
-```
-
-### data/.gitignore
-
-Every resources server has `data/.gitignore` (generated by `ng_init_resources_server`):
-```
-*train.jsonl
-*validation.jsonl
-*train_prepare.jsonl
-*validation_prepare.jsonl
-*example_prepare.jsonl
-```
-
-If your filename doesn't match these patterns (e.g. `my_eval.jsonl`), add a custom pattern (e.g. `*eval.jsonl`). If data was previously tracked, run `git rm --cached <file>`.
-
-### ng_prepare_data
-
-Validate example data (for PR submission):
-```bash
-ng_prepare_data "+config_paths=[...]" +output_dirpath=/tmp/prepare +mode=example_validation
-```
-
-Download and prepare train/validation from GitLab:
-```bash
-ng_prepare_data "+config_paths=[...]" +output_dirpath=data/my_benchmark \
-    +mode=train_preparation +should_download=true +data_source=gitlab
-```
-
-## Adding a New Benchmark (Resources Server + Agent)
-
-For wrapping an existing 3rd-party benchmark library, integrate at the agent server level: wrap the library in `/run`, pre-process from Gym schema to library input, post-process back to `BaseVerifyResponse`. Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration.
-
-For native benchmarks, follow these steps:
-
-### 1. Create the resources server
-
-Copy an existing server as template:
-- `example_single_tool_call` — simplest example
-- `code_gen` — subprocess execution with Ray (good for compilation/execution benchmarks)
-
-Required structure:
-```
-resources_servers/my_server/
-├── app.py              # Server class extending SimpleResourcesServer
-├── configs/my_server.yaml
-├── data/example.jsonl  # 5 examples for quick testing
-├── tests/__init__.py
-├── tests/test_app.py
-├── requirements.txt    # just: -e nemo-gym[dev] @ ../../
-└── README.md
-```
-
-The `verify()` method receives the model output and `verifier_metadata`, returns a response with `reward` field. The `verifier_metadata` dict is opaque to the framework — define whatever fields your benchmark needs (test cases, expected answers, task IDs, etc.) and pass them through the JSONL data.
-
-### 2. Create or reuse an agent
-
-- `simple_agent` — single-turn, works for most benchmarks. Just pair it with your resources server in the YAML config.
-- `proof_refinement_agent` — multi-turn correction loop (model gets error feedback and retries). Copy this if your benchmark benefits from iterative refinement.
-
-Agent structure:
-```
-responses_api_agents/my_agent/
-├── app.py              # Server class extending SimpleResponsesAPIAgent
-├── configs/my_agent.yaml
-├── tests/__init__.py
-├── tests/test_app.py
-└── requirements.txt
-```
-
-For multi-turn agents, propagate cookies from the incoming request through all downstream calls: `cookies=request.cookies`. Also propagate token IDs (`prompt_token_ids`, `generation_token_ids`, `generation_log_probs`) from model responses when constructing the next turn's input — these are needed for RL training.
-
-### 3. Wire up the YAML config
-
-A single YAML file in `configs/` typically defines both the resources server and its agent pairings. The agent references the resources server and model server by name.
-
-### 4. Prepare data
-
-Input JSONL has one problem per line. System prompt goes in the `input` messages. Task-specific verification data goes in `verifier_metadata`.
-
-If converting from another format, write the conversion script in the source repo (e.g. your dataset source repo) — conversion scripts and prompt files do not belong in the NeMo-Gym repo. Upload only the converted JSONL to the GitLab registry.
-
-Generate `data/example.jsonl` with 5 entries (committed to git). Upload `train`/`validation` datasets with `ng_upload_dataset_to_gitlab`. Add `gitlab_identifier` to the YAML config. See "Dataset Management" above for the full workflow.
-
-Validate your data:
-```bash
-ng_prepare_data "+config_paths=[...]" +output_dirpath=/tmp/prepare +mode=example_validation
-ng_prepare_data "+config_paths=[...]" +output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
-```
-
-### 5. Baseline (reward profiling)
-
-Run against multiple models to validate correctness:
-
-```bash
-# Start servers
-ng_run "+config_paths=[resources_servers/my_server/configs/my_server.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]"
-
-# Collect rollouts (start with example.jsonl for quick smoke test)
-ng_collect_rollouts +agent_name=my_agent +input_jsonl_fpath=<data.jsonl> +output_jsonl_fpath=results/rollouts.jsonl +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
-
-# Compute per-task pass rates
-ng_reward_profile +input_jsonl_fpath=<data.jsonl> +rollouts_jsonl_fpath=results/rollouts.jsonl +output_jsonl_fpath=results/profiled.jsonl +pass_threshold=1.0
-
-# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
-python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl
-```
-
-Run on both instruct and thinking models. Thinking models emit `<think>`/`<thinking>` blocks in their output — your code extraction logic must strip these before parsing.
-
-Use `openai_model` for endpoints supporting `/v1/responses`, `vllm_model` for `/v1/chat/completions`.
-
-### 6. Important constraints
-
-- Use NeMo Gym's OpenAI client (`nemo_gym/openai_utils.py`), not LiteLLM/Anthropic/other clients. It's `openai<=2.6.1` for schema compatibility.
-- Pass all configuration through Gym config (YAML), not environment variables. This includes model URLs, API keys, etc.
-- Environments must handle errors gracefully — tool failures and bad model outputs should return meaningful error responses, not crash the server. Must handle 4k-65k concurrent requests without crashing.
-- The `/run` endpoint must be async. Use `asyncio.Semaphore` for concurrency control if shelling out to external processes.
-- Tests should skip gracefully if external tools aren't installed (e.g. `pytest.mark.skipif(shutil.which("tool") is None, ...)`).
-- If a benchmark auto-installs its tool dependency (see "External Tool Auto-Install" below), add a `pytest_configure` hook in `conftest.py` to run the install before test collection — `skipif` markers evaluate at import time, before fixtures run.
-- Executables must run on Linux.
-- Increase num_repeats until variance is < 1% across runs on the same model.
-
-## Code Style
-
-- Line length: 119
-- Python 3.12+, async-first
-- Ruff for linting and formatting (double quotes, isort)
-- Test coverage must be >= 95%
-- All commits require DCO sign-off (`-s`) and cryptographic signature (`-S`)
-
-## Pre-commit Hooks
-
-Notable custom hooks that auto-modify files:
-- `add-verified-flag`: Adds `verified: false` to new resources server YAML configs (`verified: true` means the benchmark has been baselined and reviewed; new servers start as `false`)
-- `update-readme-table`: Updates the resources server table in root README.md
-- `ruff-format`: Auto-formats code
-
-First run may fail as hooks modify files. Stage the changes and commit again.
-
-To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:
-```bash
-pre-commit run --files resources_servers/my_benchmark/**/*
-```
-If hooks modify files in other directories, discard those changes:
-```bash
-git checkout -- resources_servers/other_server/
-```
-
-## External Tool Auto-Install
-
-When a benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup:
-
-1. Create a `setup_<tool>.py` module with an `ensure_<tool>()` function that:
-   - Checks `shutil.which("tool")` — returns early if already on PATH
-   - Forks on `sys.platform`: macOS (brew), Linux (build from source via bash script)
-   - Updates `os.environ["PATH"]` and `os.environ["LD_LIBRARY_PATH"]` for the current process
-   - Verifies the tool runs successfully after install
-2. Call `ensure_<tool>()` in the server's `model_post_init()` (runs once at startup)
-3. For tests: add a `pytest_configure` hook in `conftest.py` that calls `ensure_<tool>()` before collection, so `skipif(shutil.which("tool") is None)` markers see the installed tool
-4. Build-from-source scripts should be idempotent (skip if artifacts exist) and install into a local prefix (e.g. `.<tool_name>/` in the server dir, gitignored)
-
-## Cluster / HPC Gotchas
-
-- **Ray socket path length**: On systems with long working directory paths (e.g. Lustre mounts), Ray's AF_UNIX socket paths can exceed the 107-byte Linux limit. Fix: `RAY_TMPDIR=/tmp` before running tests or `ray.init()`.
-- **`ng_test` venv isolation**: `ng_test` creates isolated venvs per resources server. `os.environ` changes in Python don't propagate — set env vars externally (e.g. `RAY_TMPDIR=/tmp ng_test ...`).
-
-## Async Patterns
-
-- **Use aiohttp, not httpx, for async HTTP.** All async HTTP calls must go through NeMo Gym's global aiohttp client (`nemo_gym.server_utils.request()`). Do not use `httpx.AsyncClient` — httpx/httpcore has O(n^2) connection pooling that causes hangs at high concurrency (16k+ requests). When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter. See `docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md` for the full writeup and `resources_servers/tavily_search/app.py` (`TavilySearchAIOHTTPClient`) for the adapter pattern.
-- Use `asyncio.Semaphore` to bound concurrent subprocess/external calls
-- For Ray remote tasks in async code: `result = await future` (Ray futures are directly awaitable). Never call `ray.get()` directly in async context.
-- Decode all subprocess output with `errors="replace"` to handle non-UTF8
-- Guard optional nested fields: `(body.field or {}).get("key", default)`
+Before working read `Agents.md` and follow guidance.
\ No newline at end of file