-
Notifications
You must be signed in to change notification settings - Fork 322
adding readme for skills-eval #2055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jperez999
wants to merge
3
commits into
NVIDIA:main
Choose a base branch
from
jperez999:readme-skills-add
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+299
−0
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,299 @@ | ||
| # `retriever skill-eval` — benchmarking the `/nemo-retriever` skill on ViDoRe v3 | ||
|
|
||
| `retriever skill-eval run` measures whether wiring the **`/nemo-retriever`** Claude Code skill into an agent improves retrieval and answer quality over a folder of PDFs, compared to a baseline agent that has neither the `retriever` CLI nor the skill available. | ||
|
|
||
| Each invocation runs the same set of questions through Claude Code under three conditions: | ||
|
|
||
| | Condition | Retriever CLI on `$PATH` | Skill loaded into `.claude/` | Prompt style | | ||
| |---------------------|--------------------------|------------------------------|------------------------------------| | ||
| | `c1_base` | No (shimmed + denied) | No | Natural-language ("Set up search…")| | ||
| | `c2_retriever` | Yes | Yes | Natural-language | | ||
| | `c3_retriever_skill`| Yes | Yes | Explicit `/nemo-retriever …` slash | | ||
|
|
||
| This README assumes you are targeting the **ViDoRe v3** corpus and have a copy of the per-domain PDF tree on disk (e.g. on NVIDIA infra at `/datasets/nv-ingest/vidore_v3_corpus_pdf/`, or a private mirror at any path of your choice). | ||
|
|
||
| --- | ||
|
|
||
| ## Table of Contents | ||
|
|
||
| - [Prerequisites](#prerequisites) | ||
| - [Inputs at a glance](#inputs-at-a-glance) | ||
| - [1. Make the PDF tree reachable](#1-make-the-pdf-tree-reachable) | ||
| - [2. Supply an agent-eval manifest](#2-supply-an-agent-eval-manifest) | ||
| - [3. Author your `skill_eval.yaml`](#3-author-your-skill_evalyaml) | ||
| - [4. Run the benchmark](#4-run-the-benchmark) | ||
| - [CLI reference](#cli-reference) | ||
| - [Output layout](#output-layout) | ||
| - [Interpreting the summary](#interpreting-the-summary) | ||
| - [Troubleshooting](#troubleshooting) | ||
|
|
||
| --- | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - **`retriever` CLI** — install the project (`uv pip install -e ./nemo_retriever`) so `retriever skill-eval` is on `$PATH`. | ||
| - **`claude` CLI** — the Claude Code binary must be on `$PATH`. The runner exits with code `2` if `shutil.which("claude")` returns `None`. | ||
| - **A Claude account / API access** — `claude --print` will negotiate auth on first use. | ||
| - **Claude Code autorun / permission access** — the runner launches non-interactive Claude Code subprocesses with `--permission-mode bypassPermissions` and, for `c2_retriever` / `c3_retriever_skill`, `--allow-dangerously-skip-permissions`. Make sure your environment is allowed to run those commands before starting a sweep, and run a small `claude --print` smoke test first so any first-use auth or permission prompts are already resolved. | ||
| - **Disk** — each `(condition, domain)` builds a scratch workdir under `/tmp/skill_eval/` containing a `pdfs/` symlink farm, a `.claude/` sandbox, and any retrieval artifacts the agent creates (e.g. `lancedb/`). The workdir is deleted after the session completes, so only one LanceDB is on disk at a time. | ||
| - **(Optional) `NVIDIA_API_KEY`** — if set, the LLM-as-judge scores each `final_answer` against the manifest's ground-truth `answer` on a 0–5 scale. Unset means judging is skipped silently with a console note; recall numbers are still produced. | ||
|
|
||
| --- | ||
|
|
||
| ## Inputs at a glance | ||
|
|
||
| `skill-eval` needs three things you must supply: | ||
|
|
||
| 1. A folder of **PDFs per domain** (e.g. `vidore_v3_finance_en/*.pdf`). | ||
| 2. An **agent-eval manifest** (JSON list) describing the queries, paraphrased prompts, ground-truth pages, and ground-truth answers — see [§2](#2-supply-an-agent-eval-manifest). | ||
| 3. A **`skill_eval.yaml`** config binding the manifest to the PDF directories — see [§3](#3-author-your-skill_evalyaml). | ||
|
|
||
| Everything else (model, budget, timeout, conditions, judge endpoint) has working defaults in the packaged config at `src/nemo_retriever/skill_eval/configs/skill_eval.yaml`. | ||
|
|
||
| --- | ||
|
|
||
| ## 1. Make the PDF tree reachable | ||
|
|
||
| ViDoRe v3 is split per-domain. The eight domains the harness recognises are: | ||
|
|
||
| ``` | ||
| vidore_v3_computer_science | ||
| vidore_v3_energy | ||
| vidore_v3_finance_en | ||
| vidore_v3_finance_fr | ||
| vidore_v3_hr | ||
| vidore_v3_industrial | ||
| vidore_v3_pharmaceuticals | ||
| vidore_v3_physics | ||
| ``` | ||
|
|
||
| Each one resolves to a directory containing only the relevant PDFs. On NVIDIA infra those live at `/datasets/nv-ingest/vidore_v3_corpus_pdf/<domain>`. If your copy lives elsewhere, substitute that path everywhere `<VIDORE_ROOT>` appears below; e.g. `<VIDORE_ROOT> = /raid/datasets/vidore_v3_corpus_pdf`. | ||
|
|
||
| The runner does **not** copy the PDFs — it builds per-trial symlinks into `<workdir>/pdfs/`. The PDF roots can stay on a read-only mount. | ||
|
|
||
| > **Tip:** if your `domain` strings in the manifest are bare (e.g. `finance_en` rather than `vidore_v3_finance_en`), the `pdf_dirs` keys in the config must match exactly what's in the manifest, not what's on the filesystem. See [§3](#3-author-your-skill_evalyaml). | ||
|
|
||
| --- | ||
|
|
||
| ## 2. Supply an agent-eval manifest | ||
|
|
||
| The manifest is a **JSON list**; each item describes one query. It is produced by an upstream SDG pipeline; the skill-eval loader is dataset-agnostic and only enforces the schema described in `dataset.py:load_eval_manifest`. The minimum required keys per entry are: | ||
|
|
||
| | Field | Type | Purpose | | ||
| |---------------------------------------------|----------------------|-----------------------------------------------------------------------------------------------| | ||
| | `original_query` | string | The raw user question. Used by the `c3_retriever_skill` slash-command prompt. | | ||
| | `sdg_prompt_candidates.candidates` | list of `{variant_id, prompt}` | Paraphrased prompt variants. The runner uses the one matching `sdg_prompt_validation.selected_variant_id`, falling back to the first. | | ||
| | `sdg_prompt_validation.selected_variant_id` | int (optional) | Chosen variant. | | ||
| | `relevant_pages` | list of `{doc_id, page_number_in_doc, score}` | Ground-truth pages. `doc_id` is the PDF basename without `.pdf`; `page_number_in_doc` is 0-indexed. | | ||
| | `answer` | string | Ground-truth answer, used by the LLM judge. | | ||
| | `domain` | string | Joins the entry to a `pdf_dirs` key in the config. | | ||
| | `prompt_taxonomy.domain_label` | string | Human-readable domain name injected into the setup-turn prompt (e.g. "energy industry reports"). | | ||
| | `primary_eval_id` | string (optional) | Stable per-query id (else `eval_base_id`, else 1-indexed position). | | ||
|
|
||
| The newer scenario-format keys (`scenario_prompt_candidates`, `scenario_prompt_validation`) are accepted as aliases. | ||
|
|
||
| Entries with `prompt_export_status` not in `(None, "exported")` are skipped, as are entries with no usable paraphrased prompt. | ||
|
|
||
| **Example entry** (one item from the manifest list): | ||
|
|
||
| ```json | ||
| { | ||
| "primary_eval_id": "vidore_v3_finance_en:42:variant-1", | ||
| "domain": "vidore_v3_finance_en", | ||
| "prompt_taxonomy": { "domain_label": "English-language corporate finance filings" }, | ||
| "original_query": "What was Acme Corp's free cash flow in FY2024?", | ||
| "sdg_prompt_candidates": { | ||
| "candidates": [ | ||
| { "variant_id": 1, "prompt": "Look at the PDFs at ./pdfs/ and tell me Acme Corp's FY2024 free cash flow." } | ||
| ] | ||
| }, | ||
| "sdg_prompt_validation": { "selected_variant_id": 1 }, | ||
| "relevant_pages": [ | ||
| { "doc_id": "Acme_10K_2024", "page_number_in_doc": 47, "score": 1 } | ||
| ], | ||
| "answer": "$3.2B (per the FY2024 cash flow statement)." | ||
| } | ||
| ``` | ||
|
|
||
| Note any path references inside paraphrased prompts (e.g. `"the PDFs at test-data/vidore_v3/.../pdfs/"`) — they may need rewriting via `testdata_prefixes`; see [§3](#3-author-your-skill_evalyaml). | ||
|
|
||
| --- | ||
|
|
||
| ## 3. Author your `skill_eval.yaml` | ||
|
|
||
| Copy the packaged config next to your dataset checkout and edit it: | ||
|
|
||
| ```bash | ||
| cp nemo_retriever/src/nemo_retriever/skill_eval/configs/skill_eval.yaml \ | ||
| ~/datasets/vidore_v3/skill_eval.yaml | ||
| ``` | ||
|
|
||
| A complete ViDoRe v3 config looks like: | ||
|
|
||
| ```yaml | ||
| # ~/datasets/vidore_v3/skill_eval.yaml | ||
|
|
||
| # Absolute path to your agent-eval manifest (JSON list). | ||
| eval_manifest_path: ~/datasets/vidore_v3/agent_eval_manifest.json | ||
|
|
||
| # Per-domain PDF roots. KEY = manifest "domain" field; VALUE = directory of PDFs. | ||
| # Substitute <VIDORE_ROOT> with wherever your ViDoRe v3 corpus lives. | ||
| pdf_dirs: | ||
| vidore_v3_computer_science: <VIDORE_ROOT>/vidore_v3_computer_science | ||
| vidore_v3_energy: <VIDORE_ROOT>/vidore_v3_energy | ||
| vidore_v3_finance_en: <VIDORE_ROOT>/vidore_v3_finance_en | ||
| vidore_v3_finance_fr: <VIDORE_ROOT>/vidore_v3_finance_fr | ||
| vidore_v3_hr: <VIDORE_ROOT>/vidore_v3_hr | ||
| vidore_v3_industrial: <VIDORE_ROOT>/vidore_v3_industrial | ||
| vidore_v3_pharmaceuticals: <VIDORE_ROOT>/vidore_v3_pharmaceuticals | ||
| vidore_v3_physics: <VIDORE_ROOT>/vidore_v3_physics | ||
|
|
||
| # OPTIONAL — rewrite dataset-source path prefixes in paraphrased prompts to ./pdfs. | ||
| # Add one entry per prefix the manifest hard-codes. | ||
| testdata_prefixes: | ||
| - test-data/vidore_v3/ | ||
|
|
||
| # Agent + per-trial limits (defaults shown). | ||
| agent_model: claude-opus-4-7 | ||
| per_trial_budget_usd: 5.0 | ||
| per_trial_timeout_s: 600 | ||
| per_trial_workdir_root: /tmp/skill_eval | ||
|
|
||
| # Conditions to run, in order. Each (condition, domain) workdir is deleted after | ||
| # it finishes, so only one LanceDB exists on disk at a time. | ||
| conditions: | ||
| - c1_base | ||
| - c2_retriever | ||
| - c3_retriever_skill | ||
|
|
||
| # LLM-as-judge. Skipped silently if $NVIDIA_API_KEY is unset. | ||
| judge: | ||
| enabled: true | ||
| model: nvidia_nim/mistralai/mixtral-8x22b-instruct-v0.1 | ||
| api_base: https://integrate.api.nvidia.com/v1 | ||
| api_key_env: NVIDIA_API_KEY | ||
| ``` | ||
|
|
||
| **Things to double-check:** | ||
|
|
||
| - `pdf_dirs` keys must **exactly match** the `domain` field on each manifest entry. Mismatches cause the runner to exit with `pdf_dirs is missing an entry for domain '…'`. | ||
| - Each path under `pdf_dirs` must be a directory (not a glob). The runner symlinks every `*.pdf` inside. | ||
| - If your manifest references PDFs by paths like `test-data/vidore_v3/finance_en/pdfs/Acme.pdf`, add `test-data/vidore_v3/<domain>/pdfs/` (or the common prefix) to `testdata_prefixes` so the prompt text resolves to `./pdfs/Acme.pdf` inside the trial workdir. | ||
| - The single-path key `pdf_dir` is still honored as a fallback if you only have one domain. | ||
|
|
||
| --- | ||
|
|
||
| ## 4. Run the benchmark | ||
|
|
||
| Smoke-test on one domain first to validate config + manifest binding before paying for the whole sweep: | ||
|
|
||
| ```bash | ||
| retriever skill-eval run \ | ||
| --config ~/datasets/vidore_v3/skill_eval.yaml \ | ||
| --domains vidore_v3_finance_en \ | ||
| --conditions c2_retriever | ||
| ``` | ||
|
|
||
| This runs one condition × one domain — one Claude session, with one setup turn followed by N query turns (one per manifest entry tagged `vidore_v3_finance_en`). | ||
|
|
||
| Once that succeeds, run the full sweep: | ||
|
|
||
| ```bash | ||
| retriever skill-eval run --config ~/datasets/vidore_v3/skill_eval.yaml | ||
| ``` | ||
|
|
||
| Conditions and domains execute sequentially — three conditions × eight domains = 24 sessions in the default ViDoRe v3 setup. Each session is one Claude Code subprocess holding state across turns via `--resume <session-id>`. | ||
|
|
||
| The runner prints per-turn status, token usage, cost, and recall per `(condition, domain)`. Example: | ||
|
|
||
| ``` | ||
| Loaded 412 dataset entries. | ||
| Domains in this run: ['vidore_v3_computer_science', …, 'vidore_v3_physics'] (412 entries total) | ||
| Session dir: /raid/.../nemo_retriever/artifacts/skilleval_20260518_141200 | ||
| Starting session for c1_base/vidore_v3_finance_en — setup + 52 query turns (pdfs=/datasets/.../vidore_v3_finance_en) | ||
| turn 1 [vidore_v3_finance_en] setup: status=ok tokens(in/out/cache_r)=… cost=$0.041 retrieved=0 | ||
| turn 2 [vidore_v3_finance_en] entry_id=1 query_id=vidore_v3_finance_en:1:variant-1: status=ok … judge=4 | ||
| … | ||
| Recall for c1_base/vidore_v3_finance_en: recall@1=0.115 recall@5=0.327 recall@10=0.481 | ||
| Cleaned up workdir for c1_base/vidore_v3_finance_en | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## CLI reference | ||
|
|
||
| ``` | ||
| retriever skill-eval run [OPTIONS] | ||
| ``` | ||
|
|
||
| | Option | Default | Notes | | ||
| |---------------------|------------------------------------------------------|------------------------------------------------------------------------------------------------| | ||
| | `--config PATH` | packaged `skill_eval.yaml` (errors w/o `pdf_dirs`) | The YAML described in [§3](#3-author-your-skill_evalyaml). Strongly recommend supplying. | | ||
| | `--eval-manifest PATH` | `cfg.eval_manifest_path` | Overrides the config's manifest path. | | ||
| | `--conditions LIST` | `c1_base,c2_retriever,c3_retriever_skill` | Comma-separated, in execution order. Unknown values exit with code `2`. | | ||
| | `--domains LIST` | all domains present in the manifest | Comma-separated subset (e.g. `vidore_v3_finance_en,vidore_v3_finance_fr`). Unknowns exit with code `2`. | | ||
| | `--artifacts-root PATH` | `<repo>/nemo_retriever/artifacts/` | Where the session directory is created. | | ||
|
|
||
| The CLI exits with code `2` for any configuration error (missing `claude` binary, missing `eval_manifest_path`, malformed `pdf_dirs`, unknown condition/domain, missing PDF directory). | ||
|
|
||
| --- | ||
|
|
||
| ## Output layout | ||
|
|
||
| Each run writes a timestamped session directory: | ||
|
|
||
| ``` | ||
| <artifacts-root>/skilleval_<timestamp>/ | ||
| ├── config.yaml # Snapshot of the resolved config | ||
| ├── session_summary.json # Machine-readable per-(condition, domain) metrics | ||
| ├── session_summary.md # Human-readable markdown report | ||
| └── trials/ | ||
| ├── c1_base/ | ||
| │ └── vidore_v3_finance_en/ | ||
| │ ├── c1_base_vidore_v3_finance_en_setup_t1.json | ||
| │ ├── c1_base_vidore_v3_finance_en_e1_t2.json | ||
| │ └── … | ||
| ├── c2_retriever/ | ||
| │ └── vidore_v3_finance_en/… | ||
| └── c3_retriever_skill/ | ||
| └── vidore_v3_finance_en/… | ||
| ``` | ||
|
|
||
| **Per-trial JSON** (`trials/<cond>/<domain>/<trial_id>.json`) is the `TrialResult` dataclass serialized: status, duration, token usage, cost, `final_answer`, `ranked_retrieved`, judge score, and `retriever_used_ever` / `skill_fired` diagnostics. | ||
|
|
||
| **Trial workdirs** under `per_trial_workdir_root` (`/tmp/skill_eval/` by default) are **deleted** after each `(condition, domain)` session finishes — only the session directory above survives. If you want to inspect a workdir mid-run (e.g. examine the agent's `lancedb/`), kill the run before the cleanup, or set a breakpoint. | ||
|
|
||
| --- | ||
|
|
||
| ## Interpreting the summary | ||
|
|
||
| `session_summary.md` contains, per condition (rolled up across domains) and per `(condition, domain)`: | ||
|
|
||
| - `success_rate` — fraction of turns that exited cleanly. | ||
| - `retr_used` — fraction of turns whose Claude Code transcript contains a Bash invocation of the `retriever` CLI. Should be near 0 for `c1_base` and near 1 for `c2/c3`. | ||
| - `recall@1 / @5 / @10` — macro-averaged recall@k over the `(doc_id, page_number)` pairs the agent wrote into `ranked_retrieved`. Comparable to `retriever harness` BEIR output. | ||
| - `judge` — mean LLM judge score on the 0–5 scale, with sample size. `—` when the judge was disabled or unreachable. | ||
| - `q_input / q_output / q_cache_read / q_cache_create` — mean per-query-turn token usage on the agent session (not the underlying retrieval pipeline's embedding/VLM calls — those aren't instrumented here). | ||
| - `q_cost` — mean per-query-turn USD cost. | ||
|
|
||
| A separate **"Setup turns"** table sums the one-time setup-turn cost across all domains for each condition. For `c2/c3` this captures the cost of running `retriever ingest ./pdfs/` over a domain; for `c1` it captures the cost of whatever ad-hoc scaffolding the agent invents (typically expensive and noisy). | ||
|
|
||
| The **"Diagnostics"** section reports `skill_fired_rate` for `c2/c3`: the fraction of turns where the agent invoked `retriever` within the first two turns (a proxy for "did the skill description auto-discover correctly"). | ||
|
|
||
| --- | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| **`Error: \`claude\` CLI is not on PATH`** — install Claude Code and confirm `which claude` resolves before re-running. | ||
|
|
||
| **`config 'pdf_dirs' is missing an entry for domain '<X>'`** — your manifest contains a `domain` value that has no key in `pdf_dirs`. Either add the key, or use `--domains` to skip that subset. | ||
|
|
||
| **`PDF directory '…' for domain '…' does not exist or is not a directory`** — the value under `pdf_dirs.<domain>` was unset (`~` expansion failed, typo, etc.). Resolve the path manually with `ls "/your/configured/pdf_dirs/path"` and update the config. | ||
|
|
||
| **Judge prints `Judge disabled: $NVIDIA_API_KEY is not set` and exits cleanly** — that is by design. Recall and other metrics still land in the summary; only the `judge` column shows `—`. Export `NVIDIA_API_KEY` and re-run if you want the score. | ||
|
|
||
| **`c1_base` shows `retr_used` > 0** — the `_C1_BASH_DENY_PATTERNS` in `runner.py` are deny-globs against the assembled command line. If the agent invented a new path that those globs don't catch, the call goes through. File an issue with the offending command from the trial JSON's session-log path and extend the list. | ||
|
|
||
| **Per-domain run times look too long** — drop `--conditions c2_retriever,c3_retriever_skill` for a quick recall-only sweep against the skill (skipping `c1_base`), or use `--domains` to subset. Each condition × domain is independent; you can re-run any subset and the session directories don't collide (each has its own timestamped name). | ||
|
|
||
| **Agent failed to write `./output.json`** — the per-trial JSON will have `status="extraction_failed"` and `extraction_method` in `("missing", "invalid_json")`. The Claude Code session log path (under `~/.claude/projects/`) is reconstructable from the workdir, but the workdir has been deleted — re-run that single trial with `--conditions <cond> --domains <domain>` to capture the transcript fresh. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~/paths in YAML example won't expandThe comment on line 136 says "Absolute path to your agent-eval manifest," but
~/datasets/...is not an absolute path —~is a shell shorthand that standard YAML loaders (PyYAML'ssafe_load) pass through verbatim. Unlessrunner.pyexplicitly callsos.path.expanduser()on every path value after loading, the runner will look for a literal directory named~and fail. The troubleshooting section (line 291) already hints at this: "the value underpdf_dirs.<domain>was unset (~expansion failed …)" — but users who hit this error after following the example config will find it confusing. The example should use a real absolute path (e.g./home/user/datasets/vidore_v3/agent_eval_manifest.json) or add a note that~must be pre-expanded before writing it into the YAML.Prompt To Fix With AI