diff --git a/.claude/skills/nemo-retriever/SKILL.md b/.claude/skills/nemo-retriever/SKILL.md
index 0a9e102482..7c9322b929 100644
--- a/.claude/skills/nemo-retriever/SKILL.md
+++ b/.claude/skills/nemo-retriever/SKILL.md
@@ -13,9 +13,36 @@ If no arguments are provided, run `retriever --help` and summarize the available
 
 ## Subcommand references
 
-For per-subcommand details (when to use it, canonical invocations, inputs/outputs, flags, common failure modes), read the matching file in `references/` *before* running anything non-trivial:
+For per-subcommand details (when to use it, canonical invocations, inputs/outputs, flags, common failure modes), read the matching file in `references/` *before* running anything non-trivial.
 
-- `references/ingest.md` — `retriever ingest`: PDFs → LanceDB (full pipeline).
+End-to-end / search:
+
+- `references/ingest.md` — `retriever ingest`: docs → LanceDB (full pipeline, defaults).
 - `references/query.md` — `retriever query`: text query → top-k LanceDB hits.
+- `references/pipeline.md` — `retriever pipeline run`: graph-based end-to-end with per-stage knobs.
+- `references/service.md` — `retriever service`: long-running ingest service + client.
+- `references/local.md` — `retriever local stage{1..7}`: non-distributed per-stage runner.
+
+Per-input-type extractors:
+
+- `references/pdf.md` — `retriever pdf stage page-elements`: PDF → primitives JSON.
+- `references/chart.md` — `retriever chart stage run` / `graphic-elements`: chart enrichment.
+- `references/audio.md` — `retriever audio extract` / `discover`: chunk + ASR.
+- `references/txt.md` — `retriever txt run`: plain-text chunking.
+- `references/html.md` — `retriever html run`: HTML → markdown → chunks.
+- `references/image.md` — `retriever image render`: detection overlay visualization.
+
+Storage and evaluation:
+
+- `references/vector-store.md` — `retriever vector-store stage run`: embeddings → LanceDB.
+- `references/recall.md` — `retriever recall vdb-recall run`: recall@k over a query CSV.
+- `references/eval.md` — `retriever eval run` / `export` / `build-page-index`: QA evaluation.
+- `references/benchmark.md` — `retriever benchmark <stage> run`: per-stage rows/sec.
+- `references/harness.md` — `retriever harness run` / `sweep` / `nightly` / `portal` / …: sessioned orchestration.
+- `references/compare.md` — `retriever compare`: JSON / results-bundle diffs.
+
+Cross-cutting:
+
+- `references/pipeline-stages.md` — map of the internal pipeline stages (page-elements, ocr, table-structure, graphic-elements, embed, caption, dedup, store, …) → which CLI command exposes each.
 
-Additional per-stage references (`pdf`, `chart`, `image`, `audio`, `txt`, `html`, `pipeline`, `vector-store`, `recall`, `eval`, `benchmark`, `service`, `local`, `compare`, `harness`) will be added as those stages stabilize. Until then, fall back to `retriever <subcommand> --help` for any subcommand not listed above.
+If a subcommand isn't listed above, fall back to `retriever <subcommand> --help`.
diff --git a/.claude/skills/nemo-retriever/references/audio.md b/.claude/skills/nemo-retriever/references/audio.md
new file mode 100644
index 0000000000..ebd700e345
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/audio.md
@@ -0,0 +1,100 @@
+# retriever audio
+
+Audio / video extraction stage: chunk media files, run ASR (Parakeet locally
+or a remote Riva/NIM endpoint), and write extraction JSON sidecars in the
+same primitives shape as [[pdf]].
+
+If flags below look stale, re-check `retriever audio extract --help`.
+
+## When to use this
+
+- You have audio (`.mp3`, `.wav`) or video files and want ASR transcripts
+  fed into the rest of the retriever pipeline.
+- You want to verify mount/path layout before kicking off a long ASR run →
+  use `retriever audio discover` (no ASR, just lists what would be
+  processed).
+
+**Use a different command when:**
+
+- You want full ingest including audio → [[pipeline]] with
+  `--input-type audio` or [[ingest]] once it accepts audio inputs.
+- You want to benchmark ASR throughput → [[benchmark]] (`audio-extract`).
+
+## Canonical invocations
+
+Dry-run discovery:
+
+```bash
+retriever audio discover --input-dir data/audio/
+```
+
+Local Parakeet ASR over `*.mp3`/`*.wav` (default globs):
+
+```bash
+retriever audio extract --input-dir data/audio/
+```
+
+Cloud ASR via NIM env vars:
+
+```bash
+export NGC_API_KEY=...
+export AUDIO_FUNCTION_ID=...
+retriever audio extract --input-dir data/audio/ --use-env-asr
+```
+
+Override the gRPC endpoint explicitly:
+
+```bash
+retriever audio extract \
+  --input-dir data/audio/ \
+  --audio-grpc-endpoint riva-asr:50051 \
+  --auth-token "$NVIDIA_API_KEY"
+```
+
+Process video too, extracting audio first:
+
+```bash
+retriever audio extract --input-dir data/media/ --glob "*.mp4" --audio-only
+```
+
+## Inputs
+
+- **`--input-dir DIR`** — required, scanned (non-recursive) for files
+  matching `--glob`.
+- **`--glob`** — comma-separated patterns. Default `*.mp3,*.wav`.
+
+## Outputs
+
+- One `<file>.audio_extraction.json` sidecar per source file (default; toggle
+  with `--write-json/--no-write-json`).
+- Sidecar shape mirrors PDF primitives (`text`, `source_id`, `metadata`),
+  with `metadata.content_metadata.type == "text"` per ASR chunk.
+
+## Key flags
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--split-type` | `size` | `size` (bytes), `time` (seconds), or `frame`. |
+| `--split-interval` | `450` | Chunk size in the chosen units. |
+| `--audio-only` | off | Extract audio track from video first, then chunk. |
+| `--video-audio-separate` | off | Emit the extracted MP3 as its own item. |
+| `--use-env-asr` | on | Build ASR params from `AUDIO_GRPC_ENDPOINT`/`NGC_API_KEY`/`AUDIO_FUNCTION_ID`. |
+| `--audio-grpc-endpoint` | — | Override env; sets remote ASR. Wins over `--use-env-asr`. |
+| `--auth-token` | — | Bearer for cloud ASR (also `$NVIDIA_API_KEY`). |
+| `--limit` | — | Cap files processed. |
+
+## Common failure modes
+
+- **`No files matched glob`** — default globs are `*.mp3,*.wav`. Pass
+  `--glob "*.mp4"` for video, etc.
+- **Falls back to local Parakeet unexpectedly** — `--use-env-asr` is on but
+  none of `AUDIO_GRPC_ENDPOINT` / `NGC_API_KEY` / `AUDIO_FUNCTION_ID` are
+  set. Either set them or pass `--audio-grpc-endpoint`.
+- **Local Parakeet OOM on long files** — drop `--split-interval` (smaller
+  chunks) or switch to a remote NIM.
+
+## Related
+
+- [[pipeline]] with `--input-type audio` — full ingest including embedding +
+  VDB.
+- [[benchmark]] `audio-extract` — throughput benchmarks.
diff --git a/.claude/skills/nemo-retriever/references/benchmark.md b/.claude/skills/nemo-retriever/references/benchmark.md
new file mode 100644
index 0000000000..10d07ba968
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/benchmark.md
@@ -0,0 +1,93 @@
+# retriever benchmark
+
+Throughput micro-benchmarks for individual Ray actors in the ingest
+pipeline. Each subcommand isolates one stage and reports rows/sec.
+
+Subcommands:
+
+| Stage | Subcommand | Actor benchmarked |
+|---|---|---|
+| Split | `retriever benchmark split run` | `PDFSplitActor` |
+| Extract | `retriever benchmark extract run` | `PDFExtractionActor` |
+| Page elements | `retriever benchmark page-elements run` | `PageElementDetectionActor` |
+| OCR | `retriever benchmark ocr run` | `OCRActor` |
+| Audio extract | `retriever benchmark audio-extract run` | `MediaChunkActor + ASRActor` |
+| All | `retriever benchmark all run` | runs the above in sequence |
+
+If flags below look stale, re-check `retriever benchmark <stage> run --help`.
+
+## When to use this
+
+- You suspect a specific pipeline stage is the bottleneck and want
+  rows/sec numbers under controlled load.
+- You're sizing Ray actor counts / GPU fractions for [[pipeline]] / [[ingest]]
+  and need empirical numbers per stage.
+- You want a regression-style benchmark across machines or releases (pair
+  with [[harness]] for orchestration).
+
+**Use a different command when:**
+
+- You want end-to-end ingest, not stage-isolated numbers → [[ingest]] or
+  [[pipeline]] with a stopwatch.
+- You want recall/QA quality, not throughput → [[recall]] / [[eval]].
+
+## Canonical invocations
+
+Benchmark the page-element detector alone:
+
+```bash
+retriever benchmark page-elements run --help   # see options
+retriever benchmark page-elements run
+```
+
+Benchmark OCR (v2 by default; pair with [[pipeline]]'s `--ocr-version`):
+
+```bash
+retriever benchmark ocr run
+```
+
+Run all stage benchmarks in sequence and print a summary:
+
+```bash
+retriever benchmark all run --num-gpus 0.5 --num-cpus 1.0
+```
+
+## Inputs
+
+- All `run` commands take their own flag set (run `--help` on the
+  individual subcommand). Common shape: rows count, batch size, GPU/CPU
+  fractions per actor, optional remote NIM URL.
+
+## Outputs
+
+- Stdout report with per-actor throughput in rows/sec, plus headers per
+  stage (e.g. `=== benchmark: page-elements ===`).
+
+## Key flags (`all run`)
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--num-gpus` | `1.0` | GPUs reserved per page-elements / OCR actor. |
+| `--num-cpus` | `1.0` | CPUs reserved per actor. |
+| `--rows-page-elements` etc. | per-stage | Synthetic rows per stage benchmark. |
+
+## Reading the results
+
+- Numbers come from a synthetic Ray Dataset; they're representative of the
+  stage in isolation, not of end-to-end throughput.
+- To convert to [[pipeline]] tuning: pick the slowest stage's rows/sec,
+  divide your target rate by it → number of actors needed.
+
+## Common failure modes
+
+- **Page-elements benchmark stalls** — needs YOLOX weights or a remote
+  endpoint. Pass the URL flags or pre-cache weights.
+- **Benchmark numbers don't match [[pipeline]]** — micro-benchmarks exclude
+  inter-stage queues / batching overhead. Treat as upper bounds.
+- **`CUDA OOM`** — drop `--num-gpus` (fractional) or `*-batch-size` per
+  stage.
+
+## Related
+
+- [[pipeline]] — apply the actor counts derived from these benchmarks.
+- [[harness]] — runs benchmarks across configs/datasets and stores results.
diff --git a/.claude/skills/nemo-retriever/references/chart.md b/.claude/skills/nemo-retriever/references/chart.md
new file mode 100644
index 0000000000..0adc5319c0
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/chart.md
@@ -0,0 +1,85 @@
+# retriever chart
+
+Chart-specific enrichment over already-extracted primitives — parses chart
+images (titles, axes, series, values) and adds them as structured text to
+each chart primitive. Two related subcommands:
+
+- `retriever chart stage run` — enrich an existing primitives DataFrame.
+- `retriever chart stage graphic-elements` — run the extract+detect path
+  starting from PDFs, with chart extraction enabled.
+
+If flags below look stale, re-check `retriever chart stage --help`.
+
+## When to use this
+
+- You already ran [[pdf]] (or another extractor) and want to add chart
+  parsing on top of the primitives without re-extracting.
+- You're iterating on chart parsing parameters and don't want to rerun the
+  whole pipeline.
+
+**Use a different command when:**
+
+- You want full ingest with charts → [[ingest]] / [[pipeline]] with
+  `--extract-charts`.
+- You want only PDF extraction (no chart parsing) → [[pdf]].
+
+## Canonical invocations
+
+Enrich a primitives parquet with chart parsing:
+
+```bash
+retriever chart stage run \
+  --input out/extractions.parquet \
+  --output out/extractions.+chart.parquet
+```
+
+Extract from PDFs with charts enabled:
+
+```bash
+retriever chart stage graphic-elements \
+  --input-dir data/pdfs/ \
+  --extract-charts \
+  --yolox-http-endpoint http://page-elements:8000/v1/infer
+```
+
+## Inputs
+
+- **`run`**: `--input` parquet/jsonl/json with a `metadata` column.
+- **`graphic-elements`**: `--input-dir` of PDFs (same shape as `retriever pdf
+  stage page-elements`).
+
+## Outputs
+
+- **`run`**: enriched DataFrame at `--output` (defaults to
+  `<input>.+chart<ext>`). Chart primitives gain parsed structured text in
+  their `text` field.
+- **`graphic-elements`**: per-PDF `*.pdf_extraction.json` sidecars including
+  chart primitives.
+
+## Key flags (`chart stage run`)
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--input` | — | Required. `.parquet`, `.jsonl`, or `.json` with `metadata`. |
+| `--output` | `<input>.+chart<ext>` | Output path. |
+| `--config` | auto-discover | YAML config (section: `chart`). |
+
+## Key flags (`chart stage graphic-elements`)
+
+Same as `retriever pdf stage page-elements` plus `--extract-charts` toggled
+on by default. See [[pdf]] for the full flag table.
+
+## Common failure modes
+
+- **`KeyError: 'metadata'`** — input DataFrame is missing the `metadata`
+  column. Make sure you fed it primitives JSON/parquet from
+  `retriever pdf stage` or [[pipeline]].
+- **No chart rows in output** — the input has no rows with
+  `metadata.content_metadata.type == "structured"` and chart subtype. Run
+  extraction with `--extract-charts` first.
+
+## Related
+
+- [[pdf]] — generate the primitives that `chart stage run` consumes.
+- [[pipeline]] — wraps chart extraction into the graph pipeline.
+- [[ingest]] — end-to-end including charts when enabled.
diff --git a/.claude/skills/nemo-retriever/references/compare.md b/.claude/skills/nemo-retriever/references/compare.md
new file mode 100644
index 0000000000..4fa7c4aef6
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/compare.md
@@ -0,0 +1,46 @@
+# retriever compare
+
+Comparison utilities. Optional subcommands are registered lazily — if the
+relevant module is installed, you'll see:
+
+- `retriever compare json` — diff two JSON files (extraction sidecars, eval
+  outputs, recall outputs).
+- `retriever compare results` — diff two retrieval/eval result bundles.
+
+Run `retriever compare --help` to see which subcommands are present in your
+install.
+
+## When to use this
+
+- You changed an extraction flag, ran the pipeline twice, and want a
+  semantic diff of the outputs (not a textual diff).
+- You ran [[recall]] or [[eval]] twice and want to know which queries
+  regressed / improved.
+
+**Use a different command when:**
+
+- You want a single-number metric, not a diff → [[recall]] / [[eval]].
+- You want a UI / portal for sweep comparison → [[harness]] (`portal` /
+  `compare`).
+
+## Canonical invocations
+
+```bash
+retriever compare json before.json after.json
+retriever compare results runs/baseline/ runs/candidate/
+```
+
+Run `--help` on each subcommand for the exact flag set; the modules are
+optional and may expose different options across releases.
+
+## Common failure modes
+
+- **`retriever compare json` not found** — the `compare_json` module isn't
+  installed. Install the extras (or upgrade the package).
+- **Diff shows everything different** — files have non-stable key order or
+  embedded timestamps; the subcommand normalises common cases but not all.
+
+## Related
+
+- [[recall]] / [[eval]] — produce the artifacts this command compares.
+- [[harness]] `compare` — session-level comparison with summaries.
diff --git a/.claude/skills/nemo-retriever/references/eval.md b/.claude/skills/nemo-retriever/references/eval.md
new file mode 100644
index 0000000000..88bfcc9769
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/eval.md
@@ -0,0 +1,107 @@
+# retriever eval
+
+End-to-end QA evaluation: retrieval + generation + judge. Three
+subcommands:
+
+- `retriever eval run` — run a configured QA sweep.
+- `retriever eval export` — turn a LanceDB table into FileRetriever JSON for
+  use as a static retriever in an eval config.
+- `retriever eval build-page-index` — build a page-level markdown index for
+  full-page eval mode.
+
+If flags below look stale, re-check `retriever eval <subcmd> --help`.
+
+## When to use this
+
+- You want a single number for "is this retrieval+generation setup good?"
+  (judge score, per-question answers, etc.).
+- You're comparing models or chunking strategies and need a controlled QA
+  benchmark.
+
+**Use a different command when:**
+
+- You only need retrieval recall metrics → [[recall]].
+- You want a single ad-hoc query → [[query]].
+- You're tuning extraction quality, not QA → [[pipeline]] / [[pdf]].
+
+## Canonical invocations
+
+Run a sweep from a config file:
+
+```bash
+retriever eval run --config evaluation/eval_sweep.yaml
+```
+
+Run a sweep from environment (Docker/CI pattern):
+
+```bash
+export RETRIEVAL_FILE=out/retrieval.json
+export QA_DATASET=path/to/qa.json
+export GEN_MODEL=...
+export JUDGE_MODEL=...
+retriever eval run --from-env
+```
+
+Export LanceDB → FileRetriever JSON so eval can consume it:
+
+```bash
+retriever eval export \
+  --lancedb-uri ./lancedb --lancedb-table nv-ingest \
+  --query-csv evaluation/queries.csv \
+  --output out/retrieval.json \
+  --top-k 5
+```
+
+Build a page index for full-page eval mode:
+
+```bash
+retriever eval build-page-index \
+  --parquet-dir out/extractions/ \
+  --output out/page_index.json
+```
+
+## Inputs / Outputs
+
+- **`run`** — config (YAML/JSON) or env vars; emits per-question results +
+  aggregated metrics.
+- **`export`** — needs a populated LanceDB + a query CSV; emits a
+  FileRetriever JSON.
+- **`build-page-index`** — needs a directory of extraction Parquets; emits
+  a JSON mapping `(pdf, page) → markdown`.
+
+## Key flags
+
+`eval run`:
+
+| Flag | Notes |
+|---|---|
+| `--config FILE` | YAML/JSON sweep config (exclusive with `--from-env`). |
+| `--from-env` | Build config from env vars (`RETRIEVAL_FILE`, `QA_DATASET`, `GEN_MODEL`, `JUDGE_MODEL`, …). |
+
+`eval export`:
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--lancedb-uri` | `lancedb` | DB path. |
+| `--lancedb-table` | `nv-ingest` | Source table. **Note**: this is `--lancedb-table` (with `lancedb-` prefix), unlike [[ingest]] / [[query]] / [[recall]] / [[vector-store]] which use `--table-name`. Must point at the same table either way. |
+| `--query-csv` | — | Required. `query` (+ optional `answer`) columns. |
+| `--output` | — | Required output JSON path. |
+| `--top-k` | `5` | Chunks per query. |
+| `--embedder` | `nvidia/llama-nemotron-embed-1b-v2` | Must match ingest embedder. |
+| `--page-index FILE` | — | Enables full-page mode using `build-page-index` output. |
+
+## Common failure modes
+
+- **`run --from-env` errors with "RETRIEVAL_FILE not set"** — set every env
+  var the loader requires; `--from-env` is all-or-nothing.
+- **`export` writes empty file** — embedder mismatch with the LanceDB table
+  (different dim) or `--query-csv` lacks a `query` column.
+- **`build-page-index` is slow / OOM** — parquet directory is huge. Run on
+  a subset and merge JSONs, or run in a higher-memory environment.
+
+## Related
+
+- [[recall]] — retrieval-only metrics.
+- [[harness]] — orchestrates `eval`/`recall` sweeps with sessions, tags, and
+  Slack reporting.
+- [[compare]] — diff two eval runs.
diff --git a/.claude/skills/nemo-retriever/references/harness.md b/.claude/skills/nemo-retriever/references/harness.md
new file mode 100644
index 0000000000..c9d2c4e57d
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/harness.md
@@ -0,0 +1,143 @@
+# retriever harness
+
+Benchmark / eval orchestration. Wraps [[recall]] / [[eval]] /
+[[benchmark]] / [[pipeline]] runs into named *sessions* with tags,
+artifacts, and (optionally) a web portal + history DB + Slack reporting.
+
+Subcommands:
+
+| Subcommand | What it does |
+|---|---|
+| `run` | One configured run against a dataset. |
+| `sweep` | Multiple runs from a sweep YAML. |
+| `nightly` | Curated nightly sweep; can post results to Slack. |
+| `summary` | Print summary for a session. |
+| `compare` | Diff two sessions. |
+| `portal` | Launch the web portal. |
+| `backfill` | Import existing `results.json` artifacts into the history DB. |
+| `runner` | Runner agent (registers with a portal manager). |
+
+If flags below look stale, re-check `retriever harness <subcmd> --help`.
+
+## When to use this
+
+- You want reproducible, tagged eval/benchmark sessions you can come back
+  to later.
+- You're triaging nightly regressions and want the session+Slack flow.
+- You want to compare two sessions visually or via CLI.
+
+**Use a different command when:**
+
+- One-off run, no session bookkeeping → [[recall]] / [[eval]] /
+  [[benchmark]].
+- You're tuning extraction directly → [[pipeline]].
+
+## Canonical invocations
+
+Single run against a named dataset (preset from the config):
+
+```bash
+retriever harness run \
+  --dataset bo767 \
+  --config nemo_retriever/harness/test-config.yaml \
+  --run-name "baseline-2026-05-13" \
+  --tag dataset=bo767 --tag model=llama-nemotron-embed-1b-v2
+```
+
+Sweep:
+
+```bash
+retriever harness sweep \
+  --config nemo_retriever/harness/test-config.yaml \
+  --runs-config nemo_retriever/harness/sweep-runs.yaml \
+  --session-prefix sweep
+```
+
+Nightly with Slack:
+
+```bash
+retriever harness nightly \
+  --config nemo_retriever/harness/test-config.yaml \
+  --runs-config nemo_retriever/harness/nightly-runs.yaml
+```
+
+Replay a previous run to Slack without rerunning:
+
+```bash
+retriever harness nightly --replay runs/2026-05-12/session_summary.json
+```
+
+Compare two sessions:
+
+```bash
+retriever harness compare runs/baseline/ runs/candidate/
+```
+
+Print a session summary:
+
+```bash
+retriever harness summary runs/2026-05-13/
+```
+
+Launch the portal:
+
+```bash
+retriever harness portal --host 0.0.0.0 --port 8100
+```
+
+Backfill old artifacts into the history DB:
+
+```bash
+retriever harness backfill --artifacts-dir runs/ --db harness-history.db
+```
+
+## Key flags
+
+`harness run`:
+
+| Flag | Notes |
+|---|---|
+| `--dataset` | Required. Dataset name (from config) or direct path. |
+| `--preset` | Override the preset selection. |
+| `--config` | Harness test config YAML. |
+| `--run-name` | Label persisted in artifacts. |
+| `--override KEY=VALUE` | Per-run config override (repeatable). |
+| `--tag` | Tag persisted in artifacts (repeatable). |
+| `--recall-required/--no-recall-required` | Override the recall-required gate. |
+
+`harness sweep` / `nightly`:
+
+| Flag | Notes |
+|---|---|
+| `--runs-config` | YAML listing the runs to execute. |
+| `--preset` | Force preset for all runs. |
+| `--session-prefix` | Directory prefix (sweep only). |
+| `--tag` | Session-level tag (repeatable). |
+| `--dry-run` | Print the plan, don't execute. |
+| `--skip-slack` | Don't post to Slack (nightly only). |
+| `--replay PATH` | Replay an existing session to Slack (nightly only). |
+
+## Outputs
+
+- Session directory containing per-run subdirectories, each with
+  `results.json`, configs, and logs.
+- `session_summary.json` aggregating metrics.
+- Optional rows in the history DB (`backfill` / `portal`).
+- Optional Slack post (`nightly`).
+
+## Common failure modes
+
+- **`--dataset` not found** — name doesn't resolve in `--config`'s dataset
+  registry. Pass an absolute path or fix the name.
+- **`Slack post failed`** — env vars missing; pass `--skip-slack` or
+  configure the webhook.
+- **`portal` shows no runs** — history DB is empty. Run `backfill` once
+  against an artifacts root.
+- **`recall-required` gate fails** — a run's recall@k dropped below
+  threshold; the session is marked failed. Investigate before overriding
+  with `--no-recall-required`.
+
+## Related
+
+- [[recall]] / [[eval]] / [[benchmark]] — the underlying runners.
+- [[compare]] — non-harness JSON-level diff.
diff --git a/.claude/skills/nemo-retriever/references/html.md b/.claude/skills/nemo-retriever/references/html.md
new file mode 100644
index 0000000000..3e85f8d547
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/html.md
@@ -0,0 +1,73 @@
+# retriever html
+
+HTML extraction: `markitdown` converts HTML → Markdown, then tokenizer-split
+into chunks. Writes `<stem>.html_extraction.json` sidecars in the standard
+primitives shape.
+
+If flags below look stale, re-check `retriever html run --help`.
+
+## When to use this
+
+- You scraped a set of HTML pages and want them in the retriever pipeline.
+- You want the same downstream contract as [[txt]] but for HTML inputs.
+
+**Use a different command when:**
+
+- Input is plain text → [[txt]].
+- You want to run full ingest end-to-end on HTML → [[pipeline]] with
+  `--input-type html`.
+
+## Canonical invocations
+
+Default chunking:
+
+```bash
+retriever html run --input-dir data/html/
+```
+
+Smaller chunks with overlap:
+
+```bash
+retriever html run --input-dir data/html/ --max-tokens 256 --overlap 32
+```
+
+## Inputs
+
+- **`--input-dir DIR`** — required, scanned for `*.html`.
+
+## Outputs
+
+- `<stem>.html_extraction.json` per file (next to source by default, or in
+  `--output-dir`).
+- Same primitives-like shape as stage5 input.
+
+## Downstream
+
+```bash
+retriever local stage5 run --input-dir <dir> --pattern "*.html_extraction.json"
+retriever local stage6 run --input-dir <dir>
+```
+
+Or [[pipeline]] with `--input-type html`.
+
+## Key flags
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--max-tokens` | `512` | Per-chunk cap. |
+| `--overlap` | `0` | Tokens of overlap. |
+| `--encoding` | `utf-8` | HTML file encoding. |
+| `--limit` | — | Cap number of files processed. |
+
+## Common failure modes
+
+- **Heavy boilerplate in chunks (nav menus, footers)** — `markitdown` is
+  intentionally low-magic. Strip nav/footer in a pre-step if it pollutes
+  retrieval.
+- **JS-rendered pages produce near-empty output** — `markitdown` doesn't run
+  JS. Pre-render with a headless browser before feeding here.
+
+## Related
+
+- [[txt]] — sibling for plain-text inputs.
+- [[pipeline]] — full extract → embed → VDB for HTML.
diff --git a/.claude/skills/nemo-retriever/references/image.md b/.claude/skills/nemo-retriever/references/image.md
new file mode 100644
index 0000000000..c4fecf079b
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/image.md
@@ -0,0 +1,64 @@
+# retriever image
+
+Visualization helpers: render YOLOX page-element / chart-element detection
+overlays on page images so you can sanity-check the detector by eye.
+
+If flags below look stale, re-check `retriever image render --help`.
+
+## When to use this
+
+- A page-element or chart detector returned suspect boxes and you want to
+  see them overlaid on the source page image.
+- You're tuning thresholds and need a quick visual diff.
+
+**Use a different command when:**
+
+- You need the actual extraction output, not a picture → [[pdf]] or
+  [[chart]].
+- You want benchmarks over the detector → [[benchmark]] (`page-elements`).
+
+## Canonical invocations
+
+Overlay a single page:
+
+```bash
+retriever image render image \
+  page_001.png \
+  page_001.detections.json \
+  --output-path page_001.overlay.png
+```
+
+Overlay every page in a directory:
+
+```bash
+retriever image render dir \
+  pages/ detections/ overlays/
+```
+
+## Inputs
+
+- **`render image`**: a PNG/JPEG `image_path` plus a `detections_path` JSON
+  (YOLOX-shaped output).
+- **`render dir`**: parallel `input_dir` / `detections_dir`, output written
+  per-image to `output_dir`. Files are matched by basename.
+
+## Outputs
+
+- A single composite image with bounding boxes + class labels drawn on top
+  of the source. **Not** a side-by-side / split layout; if you want
+  original-vs-overlay panels, compose them yourself (e.g. via `ffmpeg
+  hstack` or `PIL`). `render image` writes to `--output-path`; `render dir`
+  writes into `output_dir`.
+
+## Common failure modes
+
+- **No boxes appear** — the detections JSON shape doesn't match what
+  `render` expects. Use the JSON that `retriever pdf stage page-elements`
+  (or [[pipeline]]) emitted, not a hand-rolled file.
+- **Mismatched coordinates** — detections were produced against a different
+  page render scale than the image you're overlaying on. Re-render at the
+  same DPI/`render-mode` you ran the detector with.
+
+## Related
+
+- [[pdf]] — produce the detections JSON that this command renders.
diff --git a/.claude/skills/nemo-retriever/references/ingest.md b/.claude/skills/nemo-retriever/references/ingest.md
index 2427a7d856..88d604dfc6 100644
--- a/.claude/skills/nemo-retriever/references/ingest.md
+++ b/.claude/skills/nemo-retriever/references/ingest.md
@@ -93,7 +93,11 @@ The default `ingest` runs 8 stages, in order:
   CUDA-graph capture for the embedder. Subsequent runs in the same process
   are fast; one-shot CLI invocations always pay this cost.
 - **`No existing dataset at …/nv-ingest.lance, it will be created`** — expected
-  on the first ingest into a new DB. Subsequent ingests append.
+  on the first ingest into a new DB. Subsequent ingests **always append** —
+  there is no `--overwrite` flag on `retriever ingest`. To start fresh,
+  `rm -rf <lancedb-uri>/<table-name>.lance` before running. Alternatively,
+  use [[vector-store]] (`vector-store stage run --overwrite`) on the
+  embeddings stage of the [[local]] flow.
 - **HuggingFace download on first run** — the embedder and page-element
   detector pull weights to `~/.cache/huggingface`. Needs network the first
   time; cached afterwards.
diff --git a/.claude/skills/nemo-retriever/references/local.md b/.claude/skills/nemo-retriever/references/local.md
new file mode 100644
index 0000000000..2d9eb9bf87
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/local.md
@@ -0,0 +1,89 @@
+# retriever local
+
+Non-distributed, pandas-based runner that exposes the pipeline as discrete
+numbered stages (`stage1` … `stage7`, plus `stage999` for post-mortem).
+Stages are intentionally separable so you can rerun one without touching
+the others.
+
+> The top-level group is registered as a placeholder; subcommands are
+> contributed by per-stage modules. Run `retriever local --help` (or the
+> per-stage `--help`) to see what's currently wired up in your install.
+
+## When to use this
+
+- You're iterating on a single stage (e.g. tweak chunking, rerun stage5,
+  re-upload stage6) without redoing extraction.
+- You want to debug a specific stage with `pdb` / breakpoints — no Ray, no
+  actors, deterministic ordering.
+- You need the intermediate sidecar files (per-stage JSON/parquet) for
+  inspection.
+
+**Use a different command when:**
+
+- You want full ingest in one command → [[ingest]] or [[pipeline]].
+- You need parallelism on a cluster → [[pipeline]] in batch mode.
+- You want a long-running endpoint → [[service]].
+
+## Pipeline stages (mapped to files)
+
+Stages live in `nemo_retriever/src/nemo_retriever/local/stages/`:
+
+| Stage | File | What it does |
+|---|---|---|
+| `stage1` | `stage1_pdf_extraction.py` | PDF extraction (same idea as [[pdf]]). |
+| `stage2` | `stage2_infographic_extraction.py` | Infographic enrichment. |
+| `stage3` | `stage3_table_extractor.py` | Table structure / OCR. |
+| `stage4` | `stage4_chart_extractor.py` | Chart enrichment (same idea as [[chart]]). |
+| `stage5` | `stage5_text_embeddings.py` | Text embedding → `*.text_embeddings.json`. |
+| `stage6` | `stage6_vdb_upload.py` | LanceDB upload (same idea as [[vector-store]]). |
+| `stage7` | `stage7_vdb_query.py` | Single-query lookup against LanceDB. |
+| `stage999` | `stage999_post_mortem_analysis.py` | Post-run analysis. |
+
+Each stage's `run` reads sidecars matching a pattern (e.g.
+`*.pdf_extraction.json` for stage5) and writes the next sidecar type.
+
+## Canonical flow
+
+```bash
+# 1. extract
+retriever local stage1 run --input-dir data/pdfs/
+
+# 2. enrich (optional)
+retriever local stage3 run --input-dir data/pdfs/   # tables
+retriever local stage4 run --input-dir data/pdfs/   # charts
+
+# 3. embed
+retriever local stage5 run --input-dir data/pdfs/ --pattern "*.pdf_extraction.json"
+
+# 4. upload to LanceDB
+retriever local stage6 run --input-dir data/pdfs/
+
+# 5. query
+retriever local stage7 run --query "what is in chart 1?"
+```
+
+For txt/html, swap stage1 for [[txt]] / [[html]] and adjust stage5's
+`--pattern`.
+
+## Inputs / outputs
+
+Each stage takes `--input-dir` (and stage-specific flags) and writes
+sidecars next to source files. The pattern is consistent: stage N reads
+stage N-1's output and writes its own type.
+
+## Common failure modes
+
+- **`stage5: no files matched pattern`** — `--pattern` defaults to
+  `*.pdf_extraction.json`; pass `*.txt_extraction.json` /
+  `*.html_extraction.json` for those inputs.
+- **`stage6` overwrites a table I wanted to append to** — pass the
+  stage-appropriate flag, or use [[vector-store]] which has explicit
+  `--append`.
+- **First `stage5` run is slow** — model load. Same trade-off as the
+  one-shot CLIs; reuse the process for multiple inputs in research scripts.
+
+## Related
+
+- [[pdf]] / [[chart]] / [[vector-store]] — standalone equivalents of
+  individual stages.
+- [[pipeline]] — distributed graph version of the same flow.
diff --git a/.claude/skills/nemo-retriever/references/pdf.md b/.claude/skills/nemo-retriever/references/pdf.md
new file mode 100644
index 0000000000..dc4ef270f7
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/pdf.md
@@ -0,0 +1,122 @@
+# retriever pdf
+
+Single-stage PDF extraction: scan a directory of PDFs and write per-PDF
+primitives JSON sidecars (text / table / chart / image / page-image rows),
+without running embedding or vector-DB stages.
+
+If flags below look stale, re-check `retriever pdf stage page-elements --help`.
+
+## When to use this
+
+- You only need extraction output (primitives JSON) — no embeddings, no
+  LanceDB. Useful for debugging, comparing extraction methods, or feeding a
+  custom downstream pipeline.
+- You want to swap extraction *methods* (pdfium, pdfium_hybrid, ocr,
+  nemotron_parse, tika) without rebuilding the whole pipeline.
+- You need to point at a remote YOLOX / Nemotron Parse NIM rather than the
+  bundled embedded models.
+
+**Use a different command when:**
+
+- You want the full extract → embed → ingest flow → [[ingest]] or
+  [[pipeline]].
+- You want only chart enrichment over already-extracted primitives →
+  [[chart]].
+- You want to inspect extraction overlays visually → [[image]].
+- You want to benchmark extraction throughput → [[benchmark]] (`split` /
+  `extract` / `page-elements`).
+
+## Canonical invocations
+
+Default extraction (pdfium, text only) on a directory:
+
+```bash
+retriever pdf stage page-elements --input-dir data/pdfs/
+```
+
+Extract everything (text + tables + charts + images) via pdfium + remote
+YOLOX:
+
+```bash
+retriever pdf stage page-elements \
+  --input-dir data/pdfs/ \
+  --method pdfium \
+  --yolox-http-endpoint http://page-elements:8000/v1/infer \
+  --extract-text --extract-tables --extract-charts --extract-images
+```
+
+Use NemotronParse instead of pdfium+YOLOX:
+
+```bash
+retriever pdf stage page-elements \
+  --input-dir data/pdfs/ \
+  --method nemotron_parse \
+  --nemotron-parse-http-endpoint http://nemotron-parse:8000/v1/infer
+```
+
+Write all sidecars to a single output directory:
+
+```bash
+retriever pdf stage page-elements \
+  --input-dir data/pdfs/ \
+  --json-output-dir out/extractions/
+```
+
+## Inputs
+
+- **`--input-dir DIR`** — recursively scanned for `*.pdf`. Required (or via
+  `--config`).
+- **`--config FILE`** — optional ingest YAML. Auto-discovered from
+  `./ingest-config.yaml` then `$HOME/.ingest-config.yaml`. CLI flags override
+  YAML values.
+
+## Outputs
+
+- One `<pdf>.pdf_extraction.json` sidecar per input PDF, written next to the
+  PDF unless `--json-output-dir` is set.
+- Each sidecar is a list of primitives. Per primitive: `text`,
+  `source_id`/`path`, `page_number`, `metadata` (type, bbox, render info).
+
+These sidecars are the canonical stage-1 input for the rest of the
+non-distributed `local stage*` flow (`stage5` embed, `stage6` VDB upload).
+
+## Key flags
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--method` | `pdfium` | `pdfium`, `pdfium_hybrid`, `ocr`, `nemotron_parse`, `tika`. |
+| `--yolox-grpc-endpoint` / `--yolox-http-endpoint` | — | Required for `pdfium` family when extracting page elements. |
+| `--nemotron-parse-grpc-endpoint` / `--nemotron-parse-http-endpoint` | — | Required for `method=nemotron_parse`. |
+| `--extract-text/--extract-tables/--extract-charts/--extract-images/--extract-infographics/--extract-page-as-image` | text only | Toggle which primitives are written. |
+| `--text-depth` | `page` | `page` or `document`. |
+| `--render-mode` | `fit_to_model` | `full_dpi` (DPI-then-resize) or `fit_to_model` (≈93 DPI for US Letter). |
+| `--limit` | — | Cap number of PDFs processed (debugging). |
+
+## Method cheat-sheet
+
+- **`pdfium`** — fast, native text + YOLOX-driven element detection. Default.
+- **`pdfium_hybrid`** — pdfium text + OCR fallback per page where text
+  extraction was empty/sparse.
+- **`ocr`** — render each page, OCR everything. Use for scanned PDFs.
+- **`nemotron_parse`** — NemotronParse end-to-end (text + tables + charts +
+  layout) via a single NIM call.
+- **`tika`** — Apache Tika fallback (no element detection).
+
+## Common failure modes
+
+- **`YOLOX endpoint is required for method='pdfium'`** — pass
+  `--yolox-grpc-endpoint` or `--yolox-http-endpoint`. Without it, only
+  `--extract-text` works.
+- **Empty primitives for scanned PDFs with `--method pdfium`** — there's no
+  embedded text. Switch to `--method ocr` or `pdfium_hybrid`.
+- **No sidecars written** — `--write-json-outputs/--no-write-json-outputs`
+  toggles output. Default is on; check you didn't disable it via `--config`.
+- **`auth-token` errors against NGC NIMs** — set `--auth-token` or
+  `NVIDIA_API_KEY` in the environment.
+
+## Related
+
+- [[chart]] — enrich the primitives from this stage with chart parsing.
+- [[ingest]] — full pipeline that wraps this stage end-to-end.
+- [[pipeline]] — graph-based pipeline exposing per-stage knobs.
+- [[benchmark]] — measure throughput of this stage.
diff --git a/.claude/skills/nemo-retriever/references/pipeline-stages.md b/.claude/skills/nemo-retriever/references/pipeline-stages.md
new file mode 100644
index 0000000000..25ae629992
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/pipeline-stages.md
@@ -0,0 +1,65 @@
+# pipeline stages
+
+Cross-reference for the **internal pipeline stages** (page-elements, ocr,
+table-structure, graphic-elements, embed, caption, dedup, store). These are
+not top-level CLI commands of their own; they're surfaced as:
+
+1. Flag groups under [[pipeline]] (`pipeline run`).
+2. Stand-alone benchmark subcommands under [[benchmark]].
+3. In some cases, dedicated subcommands under other groups (e.g.
+   page-elements lives under `retriever pdf stage page-elements`).
+
+Use this page to figure out *which* command to reach for when you want to
+exercise or tune a specific stage.
+
+## Stage map
+
+| Stage | What it does | Tuned via | Benchmarked via | Standalone CLI |
+|---|---|---|---|---|
+| **pdf-split** | Split PDFs into per-page tasks | `--pdf-split-batch-size` | `retriever benchmark split run` | — |
+| **pdf-extract** | Native PDF text/structure extraction | `--method`, `--pdf-extract-*` | `retriever benchmark extract run` | [[pdf]] |
+| **page-elements** | YOLOX text/table/chart/image detection | `--page-elements-invoke-url`, `--page-elements-actors`, `--page-elements-batch-size`, `--page-elements-{cpus,gpus}-per-actor` | `retriever benchmark page-elements run` | `retriever pdf stage page-elements` (see [[pdf]]) |
+| **ocr** | OCR for sparse text regions | `--ocr-invoke-url`, `--ocr-version` (`v1`/`v2`), `--ocr-{actors,batch-size,cpus-per-actor,gpus-per-actor}` | `retriever benchmark ocr run` | — |
+| **table-structure** | Structured OCR over detected tables | `--use-table-structure`, `--table-structure-invoke-url`, `--table-output-format` | — | (`nemo_retriever.table.commands` exposes `run-structure-ocr` under the table sub-app where wired) |
+| **graphic-elements** | Chart parsing | `--use-graphic-elements`, `--graphic-elements-invoke-url`, `--extract-charts` | — | [[chart]] |
+| **infographic** | Infographic parsing | `--extract-infographics` | — | (`retriever local stage2`) |
+| **dedup** | IoU-based primitive dedup | `--dedup/--no-dedup`, `--dedup-iou-threshold` | — | — |
+| **caption** | VLM caption for image primitives | `--caption/--no-caption`, `--caption-invoke-url`, `--caption-model-name`, `--caption-temperature`, `--caption-top-p`, `--caption-max-tokens` | — | — |
+| **udf** | User-defined transforms (passthrough by default) | (code) | — | — |
+| **embed** | Embed primitives | `--embed-invoke-url`, `--embed-model-name`, `--embed-modality`, `--embed-granularity`, `--embed-{actors,batch-size,cpus-per-actor,gpus-per-actor}`, `--local-ingest-embed-backend` | — | `retriever local stage5` |
+| **audio-extract** | Chunk media + ASR | `--segment-audio`, `--audio-split-type`, `--audio-split-interval`, `--audio-match-tolerance`, audio NIM env | `retriever benchmark audio-extract run` | [[audio]] |
+| **store (VDB)** | Write embeddings to LanceDB | `--store-actors`, `--lancedb-uri`, `--table-name` (set on [[ingest]] / [[vector-store]]) | — | [[vector-store]] |
+| **query** | Embed query + search | (read side) | — | [[query]] / [[recall]] |
+
+## Choosing the right entry point
+
+- **"I want to ingest a corpus end-to-end"** → [[ingest]] (defaults) or
+  [[pipeline]] (per-stage control).
+- **"I only want this one stage's output"** → the *Standalone CLI* column.
+- **"I want to know how fast this stage is on this machine"** → the
+  *Benchmarked via* column.
+- **"I want to route this stage through a NIM"** → set the matching
+  `--*-invoke-url` on [[pipeline]] (and `--api-key`).
+- **"I want to size Ray actors for this stage"** → tune the
+  `--<stage>-actors` / `--<stage>-batch-size` /
+  `--<stage>-{cpus,gpus}-per-actor` quartet on [[pipeline]].
+
+## Stage ordering
+
+Default order for a PDF input under [[pipeline]] / [[ingest]]:
+
+```
+pdf-split → pdf-extract → page-elements → ocr
+         → (table-structure)  (graphic-elements)  (infographic)
+         → dedup → (caption) → udf → embed → store
+```
+
+Audio swaps the head: `audio-extract` (chunk + ASR) replaces
+pdf-split/pdf-extract/page-elements/ocr; the tail (embed, store) is the
+same. Txt/html similarly replace the head with [[txt]] / [[html]].
+
+## Related
+
+- [[pipeline]] — the command that wires every stage above together.
+- [[benchmark]] — per-stage rows/sec.
+- [[local]] — non-distributed, file-per-stage version of the same flow.
diff --git a/.claude/skills/nemo-retriever/references/pipeline.md b/.claude/skills/nemo-retriever/references/pipeline.md
new file mode 100644
index 0000000000..6de2ae6780
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/pipeline.md
@@ -0,0 +1,148 @@
+# retriever pipeline
+
+Graph-based end-to-end ingestion pipeline. Same outcome as [[ingest]]
+(documents → LanceDB) but exposes per-stage knobs for extraction methods,
+NIM endpoints, Ray actor counts, embedding model, dedup/caption, audio/video
+options, and storage.
+
+Use `retriever pipeline run --help` to see *all* flag groups — there are
+many. This page covers the groups and the most-used flags within each.
+
+## When to use this
+
+- You need fine-grained control over a pipeline stage (e.g. swap the OCR
+  model, set per-actor GPU fractions, route through a remote NIM, use a
+  different embedder).
+- You're tuning throughput on a Ray cluster and need actor / batch-size
+  knobs.
+- You want to ingest non-PDF inputs (audio, video, txt, html, image)
+  through the same graph.
+
+**Use a different command when:**
+
+- Defaults are fine → [[ingest]] (one flag, same outcome).
+- You only need a single stage's output → [[pdf]], [[chart]], [[audio]],
+  [[txt]], [[html]].
+- You want long-running service mode → [[service]].
+- You want a non-Ray local debug runner → [[local]].
+- You want throughput numbers per stage → [[benchmark]].
+
+## Canonical invocations
+
+Default batch ingest of a PDF directory:
+
+```bash
+retriever pipeline run data/pdfs/
+```
+
+In-process (no Ray) for quick local runs:
+
+```bash
+retriever pipeline run data/pdfs/ --run-mode inprocess
+```
+
+Ingest audio:
+
+```bash
+retriever pipeline run data/audio/ --input-type audio
+```
+
+Route through remote NIMs (no local GPU). Note: `--use-table-structure` and
+`--use-graphic-elements` default to **off** — passing the matching
+`--*-invoke-url` alone is not enough; the `--use-*` flag must also be set
+to enable that stage.
+
+```bash
+retriever pipeline run data/pdfs/ \
+  --page-elements-invoke-url http://page-elements:8000/v1/infer \
+  --ocr-invoke-url http://ocr:8000/v1/infer \
+  --use-table-structure \
+  --table-structure-invoke-url http://table-structure:8000/v1/infer \
+  --use-graphic-elements \
+  --graphic-elements-invoke-url http://graphic-elements:8000/v1/infer \
+  --embed-invoke-url http://embed:8000/v1/embed \
+  --api-key "$NVIDIA_API_KEY"
+```
+
+Tune Ray actor counts for a busy stage:
+
+```bash
+retriever pipeline run data/pdfs/ \
+  --page-elements-actors 4 --page-elements-gpus-per-actor 0.5 \
+  --ocr-actors 2 --ocr-gpus-per-actor 1.0 \
+  --embed-actors 1 --embed-batch-size 64
+```
+
+## Inputs
+
+- **Positional `INPUT_PATH`** — file or directory of documents. Required.
+- **`--input-type`** — `pdf` (default) / `doc` / `txt` / `html` / `image` /
+  `audio`.
+
+## Outputs
+
+- LanceDB table populated by the `IngestVdbOperator` sink (defaults
+  `lancedb/nv-ingest.lance`). See [[query]] for reading.
+- If `--store-images-uri` is set, extracted images are also persisted there.
+
+## Flag groups (from `--help`)
+
+| Group | What it controls |
+|---|---|
+| **I/O and Execution** | `--run-mode` (`batch` / `inprocess` / `service`), `--input-type`, `--debug`, `--log-file`. |
+| **PDF / Document Extraction** | `--method`, `--dpi`, `--extract-text/--extract-tables/--extract-charts/--extract-infographics/--extract-page-as-image`, `--use-graphic-elements`, `--use-table-structure`, `--table-output-format`. |
+| **Remote NIM Endpoints** | `--api-key`, plus `--*-invoke-url` for `page-elements`, `ocr`, `graphic-elements`, `table-structure`, `embed`. `--ocr-version v1/v2`. |
+| **Embedding** | `--embed-model-name`, `--embed-modality`, `--embed-granularity`, `--local-ingest-embed-backend` (`vllm`/`hf`), `--text-elements-modality`, `--structured-elements-modality`. |
+| **Dedup and Caption** | `--dedup/--no-dedup`, `--dedup-iou-threshold`, `--caption/--no-caption`, `--caption-invoke-url`, `--caption-model-name`, GPU fractions, `--caption-temperature`/`--caption-top-p`/`--caption-max-tokens`. |
+| **Storage and Text Chunking** | `--store-images-uri`, `--text-chunk`, `--text-chunk-max-tokens`, `--text-chunk-overlap-tokens`. |
+| **Ray / Batch Tuning** | `--ray-address`, per-stage `*-actors`/`*-batch-size`/`*-cpus-per-actor`/`*-gpus-per-actor` for `page-elements`, `ocr`, `embed`, `nemotron-parse`, plus `--store-actors`, `--pdf-split-batch-size`, `--pdf-extract-*`. |
+| **Audio** | `--segment-audio`, `--audio-split-type`/`--audio-split-interval`, `--audio-match-tolerance`. |
+| **Video** | `--video-extract-audio`, video-specific split/sampling flags. |
+
+## Pipeline stages (what runs end-to-end)
+
+For a PDF input with all defaults, the graph runs roughly:
+
+1. **PDFSplitActor** — split into per-page tasks.
+2. **PDFExtractionActor** — native text/structure extraction.
+3. **PageElementDetectionActor** — YOLOX detects text/table/chart/image
+   regions. Tunable via `--page-elements-*` flags.
+4. **OCRV2Actor** / OCRActor — OCR text where extraction is sparse. Tunable
+   via `--ocr-*` flags; `--ocr-version v1` for the legacy engine.
+5. **(optional) TableStructureActor** — structured-OCR on detected tables
+   when `--use-table-structure` is set; route via
+   `--table-structure-invoke-url`.
+6. **(optional) GraphicElementsActor** — chart enrichment when
+   `--use-graphic-elements`; route via `--graphic-elements-invoke-url`.
+7. **(optional) CaptionActor** — VLM captioning when `--caption`.
+8. **UDFOperator** — user-defined transforms (passthrough by default).
+9. **EmbedActor** — embed primitives. Tunable via `--embed-*` flags.
+10. **IngestVdbOperator (StoreOperator)** — write to LanceDB.
+
+Each stage has its own `--*-invoke-url` for routing to a NIM, and (in batch
+mode) `--*-actors` / `--*-batch-size` / `--*-cpus-per-actor` /
+`--*-gpus-per-actor` for resource sizing.
+
+## Common failure modes
+
+- **Stage saturates and stalls** — bump `--<stage>-actors` and/or
+  `--<stage>-batch-size`. Use [[benchmark]] to find the bottleneck stage
+  first.
+- **"No GPU available" with `--run-mode batch`** — set
+  `--<stage>-gpus-per-actor 0` for stages you want on CPU, or pass
+  `--*-invoke-url` to offload to a NIM.
+- **Embedding mismatch on read** — `--embed-model-name` differs from what
+  [[query]] uses. Keep ingest and query embedders aligned.
+- **Output table empty** — input matched no files for `--input-type`. Check
+  globs and file extensions.
+- **Tables / charts not appearing in output despite `--*-invoke-url` set**
+  — `--use-table-structure` / `--use-graphic-elements` default to off.
+  Setting the invoke URL alone does *not* enable the stage; pass the
+  `--use-*` flag too.
+
+## Related
+
+- [[ingest]] — defaults-only wrapper around this command.
+- [[local]] — non-distributed runner for debugging stages.
+- [[service]] — long-running pipeline behind an HTTP API.
+- [[benchmark]] — per-stage throughput numbers.
diff --git a/.claude/skills/nemo-retriever/references/recall.md b/.claude/skills/nemo-retriever/references/recall.md
new file mode 100644
index 0000000000..75d24815cf
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/recall.md
@@ -0,0 +1,87 @@
+# retriever recall
+
+Batch query + recall@k evaluation. Reads a CSV of ground-truth queries,
+embeds each query, searches a LanceDB table, prints per-query hits, and
+computes recall@1 / @5 / @10.
+
+If flags below look stale, re-check `retriever recall vdb-recall run --help`.
+
+## When to use this
+
+- You have labelled `(query, pdf, page)` ground truth and want recall
+  metrics for a retrieval setup.
+- Sweeping embedding models / chunking / top-k against a fixed query set.
+
+**Use a different command when:**
+
+- You want a single ad-hoc lookup → [[query]].
+- You want full QA quality (answer grading), not just retrieval recall →
+  [[eval]].
+- You want to compare two recall runs → [[compare]].
+
+## Canonical invocations
+
+Default recall against the project query set:
+
+```bash
+retriever recall vdb-recall run
+```
+
+Custom query CSV + custom table:
+
+```bash
+retriever recall vdb-recall run \
+  --query-csv my-queries.csv \
+  --top-k 10 \
+  --lancedb-uri ./my-lancedb \
+  --table-name my-corpus
+```
+
+Route embedding through a remote NIM:
+
+```bash
+retriever recall vdb-recall run \
+  --query-csv my-queries.csv \
+  --embedding-http-endpoint http://embed:8000/v1/embed
+```
+
+## Inputs
+
+- **`--query-csv FILE`** — CSV with `query,pdf_page` or `query,pdf,page`
+  columns. Default `bo767_query_gt.csv`.
+
+## Outputs
+
+- Per-query top-k hits printed to stdout.
+- A summary line with `recall@1 / @5 / @10`.
+
+`recall@10` always queries with `search_k = max(top_k, 10)` so the metric
+remains valid even when you display fewer hits.
+
+## Key flags
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--query-csv` | `bo767_query_gt.csv` | Ground-truth CSV. |
+| `--top-k` | `5` | Hits shown per query (recall@10 still computed). |
+| `--lancedb-uri` | `lancedb` | Must match [[ingest]] / [[vector-store]]. |
+| `--table-name` | `nv-ingest` | Same. |
+| `--vector-column` | `vector` | Column to search. |
+| `--embedding-endpoint` / `--embedding-http-endpoint` / `--embedding-grpc-endpoint` | — | Remote query embedder. Falls back to local HF if all unset. |
+| `--limit` | — | Cap queries (debug). |
+
+## Common failure modes
+
+- **`recall@10 = 0.0`** — query embedder doesn't match the ingest embedder
+  (different model / dim). Re-ingest with the same embedder or pass the
+  matching `--embedding-*-endpoint`.
+- **`KeyError: 'pdf_page'`** — CSV uses `pdf,page` instead. The command
+  accepts either schema, but typos in column names break both.
+- **Slow first run** — local HF embedder cold-start. Reuse a single process
+  or hit a warm NIM.
+
+## Related
+
+- [[query]] — ad-hoc retrieval against the same table.
+- [[eval]] — adds answer-quality grading on top of retrieval.
+- [[compare]] — diff two retrieval runs.
diff --git a/.claude/skills/nemo-retriever/references/service.md b/.claude/skills/nemo-retriever/references/service.md
new file mode 100644
index 0000000000..e02b0b3886
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/service.md
@@ -0,0 +1,100 @@
+# retriever service
+
+Long-running ingest service: an HTTP/SSE server that accepts document
+uploads and runs the pipeline behind the scenes. Two subcommands:
+
+- `retriever service start` — boot the server.
+- `retriever service ingest` — client that uploads files to a running
+  server.
+
+If flags below look stale, re-check `retriever service <subcmd> --help`.
+
+## When to use this
+
+- You want a single warm process serving many ingest requests (avoids the
+  one-shot CLI startup cost — vLLM load, CUDA-graph capture).
+- You want to ingest from a remote machine / orchestrator without copying
+  files onto a GPU host every time.
+- You want to point [[pipeline]] at a remote pipeline via
+  `--run-mode service`.
+
+**Use a different command when:**
+
+- One-shot ingest → [[ingest]] / [[pipeline]].
+- Local debugging / no service → [[local]].
+
+## Canonical invocations
+
+Start with a YAML config:
+
+```bash
+retriever service start --config deploy/retriever-service.yaml
+```
+
+Start with inline flags (overrides any YAML):
+
+```bash
+retriever service start \
+  --host 0.0.0.0 --port 7670 \
+  --gpu-devices 0,1 \
+  --nim-api-key "$NVIDIA_API_KEY" \
+  --api-token "$NEMO_RETRIEVER_API_TOKEN"
+```
+
+Upload files to a running server (SSE streaming progress):
+
+```bash
+retriever service ingest --server-url http://localhost:7670 data/pdfs/*.pdf
+```
+
+Polling instead of SSE (firewalled environments):
+
+```bash
+retriever service ingest --no-sse --poll-interval 5.0 data/pdfs/foo.pdf
+```
+
+## Inputs / outputs
+
+- **`start`** — no inputs; serves until killed.
+- **`ingest`** — one or more file paths, streamed/polled to completion.
+  Prints per-file status.
+
+## Key flags
+
+`service start`:
+
+| Flag | Notes |
+|---|---|
+| `--config -c` | Path to `retriever-service.yaml`. |
+| `--host` / `--port -p` | Bind address. Default per YAML. |
+| `--log-level` / `--log-file` | Logging overrides. |
+| `--nim-api-key` | NIM bearer (also `$NVIDIA_API_KEY`). |
+| `--gpu-devices` | CSV GPU IDs. |
+| `--api-token` | Bearer required on every request (also `$NEMO_RETRIEVER_API_TOKEN`). Unset = no auth. |
+
+`service ingest`:
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--server-url -s` | `http://localhost:7670` | Server base URL. |
+| `--sse / --no-sse` | `sse` | Stream progress or poll. |
+| `--poll-interval` | `2.0` s | Polling cadence when `--no-sse`. |
+| `--concurrency` | `8` | Max concurrent uploads. |
+| `--api-token` | from `$NEMO_RETRIEVER_API_TOKEN` | Auto-falls back to the env var; pass the flag only to override. |
+
+## Common failure modes
+
+- **`401 Unauthorized`** — server has `--api-token` set; the client must
+  match (`--api-token` or `$NEMO_RETRIEVER_API_TOKEN`).
+- **Hangs on first request after boot** — model warmup. First request can
+  take 30–60s; subsequent ones are sub-second.
+- **`Connection refused`** — server binds `0.0.0.0` but firewall blocks the
+  port. Tunnel or open the port.
+- **CUDA OOM under concurrency** — drop client `--concurrency`, or reduce
+  per-stage actor counts in the server YAML.
+
+## Related
+
+- [[pipeline]] with `--run-mode service` — pipeline CLI that delegates to a
+  running service.
+- [[ingest]] — local one-shot equivalent.
diff --git a/.claude/skills/nemo-retriever/references/txt.md b/.claude/skills/nemo-retriever/references/txt.md
new file mode 100644
index 0000000000..b81867daa6
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/txt.md
@@ -0,0 +1,78 @@
+# retriever txt
+
+Plain-text extraction: scan a directory for `*.txt`, tokenizer-split each
+file into chunks, and write `<stem>.txt_extraction.json` sidecars in the
+same primitives shape as the rest of the pipeline.
+
+If flags below look stale, re-check `retriever txt run --help`.
+
+## When to use this
+
+- You have plain-text corpora (logs, scraped articles, transcripts) and want
+  to feed them into embed → VDB downstream stages.
+- Quick way to seed a LanceDB table for retrieval experiments without going
+  through PDF rendering.
+
+**Use a different command when:**
+
+- Input is HTML → [[html]].
+- Input is PDF/audio/etc → [[pdf]], [[audio]], or the unified [[pipeline]]
+  with `--input-type txt`.
+
+## Canonical invocations
+
+Default chunking (512 tokens, no overlap):
+
+```bash
+retriever txt run --input-dir data/text/
+```
+
+Smaller chunks with overlap:
+
+```bash
+retriever txt run --input-dir data/text/ --max-tokens 256 --overlap 32
+```
+
+## Inputs
+
+- **`--input-dir DIR`** — required, scanned for `*.txt`.
+
+## Outputs
+
+- `<stem>.txt_extraction.json` per file (next to source by default, or in
+  `--output-dir` if set).
+- Same primitives-like shape as stage5 input: `text`, `path`, `page_number`
+  (always 0 for txt), `metadata`.
+
+## Downstream
+
+After this, run (as the `--help` text instructs):
+
+```bash
+retriever local stage5 run --input-dir <dir> --pattern "*.txt_extraction.json"
+retriever local stage6 run --input-dir <dir>
+```
+
+Or pipe straight through [[pipeline]] with `--input-type txt`.
+
+## Key flags
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--max-tokens` | `512` | Hard cap per chunk. |
+| `--overlap` | `0` | Token overlap between consecutive chunks. |
+| `--encoding` | `utf-8` | File read encoding. |
+| `--limit` | — | Cap number of files processed. |
+
+## Common failure modes
+
+- **Empty output files** — input `.txt` is empty or all-whitespace; the
+  tokenizer produced 0 chunks.
+- **Mojibake in extracted text** — wrong `--encoding`; try `latin-1` or
+  `utf-16` for legacy files.
+
+## Related
+
+- [[html]] — sibling command for HTML inputs.
+- [[pipeline]] — wraps txt extraction + embed + VDB in one command.
+- [[vector-store]] — upload the resulting embeddings.
diff --git a/.claude/skills/nemo-retriever/references/vector-store.md b/.claude/skills/nemo-retriever/references/vector-store.md
new file mode 100644
index 0000000000..391ae3d17f
--- /dev/null
+++ b/.claude/skills/nemo-retriever/references/vector-store.md
@@ -0,0 +1,85 @@
+# retriever vector-store
+
+LanceDB upload stage: take a directory of `*.text_embeddings.json` files
+(produced by the local `stage5` embedder) and load them into a LanceDB
+table, optionally creating an IVF index.
+
+If flags below look stale, re-check `retriever vector-store stage run --help`.
+
+## When to use this
+
+- You ran embedding offline (e.g. via [[local]] stage5 or a custom embed
+  job) and now want the vectors searchable.
+- You want to (re)build a LanceDB index over existing embedding sidecars.
+
+**Use a different command when:**
+
+- You want full ingest in one shot → [[ingest]] or [[pipeline]] (their last
+  stage already does this).
+- You want to *query* an existing table → [[query]] / [[recall]].
+
+## Canonical invocations
+
+Upload + index with defaults (overwrites the table):
+
+```bash
+retriever vector-store stage run --input-dir out/embeddings/
+```
+
+Append rather than overwrite, into a custom DB/table:
+
+```bash
+retriever vector-store stage run \
+  --input-dir out/embeddings/ \
+  --lancedb-uri ./my-lancedb \
+  --table-name my-corpus \
+  --append
+```
+
+Skip indexing (faster, but slower searches afterwards):
+
+```bash
+retriever vector-store stage run --input-dir out/embeddings/ --no-create-index
+```
+
+## Inputs
+
+- **`--input-dir DIR`** — required. Contains `*.text_embeddings.json` files.
+  `--recursive` to scan subdirectories.
+
+## Outputs
+
+- LanceDB table at `<lancedb-uri>/<table-name>.lance`. Defaults
+  `lancedb/nv-ingest.lance` — matches [[ingest]] / [[query]] defaults.
+- Each row carries `vector`, `pdf_basename`, `page_number`, `path`,
+  `source_id`, and the original primitive metadata.
+
+## Key flags
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--recursive` | off | Walk subdirectories of `--input-dir`. |
+| `--lancedb-uri` | `lancedb` | DB path/URI. |
+| `--table-name` | `nv-ingest` | Table name (must match [[query]]). |
+| `--overwrite/--append` | `overwrite` | Replace or extend existing table. |
+| `--create-index/--no-create-index` | `create-index` | Build vector index after upload. |
+| `--index-type` | `IVF_HNSW_SQ` | LanceDB index type. |
+| `--metric` | `l2` | Distance metric (must match how you'll search). |
+| `--num-partitions` | `16` | IVF partitions. Clamped down for tiny tables. |
+| `--num-sub-vectors` | `256` | PQ sub-vectors. |
+
+## Common failure modes
+
+- **`Clamping num_partitions from 16 to N`** — informational; index needs
+  partitions < row count. Happens on small uploads.
+- **`Table already exists`** with `--append` returning unexpected rows —
+  `--append` does not dedupe. Run [[query]] / inspect the table if you
+  suspect duplicates.
+- **Query results look bad after upload** — metric mismatch between this
+  stage's `--metric` and what [[query]] uses (`l2` everywhere by default).
+
+## Related
+
+- [[query]] — search the table this command writes.
+- [[recall]] — batch query + recall metrics over a CSV of ground truth.
+- [[pipeline]] — full ingest that uses this stage as its sink.