NVIDIA · edknv · May 13, 2026
@@ -13,9 +13,36 @@ If no arguments are provided, run `retriever --help` and summarize the available
 
 ## Subcommand references
 
-For per-subcommand details (when to use it, canonical invocations, inputs/outputs, flags, common failure modes), read the matching file in `references/` *before* running anything non-trivial:
+For per-subcommand details (when to use it, canonical invocations, inputs/outputs, flags, common failure modes), read the matching file in `references/` *before* running anything non-trivial.
 
-- `references/ingest.md` — `retriever ingest`: PDFs → LanceDB (full pipeline).
+End-to-end / search:
+
+- `references/ingest.md` — `retriever ingest`: docs → LanceDB (full pipeline, defaults).
 - `references/query.md` — `retriever query`: text query → top-k LanceDB hits.
+- `references/pipeline.md` — `retriever pipeline run`: graph-based end-to-end with per-stage knobs.
+- `references/service.md` — `retriever service`: long-running ingest service + client.
+- `references/local.md` — `retriever local stage{1..7}`: non-distributed per-stage runner.
+
+Per-input-type extractors:
+
+- `references/pdf.md` — `retriever pdf stage page-elements`: PDF → primitives JSON.
+- `references/chart.md` — `retriever chart stage run` / `graphic-elements`: chart enrichment.
+- `references/audio.md` — `retriever audio extract` / `discover`: chunk + ASR.
+- `references/txt.md` — `retriever txt run`: plain-text chunking.
+- `references/html.md` — `retriever html run`: HTML → markdown → chunks.
+- `references/image.md` — `retriever image render`: detection overlay visualization.
+
+Storage and evaluation:
+
+- `references/vector-store.md` — `retriever vector-store stage run`: embeddings → LanceDB.
+- `references/recall.md` — `retriever recall vdb-recall run`: recall@k over a query CSV.
+- `references/eval.md` — `retriever eval run` / `export` / `build-page-index`: QA evaluation.
+- `references/benchmark.md` — `retriever benchmark <stage> run`: per-stage rows/sec.
+- `references/harness.md` — `retriever harness run` / `sweep` / `nightly` / `portal` / …: sessioned orchestration.
+- `references/compare.md` — `retriever compare`: JSON / results-bundle diffs.
+
+Cross-cutting:
+
+- `references/pipeline-stages.md` — map of the internal pipeline stages (page-elements, ocr, table-structure, graphic-elements, embed, caption, dedup, store, …) → which CLI command exposes each.
 
-Additional per-stage references (`pdf`, `chart`, `image`, `audio`, `txt`, `html`, `pipeline`, `vector-store`, `recall`, `eval`, `benchmark`, `service`, `local`, `compare`, `harness`) will be added as those stages stabilize. Until then, fall back to `retriever <subcommand> --help` for any subcommand not listed above.
+If a subcommand isn't listed above, fall back to `retriever <subcommand> --help`.
@@ -0,0 +1,100 @@
+# retriever audio
+
+Audio / video extraction stage: chunk media files, run ASR (Parakeet locally
+or a remote Riva/NIM endpoint), and write extraction JSON sidecars in the
+same primitives shape as [[pdf]].
+
+If flags below look stale, re-check `retriever audio extract --help`.
+
+## When to use this
+
+- You have audio (`.mp3`, `.wav`) or video files and want ASR transcripts
+  fed into the rest of the retriever pipeline.
+- You want to verify mount/path layout before kicking off a long ASR run →
+  use `retriever audio discover` (no ASR, just lists what would be
+  processed).
+
+**Use a different command when:**
+
+- You want full ingest including audio → [[pipeline]] with
+  `--input-type audio` or [[ingest]] once it accepts audio inputs.
+- You want to benchmark ASR throughput → [[benchmark]] (`audio-extract`).
+
+## Canonical invocations
+
+Dry-run discovery:
+
+```bash
+retriever audio discover --input-dir data/audio/
+```
+
+Local Parakeet ASR over `*.mp3`/`*.wav` (default globs):
+
+```bash
+retriever audio extract --input-dir data/audio/
+```
+
+Cloud ASR via NIM env vars:
+
+```bash
+export NGC_API_KEY=...
+export AUDIO_FUNCTION_ID=...
+retriever audio extract --input-dir data/audio/ --use-env-asr
+```
+
+Override the gRPC endpoint explicitly:
+
+```bash
+retriever audio extract \
+  --input-dir data/audio/ \
+  --audio-grpc-endpoint riva-asr:50051 \
+  --auth-token "$NVIDIA_API_KEY"
+```
+
+Process video too, extracting audio first:
+
+```bash
+retriever audio extract --input-dir data/media/ --glob "*.mp4" --audio-only
+```
+
+## Inputs
+
+- **`--input-dir DIR`** — required, scanned (non-recursive) for files
+  matching `--glob`.
+- **`--glob`** — comma-separated patterns. Default `*.mp3,*.wav`.
+
+## Outputs
+
+- One `<file>.audio_extraction.json` sidecar per source file (default; toggle
+  with `--write-json/--no-write-json`).
+- Sidecar shape mirrors PDF primitives (`text`, `source_id`, `metadata`),
+  with `metadata.content_metadata.type == "text"` per ASR chunk.
+
+## Key flags
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--split-type` | `size` | `size` (bytes), `time` (seconds), or `frame`. |
+| `--split-interval` | `450` | Chunk size in the chosen units. |
+| `--audio-only` | off | Extract audio track from video first, then chunk. |
+| `--video-audio-separate` | off | Emit the extracted MP3 as its own item. |
+| `--use-env-asr` | on | Build ASR params from `AUDIO_GRPC_ENDPOINT`/`NGC_API_KEY`/`AUDIO_FUNCTION_ID`. |
+| `--audio-grpc-endpoint` | — | Override env; sets remote ASR. Wins over `--use-env-asr`. |
+| `--auth-token` | — | Bearer for cloud ASR (also `$NVIDIA_API_KEY`). |
+| `--limit` | — | Cap files processed. |
+
+## Common failure modes
+
+- **`No files matched glob`** — default globs are `*.mp3,*.wav`. Pass
+  `--glob "*.mp4"` for video, etc.
+- **Falls back to local Parakeet unexpectedly** — `--use-env-asr` is on but
+  none of `AUDIO_GRPC_ENDPOINT` / `NGC_API_KEY` / `AUDIO_FUNCTION_ID` are
+  set. Either set them or pass `--audio-grpc-endpoint`.
+- **Local Parakeet OOM on long files** — drop `--split-interval` (smaller
+  chunks) or switch to a remote NIM.
+
+## Related
+
+- [[pipeline]] with `--input-type audio` — full ingest including embedding +
+  VDB.
+- [[benchmark]] `audio-extract` — throughput benchmarks.
@@ -0,0 +1,93 @@
+# retriever benchmark
+
+Throughput micro-benchmarks for individual Ray actors in the ingest
+pipeline. Each subcommand isolates one stage and reports rows/sec.
+
+Subcommands:
+
+| Stage | Subcommand | Actor benchmarked |
+|---|---|---|
+| Split | `retriever benchmark split run` | `PDFSplitActor` |
+| Extract | `retriever benchmark extract run` | `PDFExtractionActor` |
+| Page elements | `retriever benchmark page-elements run` | `PageElementDetectionActor` |
+| OCR | `retriever benchmark ocr run` | `OCRActor` |
+| Audio extract | `retriever benchmark audio-extract run` | `MediaChunkActor + ASRActor` |
+| All | `retriever benchmark all run` | runs the above in sequence |
+
+If flags below look stale, re-check `retriever benchmark <stage> run --help`.
+
+## When to use this
+
+- You suspect a specific pipeline stage is the bottleneck and want
+  rows/sec numbers under controlled load.
+- You're sizing Ray actor counts / GPU fractions for [[pipeline]] / [[ingest]]
+  and need empirical numbers per stage.
+- You want a regression-style benchmark across machines or releases (pair
+  with [[harness]] for orchestration).
+
+**Use a different command when:**
+
+- You want end-to-end ingest, not stage-isolated numbers → [[ingest]] or
+  [[pipeline]] with a stopwatch.
+- You want recall/QA quality, not throughput → [[recall]] / [[eval]].
+
+## Canonical invocations
+
+Benchmark the page-element detector alone:
+
+```bash
+retriever benchmark page-elements run --help   # see options
+retriever benchmark page-elements run
+```
+
+Benchmark OCR (v2 by default; pair with [[pipeline]]'s `--ocr-version`):
+
+```bash
+retriever benchmark ocr run
+```
+
+Run all stage benchmarks in sequence and print a summary:
+
+```bash
+retriever benchmark all run --num-gpus 0.5 --num-cpus 1.0
+```
+
+## Inputs
+
+- All `run` commands take their own flag set (run `--help` on the
+  individual subcommand). Common shape: rows count, batch size, GPU/CPU
+  fractions per actor, optional remote NIM URL.
+
+## Outputs
+
+- Stdout report with per-actor throughput in rows/sec, plus headers per
+  stage (e.g. `=== benchmark: page-elements ===`).
+
+## Key flags (`all run`)
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--num-gpus` | `1.0` | GPUs reserved per page-elements / OCR actor. |
+| `--num-cpus` | `1.0` | CPUs reserved per actor. |
+| `--rows-page-elements` etc. | per-stage | Synthetic rows per stage benchmark. |
+
+## Reading the results
+
+- Numbers come from a synthetic Ray Dataset; they're representative of the
+  stage in isolation, not of end-to-end throughput.
+- To convert to [[pipeline]] tuning: pick the slowest stage's rows/sec,
+  divide your target rate by it → number of actors needed.
+
+## Common failure modes
+
+- **Page-elements benchmark stalls** — needs YOLOX weights or a remote
+  endpoint. Pass the URL flags or pre-cache weights.
+- **Benchmark numbers don't match [[pipeline]]** — micro-benchmarks exclude
+  inter-stage queues / batching overhead. Treat as upper bounds.
+- **`CUDA OOM`** — drop `--num-gpus` (fractional) or `*-batch-size` per
+  stage.
+
+## Related
+
+- [[pipeline]] — apply the actor counts derived from these benchmarks.
+- [[harness]] — runs benchmarks across configs/datasets and stores results.
@@ -0,0 +1,85 @@
+# retriever chart
+
+Chart-specific enrichment over already-extracted primitives — parses chart
+images (titles, axes, series, values) and adds them as structured text to
+each chart primitive. Two related subcommands:
+
+- `retriever chart stage run` — enrich an existing primitives DataFrame.
+- `retriever chart stage graphic-elements` — run the extract+detect path
+  starting from PDFs, with chart extraction enabled.
+
+If flags below look stale, re-check `retriever chart stage --help`.
+
+## When to use this
+
+- You already ran [[pdf]] (or another extractor) and want to add chart
+  parsing on top of the primitives without re-extracting.
+- You're iterating on chart parsing parameters and don't want to rerun the
+  whole pipeline.
+
+**Use a different command when:**
+
+- You want full ingest with charts → [[ingest]] / [[pipeline]] with
+  `--extract-charts`.
+- You want only PDF extraction (no chart parsing) → [[pdf]].
+
+## Canonical invocations
+
+Enrich a primitives parquet with chart parsing:
+
+```bash
+retriever chart stage run \
+  --input out/extractions.parquet \
+  --output out/extractions.+chart.parquet
+```
+
+Extract from PDFs with charts enabled:
+
+```bash
+retriever chart stage graphic-elements \
+  --input-dir data/pdfs/ \
+  --extract-charts \
+  --yolox-http-endpoint http://page-elements:8000/v1/infer
+```
+
+## Inputs
+
+- **`run`**: `--input` parquet/jsonl/json with a `metadata` column.
+- **`graphic-elements`**: `--input-dir` of PDFs (same shape as `retriever pdf
+  stage page-elements`).
+
+## Outputs
+
+- **`run`**: enriched DataFrame at `--output` (defaults to
+  `<input>.+chart<ext>`). Chart primitives gain parsed structured text in
+  their `text` field.
+- **`graphic-elements`**: per-PDF `*.pdf_extraction.json` sidecars including
+  chart primitives.
+
+## Key flags (`chart stage run`)
+
+| Flag | Default | Notes |
+|---|---|---|
+| `--input` | — | Required. `.parquet`, `.jsonl`, or `.json` with `metadata`. |
+| `--output` | `<input>.+chart<ext>` | Output path. |
+| `--config` | auto-discover | YAML config (section: `chart`). |
+
+## Key flags (`chart stage graphic-elements`)
+
+Same as `retriever pdf stage page-elements` plus `--extract-charts` toggled
+on by default. See [[pdf]] for the full flag table.
+
+## Common failure modes
+
+- **`KeyError: 'metadata'`** — input DataFrame is missing the `metadata`
+  column. Make sure you fed it primitives JSON/parquet from
+  `retriever pdf stage` or [[pipeline]].
+- **No chart rows in output** — the input has no rows with
+  `metadata.content_metadata.type == "structured"` and chart subtype. Run
+  extraction with `--extract-charts` first.
+
+## Related
+
+- [[pdf]] — generate the primitives that `chart stage run` consumes.
+- [[pipeline]] — wraps chart extraction into the graph pipeline.
+- [[ingest]] — end-to-end including charts when enabled.
@@ -0,0 +1,46 @@
+# retriever compare
+
+Comparison utilities. Optional subcommands are registered lazily — if the
+relevant module is installed, you'll see:
+
+- `retriever compare json` — diff two JSON files (extraction sidecars, eval
+  outputs, recall outputs).
+- `retriever compare results` — diff two retrieval/eval result bundles.
+
+Run `retriever compare --help` to see which subcommands are present in your
+install.
+
+## When to use this
+
+- You changed an extraction flag, ran the pipeline twice, and want a
+  semantic diff of the outputs (not a textual diff).
+- You ran [[recall]] or [[eval]] twice and want to know which queries
+  regressed / improved.
+
+**Use a different command when:**
+
+- You want a single-number metric, not a diff → [[recall]] / [[eval]].
+- You want a UI / portal for sweep comparison → [[harness]] (`portal` /
+  `compare`).
+
+## Canonical invocations
+
+```bash
+retriever compare json before.json after.json
+retriever compare results runs/baseline/ runs/candidate/
+```
+
+Run `--help` on each subcommand for the exact flag set; the modules are
+optional and may expose different options across releases.
+
+## Common failure modes
+
+- **`retriever compare json` not found** — the `compare_json` module isn't
+  installed. Install the extras (or upgrade the package).
+- **Diff shows everything different** — files have non-stable key order or
+  embedded timestamps; the subcommand normalises common cases but not all.
+
+## Related
+
+- [[recall]] / [[eval]] — produce the artifacts this command compares.
+- [[harness]] `compare` — session-level comparison with summaries.