diff --git a/.claude/skills/nemo-retriever/SKILL.md b/.claude/skills/nemo-retriever/SKILL.md index 0a9e102482..7c9322b929 100644 --- a/.claude/skills/nemo-retriever/SKILL.md +++ b/.claude/skills/nemo-retriever/SKILL.md @@ -13,9 +13,36 @@ If no arguments are provided, run `retriever --help` and summarize the available ## Subcommand references -For per-subcommand details (when to use it, canonical invocations, inputs/outputs, flags, common failure modes), read the matching file in `references/` *before* running anything non-trivial: +For per-subcommand details (when to use it, canonical invocations, inputs/outputs, flags, common failure modes), read the matching file in `references/` *before* running anything non-trivial. -- `references/ingest.md` — `retriever ingest`: PDFs → LanceDB (full pipeline). +End-to-end / search: + +- `references/ingest.md` — `retriever ingest`: docs → LanceDB (full pipeline, defaults). - `references/query.md` — `retriever query`: text query → top-k LanceDB hits. +- `references/pipeline.md` — `retriever pipeline run`: graph-based end-to-end with per-stage knobs. +- `references/service.md` — `retriever service`: long-running ingest service + client. +- `references/local.md` — `retriever local stage{1..7}`: non-distributed per-stage runner. + +Per-input-type extractors: + +- `references/pdf.md` — `retriever pdf stage page-elements`: PDF → primitives JSON. +- `references/chart.md` — `retriever chart stage run` / `graphic-elements`: chart enrichment. +- `references/audio.md` — `retriever audio extract` / `discover`: chunk + ASR. +- `references/txt.md` — `retriever txt run`: plain-text chunking. +- `references/html.md` — `retriever html run`: HTML → markdown → chunks. +- `references/image.md` — `retriever image render`: detection overlay visualization. + +Storage and evaluation: + +- `references/vector-store.md` — `retriever vector-store stage run`: embeddings → LanceDB. +- `references/recall.md` — `retriever recall vdb-recall run`: recall@k over a query CSV. +- `references/eval.md` — `retriever eval run` / `export` / `build-page-index`: QA evaluation. +- `references/benchmark.md` — `retriever benchmark run`: per-stage rows/sec. +- `references/harness.md` — `retriever harness run` / `sweep` / `nightly` / `portal` / …: sessioned orchestration. +- `references/compare.md` — `retriever compare`: JSON / results-bundle diffs. + +Cross-cutting: + +- `references/pipeline-stages.md` — map of the internal pipeline stages (page-elements, ocr, table-structure, graphic-elements, embed, caption, dedup, store, …) → which CLI command exposes each. -Additional per-stage references (`pdf`, `chart`, `image`, `audio`, `txt`, `html`, `pipeline`, `vector-store`, `recall`, `eval`, `benchmark`, `service`, `local`, `compare`, `harness`) will be added as those stages stabilize. Until then, fall back to `retriever --help` for any subcommand not listed above. +If a subcommand isn't listed above, fall back to `retriever --help`. diff --git a/.claude/skills/nemo-retriever/references/audio.md b/.claude/skills/nemo-retriever/references/audio.md new file mode 100644 index 0000000000..ebd700e345 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/audio.md @@ -0,0 +1,100 @@ +# retriever audio + +Audio / video extraction stage: chunk media files, run ASR (Parakeet locally +or a remote Riva/NIM endpoint), and write extraction JSON sidecars in the +same primitives shape as [[pdf]]. + +If flags below look stale, re-check `retriever audio extract --help`. + +## When to use this + +- You have audio (`.mp3`, `.wav`) or video files and want ASR transcripts + fed into the rest of the retriever pipeline. +- You want to verify mount/path layout before kicking off a long ASR run → + use `retriever audio discover` (no ASR, just lists what would be + processed). + +**Use a different command when:** + +- You want full ingest including audio → [[pipeline]] with + `--input-type audio` or [[ingest]] once it accepts audio inputs. +- You want to benchmark ASR throughput → [[benchmark]] (`audio-extract`). + +## Canonical invocations + +Dry-run discovery: + +```bash +retriever audio discover --input-dir data/audio/ +``` + +Local Parakeet ASR over `*.mp3`/`*.wav` (default globs): + +```bash +retriever audio extract --input-dir data/audio/ +``` + +Cloud ASR via NIM env vars: + +```bash +export NGC_API_KEY=... +export AUDIO_FUNCTION_ID=... +retriever audio extract --input-dir data/audio/ --use-env-asr +``` + +Override the gRPC endpoint explicitly: + +```bash +retriever audio extract \ + --input-dir data/audio/ \ + --audio-grpc-endpoint riva-asr:50051 \ + --auth-token "$NVIDIA_API_KEY" +``` + +Process video too, extracting audio first: + +```bash +retriever audio extract --input-dir data/media/ --glob "*.mp4" --audio-only +``` + +## Inputs + +- **`--input-dir DIR`** — required, scanned (non-recursive) for files + matching `--glob`. +- **`--glob`** — comma-separated patterns. Default `*.mp3,*.wav`. + +## Outputs + +- One `.audio_extraction.json` sidecar per source file (default; toggle + with `--write-json/--no-write-json`). +- Sidecar shape mirrors PDF primitives (`text`, `source_id`, `metadata`), + with `metadata.content_metadata.type == "text"` per ASR chunk. + +## Key flags + +| Flag | Default | Notes | +|---|---|---| +| `--split-type` | `size` | `size` (bytes), `time` (seconds), or `frame`. | +| `--split-interval` | `450` | Chunk size in the chosen units. | +| `--audio-only` | off | Extract audio track from video first, then chunk. | +| `--video-audio-separate` | off | Emit the extracted MP3 as its own item. | +| `--use-env-asr` | on | Build ASR params from `AUDIO_GRPC_ENDPOINT`/`NGC_API_KEY`/`AUDIO_FUNCTION_ID`. | +| `--audio-grpc-endpoint` | — | Override env; sets remote ASR. Wins over `--use-env-asr`. | +| `--auth-token` | — | Bearer for cloud ASR (also `$NVIDIA_API_KEY`). | +| `--limit` | — | Cap files processed. | + +## Common failure modes + +- **`No files matched glob`** — default globs are `*.mp3,*.wav`. Pass + `--glob "*.mp4"` for video, etc. +- **Falls back to local Parakeet unexpectedly** — `--use-env-asr` is on but + none of `AUDIO_GRPC_ENDPOINT` / `NGC_API_KEY` / `AUDIO_FUNCTION_ID` are + set. Either set them or pass `--audio-grpc-endpoint`. +- **Local Parakeet OOM on long files** — drop `--split-interval` (smaller + chunks) or switch to a remote NIM. + +## Related + +- [[pipeline]] with `--input-type audio` — full ingest including embedding + + VDB. +- [[benchmark]] `audio-extract` — throughput benchmarks. diff --git a/.claude/skills/nemo-retriever/references/benchmark.md b/.claude/skills/nemo-retriever/references/benchmark.md new file mode 100644 index 0000000000..10d07ba968 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/benchmark.md @@ -0,0 +1,93 @@ +# retriever benchmark + +Throughput micro-benchmarks for individual Ray actors in the ingest +pipeline. Each subcommand isolates one stage and reports rows/sec. + +Subcommands: + +| Stage | Subcommand | Actor benchmarked | +|---|---|---| +| Split | `retriever benchmark split run` | `PDFSplitActor` | +| Extract | `retriever benchmark extract run` | `PDFExtractionActor` | +| Page elements | `retriever benchmark page-elements run` | `PageElementDetectionActor` | +| OCR | `retriever benchmark ocr run` | `OCRActor` | +| Audio extract | `retriever benchmark audio-extract run` | `MediaChunkActor + ASRActor` | +| All | `retriever benchmark all run` | runs the above in sequence | + +If flags below look stale, re-check `retriever benchmark run --help`. + +## When to use this + +- You suspect a specific pipeline stage is the bottleneck and want + rows/sec numbers under controlled load. +- You're sizing Ray actor counts / GPU fractions for [[pipeline]] / [[ingest]] + and need empirical numbers per stage. +- You want a regression-style benchmark across machines or releases (pair + with [[harness]] for orchestration). + +**Use a different command when:** + +- You want end-to-end ingest, not stage-isolated numbers → [[ingest]] or + [[pipeline]] with a stopwatch. +- You want recall/QA quality, not throughput → [[recall]] / [[eval]]. + +## Canonical invocations + +Benchmark the page-element detector alone: + +```bash +retriever benchmark page-elements run --help # see options +retriever benchmark page-elements run +``` + +Benchmark OCR (v2 by default; pair with [[pipeline]]'s `--ocr-version`): + +```bash +retriever benchmark ocr run +``` + +Run all stage benchmarks in sequence and print a summary: + +```bash +retriever benchmark all run --num-gpus 0.5 --num-cpus 1.0 +``` + +## Inputs + +- All `run` commands take their own flag set (run `--help` on the + individual subcommand). Common shape: rows count, batch size, GPU/CPU + fractions per actor, optional remote NIM URL. + +## Outputs + +- Stdout report with per-actor throughput in rows/sec, plus headers per + stage (e.g. `=== benchmark: page-elements ===`). + +## Key flags (`all run`) + +| Flag | Default | Notes | +|---|---|---| +| `--num-gpus` | `1.0` | GPUs reserved per page-elements / OCR actor. | +| `--num-cpus` | `1.0` | CPUs reserved per actor. | +| `--rows-page-elements` etc. | per-stage | Synthetic rows per stage benchmark. | + +## Reading the results + +- Numbers come from a synthetic Ray Dataset; they're representative of the + stage in isolation, not of end-to-end throughput. +- To convert to [[pipeline]] tuning: pick the slowest stage's rows/sec, + divide your target rate by it → number of actors needed. + +## Common failure modes + +- **Page-elements benchmark stalls** — needs YOLOX weights or a remote + endpoint. Pass the URL flags or pre-cache weights. +- **Benchmark numbers don't match [[pipeline]]** — micro-benchmarks exclude + inter-stage queues / batching overhead. Treat as upper bounds. +- **`CUDA OOM`** — drop `--num-gpus` (fractional) or `*-batch-size` per + stage. + +## Related + +- [[pipeline]] — apply the actor counts derived from these benchmarks. +- [[harness]] — runs benchmarks across configs/datasets and stores results. diff --git a/.claude/skills/nemo-retriever/references/chart.md b/.claude/skills/nemo-retriever/references/chart.md new file mode 100644 index 0000000000..0adc5319c0 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/chart.md @@ -0,0 +1,85 @@ +# retriever chart + +Chart-specific enrichment over already-extracted primitives — parses chart +images (titles, axes, series, values) and adds them as structured text to +each chart primitive. Two related subcommands: + +- `retriever chart stage run` — enrich an existing primitives DataFrame. +- `retriever chart stage graphic-elements` — run the extract+detect path + starting from PDFs, with chart extraction enabled. + +If flags below look stale, re-check `retriever chart stage --help`. + +## When to use this + +- You already ran [[pdf]] (or another extractor) and want to add chart + parsing on top of the primitives without re-extracting. +- You're iterating on chart parsing parameters and don't want to rerun the + whole pipeline. + +**Use a different command when:** + +- You want full ingest with charts → [[ingest]] / [[pipeline]] with + `--extract-charts`. +- You want only PDF extraction (no chart parsing) → [[pdf]]. + +## Canonical invocations + +Enrich a primitives parquet with chart parsing: + +```bash +retriever chart stage run \ + --input out/extractions.parquet \ + --output out/extractions.+chart.parquet +``` + +Extract from PDFs with charts enabled: + +```bash +retriever chart stage graphic-elements \ + --input-dir data/pdfs/ \ + --extract-charts \ + --yolox-http-endpoint http://page-elements:8000/v1/infer +``` + +## Inputs + +- **`run`**: `--input` parquet/jsonl/json with a `metadata` column. +- **`graphic-elements`**: `--input-dir` of PDFs (same shape as `retriever pdf + stage page-elements`). + +## Outputs + +- **`run`**: enriched DataFrame at `--output` (defaults to + `.+chart`). Chart primitives gain parsed structured text in + their `text` field. +- **`graphic-elements`**: per-PDF `*.pdf_extraction.json` sidecars including + chart primitives. + +## Key flags (`chart stage run`) + +| Flag | Default | Notes | +|---|---|---| +| `--input` | — | Required. `.parquet`, `.jsonl`, or `.json` with `metadata`. | +| `--output` | `.+chart` | Output path. | +| `--config` | auto-discover | YAML config (section: `chart`). | + +## Key flags (`chart stage graphic-elements`) + +Same as `retriever pdf stage page-elements` plus `--extract-charts` toggled +on by default. See [[pdf]] for the full flag table. + +## Common failure modes + +- **`KeyError: 'metadata'`** — input DataFrame is missing the `metadata` + column. Make sure you fed it primitives JSON/parquet from + `retriever pdf stage` or [[pipeline]]. +- **No chart rows in output** — the input has no rows with + `metadata.content_metadata.type == "structured"` and chart subtype. Run + extraction with `--extract-charts` first. + +## Related + +- [[pdf]] — generate the primitives that `chart stage run` consumes. +- [[pipeline]] — wraps chart extraction into the graph pipeline. +- [[ingest]] — end-to-end including charts when enabled. diff --git a/.claude/skills/nemo-retriever/references/compare.md b/.claude/skills/nemo-retriever/references/compare.md new file mode 100644 index 0000000000..4fa7c4aef6 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/compare.md @@ -0,0 +1,46 @@ +# retriever compare + +Comparison utilities. Optional subcommands are registered lazily — if the +relevant module is installed, you'll see: + +- `retriever compare json` — diff two JSON files (extraction sidecars, eval + outputs, recall outputs). +- `retriever compare results` — diff two retrieval/eval result bundles. + +Run `retriever compare --help` to see which subcommands are present in your +install. + +## When to use this + +- You changed an extraction flag, ran the pipeline twice, and want a + semantic diff of the outputs (not a textual diff). +- You ran [[recall]] or [[eval]] twice and want to know which queries + regressed / improved. + +**Use a different command when:** + +- You want a single-number metric, not a diff → [[recall]] / [[eval]]. +- You want a UI / portal for sweep comparison → [[harness]] (`portal` / + `compare`). + +## Canonical invocations + +```bash +retriever compare json before.json after.json +retriever compare results runs/baseline/ runs/candidate/ +``` + +Run `--help` on each subcommand for the exact flag set; the modules are +optional and may expose different options across releases. + +## Common failure modes + +- **`retriever compare json` not found** — the `compare_json` module isn't + installed. Install the extras (or upgrade the package). +- **Diff shows everything different** — files have non-stable key order or + embedded timestamps; the subcommand normalises common cases but not all. + +## Related + +- [[recall]] / [[eval]] — produce the artifacts this command compares. +- [[harness]] `compare` — session-level comparison with summaries. diff --git a/.claude/skills/nemo-retriever/references/eval.md b/.claude/skills/nemo-retriever/references/eval.md new file mode 100644 index 0000000000..88bfcc9769 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/eval.md @@ -0,0 +1,107 @@ +# retriever eval + +End-to-end QA evaluation: retrieval + generation + judge. Three +subcommands: + +- `retriever eval run` — run a configured QA sweep. +- `retriever eval export` — turn a LanceDB table into FileRetriever JSON for + use as a static retriever in an eval config. +- `retriever eval build-page-index` — build a page-level markdown index for + full-page eval mode. + +If flags below look stale, re-check `retriever eval --help`. + +## When to use this + +- You want a single number for "is this retrieval+generation setup good?" + (judge score, per-question answers, etc.). +- You're comparing models or chunking strategies and need a controlled QA + benchmark. + +**Use a different command when:** + +- You only need retrieval recall metrics → [[recall]]. +- You want a single ad-hoc query → [[query]]. +- You're tuning extraction quality, not QA → [[pipeline]] / [[pdf]]. + +## Canonical invocations + +Run a sweep from a config file: + +```bash +retriever eval run --config evaluation/eval_sweep.yaml +``` + +Run a sweep from environment (Docker/CI pattern): + +```bash +export RETRIEVAL_FILE=out/retrieval.json +export QA_DATASET=path/to/qa.json +export GEN_MODEL=... +export JUDGE_MODEL=... +retriever eval run --from-env +``` + +Export LanceDB → FileRetriever JSON so eval can consume it: + +```bash +retriever eval export \ + --lancedb-uri ./lancedb --lancedb-table nv-ingest \ + --query-csv evaluation/queries.csv \ + --output out/retrieval.json \ + --top-k 5 +``` + +Build a page index for full-page eval mode: + +```bash +retriever eval build-page-index \ + --parquet-dir out/extractions/ \ + --output out/page_index.json +``` + +## Inputs / Outputs + +- **`run`** — config (YAML/JSON) or env vars; emits per-question results + + aggregated metrics. +- **`export`** — needs a populated LanceDB + a query CSV; emits a + FileRetriever JSON. +- **`build-page-index`** — needs a directory of extraction Parquets; emits + a JSON mapping `(pdf, page) → markdown`. + +## Key flags + +`eval run`: + +| Flag | Notes | +|---|---| +| `--config FILE` | YAML/JSON sweep config (exclusive with `--from-env`). | +| `--from-env` | Build config from env vars (`RETRIEVAL_FILE`, `QA_DATASET`, `GEN_MODEL`, `JUDGE_MODEL`, …). | + +`eval export`: + +| Flag | Default | Notes | +|---|---|---| +| `--lancedb-uri` | `lancedb` | DB path. | +| `--lancedb-table` | `nv-ingest` | Source table. **Note**: this is `--lancedb-table` (with `lancedb-` prefix), unlike [[ingest]] / [[query]] / [[recall]] / [[vector-store]] which use `--table-name`. Must point at the same table either way. | +| `--query-csv` | — | Required. `query` (+ optional `answer`) columns. | +| `--output` | — | Required output JSON path. | +| `--top-k` | `5` | Chunks per query. | +| `--embedder` | `nvidia/llama-nemotron-embed-1b-v2` | Must match ingest embedder. | +| `--page-index FILE` | — | Enables full-page mode using `build-page-index` output. | + +## Common failure modes + +- **`run --from-env` errors with "RETRIEVAL_FILE not set"** — set every env + var the loader requires; `--from-env` is all-or-nothing. +- **`export` writes empty file** — embedder mismatch with the LanceDB table + (different dim) or `--query-csv` lacks a `query` column. +- **`build-page-index` is slow / OOM** — parquet directory is huge. Run on + a subset and merge JSONs, or run in a higher-memory environment. + +## Related + +- [[recall]] — retrieval-only metrics. +- [[harness]] — orchestrates `eval`/`recall` sweeps with sessions, tags, and + Slack reporting. +- [[compare]] — diff two eval runs. diff --git a/.claude/skills/nemo-retriever/references/harness.md b/.claude/skills/nemo-retriever/references/harness.md new file mode 100644 index 0000000000..c9d2c4e57d --- /dev/null +++ b/.claude/skills/nemo-retriever/references/harness.md @@ -0,0 +1,143 @@ +# retriever harness + +Benchmark / eval orchestration. Wraps [[recall]] / [[eval]] / +[[benchmark]] / [[pipeline]] runs into named *sessions* with tags, +artifacts, and (optionally) a web portal + history DB + Slack reporting. + +Subcommands: + +| Subcommand | What it does | +|---|---| +| `run` | One configured run against a dataset. | +| `sweep` | Multiple runs from a sweep YAML. | +| `nightly` | Curated nightly sweep; can post results to Slack. | +| `summary` | Print summary for a session. | +| `compare` | Diff two sessions. | +| `portal` | Launch the web portal. | +| `backfill` | Import existing `results.json` artifacts into the history DB. | +| `runner` | Runner agent (registers with a portal manager). | + +If flags below look stale, re-check `retriever harness --help`. + +## When to use this + +- You want reproducible, tagged eval/benchmark sessions you can come back + to later. +- You're triaging nightly regressions and want the session+Slack flow. +- You want to compare two sessions visually or via CLI. + +**Use a different command when:** + +- One-off run, no session bookkeeping → [[recall]] / [[eval]] / + [[benchmark]]. +- You're tuning extraction directly → [[pipeline]]. + +## Canonical invocations + +Single run against a named dataset (preset from the config): + +```bash +retriever harness run \ + --dataset bo767 \ + --config nemo_retriever/harness/test-config.yaml \ + --run-name "baseline-2026-05-13" \ + --tag dataset=bo767 --tag model=llama-nemotron-embed-1b-v2 +``` + +Sweep: + +```bash +retriever harness sweep \ + --config nemo_retriever/harness/test-config.yaml \ + --runs-config nemo_retriever/harness/sweep-runs.yaml \ + --session-prefix sweep +``` + +Nightly with Slack: + +```bash +retriever harness nightly \ + --config nemo_retriever/harness/test-config.yaml \ + --runs-config nemo_retriever/harness/nightly-runs.yaml +``` + +Replay a previous run to Slack without rerunning: + +```bash +retriever harness nightly --replay runs/2026-05-12/session_summary.json +``` + +Compare two sessions: + +```bash +retriever harness compare runs/baseline/ runs/candidate/ +``` + +Print a session summary: + +```bash +retriever harness summary runs/2026-05-13/ +``` + +Launch the portal: + +```bash +retriever harness portal --host 0.0.0.0 --port 8100 +``` + +Backfill old artifacts into the history DB: + +```bash +retriever harness backfill --artifacts-dir runs/ --db harness-history.db +``` + +## Key flags + +`harness run`: + +| Flag | Notes | +|---|---| +| `--dataset` | Required. Dataset name (from config) or direct path. | +| `--preset` | Override the preset selection. | +| `--config` | Harness test config YAML. | +| `--run-name` | Label persisted in artifacts. | +| `--override KEY=VALUE` | Per-run config override (repeatable). | +| `--tag` | Tag persisted in artifacts (repeatable). | +| `--recall-required/--no-recall-required` | Override the recall-required gate. | + +`harness sweep` / `nightly`: + +| Flag | Notes | +|---|---| +| `--runs-config` | YAML listing the runs to execute. | +| `--preset` | Force preset for all runs. | +| `--session-prefix` | Directory prefix (sweep only). | +| `--tag` | Session-level tag (repeatable). | +| `--dry-run` | Print the plan, don't execute. | +| `--skip-slack` | Don't post to Slack (nightly only). | +| `--replay PATH` | Replay an existing session to Slack (nightly only). | + +## Outputs + +- Session directory containing per-run subdirectories, each with + `results.json`, configs, and logs. +- `session_summary.json` aggregating metrics. +- Optional rows in the history DB (`backfill` / `portal`). +- Optional Slack post (`nightly`). + +## Common failure modes + +- **`--dataset` not found** — name doesn't resolve in `--config`'s dataset + registry. Pass an absolute path or fix the name. +- **`Slack post failed`** — env vars missing; pass `--skip-slack` or + configure the webhook. +- **`portal` shows no runs** — history DB is empty. Run `backfill` once + against an artifacts root. +- **`recall-required` gate fails** — a run's recall@k dropped below + threshold; the session is marked failed. Investigate before overriding + with `--no-recall-required`. + +## Related + +- [[recall]] / [[eval]] / [[benchmark]] — the underlying runners. +- [[compare]] — non-harness JSON-level diff. diff --git a/.claude/skills/nemo-retriever/references/html.md b/.claude/skills/nemo-retriever/references/html.md new file mode 100644 index 0000000000..3e85f8d547 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/html.md @@ -0,0 +1,73 @@ +# retriever html + +HTML extraction: `markitdown` converts HTML → Markdown, then tokenizer-split +into chunks. Writes `.html_extraction.json` sidecars in the standard +primitives shape. + +If flags below look stale, re-check `retriever html run --help`. + +## When to use this + +- You scraped a set of HTML pages and want them in the retriever pipeline. +- You want the same downstream contract as [[txt]] but for HTML inputs. + +**Use a different command when:** + +- Input is plain text → [[txt]]. +- You want to run full ingest end-to-end on HTML → [[pipeline]] with + `--input-type html`. + +## Canonical invocations + +Default chunking: + +```bash +retriever html run --input-dir data/html/ +``` + +Smaller chunks with overlap: + +```bash +retriever html run --input-dir data/html/ --max-tokens 256 --overlap 32 +``` + +## Inputs + +- **`--input-dir DIR`** — required, scanned for `*.html`. + +## Outputs + +- `.html_extraction.json` per file (next to source by default, or in + `--output-dir`). +- Same primitives-like shape as stage5 input. + +## Downstream + +```bash +retriever local stage5 run --input-dir --pattern "*.html_extraction.json" +retriever local stage6 run --input-dir +``` + +Or [[pipeline]] with `--input-type html`. + +## Key flags + +| Flag | Default | Notes | +|---|---|---| +| `--max-tokens` | `512` | Per-chunk cap. | +| `--overlap` | `0` | Tokens of overlap. | +| `--encoding` | `utf-8` | HTML file encoding. | +| `--limit` | — | Cap number of files processed. | + +## Common failure modes + +- **Heavy boilerplate in chunks (nav menus, footers)** — `markitdown` is + intentionally low-magic. Strip nav/footer in a pre-step if it pollutes + retrieval. +- **JS-rendered pages produce near-empty output** — `markitdown` doesn't run + JS. Pre-render with a headless browser before feeding here. + +## Related + +- [[txt]] — sibling for plain-text inputs. +- [[pipeline]] — full extract → embed → VDB for HTML. diff --git a/.claude/skills/nemo-retriever/references/image.md b/.claude/skills/nemo-retriever/references/image.md new file mode 100644 index 0000000000..c4fecf079b --- /dev/null +++ b/.claude/skills/nemo-retriever/references/image.md @@ -0,0 +1,64 @@ +# retriever image + +Visualization helpers: render YOLOX page-element / chart-element detection +overlays on page images so you can sanity-check the detector by eye. + +If flags below look stale, re-check `retriever image render --help`. + +## When to use this + +- A page-element or chart detector returned suspect boxes and you want to + see them overlaid on the source page image. +- You're tuning thresholds and need a quick visual diff. + +**Use a different command when:** + +- You need the actual extraction output, not a picture → [[pdf]] or + [[chart]]. +- You want benchmarks over the detector → [[benchmark]] (`page-elements`). + +## Canonical invocations + +Overlay a single page: + +```bash +retriever image render image \ + page_001.png \ + page_001.detections.json \ + --output-path page_001.overlay.png +``` + +Overlay every page in a directory: + +```bash +retriever image render dir \ + pages/ detections/ overlays/ +``` + +## Inputs + +- **`render image`**: a PNG/JPEG `image_path` plus a `detections_path` JSON + (YOLOX-shaped output). +- **`render dir`**: parallel `input_dir` / `detections_dir`, output written + per-image to `output_dir`. Files are matched by basename. + +## Outputs + +- A single composite image with bounding boxes + class labels drawn on top + of the source. **Not** a side-by-side / split layout; if you want + original-vs-overlay panels, compose them yourself (e.g. via `ffmpeg + hstack` or `PIL`). `render image` writes to `--output-path`; `render dir` + writes into `output_dir`. + +## Common failure modes + +- **No boxes appear** — the detections JSON shape doesn't match what + `render` expects. Use the JSON that `retriever pdf stage page-elements` + (or [[pipeline]]) emitted, not a hand-rolled file. +- **Mismatched coordinates** — detections were produced against a different + page render scale than the image you're overlaying on. Re-render at the + same DPI/`render-mode` you ran the detector with. + +## Related + +- [[pdf]] — produce the detections JSON that this command renders. diff --git a/.claude/skills/nemo-retriever/references/ingest.md b/.claude/skills/nemo-retriever/references/ingest.md index 2427a7d856..88d604dfc6 100644 --- a/.claude/skills/nemo-retriever/references/ingest.md +++ b/.claude/skills/nemo-retriever/references/ingest.md @@ -93,7 +93,11 @@ The default `ingest` runs 8 stages, in order: CUDA-graph capture for the embedder. Subsequent runs in the same process are fast; one-shot CLI invocations always pay this cost. - **`No existing dataset at …/nv-ingest.lance, it will be created`** — expected - on the first ingest into a new DB. Subsequent ingests append. + on the first ingest into a new DB. Subsequent ingests **always append** — + there is no `--overwrite` flag on `retriever ingest`. To start fresh, + `rm -rf /.lance` before running. Alternatively, + use [[vector-store]] (`vector-store stage run --overwrite`) on the + embeddings stage of the [[local]] flow. - **HuggingFace download on first run** — the embedder and page-element detector pull weights to `~/.cache/huggingface`. Needs network the first time; cached afterwards. diff --git a/.claude/skills/nemo-retriever/references/local.md b/.claude/skills/nemo-retriever/references/local.md new file mode 100644 index 0000000000..2d9eb9bf87 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/local.md @@ -0,0 +1,89 @@ +# retriever local + +Non-distributed, pandas-based runner that exposes the pipeline as discrete +numbered stages (`stage1` … `stage7`, plus `stage999` for post-mortem). +Stages are intentionally separable so you can rerun one without touching +the others. + +> The top-level group is registered as a placeholder; subcommands are +> contributed by per-stage modules. Run `retriever local --help` (or the +> per-stage `--help`) to see what's currently wired up in your install. + +## When to use this + +- You're iterating on a single stage (e.g. tweak chunking, rerun stage5, + re-upload stage6) without redoing extraction. +- You want to debug a specific stage with `pdb` / breakpoints — no Ray, no + actors, deterministic ordering. +- You need the intermediate sidecar files (per-stage JSON/parquet) for + inspection. + +**Use a different command when:** + +- You want full ingest in one command → [[ingest]] or [[pipeline]]. +- You need parallelism on a cluster → [[pipeline]] in batch mode. +- You want a long-running endpoint → [[service]]. + +## Pipeline stages (mapped to files) + +Stages live in `nemo_retriever/src/nemo_retriever/local/stages/`: + +| Stage | File | What it does | +|---|---|---| +| `stage1` | `stage1_pdf_extraction.py` | PDF extraction (same idea as [[pdf]]). | +| `stage2` | `stage2_infographic_extraction.py` | Infographic enrichment. | +| `stage3` | `stage3_table_extractor.py` | Table structure / OCR. | +| `stage4` | `stage4_chart_extractor.py` | Chart enrichment (same idea as [[chart]]). | +| `stage5` | `stage5_text_embeddings.py` | Text embedding → `*.text_embeddings.json`. | +| `stage6` | `stage6_vdb_upload.py` | LanceDB upload (same idea as [[vector-store]]). | +| `stage7` | `stage7_vdb_query.py` | Single-query lookup against LanceDB. | +| `stage999` | `stage999_post_mortem_analysis.py` | Post-run analysis. | + +Each stage's `run` reads sidecars matching a pattern (e.g. +`*.pdf_extraction.json` for stage5) and writes the next sidecar type. + +## Canonical flow + +```bash +# 1. extract +retriever local stage1 run --input-dir data/pdfs/ + +# 2. enrich (optional) +retriever local stage3 run --input-dir data/pdfs/ # tables +retriever local stage4 run --input-dir data/pdfs/ # charts + +# 3. embed +retriever local stage5 run --input-dir data/pdfs/ --pattern "*.pdf_extraction.json" + +# 4. upload to LanceDB +retriever local stage6 run --input-dir data/pdfs/ + +# 5. query +retriever local stage7 run --query "what is in chart 1?" +``` + +For txt/html, swap stage1 for [[txt]] / [[html]] and adjust stage5's +`--pattern`. + +## Inputs / outputs + +Each stage takes `--input-dir` (and stage-specific flags) and writes +sidecars next to source files. The pattern is consistent: stage N reads +stage N-1's output and writes its own type. + +## Common failure modes + +- **`stage5: no files matched pattern`** — `--pattern` defaults to + `*.pdf_extraction.json`; pass `*.txt_extraction.json` / + `*.html_extraction.json` for those inputs. +- **`stage6` overwrites a table I wanted to append to** — pass the + stage-appropriate flag, or use [[vector-store]] which has explicit + `--append`. +- **First `stage5` run is slow** — model load. Same trade-off as the + one-shot CLIs; reuse the process for multiple inputs in research scripts. + +## Related + +- [[pdf]] / [[chart]] / [[vector-store]] — standalone equivalents of + individual stages. +- [[pipeline]] — distributed graph version of the same flow. diff --git a/.claude/skills/nemo-retriever/references/pdf.md b/.claude/skills/nemo-retriever/references/pdf.md new file mode 100644 index 0000000000..dc4ef270f7 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/pdf.md @@ -0,0 +1,122 @@ +# retriever pdf + +Single-stage PDF extraction: scan a directory of PDFs and write per-PDF +primitives JSON sidecars (text / table / chart / image / page-image rows), +without running embedding or vector-DB stages. + +If flags below look stale, re-check `retriever pdf stage page-elements --help`. + +## When to use this + +- You only need extraction output (primitives JSON) — no embeddings, no + LanceDB. Useful for debugging, comparing extraction methods, or feeding a + custom downstream pipeline. +- You want to swap extraction *methods* (pdfium, pdfium_hybrid, ocr, + nemotron_parse, tika) without rebuilding the whole pipeline. +- You need to point at a remote YOLOX / Nemotron Parse NIM rather than the + bundled embedded models. + +**Use a different command when:** + +- You want the full extract → embed → ingest flow → [[ingest]] or + [[pipeline]]. +- You want only chart enrichment over already-extracted primitives → + [[chart]]. +- You want to inspect extraction overlays visually → [[image]]. +- You want to benchmark extraction throughput → [[benchmark]] (`split` / + `extract` / `page-elements`). + +## Canonical invocations + +Default extraction (pdfium, text only) on a directory: + +```bash +retriever pdf stage page-elements --input-dir data/pdfs/ +``` + +Extract everything (text + tables + charts + images) via pdfium + remote +YOLOX: + +```bash +retriever pdf stage page-elements \ + --input-dir data/pdfs/ \ + --method pdfium \ + --yolox-http-endpoint http://page-elements:8000/v1/infer \ + --extract-text --extract-tables --extract-charts --extract-images +``` + +Use NemotronParse instead of pdfium+YOLOX: + +```bash +retriever pdf stage page-elements \ + --input-dir data/pdfs/ \ + --method nemotron_parse \ + --nemotron-parse-http-endpoint http://nemotron-parse:8000/v1/infer +``` + +Write all sidecars to a single output directory: + +```bash +retriever pdf stage page-elements \ + --input-dir data/pdfs/ \ + --json-output-dir out/extractions/ +``` + +## Inputs + +- **`--input-dir DIR`** — recursively scanned for `*.pdf`. Required (or via + `--config`). +- **`--config FILE`** — optional ingest YAML. Auto-discovered from + `./ingest-config.yaml` then `$HOME/.ingest-config.yaml`. CLI flags override + YAML values. + +## Outputs + +- One `.pdf_extraction.json` sidecar per input PDF, written next to the + PDF unless `--json-output-dir` is set. +- Each sidecar is a list of primitives. Per primitive: `text`, + `source_id`/`path`, `page_number`, `metadata` (type, bbox, render info). + +These sidecars are the canonical stage-1 input for the rest of the +non-distributed `local stage*` flow (`stage5` embed, `stage6` VDB upload). + +## Key flags + +| Flag | Default | Notes | +|---|---|---| +| `--method` | `pdfium` | `pdfium`, `pdfium_hybrid`, `ocr`, `nemotron_parse`, `tika`. | +| `--yolox-grpc-endpoint` / `--yolox-http-endpoint` | — | Required for `pdfium` family when extracting page elements. | +| `--nemotron-parse-grpc-endpoint` / `--nemotron-parse-http-endpoint` | — | Required for `method=nemotron_parse`. | +| `--extract-text/--extract-tables/--extract-charts/--extract-images/--extract-infographics/--extract-page-as-image` | text only | Toggle which primitives are written. | +| `--text-depth` | `page` | `page` or `document`. | +| `--render-mode` | `fit_to_model` | `full_dpi` (DPI-then-resize) or `fit_to_model` (≈93 DPI for US Letter). | +| `--limit` | — | Cap number of PDFs processed (debugging). | + +## Method cheat-sheet + +- **`pdfium`** — fast, native text + YOLOX-driven element detection. Default. +- **`pdfium_hybrid`** — pdfium text + OCR fallback per page where text + extraction was empty/sparse. +- **`ocr`** — render each page, OCR everything. Use for scanned PDFs. +- **`nemotron_parse`** — NemotronParse end-to-end (text + tables + charts + + layout) via a single NIM call. +- **`tika`** — Apache Tika fallback (no element detection). + +## Common failure modes + +- **`YOLOX endpoint is required for method='pdfium'`** — pass + `--yolox-grpc-endpoint` or `--yolox-http-endpoint`. Without it, only + `--extract-text` works. +- **Empty primitives for scanned PDFs with `--method pdfium`** — there's no + embedded text. Switch to `--method ocr` or `pdfium_hybrid`. +- **No sidecars written** — `--write-json-outputs/--no-write-json-outputs` + toggles output. Default is on; check you didn't disable it via `--config`. +- **`auth-token` errors against NGC NIMs** — set `--auth-token` or + `NVIDIA_API_KEY` in the environment. + +## Related + +- [[chart]] — enrich the primitives from this stage with chart parsing. +- [[ingest]] — full pipeline that wraps this stage end-to-end. +- [[pipeline]] — graph-based pipeline exposing per-stage knobs. +- [[benchmark]] — measure throughput of this stage. diff --git a/.claude/skills/nemo-retriever/references/pipeline-stages.md b/.claude/skills/nemo-retriever/references/pipeline-stages.md new file mode 100644 index 0000000000..25ae629992 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/pipeline-stages.md @@ -0,0 +1,65 @@ +# pipeline stages + +Cross-reference for the **internal pipeline stages** (page-elements, ocr, +table-structure, graphic-elements, embed, caption, dedup, store). These are +not top-level CLI commands of their own; they're surfaced as: + +1. Flag groups under [[pipeline]] (`pipeline run`). +2. Stand-alone benchmark subcommands under [[benchmark]]. +3. In some cases, dedicated subcommands under other groups (e.g. + page-elements lives under `retriever pdf stage page-elements`). + +Use this page to figure out *which* command to reach for when you want to +exercise or tune a specific stage. + +## Stage map + +| Stage | What it does | Tuned via | Benchmarked via | Standalone CLI | +|---|---|---|---|---| +| **pdf-split** | Split PDFs into per-page tasks | `--pdf-split-batch-size` | `retriever benchmark split run` | — | +| **pdf-extract** | Native PDF text/structure extraction | `--method`, `--pdf-extract-*` | `retriever benchmark extract run` | [[pdf]] | +| **page-elements** | YOLOX text/table/chart/image detection | `--page-elements-invoke-url`, `--page-elements-actors`, `--page-elements-batch-size`, `--page-elements-{cpus,gpus}-per-actor` | `retriever benchmark page-elements run` | `retriever pdf stage page-elements` (see [[pdf]]) | +| **ocr** | OCR for sparse text regions | `--ocr-invoke-url`, `--ocr-version` (`v1`/`v2`), `--ocr-{actors,batch-size,cpus-per-actor,gpus-per-actor}` | `retriever benchmark ocr run` | — | +| **table-structure** | Structured OCR over detected tables | `--use-table-structure`, `--table-structure-invoke-url`, `--table-output-format` | — | (`nemo_retriever.table.commands` exposes `run-structure-ocr` under the table sub-app where wired) | +| **graphic-elements** | Chart parsing | `--use-graphic-elements`, `--graphic-elements-invoke-url`, `--extract-charts` | — | [[chart]] | +| **infographic** | Infographic parsing | `--extract-infographics` | — | (`retriever local stage2`) | +| **dedup** | IoU-based primitive dedup | `--dedup/--no-dedup`, `--dedup-iou-threshold` | — | — | +| **caption** | VLM caption for image primitives | `--caption/--no-caption`, `--caption-invoke-url`, `--caption-model-name`, `--caption-temperature`, `--caption-top-p`, `--caption-max-tokens` | — | — | +| **udf** | User-defined transforms (passthrough by default) | (code) | — | — | +| **embed** | Embed primitives | `--embed-invoke-url`, `--embed-model-name`, `--embed-modality`, `--embed-granularity`, `--embed-{actors,batch-size,cpus-per-actor,gpus-per-actor}`, `--local-ingest-embed-backend` | — | `retriever local stage5` | +| **audio-extract** | Chunk media + ASR | `--segment-audio`, `--audio-split-type`, `--audio-split-interval`, `--audio-match-tolerance`, audio NIM env | `retriever benchmark audio-extract run` | [[audio]] | +| **store (VDB)** | Write embeddings to LanceDB | `--store-actors`, `--lancedb-uri`, `--table-name` (set on [[ingest]] / [[vector-store]]) | — | [[vector-store]] | +| **query** | Embed query + search | (read side) | — | [[query]] / [[recall]] | + +## Choosing the right entry point + +- **"I want to ingest a corpus end-to-end"** → [[ingest]] (defaults) or + [[pipeline]] (per-stage control). +- **"I only want this one stage's output"** → the *Standalone CLI* column. +- **"I want to know how fast this stage is on this machine"** → the + *Benchmarked via* column. +- **"I want to route this stage through a NIM"** → set the matching + `--*-invoke-url` on [[pipeline]] (and `--api-key`). +- **"I want to size Ray actors for this stage"** → tune the + `---actors` / `---batch-size` / + `---{cpus,gpus}-per-actor` quartet on [[pipeline]]. + +## Stage ordering + +Default order for a PDF input under [[pipeline]] / [[ingest]]: + +``` +pdf-split → pdf-extract → page-elements → ocr + → (table-structure) (graphic-elements) (infographic) + → dedup → (caption) → udf → embed → store +``` + +Audio swaps the head: `audio-extract` (chunk + ASR) replaces +pdf-split/pdf-extract/page-elements/ocr; the tail (embed, store) is the +same. Txt/html similarly replace the head with [[txt]] / [[html]]. + +## Related + +- [[pipeline]] — the command that wires every stage above together. +- [[benchmark]] — per-stage rows/sec. +- [[local]] — non-distributed, file-per-stage version of the same flow. diff --git a/.claude/skills/nemo-retriever/references/pipeline.md b/.claude/skills/nemo-retriever/references/pipeline.md new file mode 100644 index 0000000000..6de2ae6780 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/pipeline.md @@ -0,0 +1,148 @@ +# retriever pipeline + +Graph-based end-to-end ingestion pipeline. Same outcome as [[ingest]] +(documents → LanceDB) but exposes per-stage knobs for extraction methods, +NIM endpoints, Ray actor counts, embedding model, dedup/caption, audio/video +options, and storage. + +Use `retriever pipeline run --help` to see *all* flag groups — there are +many. This page covers the groups and the most-used flags within each. + +## When to use this + +- You need fine-grained control over a pipeline stage (e.g. swap the OCR + model, set per-actor GPU fractions, route through a remote NIM, use a + different embedder). +- You're tuning throughput on a Ray cluster and need actor / batch-size + knobs. +- You want to ingest non-PDF inputs (audio, video, txt, html, image) + through the same graph. + +**Use a different command when:** + +- Defaults are fine → [[ingest]] (one flag, same outcome). +- You only need a single stage's output → [[pdf]], [[chart]], [[audio]], + [[txt]], [[html]]. +- You want long-running service mode → [[service]]. +- You want a non-Ray local debug runner → [[local]]. +- You want throughput numbers per stage → [[benchmark]]. + +## Canonical invocations + +Default batch ingest of a PDF directory: + +```bash +retriever pipeline run data/pdfs/ +``` + +In-process (no Ray) for quick local runs: + +```bash +retriever pipeline run data/pdfs/ --run-mode inprocess +``` + +Ingest audio: + +```bash +retriever pipeline run data/audio/ --input-type audio +``` + +Route through remote NIMs (no local GPU). Note: `--use-table-structure` and +`--use-graphic-elements` default to **off** — passing the matching +`--*-invoke-url` alone is not enough; the `--use-*` flag must also be set +to enable that stage. + +```bash +retriever pipeline run data/pdfs/ \ + --page-elements-invoke-url http://page-elements:8000/v1/infer \ + --ocr-invoke-url http://ocr:8000/v1/infer \ + --use-table-structure \ + --table-structure-invoke-url http://table-structure:8000/v1/infer \ + --use-graphic-elements \ + --graphic-elements-invoke-url http://graphic-elements:8000/v1/infer \ + --embed-invoke-url http://embed:8000/v1/embed \ + --api-key "$NVIDIA_API_KEY" +``` + +Tune Ray actor counts for a busy stage: + +```bash +retriever pipeline run data/pdfs/ \ + --page-elements-actors 4 --page-elements-gpus-per-actor 0.5 \ + --ocr-actors 2 --ocr-gpus-per-actor 1.0 \ + --embed-actors 1 --embed-batch-size 64 +``` + +## Inputs + +- **Positional `INPUT_PATH`** — file or directory of documents. Required. +- **`--input-type`** — `pdf` (default) / `doc` / `txt` / `html` / `image` / + `audio`. + +## Outputs + +- LanceDB table populated by the `IngestVdbOperator` sink (defaults + `lancedb/nv-ingest.lance`). See [[query]] for reading. +- If `--store-images-uri` is set, extracted images are also persisted there. + +## Flag groups (from `--help`) + +| Group | What it controls | +|---|---| +| **I/O and Execution** | `--run-mode` (`batch` / `inprocess` / `service`), `--input-type`, `--debug`, `--log-file`. | +| **PDF / Document Extraction** | `--method`, `--dpi`, `--extract-text/--extract-tables/--extract-charts/--extract-infographics/--extract-page-as-image`, `--use-graphic-elements`, `--use-table-structure`, `--table-output-format`. | +| **Remote NIM Endpoints** | `--api-key`, plus `--*-invoke-url` for `page-elements`, `ocr`, `graphic-elements`, `table-structure`, `embed`. `--ocr-version v1/v2`. | +| **Embedding** | `--embed-model-name`, `--embed-modality`, `--embed-granularity`, `--local-ingest-embed-backend` (`vllm`/`hf`), `--text-elements-modality`, `--structured-elements-modality`. | +| **Dedup and Caption** | `--dedup/--no-dedup`, `--dedup-iou-threshold`, `--caption/--no-caption`, `--caption-invoke-url`, `--caption-model-name`, GPU fractions, `--caption-temperature`/`--caption-top-p`/`--caption-max-tokens`. | +| **Storage and Text Chunking** | `--store-images-uri`, `--text-chunk`, `--text-chunk-max-tokens`, `--text-chunk-overlap-tokens`. | +| **Ray / Batch Tuning** | `--ray-address`, per-stage `*-actors`/`*-batch-size`/`*-cpus-per-actor`/`*-gpus-per-actor` for `page-elements`, `ocr`, `embed`, `nemotron-parse`, plus `--store-actors`, `--pdf-split-batch-size`, `--pdf-extract-*`. | +| **Audio** | `--segment-audio`, `--audio-split-type`/`--audio-split-interval`, `--audio-match-tolerance`. | +| **Video** | `--video-extract-audio`, video-specific split/sampling flags. | + +## Pipeline stages (what runs end-to-end) + +For a PDF input with all defaults, the graph runs roughly: + +1. **PDFSplitActor** — split into per-page tasks. +2. **PDFExtractionActor** — native text/structure extraction. +3. **PageElementDetectionActor** — YOLOX detects text/table/chart/image + regions. Tunable via `--page-elements-*` flags. +4. **OCRV2Actor** / OCRActor — OCR text where extraction is sparse. Tunable + via `--ocr-*` flags; `--ocr-version v1` for the legacy engine. +5. **(optional) TableStructureActor** — structured-OCR on detected tables + when `--use-table-structure` is set; route via + `--table-structure-invoke-url`. +6. **(optional) GraphicElementsActor** — chart enrichment when + `--use-graphic-elements`; route via `--graphic-elements-invoke-url`. +7. **(optional) CaptionActor** — VLM captioning when `--caption`. +8. **UDFOperator** — user-defined transforms (passthrough by default). +9. **EmbedActor** — embed primitives. Tunable via `--embed-*` flags. +10. **IngestVdbOperator (StoreOperator)** — write to LanceDB. + +Each stage has its own `--*-invoke-url` for routing to a NIM, and (in batch +mode) `--*-actors` / `--*-batch-size` / `--*-cpus-per-actor` / +`--*-gpus-per-actor` for resource sizing. + +## Common failure modes + +- **Stage saturates and stalls** — bump `---actors` and/or + `---batch-size`. Use [[benchmark]] to find the bottleneck stage + first. +- **"No GPU available" with `--run-mode batch`** — set + `---gpus-per-actor 0` for stages you want on CPU, or pass + `--*-invoke-url` to offload to a NIM. +- **Embedding mismatch on read** — `--embed-model-name` differs from what + [[query]] uses. Keep ingest and query embedders aligned. +- **Output table empty** — input matched no files for `--input-type`. Check + globs and file extensions. +- **Tables / charts not appearing in output despite `--*-invoke-url` set** + — `--use-table-structure` / `--use-graphic-elements` default to off. + Setting the invoke URL alone does *not* enable the stage; pass the + `--use-*` flag too. + +## Related + +- [[ingest]] — defaults-only wrapper around this command. +- [[local]] — non-distributed runner for debugging stages. +- [[service]] — long-running pipeline behind an HTTP API. +- [[benchmark]] — per-stage throughput numbers. diff --git a/.claude/skills/nemo-retriever/references/recall.md b/.claude/skills/nemo-retriever/references/recall.md new file mode 100644 index 0000000000..75d24815cf --- /dev/null +++ b/.claude/skills/nemo-retriever/references/recall.md @@ -0,0 +1,87 @@ +# retriever recall + +Batch query + recall@k evaluation. Reads a CSV of ground-truth queries, +embeds each query, searches a LanceDB table, prints per-query hits, and +computes recall@1 / @5 / @10. + +If flags below look stale, re-check `retriever recall vdb-recall run --help`. + +## When to use this + +- You have labelled `(query, pdf, page)` ground truth and want recall + metrics for a retrieval setup. +- Sweeping embedding models / chunking / top-k against a fixed query set. + +**Use a different command when:** + +- You want a single ad-hoc lookup → [[query]]. +- You want full QA quality (answer grading), not just retrieval recall → + [[eval]]. +- You want to compare two recall runs → [[compare]]. + +## Canonical invocations + +Default recall against the project query set: + +```bash +retriever recall vdb-recall run +``` + +Custom query CSV + custom table: + +```bash +retriever recall vdb-recall run \ + --query-csv my-queries.csv \ + --top-k 10 \ + --lancedb-uri ./my-lancedb \ + --table-name my-corpus +``` + +Route embedding through a remote NIM: + +```bash +retriever recall vdb-recall run \ + --query-csv my-queries.csv \ + --embedding-http-endpoint http://embed:8000/v1/embed +``` + +## Inputs + +- **`--query-csv FILE`** — CSV with `query,pdf_page` or `query,pdf,page` + columns. Default `bo767_query_gt.csv`. + +## Outputs + +- Per-query top-k hits printed to stdout. +- A summary line with `recall@1 / @5 / @10`. + +`recall@10` always queries with `search_k = max(top_k, 10)` so the metric +remains valid even when you display fewer hits. + +## Key flags + +| Flag | Default | Notes | +|---|---|---| +| `--query-csv` | `bo767_query_gt.csv` | Ground-truth CSV. | +| `--top-k` | `5` | Hits shown per query (recall@10 still computed). | +| `--lancedb-uri` | `lancedb` | Must match [[ingest]] / [[vector-store]]. | +| `--table-name` | `nv-ingest` | Same. | +| `--vector-column` | `vector` | Column to search. | +| `--embedding-endpoint` / `--embedding-http-endpoint` / `--embedding-grpc-endpoint` | — | Remote query embedder. Falls back to local HF if all unset. | +| `--limit` | — | Cap queries (debug). | + +## Common failure modes + +- **`recall@10 = 0.0`** — query embedder doesn't match the ingest embedder + (different model / dim). Re-ingest with the same embedder or pass the + matching `--embedding-*-endpoint`. +- **`KeyError: 'pdf_page'`** — CSV uses `pdf,page` instead. The command + accepts either schema, but typos in column names break both. +- **Slow first run** — local HF embedder cold-start. Reuse a single process + or hit a warm NIM. + +## Related + +- [[query]] — ad-hoc retrieval against the same table. +- [[eval]] — adds answer-quality grading on top of retrieval. +- [[compare]] — diff two retrieval runs. diff --git a/.claude/skills/nemo-retriever/references/service.md b/.claude/skills/nemo-retriever/references/service.md new file mode 100644 index 0000000000..e02b0b3886 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/service.md @@ -0,0 +1,100 @@ +# retriever service + +Long-running ingest service: an HTTP/SSE server that accepts document +uploads and runs the pipeline behind the scenes. Two subcommands: + +- `retriever service start` — boot the server. +- `retriever service ingest` — client that uploads files to a running + server. + +If flags below look stale, re-check `retriever service --help`. + +## When to use this + +- You want a single warm process serving many ingest requests (avoids the + one-shot CLI startup cost — vLLM load, CUDA-graph capture). +- You want to ingest from a remote machine / orchestrator without copying + files onto a GPU host every time. +- You want to point [[pipeline]] at a remote pipeline via + `--run-mode service`. + +**Use a different command when:** + +- One-shot ingest → [[ingest]] / [[pipeline]]. +- Local debugging / no service → [[local]]. + +## Canonical invocations + +Start with a YAML config: + +```bash +retriever service start --config deploy/retriever-service.yaml +``` + +Start with inline flags (overrides any YAML): + +```bash +retriever service start \ + --host 0.0.0.0 --port 7670 \ + --gpu-devices 0,1 \ + --nim-api-key "$NVIDIA_API_KEY" \ + --api-token "$NEMO_RETRIEVER_API_TOKEN" +``` + +Upload files to a running server (SSE streaming progress): + +```bash +retriever service ingest --server-url http://localhost:7670 data/pdfs/*.pdf +``` + +Polling instead of SSE (firewalled environments): + +```bash +retriever service ingest --no-sse --poll-interval 5.0 data/pdfs/foo.pdf +``` + +## Inputs / outputs + +- **`start`** — no inputs; serves until killed. +- **`ingest`** — one or more file paths, streamed/polled to completion. + Prints per-file status. + +## Key flags + +`service start`: + +| Flag | Notes | +|---|---| +| `--config -c` | Path to `retriever-service.yaml`. | +| `--host` / `--port -p` | Bind address. Default per YAML. | +| `--log-level` / `--log-file` | Logging overrides. | +| `--nim-api-key` | NIM bearer (also `$NVIDIA_API_KEY`). | +| `--gpu-devices` | CSV GPU IDs. | +| `--api-token` | Bearer required on every request (also `$NEMO_RETRIEVER_API_TOKEN`). Unset = no auth. | + +`service ingest`: + +| Flag | Default | Notes | +|---|---|---| +| `--server-url -s` | `http://localhost:7670` | Server base URL. | +| `--sse / --no-sse` | `sse` | Stream progress or poll. | +| `--poll-interval` | `2.0` s | Polling cadence when `--no-sse`. | +| `--concurrency` | `8` | Max concurrent uploads. | +| `--api-token` | from `$NEMO_RETRIEVER_API_TOKEN` | Auto-falls back to the env var; pass the flag only to override. | + +## Common failure modes + +- **`401 Unauthorized`** — server has `--api-token` set; the client must + match (`--api-token` or `$NEMO_RETRIEVER_API_TOKEN`). +- **Hangs on first request after boot** — model warmup. First request can + take 30–60s; subsequent ones are sub-second. +- **`Connection refused`** — server binds `0.0.0.0` but firewall blocks the + port. Tunnel or open the port. +- **CUDA OOM under concurrency** — drop client `--concurrency`, or reduce + per-stage actor counts in the server YAML. + +## Related + +- [[pipeline]] with `--run-mode service` — pipeline CLI that delegates to a + running service. +- [[ingest]] — local one-shot equivalent. diff --git a/.claude/skills/nemo-retriever/references/txt.md b/.claude/skills/nemo-retriever/references/txt.md new file mode 100644 index 0000000000..b81867daa6 --- /dev/null +++ b/.claude/skills/nemo-retriever/references/txt.md @@ -0,0 +1,78 @@ +# retriever txt + +Plain-text extraction: scan a directory for `*.txt`, tokenizer-split each +file into chunks, and write `.txt_extraction.json` sidecars in the +same primitives shape as the rest of the pipeline. + +If flags below look stale, re-check `retriever txt run --help`. + +## When to use this + +- You have plain-text corpora (logs, scraped articles, transcripts) and want + to feed them into embed → VDB downstream stages. +- Quick way to seed a LanceDB table for retrieval experiments without going + through PDF rendering. + +**Use a different command when:** + +- Input is HTML → [[html]]. +- Input is PDF/audio/etc → [[pdf]], [[audio]], or the unified [[pipeline]] + with `--input-type txt`. + +## Canonical invocations + +Default chunking (512 tokens, no overlap): + +```bash +retriever txt run --input-dir data/text/ +``` + +Smaller chunks with overlap: + +```bash +retriever txt run --input-dir data/text/ --max-tokens 256 --overlap 32 +``` + +## Inputs + +- **`--input-dir DIR`** — required, scanned for `*.txt`. + +## Outputs + +- `.txt_extraction.json` per file (next to source by default, or in + `--output-dir` if set). +- Same primitives-like shape as stage5 input: `text`, `path`, `page_number` + (always 0 for txt), `metadata`. + +## Downstream + +After this, run (as the `--help` text instructs): + +```bash +retriever local stage5 run --input-dir --pattern "*.txt_extraction.json" +retriever local stage6 run --input-dir +``` + +Or pipe straight through [[pipeline]] with `--input-type txt`. + +## Key flags + +| Flag | Default | Notes | +|---|---|---| +| `--max-tokens` | `512` | Hard cap per chunk. | +| `--overlap` | `0` | Token overlap between consecutive chunks. | +| `--encoding` | `utf-8` | File read encoding. | +| `--limit` | — | Cap number of files processed. | + +## Common failure modes + +- **Empty output files** — input `.txt` is empty or all-whitespace; the + tokenizer produced 0 chunks. +- **Mojibake in extracted text** — wrong `--encoding`; try `latin-1` or + `utf-16` for legacy files. + +## Related + +- [[html]] — sibling command for HTML inputs. +- [[pipeline]] — wraps txt extraction + embed + VDB in one command. +- [[vector-store]] — upload the resulting embeddings. diff --git a/.claude/skills/nemo-retriever/references/vector-store.md b/.claude/skills/nemo-retriever/references/vector-store.md new file mode 100644 index 0000000000..391ae3d17f --- /dev/null +++ b/.claude/skills/nemo-retriever/references/vector-store.md @@ -0,0 +1,85 @@ +# retriever vector-store + +LanceDB upload stage: take a directory of `*.text_embeddings.json` files +(produced by the local `stage5` embedder) and load them into a LanceDB +table, optionally creating an IVF index. + +If flags below look stale, re-check `retriever vector-store stage run --help`. + +## When to use this + +- You ran embedding offline (e.g. via [[local]] stage5 or a custom embed + job) and now want the vectors searchable. +- You want to (re)build a LanceDB index over existing embedding sidecars. + +**Use a different command when:** + +- You want full ingest in one shot → [[ingest]] or [[pipeline]] (their last + stage already does this). +- You want to *query* an existing table → [[query]] / [[recall]]. + +## Canonical invocations + +Upload + index with defaults (overwrites the table): + +```bash +retriever vector-store stage run --input-dir out/embeddings/ +``` + +Append rather than overwrite, into a custom DB/table: + +```bash +retriever vector-store stage run \ + --input-dir out/embeddings/ \ + --lancedb-uri ./my-lancedb \ + --table-name my-corpus \ + --append +``` + +Skip indexing (faster, but slower searches afterwards): + +```bash +retriever vector-store stage run --input-dir out/embeddings/ --no-create-index +``` + +## Inputs + +- **`--input-dir DIR`** — required. Contains `*.text_embeddings.json` files. + `--recursive` to scan subdirectories. + +## Outputs + +- LanceDB table at `/.lance`. Defaults + `lancedb/nv-ingest.lance` — matches [[ingest]] / [[query]] defaults. +- Each row carries `vector`, `pdf_basename`, `page_number`, `path`, + `source_id`, and the original primitive metadata. + +## Key flags + +| Flag | Default | Notes | +|---|---|---| +| `--recursive` | off | Walk subdirectories of `--input-dir`. | +| `--lancedb-uri` | `lancedb` | DB path/URI. | +| `--table-name` | `nv-ingest` | Table name (must match [[query]]). | +| `--overwrite/--append` | `overwrite` | Replace or extend existing table. | +| `--create-index/--no-create-index` | `create-index` | Build vector index after upload. | +| `--index-type` | `IVF_HNSW_SQ` | LanceDB index type. | +| `--metric` | `l2` | Distance metric (must match how you'll search). | +| `--num-partitions` | `16` | IVF partitions. Clamped down for tiny tables. | +| `--num-sub-vectors` | `256` | PQ sub-vectors. | + +## Common failure modes + +- **`Clamping num_partitions from 16 to N`** — informational; index needs + partitions < row count. Happens on small uploads. +- **`Table already exists`** with `--append` returning unexpected rows — + `--append` does not dedupe. Run [[query]] / inspect the table if you + suspect duplicates. +- **Query results look bad after upload** — metric mismatch between this + stage's `--metric` and what [[query]] uses (`l2` everywhere by default). + +## Related + +- [[query]] — search the table this command writes. +- [[recall]] — batch query + recall metrics over a CSV of ground truth. +- [[pipeline]] — full ingest that uses this stage as its sink.