Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 30 additions & 3 deletions .claude/skills/nemo-retriever/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,36 @@ If no arguments are provided, run `retriever --help` and summarize the available

## Subcommand references

For per-subcommand details (when to use it, canonical invocations, inputs/outputs, flags, common failure modes), read the matching file in `references/` *before* running anything non-trivial:
For per-subcommand details (when to use it, canonical invocations, inputs/outputs, flags, common failure modes), read the matching file in `references/` *before* running anything non-trivial.

- `references/ingest.md` — `retriever ingest`: PDFs → LanceDB (full pipeline).
End-to-end / search:

- `references/ingest.md` — `retriever ingest`: docs → LanceDB (full pipeline, defaults).
- `references/query.md` — `retriever query`: text query → top-k LanceDB hits.
- `references/pipeline.md` — `retriever pipeline run`: graph-based end-to-end with per-stage knobs.
- `references/service.md` — `retriever service`: long-running ingest service + client.
- `references/local.md` — `retriever local stage{1..7}`: non-distributed per-stage runner.

Per-input-type extractors:

- `references/pdf.md` — `retriever pdf stage page-elements`: PDF → primitives JSON.
- `references/chart.md` — `retriever chart stage run` / `graphic-elements`: chart enrichment.
- `references/audio.md` — `retriever audio extract` / `discover`: chunk + ASR.
- `references/txt.md` — `retriever txt run`: plain-text chunking.
- `references/html.md` — `retriever html run`: HTML → markdown → chunks.
- `references/image.md` — `retriever image render`: detection overlay visualization.

Storage and evaluation:

- `references/vector-store.md` — `retriever vector-store stage run`: embeddings → LanceDB.
- `references/recall.md` — `retriever recall vdb-recall run`: recall@k over a query CSV.
- `references/eval.md` — `retriever eval run` / `export` / `build-page-index`: QA evaluation.
- `references/benchmark.md` — `retriever benchmark <stage> run`: per-stage rows/sec.
- `references/harness.md` — `retriever harness run` / `sweep` / `nightly` / `portal` / …: sessioned orchestration.
- `references/compare.md` — `retriever compare`: JSON / results-bundle diffs.

Cross-cutting:

- `references/pipeline-stages.md` — map of the internal pipeline stages (page-elements, ocr, table-structure, graphic-elements, embed, caption, dedup, store, …) → which CLI command exposes each.

Additional per-stage references (`pdf`, `chart`, `image`, `audio`, `txt`, `html`, `pipeline`, `vector-store`, `recall`, `eval`, `benchmark`, `service`, `local`, `compare`, `harness`) will be added as those stages stabilize. Until then, fall back to `retriever <subcommand> --help` for any subcommand not listed above.
If a subcommand isn't listed above, fall back to `retriever <subcommand> --help`.
100 changes: 100 additions & 0 deletions .claude/skills/nemo-retriever/references/audio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# retriever audio

Audio / video extraction stage: chunk media files, run ASR (Parakeet locally
or a remote Riva/NIM endpoint), and write extraction JSON sidecars in the
same primitives shape as [[pdf]].

If flags below look stale, re-check `retriever audio extract --help`.

## When to use this

- You have audio (`.mp3`, `.wav`) or video files and want ASR transcripts
fed into the rest of the retriever pipeline.
- You want to verify mount/path layout before kicking off a long ASR run →
use `retriever audio discover` (no ASR, just lists what would be
processed).

**Use a different command when:**

- You want full ingest including audio → [[pipeline]] with
`--input-type audio` or [[ingest]] once it accepts audio inputs.
- You want to benchmark ASR throughput → [[benchmark]] (`audio-extract`).

## Canonical invocations

Dry-run discovery:

```bash
retriever audio discover --input-dir data/audio/
```

Local Parakeet ASR over `*.mp3`/`*.wav` (default globs):

```bash
retriever audio extract --input-dir data/audio/
```

Cloud ASR via NIM env vars:

```bash
export NGC_API_KEY=...
export AUDIO_FUNCTION_ID=...
retriever audio extract --input-dir data/audio/ --use-env-asr
```

Override the gRPC endpoint explicitly:

```bash
retriever audio extract \
--input-dir data/audio/ \
--audio-grpc-endpoint riva-asr:50051 \
--auth-token "$NVIDIA_API_KEY"
```

Process video too, extracting audio first:

```bash
retriever audio extract --input-dir data/media/ --glob "*.mp4" --audio-only
```

## Inputs

- **`--input-dir DIR`** — required, scanned (non-recursive) for files
matching `--glob`.
- **`--glob`** — comma-separated patterns. Default `*.mp3,*.wav`.

## Outputs

- One `<file>.audio_extraction.json` sidecar per source file (default; toggle
with `--write-json/--no-write-json`).
- Sidecar shape mirrors PDF primitives (`text`, `source_id`, `metadata`),
with `metadata.content_metadata.type == "text"` per ASR chunk.

## Key flags

| Flag | Default | Notes |
|---|---|---|
| `--split-type` | `size` | `size` (bytes), `time` (seconds), or `frame`. |
| `--split-interval` | `450` | Chunk size in the chosen units. |
| `--audio-only` | off | Extract audio track from video first, then chunk. |
| `--video-audio-separate` | off | Emit the extracted MP3 as its own item. |
| `--use-env-asr` | on | Build ASR params from `AUDIO_GRPC_ENDPOINT`/`NGC_API_KEY`/`AUDIO_FUNCTION_ID`. |
| `--audio-grpc-endpoint` | — | Override env; sets remote ASR. Wins over `--use-env-asr`. |
| `--auth-token` | — | Bearer for cloud ASR (also `$NVIDIA_API_KEY`). |
| `--limit` | — | Cap files processed. |

## Common failure modes

- **`No files matched glob`** — default globs are `*.mp3,*.wav`. Pass
`--glob "*.mp4"` for video, etc.
- **Falls back to local Parakeet unexpectedly** — `--use-env-asr` is on but
none of `AUDIO_GRPC_ENDPOINT` / `NGC_API_KEY` / `AUDIO_FUNCTION_ID` are
set. Either set them or pass `--audio-grpc-endpoint`.
- **Local Parakeet OOM on long files** — drop `--split-interval` (smaller
chunks) or switch to a remote NIM.

## Related

- [[pipeline]] with `--input-type audio` — full ingest including embedding +
VDB.
- [[benchmark]] `audio-extract` — throughput benchmarks.
93 changes: 93 additions & 0 deletions .claude/skills/nemo-retriever/references/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# retriever benchmark

Throughput micro-benchmarks for individual Ray actors in the ingest
pipeline. Each subcommand isolates one stage and reports rows/sec.

Subcommands:

| Stage | Subcommand | Actor benchmarked |
|---|---|---|
| Split | `retriever benchmark split run` | `PDFSplitActor` |
| Extract | `retriever benchmark extract run` | `PDFExtractionActor` |
| Page elements | `retriever benchmark page-elements run` | `PageElementDetectionActor` |
| OCR | `retriever benchmark ocr run` | `OCRActor` |
| Audio extract | `retriever benchmark audio-extract run` | `MediaChunkActor + ASRActor` |
| All | `retriever benchmark all run` | runs the above in sequence |

If flags below look stale, re-check `retriever benchmark <stage> run --help`.

## When to use this

- You suspect a specific pipeline stage is the bottleneck and want
rows/sec numbers under controlled load.
- You're sizing Ray actor counts / GPU fractions for [[pipeline]] / [[ingest]]
and need empirical numbers per stage.
- You want a regression-style benchmark across machines or releases (pair
with [[harness]] for orchestration).

**Use a different command when:**

- You want end-to-end ingest, not stage-isolated numbers → [[ingest]] or
[[pipeline]] with a stopwatch.
- You want recall/QA quality, not throughput → [[recall]] / [[eval]].

## Canonical invocations

Benchmark the page-element detector alone:

```bash
retriever benchmark page-elements run --help # see options
retriever benchmark page-elements run
```

Benchmark OCR (v2 by default; pair with [[pipeline]]'s `--ocr-version`):

```bash
retriever benchmark ocr run
```

Run all stage benchmarks in sequence and print a summary:

```bash
retriever benchmark all run --num-gpus 0.5 --num-cpus 1.0
```

## Inputs

- All `run` commands take their own flag set (run `--help` on the
individual subcommand). Common shape: rows count, batch size, GPU/CPU
fractions per actor, optional remote NIM URL.

## Outputs

- Stdout report with per-actor throughput in rows/sec, plus headers per
stage (e.g. `=== benchmark: page-elements ===`).

## Key flags (`all run`)

| Flag | Default | Notes |
|---|---|---|
| `--num-gpus` | `1.0` | GPUs reserved per page-elements / OCR actor. |
| `--num-cpus` | `1.0` | CPUs reserved per actor. |
| `--rows-page-elements` etc. | per-stage | Synthetic rows per stage benchmark. |

## Reading the results

- Numbers come from a synthetic Ray Dataset; they're representative of the
stage in isolation, not of end-to-end throughput.
- To convert to [[pipeline]] tuning: pick the slowest stage's rows/sec,
divide your target rate by it → number of actors needed.

## Common failure modes

- **Page-elements benchmark stalls** — needs YOLOX weights or a remote
endpoint. Pass the URL flags or pre-cache weights.
- **Benchmark numbers don't match [[pipeline]]** — micro-benchmarks exclude
inter-stage queues / batching overhead. Treat as upper bounds.
- **`CUDA OOM`** — drop `--num-gpus` (fractional) or `*-batch-size` per
stage.

## Related

- [[pipeline]] — apply the actor counts derived from these benchmarks.
- [[harness]] — runs benchmarks across configs/datasets and stores results.
85 changes: 85 additions & 0 deletions .claude/skills/nemo-retriever/references/chart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# retriever chart

Chart-specific enrichment over already-extracted primitives — parses chart
images (titles, axes, series, values) and adds them as structured text to
each chart primitive. Two related subcommands:

- `retriever chart stage run` — enrich an existing primitives DataFrame.
- `retriever chart stage graphic-elements` — run the extract+detect path
starting from PDFs, with chart extraction enabled.

If flags below look stale, re-check `retriever chart stage --help`.

## When to use this

- You already ran [[pdf]] (or another extractor) and want to add chart
parsing on top of the primitives without re-extracting.
- You're iterating on chart parsing parameters and don't want to rerun the
whole pipeline.

**Use a different command when:**

- You want full ingest with charts → [[ingest]] / [[pipeline]] with
`--extract-charts`.
- You want only PDF extraction (no chart parsing) → [[pdf]].

## Canonical invocations

Enrich a primitives parquet with chart parsing:

```bash
retriever chart stage run \
--input out/extractions.parquet \
--output out/extractions.+chart.parquet
```

Extract from PDFs with charts enabled:

```bash
retriever chart stage graphic-elements \
--input-dir data/pdfs/ \
--extract-charts \
--yolox-http-endpoint http://page-elements:8000/v1/infer
```

## Inputs

- **`run`**: `--input` parquet/jsonl/json with a `metadata` column.
- **`graphic-elements`**: `--input-dir` of PDFs (same shape as `retriever pdf
stage page-elements`).

## Outputs

- **`run`**: enriched DataFrame at `--output` (defaults to
`<input>.+chart<ext>`). Chart primitives gain parsed structured text in
their `text` field.
- **`graphic-elements`**: per-PDF `*.pdf_extraction.json` sidecars including
chart primitives.

## Key flags (`chart stage run`)

| Flag | Default | Notes |
|---|---|---|
| `--input` | — | Required. `.parquet`, `.jsonl`, or `.json` with `metadata`. |
| `--output` | `<input>.+chart<ext>` | Output path. |
| `--config` | auto-discover | YAML config (section: `chart`). |

## Key flags (`chart stage graphic-elements`)

Same as `retriever pdf stage page-elements` plus `--extract-charts` toggled
on by default. See [[pdf]] for the full flag table.

## Common failure modes

- **`KeyError: 'metadata'`** — input DataFrame is missing the `metadata`
column. Make sure you fed it primitives JSON/parquet from
`retriever pdf stage` or [[pipeline]].
- **No chart rows in output** — the input has no rows with
`metadata.content_metadata.type == "structured"` and chart subtype. Run
extraction with `--extract-charts` first.

## Related

- [[pdf]] — generate the primitives that `chart stage run` consumes.
- [[pipeline]] — wraps chart extraction into the graph pipeline.
- [[ingest]] — end-to-end including charts when enabled.
46 changes: 46 additions & 0 deletions .claude/skills/nemo-retriever/references/compare.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# retriever compare

Comparison utilities. Optional subcommands are registered lazily — if the
relevant module is installed, you'll see:

- `retriever compare json` — diff two JSON files (extraction sidecars, eval
outputs, recall outputs).
- `retriever compare results` — diff two retrieval/eval result bundles.

Run `retriever compare --help` to see which subcommands are present in your
install.

## When to use this

- You changed an extraction flag, ran the pipeline twice, and want a
semantic diff of the outputs (not a textual diff).
- You ran [[recall]] or [[eval]] twice and want to know which queries
regressed / improved.

**Use a different command when:**

- You want a single-number metric, not a diff → [[recall]] / [[eval]].
- You want a UI / portal for sweep comparison → [[harness]] (`portal` /
`compare`).

## Canonical invocations

```bash
retriever compare json before.json after.json
retriever compare results runs/baseline/ runs/candidate/
```

Run `--help` on each subcommand for the exact flag set; the modules are
optional and may expose different options across releases.

## Common failure modes

- **`retriever compare json` not found** — the `compare_json` module isn't
installed. Install the extras (or upgrade the package).
- **Diff shows everything different** — files have non-stable key order or
embedded timestamps; the subcommand normalises common cases but not all.

## Related

- [[recall]] / [[eval]] — produce the artifacts this command compares.
- [[harness]] `compare` — session-level comparison with summaries.
Loading
Loading