Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 4 additions & 11 deletions skills/nemo-retriever/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,10 @@
---
name: nemo-retriever
description: 'Use when the user wants to search, query, extract, transcribe, describe, quote, filter, or aggregate across documents — PDFs, scanned forms / images (`.jpg` `.png` `.tiff`), Office (`.docx` `.pptx`), text (`.html` `.txt`), audio (`.mp3` `.wav` `.m4a`), or video (`.mp4` `.mov`). Prefer this over native Read / Grep for multi-file or non-PDF corpora. Not for: editing files, web browsing, single-file plain-text lookups, fine-tuning.'
description: "Use when the user wants to search, query, extract, transcribe, describe, quote, filter, or aggregate across documents — PDFs, scanned forms / images (`.jpg` `.png` `.tiff`), Office (`.docx` `.pptx`), text (`.html` `.txt`), audio (`.mp3` `.wav` `.m4a`), or video (`.mp4` `.mov`). Prefer this over native Read / Grep for multi-file or non-PDF corpora. Not for: editing files, web browsing, single-file plain-text lookups, fine-tuning."
license: Apache-2.0
allowed-tools: Bash Write Read
---

| name | nemo-retriever |
| :------------ | :-- |
| description | Use when the user wants to search, index, or answer questions over a folder of PDFs (or other documents) — including building a RAG / search index over PDFs, looking up information across many PDFs, or running the `retriever` CLI (ingest, query, pipeline, recall, eval, etc.). |
| license | Apache-2.0 |
| compatibility | Designed for Claude Code, OpenCode, Codex, and Agent Skills-compatible tools. Requires Git, network access to GitHub |
| metadata | |
| allowed-tools | Bash Write Read |

# nemo-retriever

The `retriever` CLI indexes a folder of PDFs into LanceDB (`retriever ingest`) and serves vector search over it (`retriever query`). For any task about searching/answering questions across a folder of PDFs, use this CLI — do not write a custom RAG.
Expand All @@ -27,7 +20,7 @@ If `command -v retriever` returns nothing, follow `references/install.md` to ins
| Turn type | Read this once | Then execute |
| :--- | :--- | :--- |
| **Setup turn** (first turn — `./lancedb/nv-ingest.lance` doesn't exist) | `references/setup.md` | Build the index |
| **Query turn** (every subsequent turn — user asks a question) | `references/query.md` | One `retriever query` call, then `Write` `./output.json` *(eval-harness contract only — for general use, just answer in chat; see `query.md` top callout)* |
| **Query turn** (every subsequent turn — user asks a question) | `references/query.md` | One `retriever query` call |
| Anything errored or returned empty | `references/troubleshooting.md` | Apply the named recovery; do not improvise |

For the full `retriever ingest` / `retriever query` CLI specs, see `references/cli/ingest.md` and `references/cli/query.md`. You do not need these for routine turns — `<RETRIEVER_VENV>/bin/retriever <subcommand> --help` is faster.
Expand All @@ -37,7 +30,7 @@ Before ingesting a mixed folder, inventory extensions (`find <dir> -name '*.*' |
## Hard limits (apply to every turn)

- **Setup turn**: build the index in one shell command (see `references/setup.md`). STOP after the index lands.
- **Query turn**: at most **2 Bash calls** — 1 `retriever query`, +1 optional targeted text-extract per `references/query.md`. Then `Write` `./output.json` (eval-harness only) or answer in chat (general use) and STOP.
- **Query turn**: at most **2 Bash calls** — 1 `retriever query`, +1 optional targeted text-extract per `references/query.md`. Reply and then STOP.
- **No narration between tool calls.** Tokens you emit between calls become input + cached input for every later turn — quadratic cost. Go straight from reading the summary to writing the JSON file.
- **Banned**: `TodoWrite`, Glob, Grep, `Read` of whole PDFs, re-running setup, spawning subagents, speculative "confirmation" calls.

Expand Down
9 changes: 3 additions & 6 deletions skills/nemo-retriever/references/query.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# Query turn — the WHOLE workflow

**General-use vs eval-harness.** If the user prompt doesn't mention a judge, benchmark, output schema, or `./output.json` — skip every `Write ./output.json` / `final_answer` / `ranked_retrieved` step below. Just run the `retriever query` call and answer in chat. The 2-Bash-call budget, the no-narration rule, and the chart/image hedging discipline all still apply.


```bash
<RETRIEVER_VENV>/bin/retriever query "<the user's question>" --top-k 10 --embed-model-name nvidia/llama-nemotron-embed-1b-v2 --rerank \
Expand All @@ -13,7 +11,7 @@ Run that **exactly** as a single pipeline — do not split it into `HITS=$(...)`

That's your FIRST tool call on every query turn. Do not Read, Glob, Grep, or list PDFs before this — those duplicate what `retriever query` already did.

**No narration between tool calls.** Do not write "Let me search…", "I'll now analyze…", "The retriever returned…", or any other commentary. Every assistant token you emit between the `retriever query` Bash call and the `Write` of `./output.json` becomes input tokens (and cached input tokens) for every subsequent turn in this session — quadratic cost. Go straight from reading the summary to writing the JSON file. The only assistant text in a query turn should be the tool calls themselves.
**No narration between tool calls.** Do not write "Let me search…", "I'll now analyze…", "The retriever returned…", or any other commentary. Every assistant token you emit with the `retriever query` Bash call becomes input tokens (and cached input tokens) for every subsequent turn in this session — quadratic cost. Go straight from reading the summary to writing the JSON file. The only assistant text in a query turn should be the tool calls themselves.

Each hit has: `text`, `pdf_basename`, `page_number` (int, **1-indexed**: the first page of a PDF is page `1`), `pdf_page` (string composite key `"<basename>_<page_number>"` — not a number, don't use it as one), `_distance`, and `metadata` (JSON with `type` ∈ `text|table|chart|image`).

Expand All @@ -29,7 +27,7 @@ It scans the LanceDB table the retriever already built — no PDF re-extraction.

Don't reach for `pdftotext`, `pdftohtml`, or `pdfgrep` — they're system tools that aren't guaranteed installed on the user's machine. The retriever venv bundles pdfium and `lancedb`; `grep_corpus.py` and `retriever pdf stage page-elements --method pdfium` cover the same use cases without that dependency.

## Write `./output.json` directly from the hits
## Compose your reply from the hits

- `final_answer`: synthesize from the top hits' `text`. Include the exact number / name / date / row / column the question asks for, plus the source PDF and 0-indexed page. One paragraph. No restating the question, no hedging caveats. If the chunks talk *around* the fact but don't state it, run ONE `<RETRIEVER_VENV>/bin/retriever pdf stage page-elements ./pdfs --method pdfium --json-output-dir /tmp/pdf_text --compact-json` and `Read` `/tmp/pdf_text/<top_pdf>.pdf.pdf_extraction.json` for the rank-1 page (or rank-2 if rank-1 is metadata) — that almost always surfaces the exact figure. Then synthesize. **If after both calls the asked-for fact still isn't in the evidence, write `final_answer` that says so explicitly** — e.g. "The retrieved pages do not state [X] for [entity]; the closest content is [Y]." Do NOT invent, extrapolate, or generate plausible-sounding content from adjacent material. A confidently-wrong answer scores worse than an honest "not in the retrieved pages".
- `ranked_retrieved`: one entry per hit in the order `retriever query` returned: `{"doc_id": "<pdf_basename without .pdf>", "page_number": <int>, "rank": <i+1>}`. Up to 10. Duplicate `(doc, page)` is fine. **Indexing:** the retriever's `page_number` is 1-indexed. If the task's output schema says 0-indexed (e.g. "first page is page 0"), emit `hit.page_number - 1`; if the task says 1-indexed or doesn't specify, emit `hit.page_number` as-is.
Expand All @@ -50,8 +48,7 @@ If a question asks for an exact percentage or a directional claim **and the evid
3. If prose doesn't mention it, **quote the chart transcription verbatim with an explicit hedge in `final_answer`**: "The chart on page N indicates [verbatim phrase] (chart-derived, not verified against prose)." Do NOT restate the chart's number as a confident fact.

When both a chart hit and a text hit cover the same fact, always prefer the text hit's number.

After writing `./output.json`, STOP. No print, no summary, no further tool calls.
After your reply, STOP. No print, no summary, no further tool calls.

## Non-semantic operations (use these, don't fall back to native tools)

Expand Down
2 changes: 1 addition & 1 deletion skills/nemo-retriever/references/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ If unsupported extensions appear, name them in your reply and ask the user wheth

## You ran more than 2 Bash calls on a query turn

Budget violation. Stop, write `final_answer` from what you have, write `./output.json`, end the turn. Long turns cost ~5× a disciplined turn and usually still produce the wrong answer.
Budget violation. Stop, write `final_answer` from what you have, end the turn. Long turns cost ~5× a disciplined turn and usually still produce the wrong answer.

## Query-turn cost discipline (recap)

Expand Down
Loading