diff --git a/.claude/skills/nemo-retriever/SKILL.md b/.claude/skills/nemo-retriever/SKILL.md index 75d4b5f77..5975ac0bd 100644 --- a/.claude/skills/nemo-retriever/SKILL.md +++ b/.claude/skills/nemo-retriever/SKILL.md @@ -30,7 +30,7 @@ Don't pre-OCR, don't pre-chunk, don't write Python wrappers — the CLI handles ## Query turn — the WHOLE workflow ```bash -retriever query "" --top-k 10 --embed-model-name nvidia/llama-nemotron-embed-1b-v2 --rerank \ +retriever query "" --top-k 20 --embed-model-name nvidia/llama-nemotron-embed-1b-v2 --rerank \ | tee /tmp/hits.json \ | jq -r '.[] | "rank=\(.rank // 0) page=\(.page_number) pdf=\(.pdf_basename) type=\(.metadata.type // "?") text=\(.text[:200])"' ``` @@ -46,7 +46,7 @@ Each hit has: `text`, `pdf_basename`, `page_number` (int, **1-indexed**: the fir **Then write `./output.json` directly from $HITS:** - `final_answer`: synthesize from the top hits' `text`. Include the exact number / name / date / row / column the question asks for, plus the source PDF and 0-indexed page. One paragraph. No restating the question, no hedging caveats. If the chunks talk *around* the fact but don't state it, run ONE `retriever pdf stage page-elements ./pdfs --method pdfium --json-output-dir /tmp/pdf_text --compact-json` and read `/tmp/pdf_text/.pdf.pdf_extraction.json` for the rank-1 page (or rank-2 if rank-1 is metadata) — that almost always surfaces the exact figure. Then synthesize. **If after both calls the asked-for fact still isn't in the evidence, write `final_answer` that says so explicitly** — e.g. "The retrieved pages do not state [X] for [entity]; the closest content is [Y]." Do NOT invent, extrapolate, or generate plausible-sounding content from adjacent material. A confidently-wrong answer scores worse than an honest "not in the retrieved pages". -- `ranked_retrieved`: one entry per hit in the order `retriever query` returned: `{"doc_id": "", "page_number": , "rank": }`. Up to 10. Duplicate `(doc, page)` is fine. **Indexing:** the retriever's `page_number` is 1-indexed. If the task's output schema says 0-indexed (e.g. "first page is page 0"), emit `hit.page_number - 1`; if the task says 1-indexed or doesn't specify, emit `hit.page_number` as-is. +- `ranked_retrieved`: emit **up to 10 entries** `{"doc_id": "", "page_number": , "rank": }` chosen from the reranked pool. `--top-k 20 --rerank` returns 20 reranked candidates (rerank's `refine_factor=4` means retriever pulled 80 candidates internally then the cross-encoder reranked them); the wider pool exists so you can surface 10 distinct pages even when multimodal indexes produce text + table + chart chunks on the same page. **Rule (deterministic):** walk the 20 reranked hits in order and emit each hit whose `(doc_id, page_number)` you have not emitted yet, with `rank` = 1, 2, 3, ... in emit order. Stop at 10 distinct pairs. Only if the full 20-candidate pool runs out before you reach 10 may you fall back to emitting a duplicate — and only enough to fill the remaining slots. **Indexing:** the retriever's `page_number` is 1-indexed. If the task's output schema says 0-indexed (e.g. "first page is page 0"), emit `hit.page_number - 1`; if the task says 1-indexed or doesn't specify, emit `hit.page_number` as-is. **Before writing `final_answer`, re-read the question.** If it lists multiple entities, years, or categories, your answer must address each one explicitly — even if for some of them the chunks say "not provided" or contain no data. Missing entities lose more judge points than imprecise numbers.