(retriever) add pdfium_hybrid/ocr extract methods for scanned pages by edknv · Pull Request #1521 · NVIDIA/NeMo-Retriever

edknv · 2026-03-09T21:18:57Z

Description

This PR adds pdfium_hybrid and ocr text extraction methods to the nemo_retriever library, matching the nv-ingest pipeline's existing support.

--method pdfium (default): no behavioral change, native text extraction only
--method pdfium_hybrid: scanned pages get OCR'd text, non-scanned pages keep native text
--method ocr: all pages get OCR'd text

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

jdye64 · 2026-03-10T16:41:19Z

        help="LanceDB URI/path for this run.",
    ),
+    method: str = typer.Option(
+        "pdfium",


should pdfium or pdfium_hybrid be the default? Asking because I really don't know?

I changed the default to pdfium_hybrid in 4c7e837.

It behaves identically to pdfium for text-based pages but for scanned pages it automatically switches to OCR instead of returning empty strings. There's no downside for non-scanned PDFs since the OCR path only activates when the page has zero native text.

This should gives us a 1% recall boost on bo767.

jdye64 · 2026-03-10T16:41:49Z

        help="Embedding model name passed to .embed().",
    ),
+    method: str = typer.Option(
+        "pdfium",


same question about default

…t into edwardk/retriever-scanned

edknv and others added 2 commits March 9, 2026 14:15

(retriever) add pdfium_hybrid/ocr extract methods for scanned pages

8d4c5d0

Merge branch 'main' into edwardk/retriever-scanned

933327a

edknv requested review from jdye64 and jperez999 March 10, 2026 16:38

edknv marked this pull request as ready for review March 10, 2026 16:38

edknv requested a review from a team as a code owner March 10, 2026 16:38

Merge branch 'main' into edwardk/retriever-scanned

bc84310

jdye64 reviewed Mar 10, 2026

View reviewed changes

jdye64 and others added 7 commits March 10, 2026 13:20

Merge branch 'main' into edwardk/retriever-scanned

8e3f197

change default to pdfium_hybrid

4c7e837

Merge branch 'main' into edwardk/retriever-scanned

3e56c47

Merge branch 'edwardk/retriever-scanned' of github.com:edknv/nv-inges…

d684666

…t into edwardk/retriever-scanned

Merge branch 'main' into edwardk/retriever-scanned

e102eed

Merge branch 'main' into edwardk/retriever-scanned

2537a53

switch back default to

b037053

jdye64 merged commit c4537ca into NVIDIA:main Mar 10, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(retriever) add pdfium_hybrid/ocr extract methods for scanned pages#1521

(retriever) add pdfium_hybrid/ocr extract methods for scanned pages#1521
jdye64 merged 10 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-scanned

edknv commented Mar 9, 2026

Uh oh!

jdye64 Mar 10, 2026

Uh oh!

edknv Mar 10, 2026

Uh oh!

jdye64 Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edknv commented Mar 9, 2026

Description

Checklist

Uh oh!

jdye64 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

edknv Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

jdye64 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants