Skip to content

(retriever) add pdfium_hybrid/ocr extract methods for scanned pages#1521

Merged
jdye64 merged 10 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-scanned
Mar 10, 2026
Merged

(retriever) add pdfium_hybrid/ocr extract methods for scanned pages#1521
jdye64 merged 10 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-scanned

Conversation

@edknv
Copy link
Copy Markdown
Collaborator

@edknv edknv commented Mar 9, 2026

Description

This PR adds pdfium_hybrid and ocr text extraction methods to the nemo_retriever library, matching the nv-ingest pipeline's existing support.

  • --method pdfium (default): no behavioral change, native text extraction only
  • --method pdfium_hybrid: scanned pages get OCR'd text, non-scanned pages keep native text
  • --method ocr: all pages get OCR'd text

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv requested review from jdye64 and jperez999 March 10, 2026 16:38
@edknv edknv marked this pull request as ready for review March 10, 2026 16:38
@edknv edknv requested a review from a team as a code owner March 10, 2026 16:38
help="LanceDB URI/path for this run.",
),
method: str = typer.Option(
"pdfium",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should pdfium or pdfium_hybrid be the default? Asking because I really don't know?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the default to pdfium_hybrid in 4c7e837.

It behaves identically to pdfium for text-based pages but for scanned pages it automatically switches to OCR instead of returning empty strings. There's no downside for non-scanned PDFs since the OCR path only activates when the page has zero native text.

This should gives us a 1% recall boost on bo767.

help="Embedding model name passed to .embed().",
),
method: str = typer.Option(
"pdfium",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question about default

@jdye64 jdye64 merged commit c4537ca into NVIDIA:main Mar 10, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants