From 6a8fbc6de6feaafcec969b38f9191599f1222a1c Mon Sep 17 00:00:00 2001 From: Kurt Heiss Date: Fri, 22 May 2026 09:30:57 -0700 Subject: [PATCH 1/2] docs(extraction): document nemo-retriever[nemotron-parse] extra (NVBugs 6170950) Tell users to install the nemotron-parse PyPI extra before extract_method=nemotron_parse so open_clip is available. Packaging already declares the extra on main; this is documentation only. --- docs/docs/extraction/faq.md | 1 + docs/docs/extraction/overview.md | 1 + .../prerequisites-support-matrix.md | 5 ++++ docs/docs/extraction/troubleshoot.md | 24 +++++++++++++++++++ 4 files changed, 31 insertions(+) diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md index 7014cb4ee..f0e4357bd 100644 --- a/docs/docs/extraction/faq.md +++ b/docs/docs/extraction/faq.md @@ -29,6 +29,7 @@ For more information, refer to [Extract Captions from Images](nemo-retriever-api For scanned documents, or documents with complex layouts, you can use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse) as an alternate PDF extraction method by setting `extract_method="nemotron_parse"`. +Install the Python client dependencies first with `pip install "nemo-retriever[nemotron-parse]"` (or combine with `[local]` as `nemo-retriever[local,nemotron-parse]` when you also run models on your GPU). For more information, refer to [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse). ## Why are the environment variables different between library mode and self-hosted mode? diff --git a/docs/docs/extraction/overview.md b/docs/docs/extraction/overview.md index 07ef2c83d..0af0827ed 100644 --- a/docs/docs/extraction/overview.md +++ b/docs/docs/extraction/overview.md @@ -15,6 +15,7 @@ NeMo Retriever Library does the following: - Accept directories of input files and a series of configurable ingestion tasks to perform on that input - Allow the extracted content be retrieved from a VDB containing discrete metadata element +- Support multiple extraction methods per document type to balance throughput and accuracy—for example, PDFs can use **pdfium** or [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse) (`extract_method="nemotron_parse"`) - Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage. !!! note diff --git a/docs/docs/extraction/prerequisites-support-matrix.md b/docs/docs/extraction/prerequisites-support-matrix.md index 7d9df5911..cdbb2427a 100644 --- a/docs/docs/extraction/prerequisites-support-matrix.md +++ b/docs/docs/extraction/prerequisites-support-matrix.md @@ -13,6 +13,11 @@ Before you begin using [NeMo Retriever Library](overview.md), confirm your softw `ffmpeg-python` and `nemo-retriever[multimedia]` do not install these binaries. On Helm with package-repo access, set `service.installFfmpeg=true`. For air-gapped clusters, see [Air-gapped and disconnected deployment](deployment-options.md#air-gapped-deployment). +- For PDF extraction with `extract_method="nemotron_parse"`, install the Nemotron Parse + client dependencies with `pip install "nemo-retriever[nemotron-parse]"` (pulls + `open-clip-torch`, which provides the `open_clip` module required by the Nemotron Parse + NIM client). The base `nemo-retriever` install and `[local]` extra do not include this + package. !!! note diff --git a/docs/docs/extraction/troubleshoot.md b/docs/docs/extraction/troubleshoot.md index fa415cd7e..10c4d457f 100644 --- a/docs/docs/extraction/troubleshoot.md +++ b/docs/docs/extraction/troubleshoot.md @@ -100,6 +100,30 @@ You can set the variable in your .env file or directly in your environment. +## ModuleNotFoundError: No module named open_clip when using nemotron_parse { #modulenotfounderror-no-module-named-open-clip-when-using-nemotron-parse } + +When you run PDF extraction with `extract_method="nemotron_parse"`, you might see an error similar to the following: + +```text +ModuleNotFoundError: No module named 'open_clip' +``` + +The Nemotron Parse NIM client requires the `open_clip` Python module from `open-clip-torch==3.2.0`. That package is not part of the default `nemo-retriever` install or the `[local]` extra. + +Install the dedicated PyPI extra before running Nemotron Parse extraction: + +```bash +pip install "nemo-retriever[nemotron-parse]" +``` + +For local GPU inference with Nemotron Parse, combine extras: + +```bash +pip install "nemo-retriever[local,nemotron-parse]" +``` + +See also [What is NeMo Retriever Library?](overview.md) and [Pre-Requisites & Support Matrix](prerequisites-support-matrix.md#software-requirements). + ## Extract method nemotron-parse doesn't support image files Currently, extraction with Nemotron parse doesn't support image files, only scanned PDFs. From f3f93c2f2ec0cceef2ecbe2c1646546d93a78a06 Mon Sep 17 00:00:00 2001 From: Kurt Heiss Date: Fri, 22 May 2026 09:38:18 -0700 Subject: [PATCH 2/2] docs(extraction): address PR review for nemotron-parse install text Quote the combined local+nemotron-parse pip command in the FAQ so shell copy-paste works, and drop the hard-pinned open-clip-torch version from troubleshoot prose. --- docs/docs/extraction/faq.md | 2 +- docs/docs/extraction/troubleshoot.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md index f0e4357bd..822443d2d 100644 --- a/docs/docs/extraction/faq.md +++ b/docs/docs/extraction/faq.md @@ -29,7 +29,7 @@ For more information, refer to [Extract Captions from Images](nemo-retriever-api For scanned documents, or documents with complex layouts, you can use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse) as an alternate PDF extraction method by setting `extract_method="nemotron_parse"`. -Install the Python client dependencies first with `pip install "nemo-retriever[nemotron-parse]"` (or combine with `[local]` as `nemo-retriever[local,nemotron-parse]` when you also run models on your GPU). +Install the Python client dependencies first with `pip install "nemo-retriever[nemotron-parse]"` (or combine extras as `pip install "nemo-retriever[local,nemotron-parse]"` when you also run models on your GPU). For more information, refer to [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse). ## Why are the environment variables different between library mode and self-hosted mode? diff --git a/docs/docs/extraction/troubleshoot.md b/docs/docs/extraction/troubleshoot.md index 10c4d457f..26a90bab1 100644 --- a/docs/docs/extraction/troubleshoot.md +++ b/docs/docs/extraction/troubleshoot.md @@ -108,7 +108,7 @@ When you run PDF extraction with `extract_method="nemotron_parse"`, you might se ModuleNotFoundError: No module named 'open_clip' ``` -The Nemotron Parse NIM client requires the `open_clip` Python module from `open-clip-torch==3.2.0`. That package is not part of the default `nemo-retriever` install or the `[local]` extra. +The Nemotron Parse NIM client requires the `open_clip` Python module, provided by `open-clip-torch`. That package is not part of the default `nemo-retriever` install or the `[local]` extra. Install the dedicated PyPI extra before running Nemotron Parse extraction: