From 6a8fbc6de6feaafcec969b38f9191599f1222a1c Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Fri, 22 May 2026 09:30:57 -0700
Subject: [PATCH 1/2] docs(extraction): document nemo-retriever[nemotron-parse]
 extra (NVBugs 6170950)

Tell users to install the nemotron-parse PyPI extra before extract_method=nemotron_parse so open_clip is available. Packaging already declares the extra on main; this is documentation only.
---
 docs/docs/extraction/faq.md                   |  1 +
 docs/docs/extraction/overview.md              |  1 +
 .../prerequisites-support-matrix.md           |  5 ++++
 docs/docs/extraction/troubleshoot.md          | 24 +++++++++++++++++++
 4 files changed, 31 insertions(+)

diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md
index 7014cb4ee..f0e4357bd 100644
--- a/docs/docs/extraction/faq.md
+++ b/docs/docs/extraction/faq.md
@@ -29,6 +29,7 @@ For more information, refer to [Extract Captions from Images](nemo-retriever-api
 
 For scanned documents, or documents with complex layouts, 
 you can use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse) as an alternate PDF extraction method by setting `extract_method="nemotron_parse"`. 
+Install the Python client dependencies first with `pip install "nemo-retriever[nemotron-parse]"` (or combine with `[local]` as `nemo-retriever[local,nemotron-parse]` when you also run models on your GPU).
 For more information, refer to [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse).
 
 ## Why are the environment variables different between library mode and self-hosted mode?
diff --git a/docs/docs/extraction/overview.md b/docs/docs/extraction/overview.md
index 07ef2c83d..0af0827ed 100644
--- a/docs/docs/extraction/overview.md
+++ b/docs/docs/extraction/overview.md
@@ -15,6 +15,7 @@ NeMo Retriever Library does the following:
 
 - Accept directories of input files and a series of configurable ingestion tasks to perform on that input
 - Allow the extracted content be retrieved from a VDB containing discrete metadata element
+- Support multiple extraction methods per document type to balance throughput and accuracy—for example, PDFs can use **pdfium** or [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse) (`extract_method="nemotron_parse"`)
 - Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.
 
 !!! note
diff --git a/docs/docs/extraction/prerequisites-support-matrix.md b/docs/docs/extraction/prerequisites-support-matrix.md
index 7d9df5911..cdbb2427a 100644
--- a/docs/docs/extraction/prerequisites-support-matrix.md
+++ b/docs/docs/extraction/prerequisites-support-matrix.md
@@ -13,6 +13,11 @@ Before you begin using [NeMo Retriever Library](overview.md), confirm your softw
   `ffmpeg-python` and `nemo-retriever[multimedia]` do not install these binaries.
   On Helm with package-repo access, set `service.installFfmpeg=true`. For
   air-gapped clusters, see [Air-gapped and disconnected deployment](deployment-options.md#air-gapped-deployment).
+- For PDF extraction with `extract_method="nemotron_parse"`, install the Nemotron Parse
+  client dependencies with `pip install "nemo-retriever[nemotron-parse]"` (pulls
+  `open-clip-torch`, which provides the `open_clip` module required by the Nemotron Parse
+  NIM client). The base `nemo-retriever` install and `[local]` extra do not include this
+  package.
 
 !!! note
 
diff --git a/docs/docs/extraction/troubleshoot.md b/docs/docs/extraction/troubleshoot.md
index fa415cd7e..10c4d457f 100644
--- a/docs/docs/extraction/troubleshoot.md
+++ b/docs/docs/extraction/troubleshoot.md
@@ -100,6 +100,30 @@ You can set the variable in your .env file or directly in your environment.
 
 
 
+## ModuleNotFoundError: No module named open_clip when using nemotron_parse { #modulenotfounderror-no-module-named-open-clip-when-using-nemotron-parse }
+
+When you run PDF extraction with `extract_method="nemotron_parse"`, you might see an error similar to the following:
+
+```text
+ModuleNotFoundError: No module named 'open_clip'
+```
+
+The Nemotron Parse NIM client requires the `open_clip` Python module from `open-clip-torch==3.2.0`. That package is not part of the default `nemo-retriever` install or the `[local]` extra.
+
+Install the dedicated PyPI extra before running Nemotron Parse extraction:
+
+```bash
+pip install "nemo-retriever[nemotron-parse]"
+```
+
+For local GPU inference with Nemotron Parse, combine extras:
+
+```bash
+pip install "nemo-retriever[local,nemotron-parse]"
+```
+
+See also [What is NeMo Retriever Library?](overview.md) and [Pre-Requisites & Support Matrix](prerequisites-support-matrix.md#software-requirements).
+
 ## Extract method nemotron-parse doesn't support image files
 
 Currently, extraction with Nemotron parse doesn't support image files, only scanned PDFs. 

From f3f93c2f2ec0cceef2ecbe2c1646546d93a78a06 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Fri, 22 May 2026 09:38:18 -0700
Subject: [PATCH 2/2] docs(extraction): address PR review for nemotron-parse
 install text

Quote the combined local+nemotron-parse pip command in the FAQ so shell copy-paste works, and drop the hard-pinned open-clip-torch version from troubleshoot prose.
---
 docs/docs/extraction/faq.md          | 2 +-
 docs/docs/extraction/troubleshoot.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md
index f0e4357bd..822443d2d 100644
--- a/docs/docs/extraction/faq.md
+++ b/docs/docs/extraction/faq.md
@@ -29,7 +29,7 @@ For more information, refer to [Extract Captions from Images](nemo-retriever-api
 
 For scanned documents, or documents with complex layouts, 
 you can use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse) as an alternate PDF extraction method by setting `extract_method="nemotron_parse"`. 
-Install the Python client dependencies first with `pip install "nemo-retriever[nemotron-parse]"` (or combine with `[local]` as `nemo-retriever[local,nemotron-parse]` when you also run models on your GPU).
+Install the Python client dependencies first with `pip install "nemo-retriever[nemotron-parse]"` (or combine extras as `pip install "nemo-retriever[local,nemotron-parse]"` when you also run models on your GPU).
 For more information, refer to [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse).
 
 ## Why are the environment variables different between library mode and self-hosted mode?
diff --git a/docs/docs/extraction/troubleshoot.md b/docs/docs/extraction/troubleshoot.md
index 10c4d457f..26a90bab1 100644
--- a/docs/docs/extraction/troubleshoot.md
+++ b/docs/docs/extraction/troubleshoot.md
@@ -108,7 +108,7 @@ When you run PDF extraction with `extract_method="nemotron_parse"`, you might se
 ModuleNotFoundError: No module named 'open_clip'
 ```
 
-The Nemotron Parse NIM client requires the `open_clip` Python module from `open-clip-torch==3.2.0`. That package is not part of the default `nemo-retriever` install or the `[local]` extra.
+The Nemotron Parse NIM client requires the `open_clip` Python module, provided by `open-clip-torch`. That package is not part of the default `nemo-retriever` install or the `[local]` extra.
 
 Install the dedicated PyPI extra before running Nemotron Parse extraction: