diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md index 7014cb4ee..822443d2d 100644 --- a/docs/docs/extraction/faq.md +++ b/docs/docs/extraction/faq.md @@ -29,6 +29,7 @@ For more information, refer to [Extract Captions from Images](nemo-retriever-api For scanned documents, or documents with complex layouts, you can use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse) as an alternate PDF extraction method by setting `extract_method="nemotron_parse"`. +Install the Python client dependencies first with `pip install "nemo-retriever[nemotron-parse]"` (or combine extras as `pip install "nemo-retriever[local,nemotron-parse]"` when you also run models on your GPU). For more information, refer to [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse). ## Why are the environment variables different between library mode and self-hosted mode? diff --git a/docs/docs/extraction/overview.md b/docs/docs/extraction/overview.md index 07ef2c83d..0af0827ed 100644 --- a/docs/docs/extraction/overview.md +++ b/docs/docs/extraction/overview.md @@ -15,6 +15,7 @@ NeMo Retriever Library does the following: - Accept directories of input files and a series of configurable ingestion tasks to perform on that input - Allow the extracted content be retrieved from a VDB containing discrete metadata element +- Support multiple extraction methods per document type to balance throughput and accuracy—for example, PDFs can use **pdfium** or [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse) (`extract_method="nemotron_parse"`) - Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage. !!! note diff --git a/docs/docs/extraction/prerequisites-support-matrix.md b/docs/docs/extraction/prerequisites-support-matrix.md index 7d9df5911..cdbb2427a 100644 --- a/docs/docs/extraction/prerequisites-support-matrix.md +++ b/docs/docs/extraction/prerequisites-support-matrix.md @@ -13,6 +13,11 @@ Before you begin using [NeMo Retriever Library](overview.md), confirm your softw `ffmpeg-python` and `nemo-retriever[multimedia]` do not install these binaries. On Helm with package-repo access, set `service.installFfmpeg=true`. For air-gapped clusters, see [Air-gapped and disconnected deployment](deployment-options.md#air-gapped-deployment). +- For PDF extraction with `extract_method="nemotron_parse"`, install the Nemotron Parse + client dependencies with `pip install "nemo-retriever[nemotron-parse]"` (pulls + `open-clip-torch`, which provides the `open_clip` module required by the Nemotron Parse + NIM client). The base `nemo-retriever` install and `[local]` extra do not include this + package. !!! note diff --git a/docs/docs/extraction/troubleshoot.md b/docs/docs/extraction/troubleshoot.md index fa415cd7e..26a90bab1 100644 --- a/docs/docs/extraction/troubleshoot.md +++ b/docs/docs/extraction/troubleshoot.md @@ -100,6 +100,30 @@ You can set the variable in your .env file or directly in your environment. +## ModuleNotFoundError: No module named open_clip when using nemotron_parse { #modulenotfounderror-no-module-named-open-clip-when-using-nemotron-parse } + +When you run PDF extraction with `extract_method="nemotron_parse"`, you might see an error similar to the following: + +```text +ModuleNotFoundError: No module named 'open_clip' +``` + +The Nemotron Parse NIM client requires the `open_clip` Python module, provided by `open-clip-torch`. That package is not part of the default `nemo-retriever` install or the `[local]` extra. + +Install the dedicated PyPI extra before running Nemotron Parse extraction: + +```bash +pip install "nemo-retriever[nemotron-parse]" +``` + +For local GPU inference with Nemotron Parse, combine extras: + +```bash +pip install "nemo-retriever[local,nemotron-parse]" +``` + +See also [What is NeMo Retriever Library?](overview.md) and [Pre-Requisites & Support Matrix](prerequisites-support-matrix.md#software-requirements). + ## Extract method nemotron-parse doesn't support image files Currently, extraction with Nemotron parse doesn't support image files, only scanned PDFs.