Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/docs/extraction/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ For more information, refer to [Extract Captions from Images](nemo-retriever-api

For scanned documents, or documents with complex layouts,
you can use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse) as an alternate PDF extraction method by setting `extract_method="nemotron_parse"`.
Install the Python client dependencies first with `pip install "nemo-retriever[nemotron-parse]"` (or combine extras as `pip install "nemo-retriever[local,nemotron-parse]"` when you also run models on your GPU).
For more information, refer to [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse).

## Why are the environment variables different between library mode and self-hosted mode?
Expand Down
1 change: 1 addition & 0 deletions docs/docs/extraction/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ NeMo Retriever Library does the following:

- Accept directories of input files and a series of configurable ingestion tasks to perform on that input
- Allow the extracted content be retrieved from a VDB containing discrete metadata element
- Support multiple extraction methods per document type to balance throughput and accuracy—for example, PDFs can use **pdfium** or [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse) (`extract_method="nemotron_parse"`)
- Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.

!!! note
Expand Down
5 changes: 5 additions & 0 deletions docs/docs/extraction/prerequisites-support-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ Before you begin using [NeMo Retriever Library](overview.md), confirm your softw
`ffmpeg-python` and `nemo-retriever[multimedia]` do not install these binaries.
On Helm with package-repo access, set `service.installFfmpeg=true`. For
air-gapped clusters, see [Air-gapped and disconnected deployment](deployment-options.md#air-gapped-deployment).
- For PDF extraction with `extract_method="nemotron_parse"`, install the Nemotron Parse
client dependencies with `pip install "nemo-retriever[nemotron-parse]"` (pulls
`open-clip-torch`, which provides the `open_clip` module required by the Nemotron Parse
NIM client). The base `nemo-retriever` install and `[local]` extra do not include this
package.

!!! note

Expand Down
24 changes: 24 additions & 0 deletions docs/docs/extraction/troubleshoot.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,30 @@ You can set the variable in your .env file or directly in your environment.



## ModuleNotFoundError: No module named open_clip when using nemotron_parse { #modulenotfounderror-no-module-named-open-clip-when-using-nemotron-parse }

When you run PDF extraction with `extract_method="nemotron_parse"`, you might see an error similar to the following:

```text
ModuleNotFoundError: No module named 'open_clip'
```

The Nemotron Parse NIM client requires the `open_clip` Python module, provided by `open-clip-torch`. That package is not part of the default `nemo-retriever` install or the `[local]` extra.

Install the dedicated PyPI extra before running Nemotron Parse extraction:

```bash
pip install "nemo-retriever[nemotron-parse]"
```

For local GPU inference with Nemotron Parse, combine extras:

```bash
pip install "nemo-retriever[local,nemotron-parse]"
```

See also [What is NeMo Retriever Library?](overview.md) and [Pre-Requisites & Support Matrix](prerequisites-support-matrix.md#software-requirements).

## Extract method nemotron-parse doesn't support image files

Currently, extraction with Nemotron parse doesn't support image files, only scanned PDFs.
Expand Down
Loading