Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 5 additions & 6 deletions docs/docs/extraction/audio-video.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ This pipeline enables retrieval at the speech segment level when you enable segm

Use the following procedure to run the NIM on your own infrastructure. Self-hosted Parakeet runs on Kubernetes via the [NeMo Retriever Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md). Enable the ASR NIM per [Optional Helm NIMs](prerequisites-support-matrix.md#optional-helm-nims-not-auto-wired-by-default) and the [Helm chart — NIM operator sub-stack](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md#nim-operator-sub-stack); pin the workload to a dedicated GPU and wire the ASR endpoint in your pipeline.

!!! important
After deploy, call the pipeline from Python:

Pin the Parakeet workload to the dedicated GPU with your Helm values or the [NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html) (for example, node selectors, resource limits, or device requests appropriate to your cluster).
Comment on lines +66 to 68
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 GPU pinning note silently rendered as a code block

After the !!! important admonition was removed, the paragraph at line 68 (Pin the Parakeet workload…) is now indented by 4 spaces with no enclosing list item. In Markdown (including MkDocs Material), a 4-space-indented paragraph outside a list context is treated as an indented code block, so this critical deployment warning will render as <pre><code> text rather than readable prose — readers following the setup steps will miss the GPU pinning requirement entirely.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/audio-video.md
Line: 66-68

Comment:
**GPU pinning note silently rendered as a code block**

After the `!!! important` admonition was removed, the paragraph at line 68 (`Pin the Parakeet workload…`) is now indented by 4 spaces with no enclosing list item. In Markdown (including MkDocs Material), a 4-space-indented paragraph outside a list context is treated as an **indented code block**, so this critical deployment warning will render as `<pre><code>` text rather than readable prose — readers following the setup steps will miss the GPU pinning requirement entirely.

How can I resolve this? If you propose a fix, please make it concise.


Expand All @@ -87,15 +87,14 @@ Use the following procedure to run the NIM on your own infrastructure. Self-host
asr_params=ASRParams(segment_audio=True),
)
)
```
)
```
Comment on lines 88 to +91
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Stray ) produces a SyntaxError in the code sample

The closing ) at line 90 is placed outside the code fence, making it part of the rendered code content. The code block therefore ends with two consecutive ) characters — one closing extract_audio(...) and an extra one below ingestor = (...). Anyone copying this snippet will get a SyntaxError immediately.

Suggested change
)
)
```
)
```
)
)

<details><summary>Prompt To Fix With AI</summary>

`````markdown
This is a comment left during a code review.
Path: docs/docs/extraction/audio-video.md
Line: 88-91

Comment:
**Stray `)` produces a `SyntaxError` in the code sample**

The closing `)` at line 90 is placed outside the code fence, making it part of the rendered code content. The code block therefore ends with two consecutive `)` characters — one closing `extract_audio(...)` and an extra one below `ingestor = (...)`. Anyone copying this snippet will get a `SyntaxError` immediately.

```suggestion
        )
    )

How can I resolve this? If you propose a fix, please make it concise.


To generate one extracted element for each sentence-like ASR segment, include `extract_audio_params={"segment_audio": True}` when calling `.extract(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model.

To generate one extracted element for each sentence-like ASR segment, pass `asr_params=ASRParams(segment_audio=True)` to `.extract_audio(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model.


!!! tip

For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).
For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).
Comment on lines +93 to +97
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate near-identical segment_audio paragraphs with conflicting API names

Line 93 (unindented) says to use extract_audio_params={"segment_audio": True} with .extract(...), while line 95 (indented continuation of step 3) says to use asr_params=ASRParams(segment_audio=True) with .extract_audio(...). These look like two different API call styles that both appeared after the admonition block was removed. One of them should be removed, or it should be clarified which applies to library mode vs. the service ingestor.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/audio-video.md
Line: 93-97

Comment:
**Duplicate near-identical `segment_audio` paragraphs with conflicting API names**

Line 93 (unindented) says to use `extract_audio_params={"segment_audio": True}` with `.extract(...)`, while line 95 (indented continuation of step 3) says to use `asr_params=ASRParams(segment_audio=True)` with `.extract_audio(...)`. These look like two different API call styles that both appeared after the admonition block was removed. One of them should be removed, or it should be clarified which applies to library mode vs. the service ingestor.

How can I resolve this? If you propose a fix, please make it concise.


## Parakeet with hosted inference (build.nvidia.com) { #parakeet-hosted-inference-build-nvidia }

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/extraction/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,6 @@ Token-based splitting uses the Llama 3.2 1B tokenizer (default `meta-llama/Llama

- **Library mode** — Run without the full container stack where appropriate; see [Deployment options](deployment-options.md).
- **Kubernetes / Helm (self-hosted)** — See [Deploy (Helm chart)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) and [deployment options](deployment-options.md) for running the full microservices pipeline on your infrastructure.
- **Notebooks** — [Jupyter examples](notebooks.md) for experimentation and RAG demos.
- **Notebooks** — [Jupyter examples](notebooks/index.md) for experimentation and RAG demos.

For a concise comparison, refer to [Deployment options](deployment-options.md).
80 changes: 25 additions & 55 deletions docs/docs/extraction/custom-metadata.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,42 @@
# Use Custom Metadata to Filter Search Results
# Custom metadata and filtering

You can upload custom metadata for documents during ingestion.
By uploading custom metadata you can attach additional information to documents,
and use it for filtering results during retrieval operations.
For example, you can add author metadata to your documents, and filter by author when you retrieve results.
To create filters at query time, use predicates supported by [LanceDB SQL](https://lancedb.github.io/lancedb/sql/) against your table schema (custom fields are serialized into the `metadata` column with your ingested chunks). For a worked example, see the repository notebook linked at the end of this page.
Use this documentation to attach per-document metadata during ingestion and to narrow [LanceDB](vdbs.md) search results in [NeMo Retriever Library](overview.md). Implementation details live in the package [Vector DB operators and LanceDB](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) README.

Use this documentation to use custom metadata to filter search results when you work with [NeMo Retriever Library](overview.md).
## On this page { #on-this-page }

- [Attach metadata at ingestion](#attach-metadata-at-ingestion)
- [How metadata is stored](#how-metadata-is-stored)
- [Filter results at query time](#filter-results-at-query-time)
- [Writing `where` predicates](#writing-where-predicates)
- [Server-side vs client-side filters](#server-side-vs-client-side-filters)
- [Inspect hit metadata](#inspect-hit-metadata)
- [Limitations](#limitations)
- [Related content](#related-content)
Comment on lines +5 to +14
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 "On this page" TOC contains 6 broken anchor links

The table of contents added in this PR references #filter-results-at-query-time, #writing-where-predicates, #server-side-vs-client-side-filters, #inspect-hit-metadata, #limitations, and #related-content. None of these section headings exist in the current file body (128 lines). The body still contains the old 26.05 structure (## Best Practices, ## Use Custom Metadata to Filter Results During Retrieval, etc.) rather than the restructured sections the TOC was written for. Clicking any of these six links in the published docs will silently scroll to the top of the page.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/custom-metadata.md
Line: 5-14

Comment:
**"On this page" TOC contains 6 broken anchor links**

The table of contents added in this PR references `#filter-results-at-query-time`, `#writing-where-predicates`, `#server-side-vs-client-side-filters`, `#inspect-hit-metadata`, `#limitations`, and `#related-content`. None of these section headings exist in the current file body (128 lines). The body still contains the old 26.05 structure (`## Best Practices`, `## Use Custom Metadata to Filter Results During Retrieval`, etc.) rather than the restructured sections the TOC was written for. Clicking any of these six links in the published docs will silently scroll to the top of the page.

How can I resolve this? If you propose a fix, please make it concise.


## Limitations
## Attach metadata at ingestion { #attach-metadata-at-ingestion }

The following are limitation when you use custom metadata:
Pass a **sidecar metadata table** on `vdb_upload` so selected columns are merged into each chunk's `content_metadata` before LanceDB upload. All three parameters must be set together:

- Metadata fields must be consistent across documents in the same collection.
- Complex filter expressions may impact retrieval performance.
- If you update your custom metadata, you must ingest your documents again to use the new metadata.
| Parameter | Purpose |
|-----------|---------|
| `meta_dataframe` | Path to CSV, JSON, or Parquet, or an in-memory `pandas.DataFrame` |
| `meta_source_field` | Column that identifies each document (must match ingest paths or basenames per `meta_join_key`) |
| `meta_fields` | Non-empty list of column names to copy into `content_metadata` |



## Add Custom Metadata During Ingestion

You can add custom metadata during the document ingestion process.
You can specify metadata for each file,
and you can specify different metadata for different documents in the same ingestion batch.


### Metadata Structure

You specify custom metadata as a dataframe or a file (json, csv, or parquet).

The following example contains metadata fields for category, department, and timestamp.
You can create whatever metadata is helpful for your scenario.
Optional `meta_join_key` controls how rows are matched to documents: `auto` (try full path then basename), `source_id` (full path), or `source_name` (basename only).

```python
import pandas as pd
from nemo_retriever import create_ingestor

meta_df = pd.DataFrame(
{
"source": ["data/woods_frost.pdf", "data/multimodal_test.pdf"],
"category": ["Alpha", "Bravo"],
"department": ["Language", "Engineering"],
"timestamp": ["2025-05-01T00:00:00", "2025-05-02T00:00:00"]
"meta_a": ["alpha", "bravo"],
"meta_b": [10, 20],
}
)

# Convert the dataframe to a csv file,
# to demonstrate how to ingest a metadata file in a later step.

file_path = "./meta_file.csv"
meta_df.to_csv(file_path)
```


### Example: Add Custom Metadata During Ingestion

The following example adds custom metadata during ingestion.
For more information about `create_ingestor` and run modes, refer to [Use the Python API](nemo-retriever-api-reference.md).
For more information about the `vdb_upload` method, refer to [Upload Data](vdbs.md).

```python
from nemo_retriever import create_ingestor

# Service-backed pipeline: point `base_url` at your running retriever service.
# For local graph execution instead, see [Use the Python API](nemo-retriever-api-reference.md).

hostname = "localhost"
table_name = "nemo_retriever_collection"
lancedb_uri = "./lancedb_data"

ingestor = (
create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")
.files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])
Comment on lines 40 to 42
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Undefined variables make the code example un-runnable

The diff removes the hostname, table_name, and lancedb_uri variable definitions that previously preceded the ingestor = (...) block, but the create_ingestor(...) call still references all three. Copying this snippet results in a NameError on hostname. The variable definitions need to be restored.

Suggested change
ingestor = (
create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")
.files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])
hostname = "localhost"
table_name = "nemo_retriever_collection"
lancedb_uri = "./lancedb_data"
ingestor = (
create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")
.files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/custom-metadata.md
Line: 40-42

Comment:
**Undefined variables make the code example un-runnable**

The diff removes the `hostname`, `table_name`, and `lancedb_uri` variable definitions that previously preceded the `ingestor = (...)` block, but the `create_ingestor(...)` call still references all three. Copying this snippet results in a `NameError` on `hostname`. The variable definitions need to be restored.

```suggestion
hostname = "localhost"
table_name = "nemo_retriever_collection"
lancedb_uri = "./lancedb_data"

ingestor = (
    create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")
        .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])
```

How can I resolve this? If you propose a fix, please make it concise.

Expand Down Expand Up @@ -150,9 +118,11 @@ hits = retriever.query(
)
```

For a runnable end-to-end flow (ingest, `Retriever.query`, and both filter modes), see [nemo_retriever_retriever_query_metadata_filter.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb).

When you ingest through the **retriever service**, upload the sidecar with [`POST /v1/ingest/sidecar`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/routers/ingest.py#L1040-L1129) (multipart file; response [`SidecarUploadResponse`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/responses.py#L60-L68)), then pass the returned `sidecar_id` as `meta_dataframe_id` with `meta_source_field` and `meta_fields` in `pipeline.vdb_upload_params` on [`POST /v1/ingest`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/requests.py#L15-L32) ([`PipelineSpec`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/pipeline_spec.py#L55-L78)). Request and response shapes, form fields, and auth headers are in the service OpenAPI UI at `/docs` (or `/openapi.json`) on your retriever base URL (for example `http://localhost:7670/docs` after `retriever service start`). Do not send a raw local path as `meta_dataframe` on the service spec.

## Related Content
## How metadata is stored { #how-metadata-is-stored }

- [Vector databases](vdbs.md) — canonical LanceDB upload and retrieval guide
- [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph ingest with sidecar metadata
Comment on lines +125 to 128
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Section heading "How metadata is stored" contains only cross-reference bullets

The heading at line 125 was renamed from ## Related Content to ## How metadata is stored, but its body was not updated — it still contains just two reference links. Readers navigating to this section via the TOC will find no explanation of how metadata is persisted (e.g., serialized into the metadata column, how content_metadata fields are mapped). Either restore the "Related content" heading or replace the bullets with the intended storage explanation.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/custom-metadata.md
Line: 125-128

Comment:
**Section heading "How metadata is stored" contains only cross-reference bullets**

The heading at line 125 was renamed from `## Related Content` to `## How metadata is stored`, but its body was not updated — it still contains just two reference links. Readers navigating to this section via the TOC will find no explanation of how metadata is persisted (e.g., serialized into the `metadata` column, how `content_metadata` fields are mapped). Either restore the "Related content" heading or replace the bullets with the intended storage explanation.

How can I resolve this? If you propose a fix, please make it concise.

2 changes: 1 addition & 1 deletion docs/docs/extraction/deployment-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ environments), use a custom service image that already contains `ffmpeg` and

### I want examples and notebooks

1. [Jupyter Notebooks](notebooks.md)
1. [Jupyter Notebooks](notebooks/index.md)
2. [Integrate with LangChain, LlamaIndex, Haystack](integrations-langchain-llamaindex-haystack.md)

### I need API details and keys
Expand Down
6 changes: 4 additions & 2 deletions docs/docs/extraction/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,12 @@ For more information, refer to [Vector databases](vdbs.md).

For images that `nemoretriever-page-elements-v3` does not classify as tables, charts, or infographics,
you can use our VLM caption task to create a dense caption of the detected image.
That caption is then be embedded along with the rest of your content.
For more information, refer to [Extract Captions from Images](nemo-retriever-api-reference.md).
That caption is then embedded along with the rest of your content.
For chart-labeled PDF regions and other caption scope limits, see [Are PDF chart or figure regions captioned when Omni is enabled?](#are-pdf-chart-or-figure-regions-captioned-when-omni-is-enabled). For more information, refer to [Extract Captions from Images](nemo-retriever-api-reference.md).

## Are PDF chart or figure regions captioned when Omni is enabled?

No. Chart-labeled PDF regions are not routed through Omni captioning. See [Image captioning](prerequisites-support-matrix.md#image-captioning-2605) for scope, validation, and what the caption stage covers.

## When should I consider advanced visual parsing?

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/extraction/getting-started-about.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@ Typical order:
- [Deployment options](deployment-options.md) for how to run NeMo Retriever Library
- **Supported:** [Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) for Kubernetes, plus [NeMo Retriever Library install docs](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) for the published charts
- **Unsupported (developer-only):** [Docker Compose (local)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docker.md) — not a supported NIM deployment path
4. Explore [Jupyter Notebooks](notebooks.md) for end-to-end examples.
4. Explore [Jupyter Notebooks](notebooks/index.md) for end-to-end examples.

If you are new to the product, read [What is NeMo Retriever Library?](overview.md) and [Concepts](concepts.md) under **Introduction** first.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The repository includes notebooks that demonstrate multimodal RAG patterns:
- [Multimodal RAG with LangChain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb)
- [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb)

These are also linked from [Jupyter Notebooks](notebooks.md) and the [FAQ](faq.md).
These are also linked from [Jupyter Notebooks](notebooks/index.md) and the [FAQ](faq.md).

## Haystack

Expand Down
5 changes: 3 additions & 2 deletions docs/docs/extraction/multimodal-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,9 @@ NeMo Retriever Library detects tables as structured page elements, processes the

Charts and infographic regions are classified with other page layout elements (tables, text blocks, titles) and processed through layout detection and OCR. `extract_charts` and `extract_infographics` are enabled by default. Outputs use the same metadata schema as other extracted objects.

Chart-labeled PDF regions are **not** routed through the Omni caption stage; they remain on the layout-and-OCR path. For scope and validation guidance, see [Image captioning](prerequisites-support-matrix.md#image-captioning-2605).

For natural-language infographic descriptions, optionally enable [image captioning](#image-captioning).
For natural-language infographic descriptions, optionally enable [image captioning](#image-captioning) and set `caption_infographics=True` when you need VLM captions on infographic regions.

**Related**

Expand All @@ -62,7 +63,7 @@ For natural-language infographic descriptions, optionally enable [image captioni

Scanned PDFs and image-only pages rely on OCR and hybrid paths that combine native text extraction with OCR when needed. For extract methods such as `ocr` and `pdfium_hybrid`, refer to the [Python API reference](nemo-retriever-api-reference.md).

The default OCR engine is **Nemotron OCR v2**. When you run extraction **locally with HuggingFace models**, v2 operates in **multilingual** mode by default. For CLI flags and API parameters, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docs/cli/README.md#nemotron-ocr-v2-language-mode). For Kubernetes installs, see [Nemotron OCR v2 — language mode](prerequisites-support-matrix.md#nemotron-ocr-v2-language-mode) in the support matrix.
OCR artifacts depend on how you deploy. **Helm / NIM:** the production chart uses **Nemotron OCR v1** (`nvcr.io/nim/nvidia/nemotron-ocr-v1:1.3.0`). **Local Hugging Face inference:** the default engine is **Nemotron OCR v2**, which operates in **multilingual** mode by default. For CLI flags and API parameters, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docs/cli/README.md#nemotron-ocr-v2-language-mode). For Kubernetes defaults and the Helm-vs-local split, see [OCR artifacts (Helm vs local Hugging Face)](prerequisites-support-matrix.md#nemotron-ocr-v2-language-mode) in the support matrix.

**Related**

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Notebooks for NeMo Retriever Library

To get started using [NeMo Retriever Library](overview.md), you can try one of the ready-made notebooks that are available.
To get started using [NeMo Retriever Library](../overview.md), you can try one of the ready-made notebooks that are available.

## Dataset Downloads for Benchmarking

Expand All @@ -23,11 +23,3 @@ For more advanced scenarios, try one of the following notebooks:
- [Evaluate bo767 retrieval recall accuracy with NeMo Retriever Library](https://github.com/NVIDIA/NeMo-Retriever/blob/main/evaluation/bo767_recall.ipynb)
- [Multimodal RAG with LangChain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb)
- [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb)



## Related Topics

- [Pre-Requisites & Support Matrix](prerequisites-support-matrix.md)
- [Deployment options](deployment-options.md)
- [Deploy with Helm](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md)
3 changes: 2 additions & 1 deletion docs/docs/extraction/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ NeMo Retriever Library does the following:

- Accept directories of input files and a series of configurable ingestion tasks to perform on that input
- Allow the extracted content be retrieved from a VDB containing discrete metadata element
- Support multiple extraction methods per document type—for example, PDFs can use **pdfium** or [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse) as an alternate method (`extract_method="nemotron_parse"`)
- Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.

!!! note
Expand Down Expand Up @@ -49,5 +50,5 @@ NeMo Retriever Library supports the following file types:
- [Deploy on Kubernetes with Helm](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md)
- [NeMo Retriever Library — prerequisites / deployment](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) (supported Helm charts)
- [Docker Compose (unsupported, developer)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docker.md)
- [Notebooks](notebooks.md)
- [Notebooks](notebooks/index.md)
- [NVIDIA AI Blueprints catalog](https://build.nvidia.com/explore/discover) — solution cards, enterprise RAG blueprints, and end-to-end patterns (including [Enterprise RAG — multimodal PDF data extraction](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag)); for integration pathways, refer to [Integrations](integrations-langchain-llamaindex-haystack.md).
Loading
Loading