Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions docs/docs/extraction/audio-video.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ This pipeline enables retrieval at the speech segment level when you enable segm

Use the following procedure to run the NIM on your own infrastructure. Self-hosted Parakeet runs on Kubernetes via the [NeMo Retriever Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md). Enable the ASR NIM per [Optional Helm NIMs](prerequisites-support-matrix.md#optional-helm-nims-not-auto-wired-by-default) and the [Helm chart — NIM operator sub-stack](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md#nim-operator-sub-stack); pin the workload to a dedicated GPU and wire the ASR endpoint in your pipeline.

After deploy, call the pipeline from Python:
!!! important

Pin the Parakeet workload to the dedicated GPU with your Helm values or the [NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html) (for example, node selectors, resource limits, or device requests appropriate to your cluster).

Expand All @@ -87,14 +87,15 @@ After deploy, call the pipeline from Python:
asr_params=ASRParams(segment_audio=True),
)
)
)
```
```

To generate one extracted element for each sentence-like ASR segment, include `extract_audio_params={"segment_audio": True}` when calling `.extract(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model.

To generate one extracted element for each sentence-like ASR segment, pass `asr_params=ASRParams(segment_audio=True)` to `.extract_audio(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model.

For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).

!!! tip

For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).

## Parakeet with hosted inference (build.nvidia.com) { #parakeet-hosted-inference-build-nvidia }

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/extraction/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,6 @@ Token-based splitting uses the Llama 3.2 1B tokenizer (default `meta-llama/Llama

- **Library mode** — Run without the full container stack where appropriate; see [Deployment options](deployment-options.md).
- **Kubernetes / Helm (self-hosted)** — See [Deploy (Helm chart)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) and [deployment options](deployment-options.md) for running the full microservices pipeline on your infrastructure.
- **Notebooks** — [Jupyter examples](notebooks/index.md) for experimentation and RAG demos.
- **Notebooks** — [Jupyter examples](notebooks.md) for experimentation and RAG demos.

For a concise comparison, refer to [Deployment options](deployment-options.md).
80 changes: 55 additions & 25 deletions docs/docs/extraction/custom-metadata.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,74 @@
# Custom metadata and filtering
# Use Custom Metadata to Filter Search Results

Use this documentation to attach per-document metadata during ingestion and to narrow [LanceDB](vdbs.md) search results in [NeMo Retriever Library](overview.md). Implementation details live in the package [Vector DB operators and LanceDB](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) README.
You can upload custom metadata for documents during ingestion.
By uploading custom metadata you can attach additional information to documents,
and use it for filtering results during retrieval operations.
For example, you can add author metadata to your documents, and filter by author when you retrieve results.
To create filters at query time, use predicates supported by [LanceDB SQL](https://lancedb.github.io/lancedb/sql/) against your table schema (custom fields are serialized into the `metadata` column with your ingested chunks). For a worked example, see the repository notebook linked at the end of this page.

## On this page { #on-this-page }
Use this documentation to use custom metadata to filter search results when you work with [NeMo Retriever Library](overview.md).

- [Attach metadata at ingestion](#attach-metadata-at-ingestion)
- [How metadata is stored](#how-metadata-is-stored)
- [Filter results at query time](#filter-results-at-query-time)
- [Writing `where` predicates](#writing-where-predicates)
- [Server-side vs client-side filters](#server-side-vs-client-side-filters)
- [Inspect hit metadata](#inspect-hit-metadata)
- [Limitations](#limitations)
- [Related content](#related-content)

## Attach metadata at ingestion { #attach-metadata-at-ingestion }
## Limitations

Pass a **sidecar metadata table** on `vdb_upload` so selected columns are merged into each chunk's `content_metadata` before LanceDB upload. All three parameters must be set together:
The following are limitation when you use custom metadata:

| Parameter | Purpose |
|-----------|---------|
| `meta_dataframe` | Path to CSV, JSON, or Parquet, or an in-memory `pandas.DataFrame` |
| `meta_source_field` | Column that identifies each document (must match ingest paths or basenames per `meta_join_key`) |
| `meta_fields` | Non-empty list of column names to copy into `content_metadata` |
- Metadata fields must be consistent across documents in the same collection.
- Complex filter expressions may impact retrieval performance.
- If you update your custom metadata, you must ingest your documents again to use the new metadata.

Optional `meta_join_key` controls how rows are matched to documents: `auto` (try full path then basename), `source_id` (full path), or `source_name` (basename only).


## Add Custom Metadata During Ingestion

You can add custom metadata during the document ingestion process.
You can specify metadata for each file,
and you can specify different metadata for different documents in the same ingestion batch.


### Metadata Structure

You specify custom metadata as a dataframe or a file (json, csv, or parquet).

The following example contains metadata fields for category, department, and timestamp.
You can create whatever metadata is helpful for your scenario.

```python
import pandas as pd
from nemo_retriever import create_ingestor

meta_df = pd.DataFrame(
{
"source": ["data/woods_frost.pdf", "data/multimodal_test.pdf"],
"meta_a": ["alpha", "bravo"],
"meta_b": [10, 20],
"category": ["Alpha", "Bravo"],
"department": ["Language", "Engineering"],
"timestamp": ["2025-05-01T00:00:00", "2025-05-02T00:00:00"]
}
)

# Convert the dataframe to a csv file,
# to demonstrate how to ingest a metadata file in a later step.

file_path = "./meta_file.csv"
meta_df.to_csv(file_path)
```


### Example: Add Custom Metadata During Ingestion

The following example adds custom metadata during ingestion.
For more information about `create_ingestor` and run modes, refer to [Use the Python API](nemo-retriever-api-reference.md).
For more information about the `vdb_upload` method, refer to [Upload Data](vdbs.md).

```python
from nemo_retriever import create_ingestor

# Service-backed pipeline: point `base_url` at your running retriever service.
# For local graph execution instead, see [Use the Python API](nemo-retriever-api-reference.md).

hostname = "localhost"
table_name = "nemo_retriever_collection"
lancedb_uri = "./lancedb_data"

ingestor = (
create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")
.files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])
Expand Down Expand Up @@ -118,11 +150,9 @@ hits = retriever.query(
)
```

For a runnable end-to-end flow (ingest, `Retriever.query`, and both filter modes), see [nemo_retriever_retriever_query_metadata_filter.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb).

When you ingest through the **retriever service**, upload the sidecar with [`POST /v1/ingest/sidecar`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/routers/ingest.py#L1040-L1129) (multipart file; response [`SidecarUploadResponse`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/responses.py#L60-L68)), then pass the returned `sidecar_id` as `meta_dataframe_id` with `meta_source_field` and `meta_fields` in `pipeline.vdb_upload_params` on [`POST /v1/ingest`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/requests.py#L15-L32) ([`PipelineSpec`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/pipeline_spec.py#L55-L78)). Request and response shapes, form fields, and auth headers are in the service OpenAPI UI at `/docs` (or `/openapi.json`) on your retriever base URL (for example `http://localhost:7670/docs` after `retriever service start`). Do not send a raw local path as `meta_dataframe` on the service spec.

## How metadata is stored { #how-metadata-is-stored }
## Related Content

- [Vector databases](vdbs.md) — canonical LanceDB upload and retrieval guide
- [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph ingest with sidecar metadata
2 changes: 1 addition & 1 deletion docs/docs/extraction/deployment-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ environments), use a custom service image that already contains `ffmpeg` and

### I want examples and notebooks

1. [Jupyter Notebooks](notebooks/index.md)
1. [Jupyter Notebooks](notebooks.md)
2. [Integrate with LangChain, LlamaIndex, Haystack](integrations-langchain-llamaindex-haystack.md)

### I need API details and keys
Expand Down
4 changes: 1 addition & 3 deletions docs/docs/extraction/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,9 @@ For more information, refer to [Vector databases](vdbs.md).
For images that `nemoretriever-page-elements-v3` does not classify as tables, charts, or infographics,
you can use our VLM caption task to create a dense caption of the detected image.
That caption is then embedded along with the rest of your content.
For chart-labeled PDF regions and other caption scope limits, see [Are PDF chart or figure regions captioned when Omni is enabled?](#are-pdf-chart-or-figure-regions-captioned-when-omni-is-enabled). For more information, refer to [Extract Captions from Images](nemo-retriever-api-reference.md).
For more information, refer to [Extract Captions from Images](nemo-retriever-api-reference.md).

## Are PDF chart or figure regions captioned when Omni is enabled?

No. Chart-labeled PDF regions are not routed through Omni captioning. See [Image captioning](prerequisites-support-matrix.md#image-captioning-2605) for scope, validation, and what the caption stage covers.

## When should I consider advanced visual parsing?

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/extraction/getting-started-about.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@ Typical order:
- [Deployment options](deployment-options.md) for how to run NeMo Retriever Library
- **Supported:** [Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) for Kubernetes, plus [NeMo Retriever Library install docs](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) for the published charts
- **Unsupported (developer-only):** [Docker Compose (local)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docker.md) — not a supported NIM deployment path
4. Explore [Jupyter Notebooks](notebooks/index.md) for end-to-end examples.
4. Explore [Jupyter Notebooks](notebooks.md) for end-to-end examples.

If you are new to the product, read [What is NeMo Retriever Library?](overview.md) and [Concepts](concepts.md) under **Introduction** first.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The repository includes notebooks that demonstrate multimodal RAG patterns:
- [Multimodal RAG with LangChain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb)
- [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb)

These are also linked from [Jupyter Notebooks](notebooks/index.md) and the [FAQ](faq.md).
These are also linked from [Jupyter Notebooks](notebooks.md) and the [FAQ](faq.md).

## Haystack

Expand Down
5 changes: 2 additions & 3 deletions docs/docs/extraction/multimodal-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,8 @@ NeMo Retriever Library detects tables as structured page elements, processes the

Charts and infographic regions are classified with other page layout elements (tables, text blocks, titles) and processed through layout detection and OCR. `extract_charts` and `extract_infographics` are enabled by default. Outputs use the same metadata schema as other extracted objects.

Chart-labeled PDF regions are **not** routed through the Omni caption stage; they remain on the layout-and-OCR path. For scope and validation guidance, see [Image captioning](prerequisites-support-matrix.md#image-captioning-2605).

For natural-language infographic descriptions, optionally enable [image captioning](#image-captioning) and set `caption_infographics=True` when you need VLM captions on infographic regions.
For natural-language infographic descriptions, optionally enable [image captioning](#image-captioning).

**Related**

Expand All @@ -63,7 +62,7 @@ For natural-language infographic descriptions, optionally enable [image captioni

Scanned PDFs and image-only pages rely on OCR and hybrid paths that combine native text extraction with OCR when needed. For extract methods such as `ocr` and `pdfium_hybrid`, refer to the [Python API reference](nemo-retriever-api-reference.md).

OCR artifacts depend on how you deploy. **Helm / NIM:** the production chart uses **Nemotron OCR v1** (`nvcr.io/nim/nvidia/nemotron-ocr-v1:1.3.0`). **Local Hugging Face inference:** the default engine is **Nemotron OCR v2**, which operates in **multilingual** mode by default. For CLI flags and API parameters, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docs/cli/README.md#nemotron-ocr-v2-language-mode). For Kubernetes defaults and the Helm-vs-local split, see [OCR artifacts (Helm vs local Hugging Face)](prerequisites-support-matrix.md#nemotron-ocr-v2-language-mode) in the support matrix.
The default OCR engine is **Nemotron OCR v2**. When you run extraction **locally with HuggingFace models**, v2 operates in **multilingual** mode by default. For CLI flags and API parameters, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docs/cli/README.md#nemotron-ocr-v2-language-mode). For Kubernetes installs, see [Nemotron OCR v2 — language mode](prerequisites-support-matrix.md#nemotron-ocr-v2-language-mode) in the support matrix.

**Related**

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Notebooks for NeMo Retriever Library

To get started using [NeMo Retriever Library](../overview.md), you can try one of the ready-made notebooks that are available.
To get started using [NeMo Retriever Library](overview.md), you can try one of the ready-made notebooks that are available.

## Dataset Downloads for Benchmarking

Expand All @@ -23,3 +23,11 @@ For more advanced scenarios, try one of the following notebooks:
- [Evaluate bo767 retrieval recall accuracy with NeMo Retriever Library](https://github.com/NVIDIA/NeMo-Retriever/blob/main/evaluation/bo767_recall.ipynb)
- [Multimodal RAG with LangChain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb)
- [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb)



## Related Topics

- [Pre-Requisites & Support Matrix](prerequisites-support-matrix.md)
- [Deployment options](deployment-options.md)
- [Deploy with Helm](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md)
3 changes: 1 addition & 2 deletions docs/docs/extraction/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ NeMo Retriever Library does the following:

- Accept directories of input files and a series of configurable ingestion tasks to perform on that input
- Allow the extracted content be retrieved from a VDB containing discrete metadata element
- Support multiple extraction methods per document type—for example, PDFs can use **pdfium** or [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse) as an alternate method (`extract_method="nemotron_parse"`)
- Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.

!!! note
Expand Down Expand Up @@ -50,5 +49,5 @@ NeMo Retriever Library supports the following file types:
- [Deploy on Kubernetes with Helm](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md)
- [NeMo Retriever Library — prerequisites / deployment](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) (supported Helm charts)
- [Docker Compose (unsupported, developer)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docker.md)
- [Notebooks](notebooks/index.md)
- [Notebooks](notebooks.md)
- [NVIDIA AI Blueprints catalog](https://build.nvidia.com/explore/discover) — solution cards, enterprise RAG blueprints, and end-to-end patterns (including [Enterprise RAG — multimodal PDF data extraction](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag)); for integration pathways, refer to [Integrations](integrations-langchain-llamaindex-haystack.md).
Loading
Loading