Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/accuracy_perf.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Change the setting if you want different behavior.
| `ENABLE_REFLECTION` | `false` | Set to `true` to enable self-reflection. For details, refer to [Self-Reflection Support](self-reflection.md). | - Can improve the response quality by refining intermediate retrieval and final LLM output. <br/> | - Significantly higher latency due to multiple iterations of LLM model call. <br/> - You might need to deploy a separate judge LLM model, increasing GPU requirement. <br/> |
| `ENABLE_RERANKER` | `true` | Set to `true` to use the reranking model. | - Improves accuracy by selecting better documents for response generation. <br/> | - Increases latency due to additional processing. <br/> - Additional hardware requirements for self-hosted on premises deployment. <br/> |
| `ENABLE_VLM_INFERENCE` | `false` | Set to `true` to use the Vision-Language Model (VLM) for response generation. For details, refer to [VLM for Generation](vlm.md). | - Enables analysis of retrieved images alongside text for richer, multimodal responses. <br/> - Can process up to 4 images per citation. <br/> - Useful for document Q&A, visual search, and multimodal chatbots. <br/> | - Requires additional GPU resources for VLM model deployment. <br/> - Increases latency due to image processing. <br/> |
| `LLM_ENABLE_THINKING` | `false` | Set to `true` to enable reasoning for Nemotron 3 models. Use `LLM_REASONING_BUDGET` and `LLM_LOW_EFFORT` for fine-grained control. For Nemotron 1.5 models, use the `/think` system prompt instead. For details, refer to [Enable Reasoning](enable-nemotron-thinking.md). | - Improves response quality through enhanced reasoning capabilities. <br/> - Yields more precise responses. <br/> | - Can increase response latency due to additional thinking process. <br/> - Can increase token usage and computational overhead. <br/> |
| `LLM_ENABLE_THINKING` <br/> `LLM_REASONING_BUDGET` <br/> `LLM_LOW_EFFORT` | `true` <br/> `256` <br/> `true` | The v2.6.0 deployment defaults enable low-effort reasoning for Nemotron 3 Super. Set `LLM_ENABLE_THINKING=false` to disable reasoning, or tune the budget and effort mode for latency and accuracy. For Nemotron 1.5 models, use the `/think` system prompt instead. For details, refer to [Enable Reasoning](enable-nemotron-thinking.md). | - Improves response quality through enhanced reasoning capabilities. <br/> - Yields more precise responses. <br/> | - Can increase response latency due to additional thinking process. <br/> - Can increase token usage and computational overhead. <br/> |
| `RERANKER_SCORE_THRESHOLD` | `0.0` | Filters out retrieved chunks if reranker relevance is lower than this threshold. We recommend that you set this value between `0.3` and `0.5` to balance quality and coverage. For details, refer to [Use the Python Package](python-client.md). | - Faster retrieval by processing fewer documents. <br/> - Can improve accuracy by excluding low-relevance documents. <br/> | - Requires `ENABLE_RERANKER` set to `true` for effective filtering. <br/> - Might filter out too many chunks if the threshold is set high, causing no response from the RAG server. <br/> |
| `RERANKER TOP K` | 10 | Increase `reranker TOP K` to increase the probability of relevant context being part of the top-k contexts. | Increasing the value can improve accuracy. | Increasing the value can increase latency. |
| `VDB TOP K` | 100 | Increase `VDB TOP K` to provide a larger candidate pool for reranking. | Increasing the value can improve accuracy. | Increasing the value can increase latency. |
Expand Down
9 changes: 5 additions & 4 deletions docs/deploy-docker-self-hosted.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ Use the following procedure to start all containers needed for this blueprint.
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d
```

5. Check the status of the deployment by running the following code. Wait until all services are up and the `nemotron-ranking-ms`, `nemotron-embedding-ms` and `nim-llm-ms` NIMs are in healthy state before proceeding further.
5. Check the status of the deployment by running the following code. Wait until all services are up and the `nemotron-ranking-ms`, `nemotron-vlm-embedding-ms`, and `nim-llm-ms` NIMs are in healthy state before proceeding further.

```bash
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'
Expand All @@ -126,7 +126,7 @@ Use the following procedure to start all containers needed for this blueprint.
nemotron-ranking-ms Up 4 minutes (healthy)
compose-graphic-elements-1 Up 4 minutes
compose-page-elements-1 Up 4 minutes
nemotron-embedding-ms Up 4 minutes (healthy)
nemotron-vlm-embedding-ms Up 4 minutes (healthy)
compose-nemotron-ocr-1 Up 4 minutes
compose-table-structure-1 Up 4 minutes
```
Expand Down Expand Up @@ -259,7 +259,7 @@ Use the following procedure to start all containers needed for this blueprint.
fe2751bfa734 nemotron-ranking-ms Up 10 minutes (healthy)
7b5ddabf8be7 compose-graphic-elements-1 Up 10 minutes
ecfaa5190302 compose-page-elements-1 Up 10 minutes
ea8c7fdf20d1 nemotron-embedding-ms Up 10 minutes (healthy)
ea8c7fdf20d1 nemotron-vlm-embedding-ms Up 10 minutes (healthy)
6d62008a9b42 compose-nemotron-ocr-1 Up 10 minutes
969b9f5c987c compose-table-structure-1 Up 10 minutes
```
Expand Down Expand Up @@ -340,10 +340,11 @@ By default, Elasticsearch is deployed as the vector database (`vectordb.yaml` wi

- For advanced users who need direct filesystem access to extraction results, refer to [Ingestor Server Volume Mounting](mount-ingestor-volume.md).

- A single NVIDIA A100-80GB or H100-80GB, B200 GPU can be used to start non-LLM NIMs (nemotron-embedding-ms, nemotron-ranking-ms, and ingestion services like page-elements, ocr, graphic-elements, and table-structure) for ingestion and RAG workflows. You can control which GPU is used for each service by setting these environment variables in `deploy/compose/.env` file before launching. For a complete list of all services and their default GPU assignments, see [Service Port and GPU Reference](service-port-gpu-reference.md).
- A single NVIDIA A100-80GB or H100-80GB, B200 GPU can be used to start non-LLM NIMs (nemotron-vlm-embedding-ms, nemotron-ranking-ms, and ingestion services like page-elements, ocr, graphic-elements, and table-structure) for ingestion and RAG workflows. You can control which GPU is used for each service by setting these environment variables in `deploy/compose/.env` file before launching. For a complete list of all services and their default GPU assignments, see [Service Port and GPU Reference](service-port-gpu-reference.md).

```bash
EMBEDDING_MS_GPU_ID=0
VLM_EMBEDDING_MS_GPU_ID=0
RANKING_MS_GPU_ID=0
YOLOX_MS_GPU_ID=0
YOLOX_GRAPHICS_MS_GPU_ID=0
Expand Down
6 changes: 3 additions & 3 deletions docs/enable-nemotron-thinking.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,13 @@ Nemotron 3 models (such as `nvidia/nemotron-3-nano-30b-a3b`) use environment var
Set the following environment variables on the RAG server container (via Docker Compose, Helm values, or shell export):

**`LLM_ENABLE_THINKING`**
: Enable or disable the reasoning phase. When `true`, the model emits reasoning tokens before the final answer. Default: `false`.
: Enable or disable the reasoning phase. When `true`, the model emits reasoning tokens before the final answer. Default: `true` in the v2.6.0 deployment files for Nemotron 3 Super.

**`LLM_REASONING_BUDGET`**
: Maximum number of tokens allocated for reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. Default: `0`.
: Maximum number of tokens allocated for reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. Default: `256` in the v2.6.0 deployment files.

**`LLM_LOW_EFFORT`**
: Low-effort reasoning mode for faster, cheaper responses with shorter reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. Default: `false`.
: Low-effort reasoning mode for faster, cheaper responses with shorter reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. Default: `true` in the v2.6.0 deployment files.

**`FILTER_THINK_TOKENS`**
: Filter reasoning out of the user-facing `content` stream. Reasoning emitted
Expand Down
6 changes: 3 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases.
- [Continuous Ingestion from Object Storage](continuous-ingestion-object-storage.md)
- [Custom Metadata Support](custom-metadata.md)
- [File System Access to Extraction Results](mount-ingestor-volume.md)
- [Multimodal Retriever — VLM Embedding & VLM Reranker (Early Access)](multimodal-retriever.md)
- [Multimodal Retriever — VLM Embedding & VLM Reranker](multimodal-retriever.md)
- [OCR Configuration Guide](nemoretriever-ocr.md)
- [Enhanced PDF Extraction](nemotron-parse-extraction.md)
- [Text-Only Ingestion](text_only_ingest.md)
Expand Down Expand Up @@ -225,7 +225,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases.
Custom metadata Support <custom-metadata.md>
Data Catalog for Collections and Documents <data-catalog.md>
File System Access to Results <mount-ingestor-volume.md>
Multimodal Retriever — VLM Embedding & VLM Reranker (Early Access) <multimodal-retriever.md>
Multimodal Retriever — VLM Embedding & VLM Reranker <multimodal-retriever.md>
OCR Configuration Guide <nemoretriever-ocr.md>
Enhanced PDF Extraction <nemotron-parse-extraction.md>
Standalone NeMo Retriever Library <nv-ingest-standalone.md>
Expand Down Expand Up @@ -268,7 +268,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases.

Evaluate Your RAG System <evaluate.md>
RAG Accuracy Benchmarks <accuracy-benchmarks.md>
RAG Performance Benchmarks <performance-benchmarking.md>
Benchmark RAG Performance <performance-benchmarking.md>
```


Expand Down
21 changes: 21 additions & 0 deletions docs/migration_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,27 @@ This documentation contains the information to upgrade [NVIDIA RAG Blueprint](re

:::{tip}
To navigate this page more easily, click the outline button at the top of the page. [outline-button](assets/outline-button.png)
:::


## Migration Guide: v2.5.1 to v2.6.0

This guide summarizes the default changes and new capabilities introduced in [NVIDIA RAG Blueprint](readme.md) v2.6.0. Review these items before upgrading an existing deployment.

### Default Deployment Changes

- **Vector database:** Elasticsearch is now the default vector database. Milvus remains available as an optional backend. If you need to keep using Milvus, set `APP_VECTORSTORE_NAME=milvus`, point `APP_VECTORSTORE_URL` to Milvus in both RAG and ingestor services, and follow [Vector Database Configuration](change-vectordb.md).
- **Object store:** SeaweedFS is now the default S3-compatible object store. Docker Compose deployments use named `rag-vol-*` volumes for persistent data. If you are upgrading from a deployment that used host-mounted data under `deploy/compose/volumes/`, follow [Manage Persistent Data Volumes](troubleshooting.md#manage-persistent-data-volumes).
- **LLM:** The default LLM is now `nvidia/nemotron-3-super-120b-a12b`. The v2.6.0 deployment files enable low-effort reasoning by default with `LLM_ENABLE_THINKING=true`, `LLM_REASONING_BUDGET=256`, and `LLM_LOW_EFFORT=true`. For latency-sensitive deployments, see [Enable Reasoning](enable-nemotron-thinking.md) for how to disable or tune reasoning.
- **Embedding model:** The default embedding model is now `nvidia/llama-nemotron-embed-vl-1b-v2`. The text-only `nvidia/llama-nemotron-embed-1b-v2` model remains available as an optional configuration. If you switch embedding models or dimensions, re-ingest your documents so the stored vectors match the retrieval embedder.
- **OCR naming:** OCR endpoint names now use `nemotron-ocr-v1` instead of `nemoretriever-ocr-v1`.

### New Optional Features

- **Agentic RAG:** v2.6.0 adds an Agentic RAG plan-and-execute pipeline. It is disabled by default and can be enabled per request with the `agentic` field or by setting `ENABLE_AGENTIC_RAG=true`. For details, see [Agentic RAG](agentic-rag.md).
- **VLM reranker:** `nvidia/llama-nemotron-rerank-vl-1b-v2` is available as an opt-in reranker for image-heavy corpora. For details, see [Change the LLM or Embedding Model](change-model.md#switch-to-the-vlm-reranker).
- **OpenShift Helm deployment:** Red Hat OpenShift and OKD deployment is now documented for Helm. For details, see [Deploy on OpenShift with Helm](deploy-helm-openshift.md).
- **Evaluation and performance tooling:** v2.6.0 adds the filesystem evaluation CLI under `scripts/eval/` and the `rag-perf` performance benchmarking CLI under `scripts/rag-perf/`. For details, see [Evaluate Your NVIDIA RAG Blueprint System](evaluate.md) and [Benchmark the Performance of Your RAG System](performance-benchmarking.md).


## Migration Guide: v2.2.0 to v2.3.0
Expand Down
16 changes: 6 additions & 10 deletions docs/multimodal-retriever.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

The multimodal retriever has two independently switchable components that together let the [NVIDIA RAG Blueprint](readme.md) embed and re-rank documents with awareness of their visual content rather than text alone:

1. **VLM Embedding for Ingestion** — replaces the text embedder with `nvidia/llama-nemotron-embed-vl-1b-v2` so PDF pages, tables, charts, and image elements are embedded by a multimodal model.
1. **VLM Embedding for Ingestion** — uses the default `nvidia/llama-nemotron-embed-vl-1b-v2` embedder so text passages, PDF pages, tables, charts, and image elements can be embedded by a multimodal model.
2. **VLM Reranker** — replaces the text reranker with `nvidia/llama-nemotron-rerank-vl-1b-v2` so retrieved passages are scored using both their text and the cited images.

Both components plug into the same retrieval pipeline and can be enabled independently or together. Pair them with [VLM-based generation](vlm.md) for a fully multimodal RAG pipeline; see [Enabling Full VLM Multimodal RAG Pipeline](vlm.md#enabling-full-vlm-multimodal-rag-pipeline) for the end-to-end picture, and [Multimodal Query Support](multimodal-query.md) for the user-facing image+text query flow.
Expand All @@ -15,9 +15,9 @@ Requirements: an NVIDIA GPU per enabled component (H100/A100 recommended) and a

---

# Part 1 — VLM Embedding for Ingestion (Early Access)
# Part 1 — VLM Embedding for Ingestion

This part shows how to enable and use the multimodal embedding model `nvidia/llama-nemotron-embed-vl-1b-v2` in the ingestion pipeline.
The multimodal embedding model `nvidia/llama-nemotron-embed-vl-1b-v2` is the default embedding model in v2.6.0. The setup steps in this section are useful when you need to start only the VLM embedding service, confirm the active endpoint, switch back from the optional text-only embedder, or enable image-modality ingestion.

In this section you do the following:

Expand All @@ -26,17 +26,13 @@ In this section you do the following:
- Point the ingestor to the VLM embedding service and model

:::{note}
**Early Access**: Currently, `nvidia/llama-nemotron-embed-vl-1b-v2` is in early access preview.
:::

:::{note}
**PDF Support Only**: The VLM embedding feature is currently only supported for PDF documents. Other document formats (Word, PowerPoint, etc.) are not supported with VLM embedding.
**Image-modality PDF support:** The default v2.6.0 configuration uses the VLM embedding service while keeping extracted text, tables, and charts in text modality. Advanced image-modality ingestion, such as embedding structured elements or whole pages as images, is currently supported for PDF workflows.
:::

## Limitations

- The VLM embedding feature is experimental and responses may not be accurate.
- Summary generation doesn't work when this feature is enabled.
- Advanced image-modality ingestion is experimental and responses may not be accurate.
- Summary generation does not work with image-modality ingestion configurations such as whole-page image extraction.

## 1. Start the VLM Embedding NIM locally

Expand Down
3 changes: 2 additions & 1 deletion docs/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases.
- [Audio Ingestion Support](audio_ingestion.md)
- [Custom Metadata Support](custom-metadata.md)
- [File System Access to Extraction Results](mount-ingestor-volume.md)
- [Multimodal Retriever — VLM Embedding & VLM Reranker (Early Access)](multimodal-retriever.md)
- [Multimodal Retriever — VLM Embedding & VLM Reranker](multimodal-retriever.md)
- [OCR Configuration Guide](nemoretriever-ocr.md)
- [Enhanced PDF Extraction](nemotron-parse-extraction.md)
- [Text-Only Ingestion](text_only_ingest.md)
Expand Down Expand Up @@ -116,6 +116,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases.

- [Evaluate Your NVIDIA RAG Blueprint System](evaluate.md)
- [RAG Accuracy Benchmarks](accuracy-benchmarks.md)
- [Benchmark the Performance of Your RAG System](performance-benchmarking.md)


- Governance
Expand Down
2 changes: 1 addition & 1 deletion docs/release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This documentation contains the release notes for [NVIDIA RAG Blueprint](readme.



## Release 2.6.0 (TBD)
## Release 2.6.0 (2026-05-30)

This release adds [Agentic RAG](./agentic-rag.md) support with plan-and-execute pipelines, streaming responses, and UI integration; changes the default vector database to Elasticsearch and the default object store to SeaweedFS; adds [Red Hat OpenShift](./deploy-helm-openshift.md) support for Helm-based deployment; and introduces new [agent skills](../skill-source/README.md) for deployment, evaluation, and performance tooling.

Expand Down
15 changes: 9 additions & 6 deletions docs/support-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,14 @@ You can also modify the RAG Blueprint to use [NVIDIA-hosted](deploy-docker-nvidi

## Hardware Requirements (Kubernetes)

To install the RAG Blueprint on Kubernetes, you need one of the following:
To install the default RAG Blueprint Helm chart on Kubernetes, you need one of the following:

- 9 x H100-80GB
- 9 x B200
- 9 x RTX PRO 6000
- 3 x H100 (with [Multi-Instance GPU](./mig-deployment.md))
- 8 x H100-80GB
- 8 x B200
- 8 x RTX PRO 6000
- 5 x H100-80GB (with [Multi-Instance GPU](./mig-deployment.md))

Optional GPU-backed services increase the requirement. Plan for one additional GPU for each optional service that you enable, such as VLM generation, VLM captioning, VLM reranking, Nemotron Parse, or audio processing, unless you use MIG slicing or another explicit sharing strategy.



Expand All @@ -74,8 +76,9 @@ The following are requirements and recommendations for the individual components

- **Pipeline operation** – 1x L40 GPU or similar recommended. This is required if you use Milvus (optional) as the vector database with GPU acceleration. The default Elasticsearch VDB does not require a GPU. If you change the vector backend or enable optional GPU acceleration for Elasticsearch vector indexing, refer [Elasticsearch Configuration](elasticsearch-configuration.md) and confirm GPU requirements for that configuration.
- **LLM NIM (nemotron-3-super-120b-a12b)** – Refer to the [Support Matrix](https://docs.nvidia.com/nim/large-language-models/latest/supported-models.html).
- **Embedding NIM (llama-nemotron-embed-vl-1b-v2)** – Refer to the embedding model support matrix for your deployment target.
- **Embedding NIM (llama-nemotron-embed-vl-1b-v2)** – Refer to the [Support Matrix](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/support-matrix.html) for your deployment target.
- **Reranking NIM (llama-nemotron-rerank-1b-v2)**: Refer to the [Support Matrix](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/support-matrix.html).
- **VLM Reranking NIM (llama-nemotron-rerank-vl-1b-v2, optional)**: Refer to the [Support Matrix](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/support-matrix.html).
- **Nemotron OCR (Default)**: Refer to the [Support Matrix](https://docs.nvidia.com/nim/ingestion/image-ocr/1.3.0/support-matrix.html).
- **NVIDIA NIMs for Object Detection**:
- Nemotron Page Elements v3 [Support Matrix](https://docs.nvidia.com/nim/ingestion/object-detection/latest/support-matrix.html#nemo-retriever-page-elements-v3)
Expand Down
Loading