Releases · NVIDIA-AI-Blueprints/rag

17 Mar 17:26

v2.5.0

6d8e0ae

v2.5.0 Latest

Latest

Release 2.5.0 (2026-03-17)

This release introduces support for the Nemotron-super-3 model, updates NIMs to the latest versions, upgrades NV-Ingest, and adds continuous ingestion along with RTX 6000 MIG support.

Highlights

This release includes the following key updates:

Nemotron-super-3 model support. You can now integrate the Nemotron-super-3 model by following the steps outlined in this document.
NIMs updated to latest versions.
The following model updates are included:
- nvidia/llama-3.2-nv-embedqa-1b-v2 → nvidia/llama-nemotron-embed-1b-v2
- nvidia/llama-3.2-nv-rerankqa-1b-v2 → nvidia/llama-nemotron-rerank-1b-v2
- nemoretriever-page-elements-v3 → nemotron-page-elements-v3
- nemoretriever-graphic-elements-v1 → nemotron-graphic-elements-v1
- nemoretriever-table-structure-v1 → nemotron-table-structure-v1
- nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1 → nvidia/llama-nemotron-embed-vl-1b-v2
Updated NVIngest to version 26.1.2.
Added an example demonstrating the continuous ingestion pipeline. For more information, see rag_event_ingest.ipynb.
Added MIG support for RTX 6000. For details, refer to MIG Deployment and use values-mig-rtx6000.yaml and mig-config-rtx6000.yaml.
Added documentation for the experimental Nemotron-parse-only ingestion pipeline. This configuration allows you to perform extraction using only Nemotron Parse through NV-Ingest, without relying on OCR, page-elements, graphic-elements, or table-structure NIMs. For more information, refer to nemotron-parse-extraction.md.
Several bug fixes, including frontend CVE resolutions, improved multimodal content concatenation for VLM embeddings, enhanced VDB serialization for high-concurrency parallel ingestion, and updates to observability and NeMo Guardrails configurations.
Added agentic skills support: the rag-blueprint skill enables AI coding assistants (Claude Code, Cursor, Codex, etc.) to deploy, configure, troubleshoot, and manage the RAG Blueprint autonomously. For details, refer to RAG Blueprint Agent Skill.
Added accuracy benchmark results across seven public datasets (RagBattlepacket, KG-RAG, Financebench, DC767, HotPotQA, Google Frames, and Vidore), comparing LLM and VLM configurations with reasoning on/off. Benchmarks use the NVIDIA Answer Accuracy metric from RAGAS.
Added a noteboook showcasing langchain connector for NVIDIA RAG Blueprint.

Fixed Known Issues

The following known issues have been resolved in this release:

Addressed frontend CVEs.
Resolved VDB indexing issues during high-concurrency batch parallel ingestion by implementing VDB serialization.

Assets 2

20 Feb 19:14

shubhadeepd

v2.4.0

19bb443

v2.4.0

Release 2.4.0 (2026-02-20)

This release adds new features to the RAG pipeline for supporting agent workflows and enhances generations with VLMs augmenting multimodal input.

Highlights

This release contains the following key changes:

Updated NIMs and code to support NVIDIA Ingest 26.01 release.
Added support for non-NIM models including OpenAI, models hosted on AWS and Azure, OSS models, and others. Supported through service-specific API keys. For details, refer to Get an API Key.
The RAG Blueprint now uses nemoretriever-ocr-v1 as the default OCR model. For details, refer to NeMo Retriever OCR Configuration Guide.
Improved VLM based generation support. The Vision-Language Model (VLM) inference feature now uses the model nemotron-nano-12b-v2-vl. For details, refer to VLM for Generation.
User interface improvements including catalog display, image and text query, and others. For details, refer to User Interface.
Added ingestion metrics endpoint support with OpenTelemetry (OTEL) for monitoring document uploads, elements ingested, and pages processed. For details, refer to Observability.
Support image and text as input query. For details, refer to Multimodal Query Support.
Nemotron-3-Nano model support with reasoning budget. For details, refer to Enable Reasoning.
Vector Database enhancements including secure database access. For details, refer to Milvus Configuration and Elasticsearch Configuration.
You can now access RAG functionality from a Model Context Protocol (MCP) server for tool integration. For details, refer to MCP Server and Client Usage.
Added OpenAI-compatible search endpoint for integration with OpenAI tools. For details, refer to API - RAG Server Schema.
Added support for collection-level data catalog, descriptions, and metadata. For details, refer to Data Catalog.
Enhanced /status endpoint publishing ingestion metrics and status information. For details, refer to the ingestion notebook.
Multi-turn conversation support is no longer the default for either retrieval or generation stage in the pipeline. Refer to Multi-Turn Conversation Support for details.
Improved document processing and element extraction.
Enhancements to RAG library mode including the following. For details, refer to Use the NVIDIA RAG Blueprint Python Package.
- Independent multi-instance support for the RAG Server and the ingestion server
- Configuration support through function arguments
- Async interface for RAG methods
- Compatibility with the NVIDIA NeMo Agent Toolkit (NAT)
Summarization enhancements including the following. For details, refer to Document Summarization Customization Guide.
- Shallow summarization support
- Easy model switches and dedicated configurations
- Ease of prompt changes
Reserved field names type, subtype, and location for NV-Ingest exclusive use in metadata schemas.
Added support for rag_library_lite_usage.ipynb which demonstrates containerless deployment of the NVIDIA RAG Python package in lite mode.
Added example showcasing NeMo Agent Toolkit integration with NVIDIA RAG.
Added weighted hybrid search support with configurable weights.
RAG server logging improvements

Fixed Known Issues

The following are the known issues that are fixed in this version:

Fixed issue in NIM LLM for automatic profile selection. For details, refer to Model Profiles.

Known limitations

The following are the known limitations in this version:

DRA support using NIM operator based helm chart is not available in this release.

Assets 2

09 Jan 13:42

anngu-2xx3

v2.3.2

30b87aa

v2.3.2

Version 2.3.2 (2025-12-25)

This release if for RAG v2.3.0 hotfix.

Changed

Bump embedqa image/version to 1.10.1 and nim-llm to version 1.14.0.
Align Helm values and any referenced tags with the new embedqa, nim-llm version

Assets 2

15 Oct 05:15

shubhadeepd

v2.3.0

ba87c08

v2.3.0

Version 2.3.0 (2025-10-14)

This release adds RTX6000 platform support, adds deployment by using NIM operator, improves vector database pluggability with the blueprint, and other changes.

Added

Support deploying the blueprint on RTX6000 platform.
Migrated to llama-3.3-nemotron-super-49b-v1.5 as the default LLM model.
Added support to deploy the helm chart by using NVIDIA NIM operator. For details, refer to Deploy NVIDIA RAG Blueprint with NIM Operator.
Updated all NIMs, NVIDIA Ingest and third party dependencies to latest versions.
Refactoring to support custom 3rd party vector DB integration in a streamlined manner.
- Interactive notebook showcasing integration with library mode here.
Added support for elasticsearch vector DB as an alternate to milvus.
Added opt-in query decomposition support.
Added opt-in nemoretriever-ocr support.
Added opt-in VLM embedding support
Custom metadata enhancments. Detailed doc here.
- Added support for more datatypes.
- Added opt-in support to generate filters using LLM yielding better accuracy.
- Added an interactive notebook showcasing new features.
Added dependency check support for ingestor server /health API.
Added support for configurable confidence threashold for retrieval from API layer.
Added support to store NV-Ingest extraction results directly from the filesystem.
Logging enhancements
Added better latency data reporting for RAG server
- API level enhancements for component level latency
- Added dedicated Prometheus metric endpoint
Added independent script to showcase batch ingestion
Enabled support for GPU indexing with CPU search
- Exposed APP_VECTORSTORE_EF as a configurable parameter
Added environment variables to control llm parameters LLM_MAX_TOKENS, LLM_TEMPERATURE and LLM_TOP_P
Added notebooks for showcasing RAG evaluation using common metrics
- Notebook 1 - evaluation using RAGAS
- Notebook 2 - Recall calculation
Added unit tests and pre-commit hooks for maintaining code quality.
Optimized container sizes by removing unnecessary packages and improving security.

Changed

Migrated default LLM model for reflection to llama-3.3-nemotron-super-49b instead of mixtral-8x22b-instruct-v01.
Refactored rag-playground code
- Use React end to end. Next.js dependencies were deprecated.
- More developer friendly and intuitive look and feel.
- rag-playground service is renamed to rag-frontend
Refactored helm chart support
- Expanded and reorganized Helm chart configuration, enabling granular control over service components, resource settings, and observability (tracing, metrics).
- Introduced ConfigMap and service definitions to facilitate improved application deployment flexibility.
- Implemented refined service account and secret management in Helm templates.
- Added a new Helm values file for nim-operator to configure LLM model environment and component toggles.

Fixed

Fixed support for long audio file ingestion.
Fixed support to ingest images without charts/tables.
Fixed requirement of rebuilding rag frontend container when LLM model name was changed.

Removed

Removed consistency level configuration support for Milvus.
Removed EMBEDDING_NIM_ENDPOINT and EMBEDDING_NIM_MODEL_NAME environment variables for nvingest.
Removed unused ENABLE_MULTITURN environment variable from rag-server.
Removed ENABLE_NEMOTRON_THINKING environment variable from rag-server.

Assets 2

22 Jul 17:50

shubhadeepd

v2.2.1

1a82e18

v2.2.1

This minor patch release updates to the latest nvingest client version 25.6.3 to fix breaking changes introduced due to pypdfium.
Details mentioned here:
https://github.com/NVIDIA/nv-ingest/releases/tag/25.6.3

All existing prebuilt containers should work.
Released corresponding pypi library:
https://pypi.org/project/nvidia-rag/2.2.1/

Assets 2

09 Jul 04:56

shubhadeepd

v2.2.0

1c2271b

v2.2.0

This release adds B200 platform support, a native Python API, and major enhancements for multimodal and metadata features. It also improves deployment flexibility and customization across the RAG blueprint.

Added

Support deploying the blueprint on B200 platform.
Support for native python API
- Refactoring code and directory to support python API
- Better modularization for easier customization
- Moved to uv as the package manager for this project
Added support for configurable vector store consistency levels (Bounded/Strong/Session) to optimize retrieval performance vs accuracy trade-offs.
Capability to add custom metadata for files and metadata based filtering
Documentation of using Multi Instance GPUs. Reduces minimum GPU requirement for helm charts to 3xH100.
Multi collection based retrieval support
Audio files (.mp3 and .wav) support
Support of using Vision Language Model based generation for charts and images
Support for generating summaries of uploaded files
Sample user interface enhancements
- Support for non-blocking file upload
- More efficient error reporting for ingestion failures
Prompt customization support without rebuilding images
Added support to enable infographics, which improves accuracy for documents containing text in image format.
- See this guide for details
New customizations
- How to support non nvingest based ingestion + retrieval
- How to enable CPU based milvus
- How to enable nemoretriever-parse as an alternate PDF parser
- How to use standalone nv-ingest python client to do ingestion
Nvidia AI Workbench support

Changed

Changed API schema to support newly added features
- POST /collections to be deprecated in favour of POST /collection for ingestor-server
- New endpoint GET /summary added for rag-server
- Metadata information available as part of GET /collections and GET /documents API
- Check out migration guide for detailed changes at API level
Optimized batch mode ingestion support to improve perf for multi user concurrent file upload.

Known Issues

Check out this section to understand the known issues present for this release.

Assets 2

14 May 20:47

shubhadeepd

v2.1.0

e9e1e40

v2.1.0

This release reduces overall GPU requirement for the deployment of the blueprint. It also improves the performance and stability for both docker and helm based deployments.

Added

Added non-blocking async support to upload documents API
- Added a new field blocking: bool to control this behaviour from client side. Default is set to true
- Added a new API /status to monitor state or completion status of uploaded docs
Helm chart is published on NGC Public registry.
Helm chart customization guide is now available for all optional features under documentation.
Issues with very large file upload has been fixed.
Security enhancements and stability improvements.

Changed

Overall GPU requirement reduced to 2xH100/3xA100.
- Changed default LLM model to llama-3_3-nemotron-super-49b-v1. This reduces overall GPU needed to deploy LLM model to 1xH100/2xA100
- Changed default GPU needed for all other NIMs (ingestion and reranker NIMs) to 1xH100/1xA100
Changed default chunk size to 512 in order to reduce LLM context size and in turn reduce RAG server response latency.
Exposed config to split PDFs post chunking. Controlled using APP_NVINGEST_ENABLEPDFSPLITTER environment variable in ingestor-server. Default value is set to True.
Added batch-based ingestion which can help manage memory usage of ingestor-server more effectively. Controlled using ENABLE_NV_INGEST_BATCH_MODE and NV_INGEST_FILES_PER_BATCH variables. Default value is True and 100 respectively.
Removed extract_options from API level of ingestor-server.
Resolved an issue during bulk ingestion, where ingestion job failed if ingestion of a single file fails.

Known Issues

The rag-playground container needs to be rebuild if the APP_LLM_MODELNAME, APP_EMBEDDINGS_MODELNAME or APP_RANKING_MODELNAME environment variable values are changed.
While trying to upload multiple files at the same time, there may be a timeout error Error uploading documents: [Error: aborted] { code: 'ECONNRESET' }. Developers are encouraged to use API's directly for bulk uploading, instead of using the sample rag-playground. The default timeout is set to 1 hour from UI side, while uploading.
In case of failure while uploading files, error messages may not be shown in the user interface of rag-playground. Developers are encouraged to check the ingestor-server logs for details.

A detailed guide is available here for easing developers experience, while migrating from older versions.

Assets 2

18 Mar 18:12

shubhadeepd

v2.0.0

003c6bb

v2.0.0

[2.0.0] - 2025-03-18

This release adds support for multimodal documents using Nvidia Ingest including support for parsing PDFs, Word and PowerPoint documents. It also significantly improves accuracy and perf considerations by refactoring the APIs, architecture as well as adds a new developer friendly UI.

Added

Integration with Nvingest for ingestion pipeline, the unstructured.io based pipeline is now deprecated.
OTEL compatible observability and telemetry support.
API refactoring. Updated schemas here.
- Support runtime configuration of all common parameters.
- Multimodal citation support.
- New dedicated endpoints for deleting collection, creating collections and reingestion of documents
New react + nodeJS based UI showcasing runtime configurations
Added optional features to improve accuracy and reliability of the pipeline, turned off by default. Best practices here
Brev dev compatible notebook
Security enhancements and stability improvements

Changed

- In RAG v1.0.0, a single server managed both ingestion and retrieval/generation APIs. In RAG v2.0.0, the architecture has evolved to utilize two separate microservices.
Helm charts are now modularized, seperate helm charts are provided for each distinct microservice.
Default settings configured to achieve a balance between accuracy and perf.
- Default flow uses on-prem models with option to switch to API catalog endpoints for docker based flow.
- Query rewriting uses a smaller llama3.1-8b-instruct and is turned off by default.
- Support to use conversation history during retrieval for low-latency multiturn support.

Known Issues

The rag-playground container needs to be rebuild if the APP_LLM_MODELNAME, APP_EMBEDDINGS_MODELNAME or APP_RANKING_MODELNAME environment variable values are changed.
Optional features reflection, nemoguardrails and image captioning are not available in helm based deployment.
Uploading large files with .txt extension may fail during ingestion, we recommend splitting such files into smaller parts, to avoid this issue.

A detailed guide is available here for easing developers experience, while migrating from older versions.

Assets 2

23 Jan 07:45

sumitkbh

v1.0.0

c2d4d84

v1.0.0

This is the first release of the NVIDIA AI RAG blueprint which serves as a reference solution for a foundational Retrieval Augmented Generation (RAG) pipeline. This blueprint demonstrates how to set up a RAG solution that uses NVIDIA NIM and GPU-accelerated components.
By default, this blueprint leverages the NVIDIA-hosted models available in the NVIDIA API Catalog.
However, you can replace these models with your own locally-deployed NIMs to meet specific data governance and latency requirements.
For more details checkout the readme.

Assets 2

Releases: NVIDIA-AI-Blueprints/rag

v2.5.0

Release 2.5.0 (2026-03-17)

Highlights

Fixed Known Issues

Uh oh!

v2.4.0

Release 2.4.0 (2026-02-20)

Highlights

Fixed Known Issues

Known limitations

Uh oh!

v2.3.2

Version 2.3.2 (2025-12-25)

Changed

Uh oh!

v2.3.0

Version 2.3.0 (2025-10-14)

Added

Changed

Fixed

Removed

Uh oh!

v2.2.1

Uh oh!

v2.2.0

Added

Changed

Known Issues

Uh oh!

v2.1.0

Added

Changed

Known Issues

Uh oh!

v2.0.0

[2.0.0] - 2025-03-18

Added

Changed

Known Issues

Uh oh!

v1.0.0

Uh oh!