From dc8d6078afed5044fbf197bcc042dee693e8f5d9 Mon Sep 17 00:00:00 2001 From: Shubhadeep Das Date: Sat, 30 May 2026 01:04:25 +0530 Subject: [PATCH 1/2] docs: refresh v2.6 support guidance Signed-off-by: Shubhadeep Das --- docs/accuracy_perf.md | 2 +- docs/deploy-docker-self-hosted.md | 9 +++++---- docs/enable-nemotron-thinking.md | 6 +++--- docs/index.md | 6 +++--- docs/migration_guide.md | 21 +++++++++++++++++++++ docs/multimodal-retriever.md | 16 ++++++---------- docs/readme.md | 3 ++- docs/release-notes.md | 2 +- docs/support-matrix.md | 13 ++++++++----- 9 files changed, 50 insertions(+), 28 deletions(-) diff --git a/docs/accuracy_perf.md b/docs/accuracy_perf.md index 5eb03a9e9..c987c1b5f 100644 --- a/docs/accuracy_perf.md +++ b/docs/accuracy_perf.md @@ -37,7 +37,7 @@ Change the setting if you want different behavior. | `ENABLE_REFLECTION` | `false` | Set to `true` to enable self-reflection. For details, refer to [Self-Reflection Support](self-reflection.md). | - Can improve the response quality by refining intermediate retrieval and final LLM output.
| - Significantly higher latency due to multiple iterations of LLM model call.
- You might need to deploy a separate judge LLM model, increasing GPU requirement.
| | `ENABLE_RERANKER` | `true` | Set to `true` to use the reranking model. | - Improves accuracy by selecting better documents for response generation.
| - Increases latency due to additional processing.
- Additional hardware requirements for self-hosted on premises deployment.
| | `ENABLE_VLM_INFERENCE` | `false` | Set to `true` to use the Vision-Language Model (VLM) for response generation. For details, refer to [VLM for Generation](vlm.md). | - Enables analysis of retrieved images alongside text for richer, multimodal responses.
- Can process up to 4 images per citation.
- Useful for document Q&A, visual search, and multimodal chatbots.
| - Requires additional GPU resources for VLM model deployment.
- Increases latency due to image processing.
| -| `LLM_ENABLE_THINKING` | `false` | Set to `true` to enable reasoning for Nemotron 3 models. Use `LLM_REASONING_BUDGET` and `LLM_LOW_EFFORT` for fine-grained control. For Nemotron 1.5 models, use the `/think` system prompt instead. For details, refer to [Enable Reasoning](enable-nemotron-thinking.md). | - Improves response quality through enhanced reasoning capabilities.
- Yields more precise responses.
| - Can increase response latency due to additional thinking process.
- Can increase token usage and computational overhead.
| +| `LLM_ENABLE_THINKING`
`LLM_REASONING_BUDGET`
`LLM_LOW_EFFORT` | `true`
`256`
`true` | The v2.6.0 deployment defaults enable low-effort reasoning for Nemotron 3 Super. Set `LLM_ENABLE_THINKING=false` to disable reasoning, or tune the budget and effort mode for latency and accuracy. For Nemotron 1.5 models, use the `/think` system prompt instead. For details, refer to [Enable Reasoning](enable-nemotron-thinking.md). | - Improves response quality through enhanced reasoning capabilities.
- Yields more precise responses.
| - Can increase response latency due to additional thinking process.
- Can increase token usage and computational overhead.
| | `RERANKER_SCORE_THRESHOLD` | `0.0` | Filters out retrieved chunks if reranker relevance is lower than this threshold. We recommend that you set this value between `0.3` and `0.5` to balance quality and coverage. For details, refer to [Use the Python Package](python-client.md). | - Faster retrieval by processing fewer documents.
- Can improve accuracy by excluding low-relevance documents.
| - Requires `ENABLE_RERANKER` set to `true` for effective filtering.
- Might filter out too many chunks if the threshold is set high, causing no response from the RAG server.
| | `RERANKER TOP K` | 10 | Increase `reranker TOP K` to increase the probability of relevant context being part of the top-k contexts. | Increasing the value can improve accuracy. | Increasing the value can increase latency. | | `VDB TOP K` | 100 | Increase `VDB TOP K` to provide a larger candidate pool for reranking. | Increasing the value can improve accuracy. | Increasing the value can increase latency. | diff --git a/docs/deploy-docker-self-hosted.md b/docs/deploy-docker-self-hosted.md index 94f6ec0fa..0867571ba 100644 --- a/docs/deploy-docker-self-hosted.md +++ b/docs/deploy-docker-self-hosted.md @@ -112,7 +112,7 @@ Use the following procedure to start all containers needed for this blueprint. USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d ``` -5. Check the status of the deployment by running the following code. Wait until all services are up and the `nemotron-ranking-ms`, `nemotron-embedding-ms` and `nim-llm-ms` NIMs are in healthy state before proceeding further. +5. Check the status of the deployment by running the following code. Wait until all services are up and the `nemotron-ranking-ms`, `nemotron-vlm-embedding-ms`, and `nim-llm-ms` NIMs are in healthy state before proceeding further. ```bash watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"' @@ -126,7 +126,7 @@ Use the following procedure to start all containers needed for this blueprint. nemotron-ranking-ms Up 4 minutes (healthy) compose-graphic-elements-1 Up 4 minutes compose-page-elements-1 Up 4 minutes - nemotron-embedding-ms Up 4 minutes (healthy) + nemotron-vlm-embedding-ms Up 4 minutes (healthy) compose-nemotron-ocr-1 Up 4 minutes compose-table-structure-1 Up 4 minutes ``` @@ -259,7 +259,7 @@ Use the following procedure to start all containers needed for this blueprint. fe2751bfa734 nemotron-ranking-ms Up 10 minutes (healthy) 7b5ddabf8be7 compose-graphic-elements-1 Up 10 minutes ecfaa5190302 compose-page-elements-1 Up 10 minutes - ea8c7fdf20d1 nemotron-embedding-ms Up 10 minutes (healthy) + ea8c7fdf20d1 nemotron-vlm-embedding-ms Up 10 minutes (healthy) 6d62008a9b42 compose-nemotron-ocr-1 Up 10 minutes 969b9f5c987c compose-table-structure-1 Up 10 minutes ``` @@ -340,10 +340,11 @@ By default, Elasticsearch is deployed as the vector database (`vectordb.yaml` wi - For advanced users who need direct filesystem access to extraction results, refer to [Ingestor Server Volume Mounting](mount-ingestor-volume.md). -- A single NVIDIA A100-80GB or H100-80GB, B200 GPU can be used to start non-LLM NIMs (nemotron-embedding-ms, nemotron-ranking-ms, and ingestion services like page-elements, ocr, graphic-elements, and table-structure) for ingestion and RAG workflows. You can control which GPU is used for each service by setting these environment variables in `deploy/compose/.env` file before launching. For a complete list of all services and their default GPU assignments, see [Service Port and GPU Reference](service-port-gpu-reference.md). +- A single NVIDIA A100-80GB or H100-80GB, B200 GPU can be used to start non-LLM NIMs (nemotron-vlm-embedding-ms, nemotron-ranking-ms, and ingestion services like page-elements, ocr, graphic-elements, and table-structure) for ingestion and RAG workflows. You can control which GPU is used for each service by setting these environment variables in `deploy/compose/.env` file before launching. For a complete list of all services and their default GPU assignments, see [Service Port and GPU Reference](service-port-gpu-reference.md). ```bash EMBEDDING_MS_GPU_ID=0 + VLM_EMBEDDING_MS_GPU_ID=0 RANKING_MS_GPU_ID=0 YOLOX_MS_GPU_ID=0 YOLOX_GRAPHICS_MS_GPU_ID=0 diff --git a/docs/enable-nemotron-thinking.md b/docs/enable-nemotron-thinking.md index 94e898d53..cc0b292a5 100644 --- a/docs/enable-nemotron-thinking.md +++ b/docs/enable-nemotron-thinking.md @@ -30,13 +30,13 @@ Nemotron 3 models (such as `nvidia/nemotron-3-nano-30b-a3b`) use environment var Set the following environment variables on the RAG server container (via Docker Compose, Helm values, or shell export): **`LLM_ENABLE_THINKING`** -: Enable or disable the reasoning phase. When `true`, the model emits reasoning tokens before the final answer. Default: `false`. +: Enable or disable the reasoning phase. When `true`, the model emits reasoning tokens before the final answer. The v2.6.0 deployment files set this to `true` for Nemotron 3 Super. Library and custom deployments that do not set the environment variable use the application default, `false`. **`LLM_REASONING_BUDGET`** -: Maximum number of tokens allocated for reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. Default: `0`. +: Maximum number of tokens allocated for reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. The v2.6.0 deployment default is `256`; the application default is `0`. **`LLM_LOW_EFFORT`** -: Low-effort reasoning mode for faster, cheaper responses with shorter reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. Default: `false`. +: Low-effort reasoning mode for faster, cheaper responses with shorter reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. The v2.6.0 deployment default is `true`; the application default is `false`. **`FILTER_THINK_TOKENS`** : Filter reasoning out of the user-facing `content` stream. Reasoning emitted diff --git a/docs/index.md b/docs/index.md index c1c495141..46a4c211f 100644 --- a/docs/index.md +++ b/docs/index.md @@ -87,7 +87,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases. - [Continuous Ingestion from Object Storage](continuous-ingestion-object-storage.md) - [Custom Metadata Support](custom-metadata.md) - [File System Access to Extraction Results](mount-ingestor-volume.md) - - [Multimodal Retriever — VLM Embedding & VLM Reranker (Early Access)](multimodal-retriever.md) + - [Multimodal Retriever — VLM Embedding & VLM Reranker](multimodal-retriever.md) - [OCR Configuration Guide](nemoretriever-ocr.md) - [Enhanced PDF Extraction](nemotron-parse-extraction.md) - [Text-Only Ingestion](text_only_ingest.md) @@ -225,7 +225,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases. Custom metadata Support Data Catalog for Collections and Documents File System Access to Results - Multimodal Retriever — VLM Embedding & VLM Reranker (Early Access) + Multimodal Retriever — VLM Embedding & VLM Reranker OCR Configuration Guide Enhanced PDF Extraction Standalone NeMo Retriever Library @@ -268,7 +268,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases. Evaluate Your RAG System RAG Accuracy Benchmarks - RAG Performance Benchmarks + Benchmark RAG Performance ``` diff --git a/docs/migration_guide.md b/docs/migration_guide.md index 87bde24be..74138f5fd 100644 --- a/docs/migration_guide.md +++ b/docs/migration_guide.md @@ -8,6 +8,27 @@ This documentation contains the information to upgrade [NVIDIA RAG Blueprint](re :::{tip} To navigate this page more easily, click the outline button at the top of the page. [outline-button](assets/outline-button.png) +::: + + +## Migration Guide: v2.5.1 to v2.6.0 + +This guide summarizes the default changes and new capabilities introduced in [NVIDIA RAG Blueprint](readme.md) v2.6.0. Review these items before upgrading an existing deployment. + +### Default Deployment Changes + +- **Vector database:** Elasticsearch is now the default vector database. Milvus remains available as an optional backend. If you need to keep using Milvus, set `APP_VECTORSTORE_NAME=milvus`, point `APP_VECTORSTORE_URL` to Milvus in both RAG and ingestor services, and follow [Vector Database Configuration](change-vectordb.md). +- **Object store:** SeaweedFS is now the default S3-compatible object store. Docker Compose deployments use named `rag-vol-*` volumes for persistent data. If you are upgrading from a deployment that used host-mounted data under `deploy/compose/volumes/`, follow [Manage Persistent Data Volumes](troubleshooting.md#manage-persistent-data-volumes). +- **LLM:** The default LLM is now `nvidia/nemotron-3-super-120b-a12b`. The v2.6.0 deployment files enable low-effort reasoning by default with `LLM_ENABLE_THINKING=true`, `LLM_REASONING_BUDGET=256`, and `LLM_LOW_EFFORT=true`. For latency-sensitive deployments, see [Enable Reasoning](enable-nemotron-thinking.md) for how to disable or tune reasoning. +- **Embedding model:** The default embedding model is now `nvidia/llama-nemotron-embed-vl-1b-v2`. The text-only `nvidia/llama-nemotron-embed-1b-v2` model remains available as an optional configuration. If you switch embedding models or dimensions, re-ingest your documents so the stored vectors match the retrieval embedder. +- **OCR naming:** OCR endpoint names now use `nemotron-ocr-v1` instead of `nemoretriever-ocr-v1`. + +### New Optional Features + +- **Agentic RAG:** v2.6.0 adds an Agentic RAG plan-and-execute pipeline. It is disabled by default and can be enabled per request with the `agentic` field or by setting `ENABLE_AGENTIC_RAG=true`. For details, see [Agentic RAG](agentic-rag.md). +- **VLM reranker:** `nvidia/llama-nemotron-rerank-vl-1b-v2` is available as an opt-in reranker for image-heavy corpora. For details, see [Change the LLM or Embedding Model](change-model.md#switch-to-the-vlm-reranker). +- **OpenShift Helm deployment:** Red Hat OpenShift and OKD deployment is now documented for Helm. For details, see [Deploy on OpenShift with Helm](deploy-helm-openshift.md). +- **Evaluation and performance tooling:** v2.6.0 adds the filesystem evaluation CLI under `scripts/eval/` and the `rag-perf` performance benchmarking CLI under `scripts/rag-perf/`. For details, see [Evaluate Your NVIDIA RAG Blueprint System](evaluate.md) and [Benchmark the Performance of Your RAG System](performance-benchmarking.md). ## Migration Guide: v2.2.0 to v2.3.0 diff --git a/docs/multimodal-retriever.md b/docs/multimodal-retriever.md index 32d75d9fe..0a8c72f19 100644 --- a/docs/multimodal-retriever.md +++ b/docs/multimodal-retriever.md @@ -6,7 +6,7 @@ The multimodal retriever has two independently switchable components that together let the [NVIDIA RAG Blueprint](readme.md) embed and re-rank documents with awareness of their visual content rather than text alone: -1. **VLM Embedding for Ingestion** — replaces the text embedder with `nvidia/llama-nemotron-embed-vl-1b-v2` so PDF pages, tables, charts, and image elements are embedded by a multimodal model. +1. **VLM Embedding for Ingestion** — uses the default `nvidia/llama-nemotron-embed-vl-1b-v2` embedder so text passages, PDF pages, tables, charts, and image elements can be embedded by a multimodal model. 2. **VLM Reranker** — replaces the text reranker with `nvidia/llama-nemotron-rerank-vl-1b-v2` so retrieved passages are scored using both their text and the cited images. Both components plug into the same retrieval pipeline and can be enabled independently or together. Pair them with [VLM-based generation](vlm.md) for a fully multimodal RAG pipeline; see [Enabling Full VLM Multimodal RAG Pipeline](vlm.md#enabling-full-vlm-multimodal-rag-pipeline) for the end-to-end picture, and [Multimodal Query Support](multimodal-query.md) for the user-facing image+text query flow. @@ -15,9 +15,9 @@ Requirements: an NVIDIA GPU per enabled component (H100/A100 recommended) and a --- -# Part 1 — VLM Embedding for Ingestion (Early Access) +# Part 1 — VLM Embedding for Ingestion -This part shows how to enable and use the multimodal embedding model `nvidia/llama-nemotron-embed-vl-1b-v2` in the ingestion pipeline. +The multimodal embedding model `nvidia/llama-nemotron-embed-vl-1b-v2` is the default embedding model in v2.6.0. The setup steps in this section are useful when you need to start only the VLM embedding service, confirm the active endpoint, switch back from the optional text-only embedder, or enable image-modality ingestion. In this section you do the following: @@ -26,17 +26,13 @@ In this section you do the following: - Point the ingestor to the VLM embedding service and model :::{note} -**Early Access**: Currently, `nvidia/llama-nemotron-embed-vl-1b-v2` is in early access preview. -::: - -:::{note} -**PDF Support Only**: The VLM embedding feature is currently only supported for PDF documents. Other document formats (Word, PowerPoint, etc.) are not supported with VLM embedding. +**Image-modality PDF support:** The default v2.6.0 configuration uses the VLM embedding service while keeping extracted text, tables, and charts in text modality. Advanced image-modality ingestion, such as embedding structured elements or whole pages as images, is currently supported for PDF workflows. ::: ## Limitations -- The VLM embedding feature is experimental and responses may not be accurate. -- Summary generation doesn't work when this feature is enabled. +- Advanced image-modality ingestion is experimental and responses may not be accurate. +- Summary generation does not work with image-modality ingestion configurations such as whole-page image extraction. ## 1. Start the VLM Embedding NIM locally diff --git a/docs/readme.md b/docs/readme.md index ad3e0b4e9..57315803d 100644 --- a/docs/readme.md +++ b/docs/readme.md @@ -87,7 +87,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases. - [Audio Ingestion Support](audio_ingestion.md) - [Custom Metadata Support](custom-metadata.md) - [File System Access to Extraction Results](mount-ingestor-volume.md) - - [Multimodal Retriever — VLM Embedding & VLM Reranker (Early Access)](multimodal-retriever.md) + - [Multimodal Retriever — VLM Embedding & VLM Reranker](multimodal-retriever.md) - [OCR Configuration Guide](nemoretriever-ocr.md) - [Enhanced PDF Extraction](nemotron-parse-extraction.md) - [Text-Only Ingestion](text_only_ingest.md) @@ -116,6 +116,7 @@ After you deploy the RAG blueprint, you can customize it for your use cases. - [Evaluate Your NVIDIA RAG Blueprint System](evaluate.md) - [RAG Accuracy Benchmarks](accuracy-benchmarks.md) + - [Benchmark the Performance of Your RAG System](performance-benchmarking.md) - Governance diff --git a/docs/release-notes.md b/docs/release-notes.md index fae923eef..6632bf532 100644 --- a/docs/release-notes.md +++ b/docs/release-notes.md @@ -8,7 +8,7 @@ This documentation contains the release notes for [NVIDIA RAG Blueprint](readme. -## Release 2.6.0 (TBD) +## Release 2.6.0 (2026-05-30) This release adds [Agentic RAG](./agentic-rag.md) support with plan-and-execute pipelines, streaming responses, and UI integration; changes the default vector database to Elasticsearch and the default object store to SeaweedFS; adds [Red Hat OpenShift](./deploy-helm-openshift.md) support for Helm-based deployment; and introduces new [agent skills](../skill-source/README.md) for deployment, evaluation, and performance tooling. diff --git a/docs/support-matrix.md b/docs/support-matrix.md index fabfef732..6a0e740b5 100644 --- a/docs/support-matrix.md +++ b/docs/support-matrix.md @@ -59,13 +59,15 @@ You can also modify the RAG Blueprint to use [NVIDIA-hosted](deploy-docker-nvidi ## Hardware Requirements (Kubernetes) -To install the RAG Blueprint on Kubernetes, you need one of the following: +To install the default RAG Blueprint Helm chart on Kubernetes, you need one of the following: -- 9 x H100-80GB -- 9 x B200 -- 9 x RTX PRO 6000 +- 8 x H100-80GB +- 8 x B200 +- 8 x RTX PRO 6000 - 3 x H100 (with [Multi-Instance GPU](./mig-deployment.md)) +Optional GPU-backed services increase the requirement. Plan for one additional GPU for each optional service that you enable, such as VLM generation, VLM captioning, VLM reranking, Nemotron Parse, or audio processing, unless you use MIG slicing or another explicit sharing strategy. + ## Hardware requirements for self-hosting all NVIDIA NIM microservices @@ -74,8 +76,9 @@ The following are requirements and recommendations for the individual components - **Pipeline operation** – 1x L40 GPU or similar recommended. This is required if you use Milvus (optional) as the vector database with GPU acceleration. The default Elasticsearch VDB does not require a GPU. If you change the vector backend or enable optional GPU acceleration for Elasticsearch vector indexing, refer [Elasticsearch Configuration](elasticsearch-configuration.md) and confirm GPU requirements for that configuration. - **LLM NIM (nemotron-3-super-120b-a12b)** – Refer to the [Support Matrix](https://docs.nvidia.com/nim/large-language-models/latest/supported-models.html). -- **Embedding NIM (llama-nemotron-embed-vl-1b-v2)** – Refer to the embedding model support matrix for your deployment target. +- **Embedding NIM (llama-nemotron-embed-vl-1b-v2)** – Refer to the [Support Matrix](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/support-matrix.html) for your deployment target. - **Reranking NIM (llama-nemotron-rerank-1b-v2)**: Refer to the [Support Matrix](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/support-matrix.html). +- **VLM Reranking NIM (llama-nemotron-rerank-vl-1b-v2, optional)**: Refer to the [Support Matrix](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/support-matrix.html). - **Nemotron OCR (Default)**: Refer to the [Support Matrix](https://docs.nvidia.com/nim/ingestion/image-ocr/1.3.0/support-matrix.html). - **NVIDIA NIMs for Object Detection**: - Nemotron Page Elements v3 [Support Matrix](https://docs.nvidia.com/nim/ingestion/object-detection/latest/support-matrix.html#nemo-retriever-page-elements-v3) From 67d4c64d6794149f5b683a5f6bdb81592ebacce8 Mon Sep 17 00:00:00 2001 From: Shubhadeep Das Date: Sat, 30 May 2026 01:15:38 +0530 Subject: [PATCH 2/2] docs: tighten reasoning and mig guidance Signed-off-by: Shubhadeep Das --- docs/enable-nemotron-thinking.md | 6 +++--- docs/support-matrix.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/enable-nemotron-thinking.md b/docs/enable-nemotron-thinking.md index cc0b292a5..36ddcf5dd 100644 --- a/docs/enable-nemotron-thinking.md +++ b/docs/enable-nemotron-thinking.md @@ -30,13 +30,13 @@ Nemotron 3 models (such as `nvidia/nemotron-3-nano-30b-a3b`) use environment var Set the following environment variables on the RAG server container (via Docker Compose, Helm values, or shell export): **`LLM_ENABLE_THINKING`** -: Enable or disable the reasoning phase. When `true`, the model emits reasoning tokens before the final answer. The v2.6.0 deployment files set this to `true` for Nemotron 3 Super. Library and custom deployments that do not set the environment variable use the application default, `false`. +: Enable or disable the reasoning phase. When `true`, the model emits reasoning tokens before the final answer. Default: `true` in the v2.6.0 deployment files for Nemotron 3 Super. **`LLM_REASONING_BUDGET`** -: Maximum number of tokens allocated for reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. The v2.6.0 deployment default is `256`; the application default is `0`. +: Maximum number of tokens allocated for reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. Default: `256` in the v2.6.0 deployment files. **`LLM_LOW_EFFORT`** -: Low-effort reasoning mode for faster, cheaper responses with shorter reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. The v2.6.0 deployment default is `true`; the application default is `false`. +: Low-effort reasoning mode for faster, cheaper responses with shorter reasoning. Only used when `LLM_ENABLE_THINKING` is `true`. Default: `true` in the v2.6.0 deployment files. **`FILTER_THINK_TOKENS`** : Filter reasoning out of the user-facing `content` stream. Reasoning emitted diff --git a/docs/support-matrix.md b/docs/support-matrix.md index 6a0e740b5..bcfa09c08 100644 --- a/docs/support-matrix.md +++ b/docs/support-matrix.md @@ -64,7 +64,7 @@ To install the default RAG Blueprint Helm chart on Kubernetes, you need one of t - 8 x H100-80GB - 8 x B200 - 8 x RTX PRO 6000 -- 3 x H100 (with [Multi-Instance GPU](./mig-deployment.md)) +- 5 x H100-80GB (with [Multi-Instance GPU](./mig-deployment.md)) Optional GPU-backed services increase the requirement. Plan for one additional GPU for each optional service that you enable, such as VLM generation, VLM captioning, VLM reranking, Nemotron Parse, or audio processing, unless you use MIG slicing or another explicit sharing strategy.