diff --git a/docs/docs/extraction/audio-video.md b/docs/docs/extraction/audio-video.md index 6cbf67a569..c85df6e598 100644 --- a/docs/docs/extraction/audio-video.md +++ b/docs/docs/extraction/audio-video.md @@ -63,7 +63,7 @@ This pipeline enables retrieval at the speech segment level when you enable segm Use the following procedure to run the NIM on your own infrastructure. Self-hosted Parakeet runs on Kubernetes via the [NeMo Retriever Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md). Enable the ASR NIM per [Optional Helm NIMs](prerequisites-support-matrix.md#optional-helm-nims-not-auto-wired-by-default) and the [Helm chart — NIM operator sub-stack](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md#nim-operator-sub-stack); pin the workload to a dedicated GPU and wire the ASR endpoint in your pipeline. -!!! important +After deploy, call the pipeline from Python: Pin the Parakeet workload to the dedicated GPU with your Helm values or the [NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html) (for example, node selectors, resource limits, or device requests appropriate to your cluster). @@ -87,15 +87,14 @@ Use the following procedure to run the NIM on your own infrastructure. Self-host asr_params=ASRParams(segment_audio=True), ) ) - ``` +) +``` +To generate one extracted element for each sentence-like ASR segment, include `extract_audio_params={"segment_audio": True}` when calling `.extract(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model. To generate one extracted element for each sentence-like ASR segment, pass `asr_params=ASRParams(segment_audio=True)` to `.extract_audio(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model. - - !!! tip - - For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb). + For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb). ## Parakeet with hosted inference (build.nvidia.com) { #parakeet-hosted-inference-build-nvidia } diff --git a/docs/docs/extraction/concepts.md b/docs/docs/extraction/concepts.md index 57d42065a2..4418682052 100644 --- a/docs/docs/extraction/concepts.md +++ b/docs/docs/extraction/concepts.md @@ -36,6 +36,6 @@ Token-based splitting uses the Llama 3.2 1B tokenizer (default `meta-llama/Llama - **Library mode** — Run without the full container stack where appropriate; see [Deployment options](deployment-options.md). - **Kubernetes / Helm (self-hosted)** — See [Deploy (Helm chart)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) and [deployment options](deployment-options.md) for running the full microservices pipeline on your infrastructure. -- **Notebooks** — [Jupyter examples](notebooks.md) for experimentation and RAG demos. +- **Notebooks** — [Jupyter examples](notebooks/index.md) for experimentation and RAG demos. For a concise comparison, refer to [Deployment options](deployment-options.md). diff --git a/docs/docs/extraction/custom-metadata.md b/docs/docs/extraction/custom-metadata.md index 41033c645a..bf47dcd1c5 100644 --- a/docs/docs/extraction/custom-metadata.md +++ b/docs/docs/extraction/custom-metadata.md @@ -1,74 +1,42 @@ -# Use Custom Metadata to Filter Search Results +# Custom metadata and filtering -You can upload custom metadata for documents during ingestion. -By uploading custom metadata you can attach additional information to documents, -and use it for filtering results during retrieval operations. -For example, you can add author metadata to your documents, and filter by author when you retrieve results. -To create filters at query time, use predicates supported by [LanceDB SQL](https://lancedb.github.io/lancedb/sql/) against your table schema (custom fields are serialized into the `metadata` column with your ingested chunks). For a worked example, see the repository notebook linked at the end of this page. +Use this documentation to attach per-document metadata during ingestion and to narrow [LanceDB](vdbs.md) search results in [NeMo Retriever Library](overview.md). Implementation details live in the package [Vector DB operators and LanceDB](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) README. -Use this documentation to use custom metadata to filter search results when you work with [NeMo Retriever Library](overview.md). +## On this page { #on-this-page } +- [Attach metadata at ingestion](#attach-metadata-at-ingestion) +- [How metadata is stored](#how-metadata-is-stored) +- [Filter results at query time](#filter-results-at-query-time) +- [Writing `where` predicates](#writing-where-predicates) +- [Server-side vs client-side filters](#server-side-vs-client-side-filters) +- [Inspect hit metadata](#inspect-hit-metadata) +- [Limitations](#limitations) +- [Related content](#related-content) -## Limitations +## Attach metadata at ingestion { #attach-metadata-at-ingestion } -The following are limitation when you use custom metadata: +Pass a **sidecar metadata table** on `vdb_upload` so selected columns are merged into each chunk's `content_metadata` before LanceDB upload. All three parameters must be set together: -- Metadata fields must be consistent across documents in the same collection. -- Complex filter expressions may impact retrieval performance. -- If you update your custom metadata, you must ingest your documents again to use the new metadata. +| Parameter | Purpose | +|-----------|---------| +| `meta_dataframe` | Path to CSV, JSON, or Parquet, or an in-memory `pandas.DataFrame` | +| `meta_source_field` | Column that identifies each document (must match ingest paths or basenames per `meta_join_key`) | +| `meta_fields` | Non-empty list of column names to copy into `content_metadata` | - - -## Add Custom Metadata During Ingestion - -You can add custom metadata during the document ingestion process. -You can specify metadata for each file, -and you can specify different metadata for different documents in the same ingestion batch. - - -### Metadata Structure - -You specify custom metadata as a dataframe or a file (json, csv, or parquet). - -The following example contains metadata fields for category, department, and timestamp. -You can create whatever metadata is helpful for your scenario. +Optional `meta_join_key` controls how rows are matched to documents: `auto` (try full path then basename), `source_id` (full path), or `source_name` (basename only). ```python import pandas as pd +from nemo_retriever import create_ingestor meta_df = pd.DataFrame( { "source": ["data/woods_frost.pdf", "data/multimodal_test.pdf"], - "category": ["Alpha", "Bravo"], - "department": ["Language", "Engineering"], - "timestamp": ["2025-05-01T00:00:00", "2025-05-02T00:00:00"] + "meta_a": ["alpha", "bravo"], + "meta_b": [10, 20], } ) -# Convert the dataframe to a csv file, -# to demonstrate how to ingest a metadata file in a later step. - -file_path = "./meta_file.csv" -meta_df.to_csv(file_path) -``` - - -### Example: Add Custom Metadata During Ingestion - -The following example adds custom metadata during ingestion. -For more information about `create_ingestor` and run modes, refer to [Use the Python API](nemo-retriever-api-reference.md). -For more information about the `vdb_upload` method, refer to [Upload Data](vdbs.md). - -```python -from nemo_retriever import create_ingestor - -# Service-backed pipeline: point `base_url` at your running retriever service. -# For local graph execution instead, see [Use the Python API](nemo-retriever-api-reference.md). - -hostname = "localhost" -table_name = "nemo_retriever_collection" -lancedb_uri = "./lancedb_data" - ingestor = ( create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670") .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"]) @@ -150,9 +118,11 @@ hits = retriever.query( ) ``` +For a runnable end-to-end flow (ingest, `Retriever.query`, and both filter modes), see [nemo_retriever_retriever_query_metadata_filter.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb). +When you ingest through the **retriever service**, upload the sidecar with [`POST /v1/ingest/sidecar`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/routers/ingest.py#L1040-L1129) (multipart file; response [`SidecarUploadResponse`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/responses.py#L60-L68)), then pass the returned `sidecar_id` as `meta_dataframe_id` with `meta_source_field` and `meta_fields` in `pipeline.vdb_upload_params` on [`POST /v1/ingest`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/requests.py#L15-L32) ([`PipelineSpec`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/pipeline_spec.py#L55-L78)). Request and response shapes, form fields, and auth headers are in the service OpenAPI UI at `/docs` (or `/openapi.json`) on your retriever base URL (for example `http://localhost:7670/docs` after `retriever service start`). Do not send a raw local path as `meta_dataframe` on the service spec. -## Related Content +## How metadata is stored { #how-metadata-is-stored } - [Vector databases](vdbs.md) — canonical LanceDB upload and retrieval guide - [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph ingest with sidecar metadata diff --git a/docs/docs/extraction/deployment-options.md b/docs/docs/extraction/deployment-options.md index 9f53687ffe..e1b030b7da 100644 --- a/docs/docs/extraction/deployment-options.md +++ b/docs/docs/extraction/deployment-options.md @@ -32,7 +32,7 @@ environments), use a custom service image that already contains `ffmpeg` and ### I want examples and notebooks -1. [Jupyter Notebooks](notebooks.md) +1. [Jupyter Notebooks](notebooks/index.md) 2. [Integrate with LangChain, LlamaIndex, Haystack](integrations-langchain-llamaindex-haystack.md) ### I need API details and keys diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md index 7014cb4eef..b3b45307dd 100644 --- a/docs/docs/extraction/faq.md +++ b/docs/docs/extraction/faq.md @@ -20,10 +20,12 @@ For more information, refer to [Vector databases](vdbs.md). For images that `nemoretriever-page-elements-v3` does not classify as tables, charts, or infographics, you can use our VLM caption task to create a dense caption of the detected image. -That caption is then be embedded along with the rest of your content. -For more information, refer to [Extract Captions from Images](nemo-retriever-api-reference.md). +That caption is then embedded along with the rest of your content. +For chart-labeled PDF regions and other caption scope limits, see [Are PDF chart or figure regions captioned when Omni is enabled?](#are-pdf-chart-or-figure-regions-captioned-when-omni-is-enabled). For more information, refer to [Extract Captions from Images](nemo-retriever-api-reference.md). +## Are PDF chart or figure regions captioned when Omni is enabled? +No. Chart-labeled PDF regions are not routed through Omni captioning. See [Image captioning](prerequisites-support-matrix.md#image-captioning-2605) for scope, validation, and what the caption stage covers. ## When should I consider advanced visual parsing? diff --git a/docs/docs/extraction/getting-started-about.md b/docs/docs/extraction/getting-started-about.md index 2305003523..bc7f6eb02b 100644 --- a/docs/docs/extraction/getting-started-about.md +++ b/docs/docs/extraction/getting-started-about.md @@ -10,6 +10,6 @@ Typical order: - [Deployment options](deployment-options.md) for how to run NeMo Retriever Library - **Supported:** [Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) for Kubernetes, plus [NeMo Retriever Library install docs](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) for the published charts - **Unsupported (developer-only):** [Docker Compose (local)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docker.md) — not a supported NIM deployment path -4. Explore [Jupyter Notebooks](notebooks.md) for end-to-end examples. +4. Explore [Jupyter Notebooks](notebooks/index.md) for end-to-end examples. If you are new to the product, read [What is NeMo Retriever Library?](overview.md) and [Concepts](concepts.md) under **Introduction** first. diff --git a/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md b/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md index ec1abe48c2..7ee0dda650 100644 --- a/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md +++ b/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md @@ -9,7 +9,7 @@ The repository includes notebooks that demonstrate multimodal RAG patterns: - [Multimodal RAG with LangChain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb) - [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb) -These are also linked from [Jupyter Notebooks](notebooks.md) and the [FAQ](faq.md). +These are also linked from [Jupyter Notebooks](notebooks/index.md) and the [FAQ](faq.md). ## Haystack diff --git a/docs/docs/extraction/multimodal-extraction.md b/docs/docs/extraction/multimodal-extraction.md index e021a7098e..5e6a5a4fb5 100644 --- a/docs/docs/extraction/multimodal-extraction.md +++ b/docs/docs/extraction/multimodal-extraction.md @@ -49,8 +49,9 @@ NeMo Retriever Library detects tables as structured page elements, processes the Charts and infographic regions are classified with other page layout elements (tables, text blocks, titles) and processed through layout detection and OCR. `extract_charts` and `extract_infographics` are enabled by default. Outputs use the same metadata schema as other extracted objects. +Chart-labeled PDF regions are **not** routed through the Omni caption stage; they remain on the layout-and-OCR path. For scope and validation guidance, see [Image captioning](prerequisites-support-matrix.md#image-captioning-2605). -For natural-language infographic descriptions, optionally enable [image captioning](#image-captioning). +For natural-language infographic descriptions, optionally enable [image captioning](#image-captioning) and set `caption_infographics=True` when you need VLM captions on infographic regions. **Related** @@ -62,7 +63,7 @@ For natural-language infographic descriptions, optionally enable [image captioni Scanned PDFs and image-only pages rely on OCR and hybrid paths that combine native text extraction with OCR when needed. For extract methods such as `ocr` and `pdfium_hybrid`, refer to the [Python API reference](nemo-retriever-api-reference.md). -The default OCR engine is **Nemotron OCR v2**. When you run extraction **locally with HuggingFace models**, v2 operates in **multilingual** mode by default. For CLI flags and API parameters, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docs/cli/README.md#nemotron-ocr-v2-language-mode). For Kubernetes installs, see [Nemotron OCR v2 — language mode](prerequisites-support-matrix.md#nemotron-ocr-v2-language-mode) in the support matrix. +OCR artifacts depend on how you deploy. **Helm / NIM:** the production chart uses **Nemotron OCR v1** (`nvcr.io/nim/nvidia/nemotron-ocr-v1:1.3.0`). **Local Hugging Face inference:** the default engine is **Nemotron OCR v2**, which operates in **multilingual** mode by default. For CLI flags and API parameters, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docs/cli/README.md#nemotron-ocr-v2-language-mode). For Kubernetes defaults and the Helm-vs-local split, see [OCR artifacts (Helm vs local Hugging Face)](prerequisites-support-matrix.md#nemotron-ocr-v2-language-mode) in the support matrix. **Related** diff --git a/docs/docs/extraction/notebooks.md b/docs/docs/extraction/notebooks/index.md similarity index 82% rename from docs/docs/extraction/notebooks.md rename to docs/docs/extraction/notebooks/index.md index 916cde203b..54a9787503 100644 --- a/docs/docs/extraction/notebooks.md +++ b/docs/docs/extraction/notebooks/index.md @@ -1,6 +1,6 @@ # Notebooks for NeMo Retriever Library -To get started using [NeMo Retriever Library](overview.md), you can try one of the ready-made notebooks that are available. +To get started using [NeMo Retriever Library](../overview.md), you can try one of the ready-made notebooks that are available. ## Dataset Downloads for Benchmarking @@ -23,11 +23,3 @@ For more advanced scenarios, try one of the following notebooks: - [Evaluate bo767 retrieval recall accuracy with NeMo Retriever Library](https://github.com/NVIDIA/NeMo-Retriever/blob/main/evaluation/bo767_recall.ipynb) - [Multimodal RAG with LangChain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb) - [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb) - - - -## Related Topics - -- [Pre-Requisites & Support Matrix](prerequisites-support-matrix.md) -- [Deployment options](deployment-options.md) -- [Deploy with Helm](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) diff --git a/docs/docs/extraction/overview.md b/docs/docs/extraction/overview.md index 07ef2c83dc..5128585233 100644 --- a/docs/docs/extraction/overview.md +++ b/docs/docs/extraction/overview.md @@ -15,6 +15,7 @@ NeMo Retriever Library does the following: - Accept directories of input files and a series of configurable ingestion tasks to perform on that input - Allow the extracted content be retrieved from a VDB containing discrete metadata element +- Support multiple extraction methods per document type—for example, PDFs can use **pdfium** or [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse) as an alternate method (`extract_method="nemotron_parse"`) - Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage. !!! note @@ -49,5 +50,5 @@ NeMo Retriever Library supports the following file types: - [Deploy on Kubernetes with Helm](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) - [NeMo Retriever Library — prerequisites / deployment](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) (supported Helm charts) - [Docker Compose (unsupported, developer)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docker.md) -- [Notebooks](notebooks.md) +- [Notebooks](notebooks/index.md) - [NVIDIA AI Blueprints catalog](https://build.nvidia.com/explore/discover) — solution cards, enterprise RAG blueprints, and end-to-end patterns (including [Enterprise RAG — multimodal PDF data extraction](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag)); for integration pathways, refer to [Integrations](integrations-langchain-llamaindex-haystack.md). diff --git a/docs/docs/extraction/prerequisites-support-matrix.md b/docs/docs/extraction/prerequisites-support-matrix.md index 46fce79ed4..b8910f2ea8 100644 --- a/docs/docs/extraction/prerequisites-support-matrix.md +++ b/docs/docs/extraction/prerequisites-support-matrix.md @@ -5,7 +5,7 @@ Before you begin using [NeMo Retriever Library](overview.md), confirm your softw ## Software Requirements - Linux operating systems (Ubuntu 22.04 or later recommended) -- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) (NVIDIA Driver >= `535`, CUDA >= `12.2`) +- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) (NVIDIA Driver >= `580`, CUDA >= `13.0`) - [Python](https://www.python.org/downloads/) `3.12` — required to install and run the NeMo Retriever Library Python API, CLI, and related packages from PyPI (for example `pip` or `uv`). Older Python versions will fail dependency resolution without a clear error. - [UV Python package and environment manager](https://docs.astral.sh/uv/getting-started/installation/) (optional; recommended for creating isolated environments) - For audio and video, `ffmpeg` and `ffprobe` must be on `PATH` (for example @@ -13,6 +13,11 @@ Before you begin using [NeMo Retriever Library](overview.md), confirm your softw `ffmpeg-python` and `nemo-retriever[multimedia]` do not install these binaries. On Helm with package-repo access, set `service.installFfmpeg=true`. For air-gapped clusters, see [Air-gapped and disconnected deployment](deployment-options.md#air-gapped-deployment). +- For PDF extraction with `extract_method="nemotron_parse"`, install the Nemotron Parse + client dependencies with `pip install "nemo-retriever[nemotron-parse]"` (pulls + `open-clip-torch`, which provides the `open_clip` module required by the Nemotron Parse + NIM client). The base `nemo-retriever` install and `[local]` extra do not include this + package. !!! note @@ -67,16 +72,20 @@ The production Helm chart enables these NIM microservices **by default** (for ex |-----------|-----|------| | `page_elements` | [nemotron-page-elements-v3](https://huggingface.co/nvidia/nemotron-page-elements-v3) | Page layout and element detection | | `table_structure` | [nemotron-table-structure-v1](https://huggingface.co/nvidia/nemotron-table-structure-v1) | Table structure extraction | -| `ocr` | [nemotron-ocr-v2](https://huggingface.co/nvidia/nemotron-ocr-v2) | Image OCR | +| `ocr` | [nemotron-ocr-v1](https://huggingface.co/nvidia/nemotron-ocr-v1) | Image OCR | | `vlm_embed` | [llama-nemotron-embed-vl-1b-v2](https://huggingface.co/nvidia/llama-nemotron-embed-vl-1b-v2) | Multimodal (VL) embedding | -### Nemotron OCR v2 language mode { #nemotron-ocr-v2-language-mode } +### OCR artifacts (Helm vs local Hugging Face) { #nemotron-ocr-v2-language-mode } !!! note + **Helm / NIM:** The production chart deploys **Nemotron OCR v1** under `nimOperator.ocr` (`nvcr.io/nim/nvidia/nemotron-ocr-v1:1.3.0`). For image defaults and upgrade notes, see [OCR NIM configuration](https://github.com/NVIDIA/NeMo-Retriever/blob/26.05/nemo_retriever/helm/README.md#ocr-nim-configuration) in the Helm chart README. + **Local Hugging Face inference:** When you deploy locally with HuggingFace model weights (for example `pip install "nemo-retriever[local]"` and GPU inference without remote OCR NIM URLs), the default OCR engine is **Nemotron OCR v2**, which runs in **multilingual** mode by default. For CLI flags and API parameters, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docs/cli/README.md#nemotron-ocr-v2-language-mode). Remote OCR NIM endpoints use their own model and language behavior; local OCR language selectors are not sent on remote requests. - **Helm / NIM:** The chart deploys the core OCR NIM under `nimOperator.ocr`. For image defaults, multilingual behavior, and upgrade notes, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/26.05/nemo_retriever/helm/README.md#nemotron-ocr-v2-language-mode) in the Helm chart README. +Default OCR NIM container for release Helm deployments: + +- **Image:** `nvcr.io/nim/nvidia/nemotron-ocr-v1:1.3.0` Default VL embedder container and model for release deployments: @@ -98,6 +107,17 @@ These NIM microservices are **optional** for the default extraction pipeline. Th For 26.05, use **`nemotron_3_nano_omni_30b_a3b_reasoning`** when you enable the caption stage (hosted model ID `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning`). The Helm key is in the [optional NIMs](#optional-helm-nims-not-auto-wired-by-default) table above. +!!! important "PDF chart regions are not captioned by Omni" + + When **nemotron-page-elements-v3** classifies a PDF region as **chart**, that region is processed through layout detection and OCR—not the Omni caption stage. Enabling the caption NIM and the `caption` pipeline stage does **not** send chart-labeled figures to `/v1/chat/completions`. + + The caption stage covers: + + - Unstructured content in the `images` column (standalone image files and page-element regions **not** classified as table, chart, or infographic) + - Optional infographic regions when you set `caption_infographics=True` on `CaptionParams` (the VLM caption is stored in `caption`, separate from OCR `text`) + + To validate caption traffic during ingest, inspect metadata such as `page_elements_v3_counts_by_label`. If the figure is labeled `chart`, expect no Omni chat-completions requests for that region even when captioning is enabled. + Optional features listed in the table above require additional GPU support, disk space, and feature-specific system dependencies beyond the four default NIMs. For published NIM model IDs and deployment-specific constraints, use the product support matrices linked under [Related Topics](#related-topics) below. @@ -111,6 +131,8 @@ NeMo Retriever Library supports the following GPU hardware given system constrai Model repositories and NIM references are linked in [Core and Advanced Pipeline Features](#core-and-advanced-pipeline-features) above. +**B200 and audio/video extraction (26.05):** The [audio and video](audio-video.md) transcription path (self-hosted Parakeet ASR via `nimOperator.audio`) is **not supported on B200** or other Blackwell GPUs. Core PDF and multimodal extraction on B200 is unchanged. See footnote ⁴ below. + | Feature | HF Model Weights | GPU Option | [RTX Pro 6000](https://www.nvidia.com/en-us/data-center/rtx-pro-6000-blackwell-server-edition/) | [B200](https://www.nvidia.com/en-us/data-center/dgx-b200/) | [H200 NVL](https://www.nvidia.com/en-us/data-center/h200/) | [H100](https://www.nvidia.com/en-us/data-center/h100/) | [A100 80GB](https://www.nvidia.com/en-us/data-center/a100/) | A100 40GB | [A10G](https://aws.amazon.com/ec2/instance-types/g5/) | L40S | [RTX PRO 4500 Blackwell](https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-4500/) | |---------|------------------|------------|--------|--------|--------|--------|--------|--------|--------|--------|------------------------| | GPU | — | Memory | 96GB | 180GB | 141GB | 80GB | 80GB | 40GB | 24GB | 48GB | 32GB GDDR7 (GB203) | diff --git a/docs/docs/extraction/releasenotes.md b/docs/docs/extraction/releasenotes.md index 86e1cd421f..a06d20fa7c 100644 --- a/docs/docs/extraction/releasenotes.md +++ b/docs/docs/extraction/releasenotes.md @@ -4,57 +4,80 @@ This documentation contains the release notes for [NeMo Retriever Library](overv ## 26.05 Release Notes (26.5.0) -NVIDIA® NeMo Retriever Library version **26.05** (PyPI **26.5.0** at GA) continues the 26.05 release line on the [`26.05`](https://github.com/NVIDIA/NeMo-Retriever/tree/26.05) branch. Pre-release builds are tagged **`26.05-RC1`**, **`26.05-RC2`**, and so on; install and deploy using the RC tag that matches your build. +NVIDIA® NeMo Retriever Library version 26.05 builds on the 26.03 foundation with a graph-based ingest architecture, expanded multimodal and tabular capabilities, production-oriented service deployment, and documentation aligned to a Helm-first supported path. -To upgrade the Helm charts for this release, refer to the [NeMo Retriever Helm chart README](https://github.com/NVIDIA/NeMo-Retriever/blob/26.05/nemo_retriever/helm/README.md) and pin chart version **`26.05-RC1`** (or the RC you are validating). +To upgrade the Helm charts for this release, refer to the [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/26.05/nemo_retriever/helm/README.md). -Highlights for the 26.05 release line include everything in [26.03](#2603-release-notes-2630) plus changes on `main` merged into the `26.05` branch. See the [Git compare view](https://github.com/NVIDIA/NeMo-Retriever/compare/26.03...26.05) for the full commit list. +Highlights for the 26.05 release include: -**Migration note:** Direct `Retriever(...)` construction uses grouped configuration dictionaries. Replace flat `lancedb_uri=`, `lancedb_table=`, `embedder=`, `embedding_endpoint=`, `local_query_embed_backend=`, and `reranker=` arguments with `vdb_kwargs={...}`, `embed_kwargs={...}`, and `rerank=...`. For example, `local_query_embed_backend="hf"` maps to `embed_kwargs={"local_ingest_embed_backend": "hf"}`. Helper APIs that document their own flat kwargs keep their own compatibility layer. +### Upgrade notes -**Install (RC1 example):** +- Text splitting for graph and library ingest moved into `.extract(split_config=...)` instead of standalone `.split()` on the graph ingest path (the service ingestor API may still expose `.split()` separately) +- Direct `Retriever(...)` construction uses `vdb_kwargs`, `embed_kwargs`, and `rerank` instead of flat `lancedb_uri`, `lancedb_table`, `embedder`, `embedding_endpoint`, `local_query_embed_backend`, and `reranker` arguments +- For Helm audio and video extraction, set `service.installFfmpeg: true` in `values.yaml` (or pass `--set service.installFfmpeg=true`) when images no longer bundle `ffmpeg` and `ffprobe` by default +- `nemo_retriever` requires Python 3.12 -```bash -uv pip install nemo-retriever==26.05-RC1 -``` +### Pipeline and ingestion -Use your organization's Artifactory or PyPI index URL when installing published wheels from CI (see the Perform Release workflow summary for the exact index). +- Legacy `nv-ingest` code paths removed; `graph_pipeline` and the graph stage registry are the canonical ingestion path +- Manifest-based ingest routing replaces input-type routing; `retriever ingest` is input-aware for PDF, image, audio, video, text, HTML, DOCX/PPTX, SVG, and related types +- `allow_no_gpu` option to skip GPU requirement during ingest for CPU-only experimentation -## 26.03 Release Notes (26.3.0) +### CLI -NVIDIA® NeMo Retriever Library version 26.03 adds broader hardware and software support along with many pipeline, evaluation, and deployment enhancements. +- Root CLI adds `retriever ingest` and `retriever query` with NIM URL flags, batch tuning, and LanceDB overwrite/append controls, plus `retriever pipeline` for graph execution +- For product use, only `retriever ingest`, `retriever query`, and `retriever pipeline` (for example `retriever pipeline run`) are supported; other top-level subcommands—including `pdf`, `html`, `eval`, `benchmark`, `harness`, `online`, `compare`, `image`, and `skill-eval`—are development and experimental -To upgrade the Helm charts for this release, refer to the [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md). +### Retriever Service and deployment -Highlights for the 26.03 release include: +- Retriever Service v2 adds a scalable multi-pod architecture with gateway, process isolation, and VectorDB integration +- OpenTelemetry basic support for pipeline and service observability +- Expanded air-gapped deployment guidance in [deployment options](deployment-options.md) and the Helm chart README -- Legacy ingestion repository consolidated under NeMo-Retriever -- NeMo Retriever Extraction pipeline renamed to NeMo Retriever Library -- NeMo Retriever Library now supports two deployment options: - - A new no-container, pip-installable in-process library for development (available on PyPI) - - Existing production-ready Helm chart with NIMs -- Added documentation notes on Air-gapped deployment support -- Added documentation notes on OpenShift support -- Added support for RTX4500 Pro Blackwell SKU -- Added support for llama-nemotron-embed-vl-v2 in text and text+image modes -- New extract methods `pdfium_hybrid` and `ocr` target scanned PDFs to improve text and layout extraction from image-based pages -- VLM-based image caption enhancements: - - Infographics can be captioned - - Reasoning mode is configurable -- Enabled hybrid search with Lancedb -- Added retrieval_bench subfolder with generalizable agentic retrieval pipeline -- The project now uses UV as the primary environment and package manager instead of Conda, resulting in faster installs and simpler dependency handling -- Default TTL for long-running pipeline job state increased from 1–2 hours to 48 hours so long-running jobs (for example, VLM captioning) do not expire before completion -- NeMo Retriever Library currently does not support image captioning via VLM; this feature will be added in the next release -- Documentation: multimodal extraction is covered on one page with an in-page table of contents and redirects from the former per-topic URLs -- Container images built from this repository no longer install `ffmpeg` and - `ffprobe` by default. Audio and video extraction require these binaries on - `PATH`; for Helm deployments set `service.installFfmpeg=true`, or install - system FFmpeg manually in non-container environments. +### Models, OCR, and captioning + +- Nemotron OCR v2 is the default OCR engine for HuggingFace, with CLI language selectors and unified OCR actors. For Helm NIM deployments, Nemotron OCR v1 is the default. +- Nemotron Parse is available as an alternate PDF extraction method (v1.2 HTTP interface; optional Helm NIM; local inference via vLLM where configured) +- VLM image captioning via vLLM (including Omni caption model profiles) addresses the capability deferred in 26.03 +- vLLM-backed text and vision-language embedders, multimodal VL reranker, and torch 2.11 for local GPU installs + +### Multimodal extraction + +- Video retrieval pipeline with frame extraction, OCR, audio-visual fusion, and text deduplication +- Long-audio Parakeet chunking with time-aligned segments; punctuation-based audio segmenting; ASR batch/streaming improvements + +### Retrieval and RAG + +- Live RAG SDK with `Retriever.retrieve()`, reference answer generation `Retriever.answer()`, and optional batch operator graphs via LiteLLM (`[llm]` extra) + +### Vector database + +- Vector database operators integrated directly in the pipeline; custom metadata support; LanceDB hybrid search guidance updated +- LanceDB is documented as the first-party vector path for new deployments; Milvus/MinIO guidance removed from the primary extraction doc set + +### Evaluation + +- BEIR-centric evaluation overhaul and `retriever skill-eval` benchmark CLI for the NeMo Retriever skill (experimental) + + +- Text-to-SQL agent graph and tabular tooling for structured data retrieval, including tabular data ingestion + +### Packaging and platform + +- Optional install extras (`[local]`, `[multimedia]`, `[llm]`, `[tabular]`, `[nemotron-parse]`, `[service]`, and others), including slim remote/NIM-only installs on Mac and Windows + +### Helm chart + +- Helm chart refresh under `nemo_retriever/helm/` with GA VL embedder defaults and optional Nemotron Parse and Omni caption NIMs + +### Documentation + +- Documentation aligned to a Helm-first supported path; [Docker Compose for local development](https://github.com/NVIDIA/NeMo-Retriever/blob/26.05/nemo_retriever/docker.md) documented as unsupported developer tooling (not a production NIM deployment path) +- Documentation consolidates extraction concepts, ingest workflow, embeddings, audio/video guides, prerequisites and support matrix, and UDF/custom stages in the [graph README](https://github.com/NVIDIA/NeMo-Retriever/tree/26.05/nemo_retriever/src/nemo_retriever/graph#nemo-retriever-graph) ## Release Notes for Previous Versions -| [26.03](https://docs.nvidia.com/nemo/retriever/26.03/extraction/releasenotes/) +| [26.03](https://docs.nvidia.com/nemo/retriever/26.3.0/extraction/releasenotes/) | [26.1.2](https://docs.nvidia.com/nemo/retriever/26.1.2/extraction/releasenotes/) | [26.1.1](https://docs.nvidia.com/nemo/retriever/26.1.1/extraction/releasenotes/) | [25.9.0](https://docs.nvidia.com/nemo/retriever/25.9.0/extraction/releasenotes/) @@ -69,4 +92,4 @@ Highlights for the 26.03 release include: - [Pre-Requisites & Support Matrix](prerequisites-support-matrix.md) - [Deployment options](deployment-options.md) -- [Deploy with Helm](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) +- [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/26.05/nemo_retriever/helm/README.md) diff --git a/docs/docs/extraction/troubleshoot.md b/docs/docs/extraction/troubleshoot.md index fa415cd7e5..26a90bab16 100644 --- a/docs/docs/extraction/troubleshoot.md +++ b/docs/docs/extraction/troubleshoot.md @@ -100,6 +100,30 @@ You can set the variable in your .env file or directly in your environment. +## ModuleNotFoundError: No module named open_clip when using nemotron_parse { #modulenotfounderror-no-module-named-open-clip-when-using-nemotron-parse } + +When you run PDF extraction with `extract_method="nemotron_parse"`, you might see an error similar to the following: + +```text +ModuleNotFoundError: No module named 'open_clip' +``` + +The Nemotron Parse NIM client requires the `open_clip` Python module, provided by `open-clip-torch`. That package is not part of the default `nemo-retriever` install or the `[local]` extra. + +Install the dedicated PyPI extra before running Nemotron Parse extraction: + +```bash +pip install "nemo-retriever[nemotron-parse]" +``` + +For local GPU inference with Nemotron Parse, combine extras: + +```bash +pip install "nemo-retriever[local,nemotron-parse]" +``` + +See also [What is NeMo Retriever Library?](overview.md) and [Pre-Requisites & Support Matrix](prerequisites-support-matrix.md#software-requirements). + ## Extract method nemotron-parse doesn't support image files Currently, extraction with Nemotron parse doesn't support image files, only scanned PDFs. diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 208b7b1544..4dcabca015 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -104,7 +104,7 @@ nav: - "NimClient and custom NIM endpoints": extraction/nimclient.md - "10. Integrations & ecosystem": - "Framework integrations": extraction/integrations-langchain-llamaindex-haystack.md - - "Starter kits": extraction/notebooks.md + - "Starter kits": extraction/notebooks/index.md - "11. Evaluation & benchmarks": - "Evaluate on your own documents": extraction/evaluate-on-your-data.md - "12. Reference": @@ -159,6 +159,7 @@ plugins: extraction/hosted-nims-when-to-use.md: extraction/deployment-options.md extraction/releasenotes-nv-ingest.md: extraction/releasenotes.md extraction/ngc-api-key.md: extraction/api-keys.md + extraction/notebooks.md: extraction/notebooks/index.md extraction/data-store.md: extraction/vdbs.md extraction/nemoretriever-parse.md: extraction/multimodal-extraction.md#text-and-layout-extraction extraction/supported-file-types.md: extraction/multimodal-extraction.md#supported-file-types-and-formats @@ -208,8 +209,9 @@ markdown_extensions: # MkDocs 1.6+: exclude suite landing and legacy duplicate pages (still in repo for parity). # extraction/chunking.md — removed from nav; content is under concepts.md (redirect_maps keeps old URLs). +# Use /index.md (docs root only); bare index.md would exclude every index.md (e.g. extraction/notebooks/index.md). exclude_docs: | - index.md + /index.md extraction/chunking.md extraction/helm.md extraction/choose-your-path.md