Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 2 additions & 6 deletions docs/docs/extraction/audio.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,13 @@ to extract speech from audio files.
- Run the NIM locally by using Docker Compose
- Use NVIDIA Cloud Functions (NVCF) endpoints for cloud-based inference

!!! note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
For product naming, refer to [What is NeMo Retriever Extraction?](overview.md) for more information.

Currently, you can extract speech from the following file types:

- `mp3`
- `wav`



-
## Overview

[NeMo Retriever extraction](overview.md) supports extracting speech from audio files for Retrieval Augmented Generation (RAG) applications.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/extraction/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ active:
text_depth: page
table_output_format: markdown

# Pipeline (optional steps)
# Pipeline (optional steps); defaults match [Split Documents](chunking.md)
enable_caption: false
enable_split: false
split_chunk_size: 1024
Expand Down
73 changes: 36 additions & 37 deletions docs/docs/extraction/chunking.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,30 +44,46 @@ If you want chunks smaller than `page`, use token-based splitting as described i

## Token-Based Splitting

The `split` task uses a tokenizer to count the number of tokens in the document,
and splits the document based on the desired maximum chunk size and chunk overlap.
We recommend that you use the `meta-llama/Llama-3.2-1B` tokenizer,
because it's the same tokenizer as the llama-3.2 embedding model that we use for embedding.
However, you can use any tokenizer from any HuggingFace model that includes a tokenizer file.
The `split` task uses a tokenizer to count tokens and split by chunk size and overlap. For the default tokenizer, optional Llama tokenizer, and environment variables, refer to the canonical subsection below.

Use the `split` method to chunk large documents as shown in the following code.
### Token-based splitting and tokenizers

!!! note
This section is the **canonical reference** for tokenizer behavior in NV-Ingest. When tokenizer behavior, gating, or env vars change, update this section first; then update or link from any other doc that mentions tokenizers (see [Documentation maintenance](contributing.md#documentation-maintenance)).

The default tokenizer (`meta-llama/Llama-3.2-1B`) requires a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens). You must set `hf_access_token": "hf_***` to authenticate.
**Default tokenizer behavior**

- When you do not specify a tokenizer, the service uses a pre-downloaded tokenizer if present: if the container was built with `DOWNLOAD_LLAMA_TOKENIZER=True`, it uses `meta-llama/Llama-3.2-1B` from the container; otherwise it uses `intfloat/e5-large-unsupervised`. If no tokenizer is pre-downloaded, the runtime default is `intfloat/e5-large-unsupervised`.
- **Recommended for embedding alignment:** Use `meta-llama/Llama-3.2-1B` with `chunk_size=1024` and `chunk_overlap=150` so split boundaries align with the default embedding model. You can use any Hugging Face model that provides a tokenizer.

**Optional Llama tokenizer (`meta-llama/Llama-3.2-1B`)**

- The model is **gated on Hugging Face**. You must accept the [license](https://huggingface.co/meta-llama/Llama-3.2-1B) and [request access](https://huggingface.co/meta-llama/Llama-3.2-1B).
- **At runtime:** If the Llama tokenizer is not pre-downloaded in the container, you must provide a Hugging Face access token via `params={"hf_access_token": "hf_***"}` in the split task or set the `HF_ACCESS_TOKEN` environment variable for the ingest service. See [Hugging Face access tokens](https://huggingface.co/docs/hub/en/security-tokens).
- **At build time:** To pre-download the Llama tokenizer into the image (no runtime token needed), set `DOWNLOAD_LLAMA_TOKENIZER=True` and provide `HF_ACCESS_TOKEN` (or the equivalent build secret) during the Docker build.

**Environment variables**

| Variable | When it applies | What it does |
|----------|------------------|--------------|
| `DOWNLOAD_LLAMA_TOKENIZER` | Build time only | When `True`, pre-downloads `meta-llama/Llama-3.2-1B` into the container. Requires `HF_ACCESS_TOKEN` (or build secret) during build. When `False` (default in docker-compose), the container pre-downloads `intfloat/e5-large-unsupervised` instead. |
| `HF_ACCESS_TOKEN` | Build and/or runtime | Hugging Face access token. **Required at build** when `DOWNLOAD_LLAMA_TOKENIZER=True`. **Required at runtime** when the split task uses `meta-llama/Llama-3.2-1B` and the tokenizer is not pre-downloaded in the container. |

All other tokenizer and split-default details (e.g. in environment variable reference, client examples) should link here. See [Support Matrix](support-matrix.md) for the default embedding model.

---

**Examples**

```python
# Recommended: Llama tokenizer aligned with default embedder (provide hf_access_token if not pre-downloaded)
ingestor = ingestor.split(
tokenizer="meta-llama/Llama-3.2-1B",
chunk_size=1024,
chunk_overlap=150,
params={"split_source_types": ["text", "PDF"], "hf_access_token": "hf_***"}
)
```

To use a different tokenizer, such as `intfloat/e5-large-unsupervised`, you can modify the `split` call as shown following.

```python
# Alternative tokenizer
ingestor = ingestor.split(
tokenizer="intfloat/e5-large-unsupervised",
chunk_size=1024,
Expand All @@ -78,34 +94,17 @@ ingestor = ingestor.split(

### Split Parameters

The following table contains the `split` parameters.

| Parameter | Description | Default |
| ------ | ----------- | -------- |
| `tokenizer` | HuggingFace Tokenizer identifier or path. | `meta-llama/Llama-3.2-1B`|
| `chunk_size` | Maximum number of tokens per chunk. | `1024` |
| `chunk_overlap` | Number of tokens to overlap between chunks. | `150` |
| `params` | A sub-dictionary that can contain `split_source_types` and `hf_access_token` | `{}` |
| `hf_access_token` | Your Hugging Face access token. | — |
| `split_source_types` | The source types to split on (only splits on text by default). | — |



### Pre-download the Tokenizer

By default, the NV Ingest container comes with the `meta-llama/Llama-3.2-1B` tokenizer pre-downloaded
so that it doesn't have to download a tokenizer at runtime.
If you are building the container yourself and want to pre-download this model, do the following:

- Review the [license agreement](https://huggingface.co/meta-llama/Llama-3.2-1B).
- [Request access](https://huggingface.co/meta-llama/Llama-3.2-1B).
- Set the `DOWNLOAD_LLAMA_TOKENIZER` environment variable to `True`
- Set the `HF_ACCESS_TOKEN` environment variable to your HuggingFace access token.


| --------- | ----------- | -------- |
| `tokenizer` | HuggingFace tokenizer identifier or path. | See [Token-based splitting and tokenizers](#token-based-splitting-and-tokenizers) above. |
| `chunk_size` | Maximum number of tokens per chunk. | `1024` |
| `chunk_overlap` | Number of tokens to overlap between chunks. | `150` |
| `params` | Can include `split_source_types` and `hf_access_token`. | `{}` |
| `hf_access_token` | Hugging Face access token (required for gated Llama tokenizer when not pre-downloaded). | — |
| `split_source_types` | Source types to split on (text only by default). | — |

## Related Topics

- [Use the Python API](nv-ingest-python-api.md)
- [NeMo Retriever Extraction V2 API Guide](v2-api-guide.md)
- [Environment Variables](environment-variables.md)
- [Environment Variables](environment-config.md)
6 changes: 2 additions & 4 deletions docs/docs/extraction/content-metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,7 @@ The definitions used in this documentation are the following:

Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods.

!!! note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
For product naming, see [What is NeMo Retriever Extraction?](overview.md).



Expand Down Expand Up @@ -374,4 +372,4 @@ For the full file, refer to the [data folder](https://github.com/NVIDIA/nv-inges

## Related Topics

- [Environment Variables](environment-variables.md)
- [Environment Variables](environment-config.md)
65 changes: 65 additions & 0 deletions docs/docs/extraction/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,68 @@

External contributions to NV-Ingest will be welcome soon, and they are greatly appreciated!
For more information, refer to [Contributing to NV-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md).

## Documentation maintenance

Prefer **centralization over repetition** for:

- **Model names, tokenizer/config defaults, environment variables, build-time flags, and job schema fields** — keep one canonical description and link from elsewhere.
- **Architecture diagrams and descriptions of service interactions** — one canonical overview; other pages summarize or link.
- **Constraints, caveats, and “gotchas” that may change over time** — e.g. known issues in [Release Notes](releasenotes-nv-ingest.md), hardware/feature limits in [Support Matrix](support-matrix.md).

When adding or updating information, always ask:

1. **Where should the canonical description live?**
Use or create the appropriate reference page (see below).
2. **Are there other places that already mention this and should now link to or reference the canonical description instead?**
Update them to point to the single source of truth.

**Canonical reference pages:**

| Topic | Canonical page |
|-------|----------------|
| Product naming | [What is NeMo Retriever Extraction?](overview.md) |
| Environment variables | [Environment Variables](environment-config.md) |
| Pipeline scaling env vars | [Resource Scaling Modes](scaling-modes.md) |
| Tokenizer and split defaults | [Split Documents](chunking.md) — canonical subsection: [Token-based splitting and tokenizers](chunking.md#token-based-splitting-and-tokenizers) |
| Pipeline NIMs and hardware | [Support Matrix](support-matrix.md) |
| Metadata / job schema fields | [Metadata Reference](content-metadata.md) |
| `vdb_upload` and `dense_dim` | [Data Store](data-store.md) |
| Known issues, deprecations, NIM caveats | [Release Notes](releasenotes-nv-ingest.md) |

### Pattern for any architecture or code change

For **any** future change in architecture or code that affects the docs (not just tokenizers), follow this pattern:

1. **Identify the concept**
Decide what concept changed (e.g. tokenizer, embedding model, environment variable, microservice, job schema field, API parameter).

2. **Choose the canonical home**
Decide which doc section or page should be the single canonical home for this concept (e.g. [Environment Variables](environment-config.md), Architecture Overview, [Token-based splitting and tokenizers](chunking.md#token-based-splitting-and-tokenizers), job configuration schema). If no suitable place exists, create a new subsection and make it discoverable from the main index or [overview](overview.md).

3. **Update the canonical description**
Update that canonical section first with: current behavior; defaults and configuration options; external requirements (e.g. NGC or Hugging Face account, licensing, tokens).

4. **De-duplicate and re-point**
Search the repo for all mentions of that concept (names, env vars, model IDs, API fields). For each mention: replace copied detailed explanations with concise text that defers to the canonical section; ensure any remaining text is fully consistent with the canonical description.

**Example — tokenizer changes:** Canonical home is [Token-based splitting and tokenizers](chunking.md#token-based-splitting-and-tokenizers). After updating it, search for: `llama-tokenizer`, `Llama tokenizer`, `meta-llama`, `DOWNLOAD_LLAMA_TOKENIZER`, `HF_ACCESS_TOKEN`, `token-based splitting`, and any model names that were previously the default; then shorten or link each match to the canonical section.

### Guardrails

- **Never** introduce a new detailed explanation of a previously documented concept without either:
- Moving that explanation into the concept’s canonical section, or
- Explicitly updating the canonical section and linking back to it from the new text.
- **Avoid** having two places where a reader could reasonably believe they are both “the main description” of the same behavior. If in doubt, keep one canonical description and have the other place summarize in 1–2 sentences and link.

### Style and linking conventions

When pointing to a canonical section:

- **Use consistent phrasing**, for example:
- “For the most up-to-date tokenizer configuration and requirements, see [Token-based splitting and tokenizers](chunking.md#token-based-splitting-and-tokenizers).”
- “For full details, see the [Environment Variables](environment-config.md) reference.”
- “For full details and the latest [concept] behavior, see [canonical section](link).” (Replace [concept] with the topic, e.g. tokenizer, environment variables.)
- **Keep cross-references stable:**
- Prefer relative links and section headings that are unlikely to change (e.g. `chunking.md#token-based-splitting-and-tokenizers`).
- If you rename or move a canonical section, update all inbound links in the same commit.
9 changes: 4 additions & 5 deletions docs/docs/extraction/data-store.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@

Use this documentation to learn how [NeMo Retriever extraction](overview.md) handles and uploads data.

!!! note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
For product naming, see [What is NeMo Retriever Extraction?](overview.md).


## Overview
Expand Down Expand Up @@ -47,6 +45,8 @@ You can delete all collections by deleting that volume, and then restarting the

When you use the `vdb_upload` method, the behavior of the upload depends on the `return_failures` parameter of the `ingest` method. For details, refer to [Capture Job Failures](nv-ingest-python-api.md#capture-job-failures).

**`dense_dim` must match your embedding model:** use `dense_dim=1024` for the default llama-3.2 embedder and `dense_dim=2048` for e5-v5. The default embedder is listed in [Support Matrix](support-matrix.md) and [Use the Python API](nv-ingest-python-api.md).

To upload to Milvus, use code similar to the following to define your `Ingestor`.

```python
Expand All @@ -59,8 +59,7 @@ Ingestor(client=client)
collection_name=collection_name,
milvus_uri=milvus_uri,
sparse=sparse,
# for llama-3.2 embedder, use 1024 for e5-v5
dense_dim=2048,
dense_dim=2048, # Use 1024 for default llama-3.2 embedder; see note above
stream=False,
recreate=False
)
Expand Down
10 changes: 4 additions & 6 deletions docs/docs/extraction/environment-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,15 @@
The following are the environment variables that you can use to configure [NeMo Retriever extraction](overview.md).
You can specify these in your .env file or directly in your environment.

!!! note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
For product naming, see [What is NeMo Retriever Extraction?](overview.md).


## General Environment Variables

| Name | Example | Description |
|----------------------------------|--------------------------------|-----------------------------------------------------------------------|
| `DOWNLOAD_LLAMA_TOKENIZER` | - | The Llama tokenizer is now pre-downloaded at build time. For details, refer to [Token-Based Splitting](chunking.md#token-based-splitting). |
| `HF_ACCESS_TOKEN` | - | A token to access HuggingFace models. For details, refer to [Token-Based Splitting](chunking.md#token-based-splitting). |
| `DOWNLOAD_LLAMA_TOKENIZER` | `True` / `False` | Build-time: when `True`, pre-downloads the Llama tokenizer. For when it is required and how it interacts with `HF_ACCESS_TOKEN`, see [Token-based splitting and tokenizers](chunking.md#token-based-splitting-and-tokenizers). |
| `HF_ACCESS_TOKEN` | | Hugging Face access token; required for gated Llama tokenizer at build or runtime depending on setup. For full details, see [Token-based splitting and tokenizers](chunking.md#token-based-splitting-and-tokenizers). |
| `INGEST_LOG_LEVEL` | - `DEBUG` <br/> - `INFO` <br/> - `WARNING` <br/> - `ERROR` <br/> - `CRITICAL` <br/> | The log level for the ingest service, which controls the verbosity of the logging output. |
| `MESSAGE_CLIENT_HOST` | - `redis` <br/> - `localhost` <br/> - `192.168.1.10` <br/> | Specifies the hostname or IP address of the message broker used for communication between services. |
| `MESSAGE_CLIENT_PORT` | - `7670` <br/> - `6379` <br/> | Specifies the port number on which the message broker is listening. |
Expand All @@ -25,7 +23,6 @@ You can specify these in your .env file or directly in your environment.
| `IMAGE_STORAGE_URI` | `s3://nv-ingest/artifacts/store/images` <br/> | Default fsspec-compatible URI for the `store` task. Supports `s3://`, `file://`, `gs://`, etc. See [Store Extracted Images](nv-ingest-python-api.md#store-extracted-images). |
| `IMAGE_STORAGE_PUBLIC_BASE_URL` | `https://assets.example.com/images` <br/> | Optional HTTP(S) base URL for serving stored images. |


## Library Mode Environment Variables

These environment variables apply specifically when running NV-Ingest in library mode.
Expand All @@ -38,4 +35,5 @@ These environment variables apply specifically when running NV-Ingest in library

## Related Topics

- [Resource Scaling Modes](scaling-modes.md) — pipeline scaling variables (`INGEST_DISABLE_DYNAMIC_SCALING`, `INGEST_DYNAMIC_MEMORY_THRESHOLD`, `INGEST_STATIC_MEMORY_THRESHOLD`)
- [Configure Ray Logging](https://docs.nvidia.com/nemo/retriever/latest/extraction/ray-logging/)
4 changes: 1 addition & 3 deletions docs/docs/extraction/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@

This documentation contains the Frequently Asked Questions (FAQ) for [NeMo Retriever extraction](overview.md).

!!! note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
For product naming, see [What is NeMo Retriever Extraction?](overview.md).



Expand Down
4 changes: 1 addition & 3 deletions docs/docs/extraction/nemoretriever-parse.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,7 @@ to run [NeMo Retriever extraction](overview.md) with nemotron-parse.
- Run the NIM locally by using Docker Compose
- Use NVIDIA Cloud Functions (NVCF) endpoints for cloud-based inference

!!! note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
For product naming, see [What is NeMo Retriever Extraction?](overview.md).


## Limitations
Expand Down
8 changes: 4 additions & 4 deletions docs/docs/extraction/nimclient.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@
The `NimClient` class provides a unified interface for connecting to and interacting with NVIDIA NIM Microservices.
This documentation demonstrates how to create custom NIM integrations for use in [NeMo Retriever extraction](overview.md) pipelines and User Defined Functions (UDFs).

!!! note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
For product naming, see [What is NeMo Retriever Extraction?](overview.md).

The NimClient architecture consists of two main components:

Expand Down Expand Up @@ -490,7 +488,9 @@ def batch_image_analysis_udf(control_message: IngestControlMessage) -> IngestCon

### Environment Variables

Set these environment variables for your NIM endpoints:
For extraction service environment variables (e.g. `NGC_API_KEY`, `MESSAGE_CLIENT_HOST`), see [Environment Variables for NeMo Retriever Extraction](environment-config.md).

For custom NIM integrations, you can set these environment variables for your NIM endpoints:

```bash
# NIM endpoints
Expand Down
4 changes: 1 addition & 3 deletions docs/docs/extraction/notebooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@

To get started using [NeMo Retriever extraction](overview.md), you can try one of the ready-made notebooks that are available.

!!! note

NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
For product naming, see [What is NeMo Retriever Extraction?](overview.md).


## Dataset Downloads for Benchmarking
Expand Down
Loading
Loading