-
Notifications
You must be signed in to change notification settings - Fork 321
docs: sync 26.05 docs/docs with main #2179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 26.05
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -63,7 +63,7 @@ This pipeline enables retrieval at the speech segment level when you enable segm | |||||||||||||||
|
|
||||||||||||||||
| Use the following procedure to run the NIM on your own infrastructure. Self-hosted Parakeet runs on Kubernetes via the [NeMo Retriever Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md). Enable the ASR NIM per [Optional Helm NIMs](prerequisites-support-matrix.md#optional-helm-nims-not-auto-wired-by-default) and the [Helm chart — NIM operator sub-stack](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md#nim-operator-sub-stack); pin the workload to a dedicated GPU and wire the ASR endpoint in your pipeline. | ||||||||||||||||
|
|
||||||||||||||||
| !!! important | ||||||||||||||||
| After deploy, call the pipeline from Python: | ||||||||||||||||
|
|
||||||||||||||||
| Pin the Parakeet workload to the dedicated GPU with your Helm values or the [NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html) (for example, node selectors, resource limits, or device requests appropriate to your cluster). | ||||||||||||||||
|
|
||||||||||||||||
|
|
@@ -87,15 +87,14 @@ Use the following procedure to run the NIM on your own infrastructure. Self-host | |||||||||||||||
| asr_params=ASRParams(segment_audio=True), | ||||||||||||||||
| ) | ||||||||||||||||
| ) | ||||||||||||||||
| ``` | ||||||||||||||||
| ) | ||||||||||||||||
| ``` | ||||||||||||||||
|
Comment on lines
88
to
+91
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The closing
Suggested change
|
||||||||||||||||
|
|
||||||||||||||||
| To generate one extracted element for each sentence-like ASR segment, include `extract_audio_params={"segment_audio": True}` when calling `.extract(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model. | ||||||||||||||||
|
|
||||||||||||||||
| To generate one extracted element for each sentence-like ASR segment, pass `asr_params=ASRParams(segment_audio=True)` to `.extract_audio(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model. | ||||||||||||||||
|
|
||||||||||||||||
|
|
||||||||||||||||
| !!! tip | ||||||||||||||||
|
|
||||||||||||||||
| For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb). | ||||||||||||||||
| For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb). | ||||||||||||||||
|
Comment on lines
+93
to
+97
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Line 93 (unindented) says to use Prompt To Fix With AIThis is a comment left during a code review.
Path: docs/docs/extraction/audio-video.md
Line: 93-97
Comment:
**Duplicate near-identical `segment_audio` paragraphs with conflicting API names**
Line 93 (unindented) says to use `extract_audio_params={"segment_audio": True}` with `.extract(...)`, while line 95 (indented continuation of step 3) says to use `asr_params=ASRParams(segment_audio=True)` with `.extract_audio(...)`. These look like two different API call styles that both appeared after the admonition block was removed. One of them should be removed, or it should be clarified which applies to library mode vs. the service ingestor.
How can I resolve this? If you propose a fix, please make it concise. |
||||||||||||||||
|
|
||||||||||||||||
| ## Parakeet with hosted inference (build.nvidia.com) { #parakeet-hosted-inference-build-nvidia } | ||||||||||||||||
|
|
||||||||||||||||
|
|
||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,74 +1,42 @@ | ||||||||||||||||||||||
| # Use Custom Metadata to Filter Search Results | ||||||||||||||||||||||
| # Custom metadata and filtering | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| You can upload custom metadata for documents during ingestion. | ||||||||||||||||||||||
| By uploading custom metadata you can attach additional information to documents, | ||||||||||||||||||||||
| and use it for filtering results during retrieval operations. | ||||||||||||||||||||||
| For example, you can add author metadata to your documents, and filter by author when you retrieve results. | ||||||||||||||||||||||
| To create filters at query time, use predicates supported by [LanceDB SQL](https://lancedb.github.io/lancedb/sql/) against your table schema (custom fields are serialized into the `metadata` column with your ingested chunks). For a worked example, see the repository notebook linked at the end of this page. | ||||||||||||||||||||||
| Use this documentation to attach per-document metadata during ingestion and to narrow [LanceDB](vdbs.md) search results in [NeMo Retriever Library](overview.md). Implementation details live in the package [Vector DB operators and LanceDB](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) README. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Use this documentation to use custom metadata to filter search results when you work with [NeMo Retriever Library](overview.md). | ||||||||||||||||||||||
| ## On this page { #on-this-page } | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - [Attach metadata at ingestion](#attach-metadata-at-ingestion) | ||||||||||||||||||||||
| - [How metadata is stored](#how-metadata-is-stored) | ||||||||||||||||||||||
| - [Filter results at query time](#filter-results-at-query-time) | ||||||||||||||||||||||
| - [Writing `where` predicates](#writing-where-predicates) | ||||||||||||||||||||||
| - [Server-side vs client-side filters](#server-side-vs-client-side-filters) | ||||||||||||||||||||||
| - [Inspect hit metadata](#inspect-hit-metadata) | ||||||||||||||||||||||
| - [Limitations](#limitations) | ||||||||||||||||||||||
| - [Related content](#related-content) | ||||||||||||||||||||||
|
Comment on lines
+5
to
+14
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The table of contents added in this PR references Prompt To Fix With AIThis is a comment left during a code review.
Path: docs/docs/extraction/custom-metadata.md
Line: 5-14
Comment:
**"On this page" TOC contains 6 broken anchor links**
The table of contents added in this PR references `#filter-results-at-query-time`, `#writing-where-predicates`, `#server-side-vs-client-side-filters`, `#inspect-hit-metadata`, `#limitations`, and `#related-content`. None of these section headings exist in the current file body (128 lines). The body still contains the old 26.05 structure (`## Best Practices`, `## Use Custom Metadata to Filter Results During Retrieval`, etc.) rather than the restructured sections the TOC was written for. Clicking any of these six links in the published docs will silently scroll to the top of the page.
How can I resolve this? If you propose a fix, please make it concise. |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Limitations | ||||||||||||||||||||||
| ## Attach metadata at ingestion { #attach-metadata-at-ingestion } | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The following are limitation when you use custom metadata: | ||||||||||||||||||||||
| Pass a **sidecar metadata table** on `vdb_upload` so selected columns are merged into each chunk's `content_metadata` before LanceDB upload. All three parameters must be set together: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - Metadata fields must be consistent across documents in the same collection. | ||||||||||||||||||||||
| - Complex filter expressions may impact retrieval performance. | ||||||||||||||||||||||
| - If you update your custom metadata, you must ingest your documents again to use the new metadata. | ||||||||||||||||||||||
| | Parameter | Purpose | | ||||||||||||||||||||||
| |-----------|---------| | ||||||||||||||||||||||
| | `meta_dataframe` | Path to CSV, JSON, or Parquet, or an in-memory `pandas.DataFrame` | | ||||||||||||||||||||||
| | `meta_source_field` | Column that identifies each document (must match ingest paths or basenames per `meta_join_key`) | | ||||||||||||||||||||||
| | `meta_fields` | Non-empty list of column names to copy into `content_metadata` | | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Add Custom Metadata During Ingestion | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| You can add custom metadata during the document ingestion process. | ||||||||||||||||||||||
| You can specify metadata for each file, | ||||||||||||||||||||||
| and you can specify different metadata for different documents in the same ingestion batch. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Metadata Structure | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| You specify custom metadata as a dataframe or a file (json, csv, or parquet). | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The following example contains metadata fields for category, department, and timestamp. | ||||||||||||||||||||||
| You can create whatever metadata is helpful for your scenario. | ||||||||||||||||||||||
| Optional `meta_join_key` controls how rows are matched to documents: `auto` (try full path then basename), `source_id` (full path), or `source_name` (basename only). | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```python | ||||||||||||||||||||||
| import pandas as pd | ||||||||||||||||||||||
| from nemo_retriever import create_ingestor | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| meta_df = pd.DataFrame( | ||||||||||||||||||||||
| { | ||||||||||||||||||||||
| "source": ["data/woods_frost.pdf", "data/multimodal_test.pdf"], | ||||||||||||||||||||||
| "category": ["Alpha", "Bravo"], | ||||||||||||||||||||||
| "department": ["Language", "Engineering"], | ||||||||||||||||||||||
| "timestamp": ["2025-05-01T00:00:00", "2025-05-02T00:00:00"] | ||||||||||||||||||||||
| "meta_a": ["alpha", "bravo"], | ||||||||||||||||||||||
| "meta_b": [10, 20], | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| ) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # Convert the dataframe to a csv file, | ||||||||||||||||||||||
| # to demonstrate how to ingest a metadata file in a later step. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| file_path = "./meta_file.csv" | ||||||||||||||||||||||
| meta_df.to_csv(file_path) | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Example: Add Custom Metadata During Ingestion | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The following example adds custom metadata during ingestion. | ||||||||||||||||||||||
| For more information about `create_ingestor` and run modes, refer to [Use the Python API](nemo-retriever-api-reference.md). | ||||||||||||||||||||||
| For more information about the `vdb_upload` method, refer to [Upload Data](vdbs.md). | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```python | ||||||||||||||||||||||
| from nemo_retriever import create_ingestor | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # Service-backed pipeline: point `base_url` at your running retriever service. | ||||||||||||||||||||||
| # For local graph execution instead, see [Use the Python API](nemo-retriever-api-reference.md). | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| hostname = "localhost" | ||||||||||||||||||||||
| table_name = "nemo_retriever_collection" | ||||||||||||||||||||||
| lancedb_uri = "./lancedb_data" | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ingestor = ( | ||||||||||||||||||||||
| create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670") | ||||||||||||||||||||||
| .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"]) | ||||||||||||||||||||||
|
Comment on lines
40
to
42
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The diff removes the
Suggested change
Prompt To Fix With AIThis is a comment left during a code review.
Path: docs/docs/extraction/custom-metadata.md
Line: 40-42
Comment:
**Undefined variables make the code example un-runnable**
The diff removes the `hostname`, `table_name`, and `lancedb_uri` variable definitions that previously preceded the `ingestor = (...)` block, but the `create_ingestor(...)` call still references all three. Copying this snippet results in a `NameError` on `hostname`. The variable definitions need to be restored.
```suggestion
hostname = "localhost"
table_name = "nemo_retriever_collection"
lancedb_uri = "./lancedb_data"
ingestor = (
create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")
.files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])
```
How can I resolve this? If you propose a fix, please make it concise. |
||||||||||||||||||||||
|
|
@@ -150,9 +118,11 @@ hits = retriever.query( | |||||||||||||||||||||
| ) | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| For a runnable end-to-end flow (ingest, `Retriever.query`, and both filter modes), see [nemo_retriever_retriever_query_metadata_filter.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb). | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| When you ingest through the **retriever service**, upload the sidecar with [`POST /v1/ingest/sidecar`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/routers/ingest.py#L1040-L1129) (multipart file; response [`SidecarUploadResponse`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/responses.py#L60-L68)), then pass the returned `sidecar_id` as `meta_dataframe_id` with `meta_source_field` and `meta_fields` in `pipeline.vdb_upload_params` on [`POST /v1/ingest`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/requests.py#L15-L32) ([`PipelineSpec`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/pipeline_spec.py#L55-L78)). Request and response shapes, form fields, and auth headers are in the service OpenAPI UI at `/docs` (or `/openapi.json`) on your retriever base URL (for example `http://localhost:7670/docs` after `retriever service start`). Do not send a raw local path as `meta_dataframe` on the service spec. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Related Content | ||||||||||||||||||||||
| ## How metadata is stored { #how-metadata-is-stored } | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - [Vector databases](vdbs.md) — canonical LanceDB upload and retrieval guide | ||||||||||||||||||||||
| - [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph ingest with sidecar metadata | ||||||||||||||||||||||
|
Comment on lines
+125
to
128
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The heading at line 125 was renamed from Prompt To Fix With AIThis is a comment left during a code review.
Path: docs/docs/extraction/custom-metadata.md
Line: 125-128
Comment:
**Section heading "How metadata is stored" contains only cross-reference bullets**
The heading at line 125 was renamed from `## Related Content` to `## How metadata is stored`, but its body was not updated — it still contains just two reference links. Readers navigating to this section via the TOC will find no explanation of how metadata is persisted (e.g., serialized into the `metadata` column, how `content_metadata` fields are mapped). Either restore the "Related content" heading or replace the bullets with the intended storage explanation.
How can I resolve this? If you propose a fix, please make it concise. |
||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the
!!! importantadmonition was removed, the paragraph at line 68 (Pin the Parakeet workload…) is now indented by 4 spaces with no enclosing list item. In Markdown (including MkDocs Material), a 4-space-indented paragraph outside a list context is treated as an indented code block, so this critical deployment warning will render as<pre><code>text rather than readable prose — readers following the setup steps will miss the GPU pinning requirement entirely.Prompt To Fix With AI