-
Notifications
You must be signed in to change notification settings - Fork 322
align metadata docs with VDB filtering guide #2108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kheiss-uwzoo
wants to merge
3
commits into
NVIDIA:main
Choose a base branch
from
kheiss-uwzoo:kheiss/nuke
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
f34c93f
docs(extraction): align custom-metadata with VDB filtering guide
kheiss-uwzoo 79b7f6d
docs(extraction): point vdbs.md metadata to custom-metadata and VDB R…
kheiss-uwzoo a61a137
docs(extraction): link service sidecar ingest to OpenAPI and models
kheiss-uwzoo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,175 +1,142 @@ | ||
| # Use Custom Metadata to Filter Search Results | ||
| # Custom metadata and filtering | ||
|
|
||
| You can upload custom metadata for documents during ingestion. | ||
| By uploading custom metadata you can attach additional information to documents, | ||
| and use it for filtering results during retrieval operations. | ||
| For example, you can add author metadata to your documents, and filter by author when you retrieve results. | ||
| To create filters at query time, use predicates supported by [LanceDB SQL](https://lancedb.github.io/lancedb/sql/) against your table schema (custom fields are serialized into the `metadata` column with your ingested chunks). For a worked example, see the repository notebook linked at the end of this page. | ||
| Use this documentation to attach per-document metadata during ingestion and to narrow [LanceDB](vdbs.md) search results in [NeMo Retriever Library](overview.md). Implementation details live in the package [Vector DB operators and LanceDB](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) README. | ||
|
|
||
| Use this documentation to use custom metadata to filter search results when you work with [NeMo Retriever Library](overview.md). | ||
| ## On this page { #on-this-page } | ||
|
|
||
| - [Attach metadata at ingestion](#attach-metadata-at-ingestion) | ||
| - [How metadata is stored](#how-metadata-is-stored) | ||
| - [Filter results at query time](#filter-results-at-query-time) | ||
| - [Writing `where` predicates](#writing-where-predicates) | ||
| - [Server-side vs client-side filters](#server-side-vs-client-side-filters) | ||
| - [Inspect hit metadata](#inspect-hit-metadata) | ||
| - [Limitations](#limitations) | ||
| - [Related content](#related-content) | ||
|
|
||
| ## Limitations | ||
| ## Attach metadata at ingestion { #attach-metadata-at-ingestion } | ||
|
|
||
| The following are limitation when you use custom metadata: | ||
| Pass a **sidecar metadata table** on `vdb_upload` so selected columns are merged into each chunk's `content_metadata` before LanceDB upload. All three parameters must be set together: | ||
|
|
||
| - Metadata fields must be consistent across documents in the same collection. | ||
| - Complex filter expressions may impact retrieval performance. | ||
| - If you update your custom metadata, you must ingest your documents again to use the new metadata. | ||
| | Parameter | Purpose | | ||
| |-----------|---------| | ||
| | `meta_dataframe` | Path to CSV, JSON, or Parquet, or an in-memory `pandas.DataFrame` | | ||
| | `meta_source_field` | Column that identifies each document (must match ingest paths or basenames per `meta_join_key`) | | ||
| | `meta_fields` | Non-empty list of column names to copy into `content_metadata` | | ||
|
|
||
|
|
||
|
|
||
| ## Add Custom Metadata During Ingestion | ||
|
|
||
| You can add custom metadata during the document ingestion process. | ||
| You can specify metadata for each file, | ||
| and you can specify different metadata for different documents in the same ingestion batch. | ||
|
|
||
|
|
||
| ### Metadata Structure | ||
|
|
||
| You specify custom metadata as a dataframe or a file (json, csv, or parquet). | ||
|
|
||
| The following example contains metadata fields for category, department, and timestamp. | ||
| You can create whatever metadata is helpful for your scenario. | ||
| Optional `meta_join_key` controls how rows are matched to documents: `auto` (try full path then basename), `source_id` (full path), or `source_name` (basename only). | ||
|
|
||
| ```python | ||
| import pandas as pd | ||
| from nemo_retriever import create_ingestor | ||
|
|
||
| meta_df = pd.DataFrame( | ||
| { | ||
| "source": ["data/woods_frost.pdf", "data/multimodal_test.pdf"], | ||
| "category": ["Alpha", "Bravo"], | ||
| "department": ["Language", "Engineering"], | ||
| "timestamp": ["2025-05-01T00:00:00", "2025-05-02T00:00:00"] | ||
| "meta_a": ["alpha", "bravo"], | ||
| "meta_b": [10, 20], | ||
| } | ||
| ) | ||
|
|
||
| # Convert the dataframe to a csv file, | ||
| # to demonstrate how to ingest a metadata file in a later step. | ||
|
|
||
| file_path = "./meta_file.csv" | ||
| meta_df.to_csv(file_path) | ||
| ingestor = ( | ||
| create_ingestor(run_mode="batch") | ||
| .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"]) | ||
| .extract(extract_text=True, text_depth="page") | ||
| .embed() | ||
| .vdb_upload( | ||
| vdb_op="lancedb", | ||
| uri="./lancedb_data", | ||
| table_name="nemo-retriever", | ||
| meta_dataframe=meta_df, | ||
| meta_source_field="source", | ||
| meta_fields=["meta_a", "meta_b"], | ||
| ) | ||
| ) | ||
| ingestor.ingest() | ||
| ``` | ||
|
|
||
| For a runnable end-to-end flow (ingest, `Retriever.query`, and both filter modes), see [nemo_retriever_retriever_query_metadata_filter.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb). | ||
|
|
||
| ### Example: Add Custom Metadata During Ingestion | ||
| When you ingest through the **retriever service**, upload the sidecar with [`POST /v1/ingest/sidecar`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/routers/ingest.py#L1040-L1129) (multipart file; response [`SidecarUploadResponse`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/responses.py#L60-L68)), then pass the returned `sidecar_id` as `meta_dataframe_id` with `meta_source_field` and `meta_fields` in `pipeline.vdb_upload_params` on [`POST /v1/ingest`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/requests.py#L15-L32) ([`PipelineSpec`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/pipeline_spec.py#L55-L78)). Request and response shapes, form fields, and auth headers are in the service OpenAPI UI at `/docs` (or `/openapi.json`) on your retriever base URL (for example `http://localhost:7670/docs` after `retriever service start`). Do not send a raw local path as `meta_dataframe` on the service spec. | ||
|
|
||
| The following example adds custom metadata during ingestion. | ||
| For more information about `create_ingestor` and run modes, refer to [Use the Python API](nemo-retriever-api-reference.md). | ||
| For more information about the `vdb_upload` method, refer to [Upload Data](vdbs.md). | ||
| ## How metadata is stored { #how-metadata-is-stored } | ||
|
|
||
| ```python | ||
| from nemo_retriever import create_ingestor | ||
| During ingestion, each chunk's `content_metadata` is serialized as a **compact JSON string** (no spaces after `:` or `,`) in the LanceDB `metadata` column. Sidecar columns are merged into that JSON object before upload, so custom keys live in the same string — not in separate table columns. SQL filters on custom fields therefore use `LIKE` against JSON substrings rather than a dedicated JSON operator. | ||
|
|
||
| # Service-backed pipeline: point `base_url` at your running retriever service. | ||
| # For local graph execution instead, see [Use the Python API](nemo-retriever-api-reference.md). | ||
| The `source` column stores the document path separately from the metadata JSON. | ||
|
|
||
| hostname = "localhost" | ||
| table_name = "nemo_retriever_collection" | ||
| lancedb_uri = "./lancedb_data" | ||
| ## Filter results at query time { #filter-results-at-query-time } | ||
|
|
||
| ingestor = ( | ||
| create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670") | ||
| .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"]) | ||
| .extract( | ||
| extract_text=True, | ||
| extract_tables=True, | ||
| extract_charts=True, | ||
| extract_images=True, | ||
| text_depth="page" | ||
| ) | ||
| .embed() | ||
| .vdb_upload( | ||
| vdb_op="lancedb", | ||
| uri=lancedb_uri, | ||
| table_name=table_name, | ||
| hybrid=False, | ||
| ) | ||
| ) | ||
| results = ingestor.ingest_async().result() | ||
| ``` | ||
| Two complementary mechanisms narrow `Retriever.query` results: | ||
|
|
||
| Merge values from `meta_df` (or `file_path`) into each document's `content_metadata` before `vdb_upload`, or follow the step-by-step pattern in [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb), so category, department, and timestamp are present on the chunks LanceDB indexes. | ||
| 1. **Server-side (`where`)** — Pass a Lance / DataFusion SQL predicate in `vdb_kwargs` per call (or as defaults on the `Retriever`). LanceDB applies it as a `.where(...)` clause on vector search. **`_filter`** is accepted as an alias for `where`. | ||
| 2. **Client-side** — Use `filter_hits_by_content_metadata(hits, predicate)` after retrieval to keep rows whose parsed `content_metadata` satisfies arbitrary Python logic. | ||
|
|
||
| ## Best Practices | ||
|
|
||
| The following are the best practices when you work with custom metadata: | ||
| ```python | ||
| from nemo_retriever.retriever import Retriever | ||
|
|
||
| - Plan metadata structure before ingestion. | ||
| - Test filter expressions with small datasets first. | ||
| - Consider performance implications of complex filters. | ||
| - Validate metadata during ingestion. | ||
| - Handle missing metadata fields gracefully. | ||
| - Log invalid filter expressions. | ||
| retriever = Retriever( | ||
| vdb="lancedb", | ||
| vdb_kwargs={"uri": "./lancedb_data", "table_name": "nemo-retriever"}, | ||
| embedder="nvidia/llama-nemotron-embed-1b-v2", | ||
| ) | ||
|
|
||
| hits = retriever.query( | ||
| "budget assumptions", | ||
| top_k=16, | ||
| vdb_kwargs={"where": "metadata LIKE '%\"meta_a\":\"bravo\"%'"}, | ||
| ) | ||
| ``` | ||
|
|
||
| ## Writing `where` predicates { #writing-where-predicates } | ||
|
|
||
| ## Use Custom Metadata to Filter Results During Retrieval | ||
| LanceDB evaluates `where` as DataFusion SQL over columns `vector`, `text`, `metadata`, and `source`: | ||
|
|
||
| You can use custom metadata to filter documents during retrieval operations. | ||
| For **predicate pushdown**, use [LanceDB SQL](https://lancedb.github.io/lancedb/sql/) on an opened table (see the native query sketch below). The **`lancedb_retrieval` helper does not accept a server-side filter**: it always returns up to `top_k` hits from the index, so any list comprehension over those hits is **application-side only**—raise `top_k` if your matches might sit outside the first `top_k` neighbors, or use a native `table.search(...).where(...)` query instead. | ||
| ```python | ||
| # Match a sidecar string field (compact JSON: "key":"value") | ||
| where = "metadata LIKE '%\"meta_a\":\"alpha\"%'" | ||
|
|
||
| # Match a numeric metadata field — numbers serialize without quotes | ||
| where = "metadata LIKE '%\"meta_b\":10%'" | ||
|
|
||
| ### Example filter ideas | ||
| # Combine predicates | ||
| where = "metadata LIKE '%\"meta_a\":\"bravo\"%' AND metadata LIKE '%\"meta_b\":10%'" | ||
|
|
||
| Typical keys to filter on include `category`, `department`, `priority`, and `timestamp` (use comparable ISO-8601 strings for time ranges). Encode predicates in LanceDB SQL against your table columns (often the serialized `metadata` string), or inspect `hit["entity"]["content_metadata"]` after search as in the `lancedb_retrieval` example below. | ||
| # Filter on the source column directly | ||
| where = "source LIKE '%annual_report%'" | ||
| ``` | ||
|
|
||
| ### Example: Use a Filter Expression in Search | ||
| Escape single quotes in SQL strings by doubling them (`''`). Because matching is substring-based, include the JSON key (`"meta_a":` rather than only `alpha`) to avoid false positives. | ||
|
|
||
| After ingestion is complete, and documents are uploaded to LanceDB with metadata, | ||
| you can narrow results in the database with a **`where`** clause, or in Python on the returned hits. | ||
| ## Server-side vs client-side filters { #server-side-vs-client-side-filters } | ||
|
|
||
| **Native LanceDB (SQL pushdown):** connect, embed the query yourself (same model as ingestion), then chain `.where("<LanceDB SQL predicate>")` on `table.search(...)` so filtering happens before the `limit`. Exact SQL depends on how `metadata` is stored; see [LanceDB SQL](https://lancedb.github.io/lancedb/sql/). | ||
| Use **`where`** when the predicate fits SQL and you want LanceDB to prune candidates before vector ranking. Use **`filter_hits_by_content_metadata`** when the predicate is easier in Python (combined numeric ranges, set membership, or fields that need parsing). They compose: run a wider `top_k` with `where`, then post-filter for finer logic. | ||
|
|
||
| ```python | ||
| import lancedb | ||
| from nemo_retriever.vdb import filter_hits_by_content_metadata | ||
|
|
||
| # Pseudocode sketch — replace YOUR_VECTOR and YOUR_PREDICATE with real values. | ||
| db = lancedb.connect("./lancedb_data") | ||
| table = db.open_table("nemo_retriever_collection") | ||
| # table.search(YOUR_VECTOR, vector_column_name="vector").where(YOUR_PREDICATE).limit(10).to_list() | ||
| hits = retriever.query( | ||
| "budget assumptions", | ||
| top_k=16, | ||
| vdb_kwargs={"where": "metadata LIKE '%\"meta_a\":\"bravo\"%'"}, | ||
| ) | ||
| hits = filter_hits_by_content_metadata( | ||
| hits, lambda m: m.get("meta_b", 0) >= 10 | ||
| ) | ||
| ``` | ||
|
|
||
| **`lancedb_retrieval` + post-filter:** the helper only returns `top_k` rows with no `where` argument; filtering in Python is for illustration and does **not** change what the database evaluates. | ||
| ## Inspect hit metadata { #inspect-hit-metadata } | ||
|
|
||
| ```python | ||
| Use the lancedb_retrieval helper from the same LanceDB module you use with create_ingestor (see Python API). | ||
|
|
||
| hostname = "localhost" | ||
| table_name = "nemo_retriever_collection" | ||
| lancedb_uri = "./lancedb_data" | ||
| top_k = 5 | ||
| model_name = "nvidia/llama-nemotron-embed-vl-1b-v2" | ||
|
|
||
| queries = ["this is expensive"] | ||
| q_results = [] | ||
| for que in queries: | ||
| batch = lancedb_retrieval( | ||
| [que], | ||
| table_path=lancedb_uri, | ||
| table_name=table_name, | ||
| embedding_endpoint=f"http://{hostname}:8012/v1", | ||
| top_k=top_k, | ||
| model_name=model_name, | ||
| ) | ||
| # Application-side only: fewer than top_k hits if Engineering rows are not in this batch | ||
| filtered = [ | ||
| hit | ||
| for hit in batch[0] | ||
| if hit.get("entity", {}) | ||
| .get("content_metadata", {}) | ||
| .get("department") | ||
| == "Engineering" | ||
| ] | ||
| q_results.append(filtered) | ||
|
|
||
| print(f"{q_results}") | ||
| ``` | ||
| Each hit's `metadata` field is a JSON string. Use **`parse_hit_content_metadata(hit)`** to obtain a `dict` (the same helper `filter_hits_by_content_metadata` uses). Both helpers are exported from `nemo_retriever.vdb`. | ||
|
|
||
| ## Limitations { #limitations } | ||
|
|
||
| - **Hybrid search** — Metadata filters on the precomputed-vector retrieval path apply to **dense vector search only**. `LanceDB.retrieval` raises `NotImplementedError` when `hybrid=True`; see [Vector databases](vdbs.md#hybrid-search-lancedb). | ||
| - **Predicate shape** — `where` uses substring `LIKE` on compact JSON in `metadata`; design keys and values accordingly. | ||
| - **Sidecar updates** — Changing sidecar data requires re-ingesting affected documents so LanceDB rows pick up new metadata. | ||
|
|
||
| ## Related Content | ||
| ## Related content { #related-content } | ||
|
|
||
| - For a notebook that uses the CLI to add custom metadata and filter query results, refer to [metadata_and_filtered_search.ipynb | ||
| ](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb). | ||
| - [Vector databases](vdbs.md) — LanceDB upload, retrieval, and hybrid notes | ||
| - [nemo_retriever_retriever_query_metadata_filter.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) — end-to-end metadata filtering with `Retriever` | ||
| - [nemo_retriever_metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_metadata_and_filtered_search.ipynb) — graph ingest with sidecar metadata | ||
| - [Vector DB operators (source)](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) — canonical developer reference for this page | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta_a == "bravo" AND meta_b == 10, but the sample DataFrame defined earlier maps"bravo"tometa_b=20and"alpha"tometa_b=10. This predicate would match zero rows against the example data, which is likely to mislead users who test it as written or base their own predicates on this template.Prompt To Fix With AI