align metadata docs with VDB filtering guide#2108
Conversation
Replace outdated ingestion and lancedb_retrieval examples with sidecar metadata, Retriever where filters, and links to the canonical vdb README.
Greptile SummaryReplaces the old
|
| Filename | Overview |
|---|---|
| docs/docs/extraction/custom-metadata.md | Complete rewrite of the custom metadata guide: replaces legacy notebook-first flow with a focused sidecar-ingest (meta_dataframe/meta_source_field/meta_fields) and Retriever.query filter pattern. A numeric LIKE predicate example may silently over-match (e.g., meta_b:10 matches rows with meta_b:100), and the linked notebook name diverges from what notebooks.md still references. |
| docs/docs/extraction/vdbs.md | Adds a new "Metadata and filtering" ToC entry and section stub that defers to custom-metadata.md, removes duplicate filter guidance from the "Semantic and hybrid retrieval" bullet list, and adds two cross-reference links in "More information" and "Related Topics". Changes are additive and internally consistent. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Ingest documents] -->|meta_dataframe\nmeta_source_field\nmeta_fields| B[vdb_upload]
B -->|compact JSON merged\ninto content_metadata| C[(LanceDB\nmetadata column)]
D[Retriever.query] -->|vdb_kwargs: where / _filter| C
C -->|LanceDB server-side\nDATAFUSION SQL LIKE| E[Filtered hits]
E -->|filter_hits_by_content_metadata\nlambda predicate| F[Client-side refined hits]
E -->|parse_hit_content_metadata| G[Metadata dict]
style C fill:#f0f4ff,stroke:#4466cc
style F fill:#e8f5e9,stroke:#4caf50
style G fill:#fff8e1,stroke:#ff9800
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
docs/docs/extraction/custom-metadata.md:99-102
**Numeric LIKE pattern can produce false positives**
The predicate `"meta_b":10` is a substring match, so it also matches rows where `meta_b` serializes to `100`, `1000`, `10.5`, etc. The note at the end of the section warns about false positives for string fields but the numeric example is the one that silently over-matches. Consider anchoring the numeric match with a trailing comma or closing brace to avoid this: `"meta_b":10,` (key in the middle of the object) or `"meta_b":10}` (key at the end), or guard both with an OR pattern.
### Issue 2 of 2
docs/docs/extraction/custom-metadata.md:141
**`notebooks.md` not updated to match the renamed notebook**
`docs/docs/extraction/notebooks.md` line 15 still links to `metadata_and_filtered_search.ipynb` (the old name) while this guide now points readers to `nemo_retriever_metadata_and_filtered_search.ipynb`. Both files currently exist in `examples/`, so no link is broken, but a reader following the "Getting Started" notebooks page lands on a different notebook than the one described here as the "graph ingest with sidecar metadata" reference. If the old notebook is considered superseded, `notebooks.md` should be updated to reference the renamed file.
Reviews (3): Last reviewed commit: "docs(extraction): link service sidecar i..." | Re-trigger Greptile
| # Combine predicates | ||
| where = "metadata LIKE '%\"meta_a\":\"bravo\"%' AND metadata LIKE '%\"meta_b\":10%'" |
There was a problem hiding this comment.
The combined-predicate example uses
meta_a == "bravo" AND meta_b == 10, but the sample DataFrame defined earlier maps "bravo" to meta_b=20 and "alpha" to meta_b=10. This predicate would match zero rows against the example data, which is likely to mislead users who test it as written or base their own predicates on this template.
| # Combine predicates | |
| where = "metadata LIKE '%\"meta_a\":\"bravo\"%' AND metadata LIKE '%\"meta_b\":10%'" | |
| # Combine predicates (meta_a="bravo" maps to meta_b=20 in the sample DataFrame) | |
| where = "metadata LIKE '%\"meta_a\":\"bravo\"%' AND metadata LIKE '%\"meta_b\":20%'" |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/custom-metadata.md
Line: 101-102
Comment:
The combined-predicate example uses `meta_a == "bravo" AND meta_b == 10`, but the sample DataFrame defined earlier maps `"bravo"` to `meta_b=20` and `"alpha"` to `meta_b=10`. This predicate would match zero rows against the example data, which is likely to mislead users who test it as written or base their own predicates on this template.
```suggestion
# Combine predicates (meta_a="bravo" maps to meta_b=20 in the sample DataFrame)
where = "metadata LIKE '%\"meta_a\":\"bravo\"%' AND metadata LIKE '%\"meta_b\":20%'"
```
How can I resolve this? If you propose a fix, please make it concise.…EADME Add a Metadata and filtering section that defers to the published page and the canonical vdb README anchor instead of duplicating guidance here.
Summary
docs/docs/extraction/custom-metadata.mdwith a published guide aligned to VDB metadata filtering: sidecar ingest (meta_dataframe/meta_source_field/meta_fields), compact JSON in LanceDB, server-sidewhereonRetriever.query, and client-sidefilter_hits_by_content_metadata/parse_hit_content_metadata. Removes outdatedlancedb_retrievaland legacy notebook-first flows.docs/docs/extraction/vdbs.mdso metadata is not duplicated on the LanceDB page: new Metadata and filtering section defers tocustom-metadata.md(MkDocs) and the canonicalnemo_retriever/src/nemo_retriever/vdb/README.mdanchor.POST /v1/ingest/sidecar,SidecarUploadResponse,PipelineSpec.vdb_upload_params, and the retriever OpenAPI UI (/docs,/openapi.json) so users can look up request/response shapes and auth headers (review fix).Test plan
custom-metadata.mdandvdbs.md#metadata-and-filteringvdbs.md,workflow-agentic-retrieval.md, andintegrations-langchain-llamaindex-haystack.mdstill resolveexamples/nemo_retriever_retriever_query_metadata_filter.ipynb/docson a running retriever) open and match the described flow