Skip to content

align metadata docs with VDB filtering guide#2108

Open
kheiss-uwzoo wants to merge 3 commits into
NVIDIA:mainfrom
kheiss-uwzoo:kheiss/nuke
Open

align metadata docs with VDB filtering guide#2108
kheiss-uwzoo wants to merge 3 commits into
NVIDIA:mainfrom
kheiss-uwzoo:kheiss/nuke

Conversation

@kheiss-uwzoo
Copy link
Copy Markdown
Collaborator

@kheiss-uwzoo kheiss-uwzoo commented May 22, 2026

Summary

  • Replace docs/docs/extraction/custom-metadata.md with a published guide aligned to VDB metadata filtering: sidecar ingest (meta_dataframe / meta_source_field / meta_fields), compact JSON in LanceDB, server-side where on Retriever.query, and client-side filter_hits_by_content_metadata / parse_hit_content_metadata. Removes outdated lancedb_retrieval and legacy notebook-first flows.
  • Update docs/docs/extraction/vdbs.md so metadata is not duplicated on the LanceDB page: new Metadata and filtering section defers to custom-metadata.md (MkDocs) and the canonical nemo_retriever/src/nemo_retriever/vdb/README.md anchor.
  • Service-mode sidecar — Link POST /v1/ingest/sidecar, SidecarUploadResponse, PipelineSpec.vdb_upload_params, and the retriever OpenAPI UI (/docs, /openapi.json) so users can look up request/response shapes and auth headers (review fix).

Test plan

  • MkDocs build / link check for custom-metadata.md and vdbs.md#metadata-and-filtering
  • Confirm inbound links from vdbs.md, workflow-agentic-retrieval.md, and integrations-langchain-llamaindex-haystack.md still resolve
  • Spot-check examples against examples/nemo_retriever_retriever_query_metadata_filter.ipynb
  • Verify service-sidecar links (GitHub route/models + /docs on a running retriever) open and match the described flow

Replace outdated ingestion and lancedb_retrieval examples with sidecar
metadata, Retriever where filters, and links to the canonical vdb README.
@kheiss-uwzoo kheiss-uwzoo requested review from a team as code owners May 22, 2026 21:36
@kheiss-uwzoo kheiss-uwzoo requested a review from drobison00 May 22, 2026 21:36
@kheiss-uwzoo kheiss-uwzoo changed the title docs(extraction): align custom-metadata with VDB filtering guide align custom-metadata with VDB filtering guide May 22, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

Replaces the old custom-metadata.md notebook-first guide with a structured reference aligned to the VDB metadata-filtering implementation, and adds a lightweight "Metadata and filtering" redirect section to vdbs.md to eliminate duplication.

  • custom-metadata.md — Rewrites the page around the sidecar ingest API (meta_dataframe / meta_source_field / meta_fields), compact-JSON storage in LanceDB, server-side where / _filter predicates on Retriever.query, and client-side filter_hits_by_content_metadata / parse_hit_content_metadata helpers; removes the legacy lancedb_retrieval post-filter example and the service-mode code block that lacked API links.
  • vdbs.md — Adds [Metadata and filtering](#metadata-and-filtering) to the ToC and section body (which points to custom-metadata.md and the VDB README anchor), removes the redundant [Custom metadata and filtering](custom-metadata.md) bullet from the "Semantic and hybrid retrieval" list, and adds two cross-reference links in "More information" and "Related Topics".

Confidence Score: 5/5

Documentation-only rewrite with no executable code changes; safe to merge.

Both files are Markdown documentation. The rewrite removes legacy, potentially misleading content and replaces it with a more accurate guide. No logic paths, APIs, or runtime behavior are changed. The two observations flagged are documentation quality notes that do not block correct use of the feature.

No files require special attention, though docs/docs/extraction/notebooks.md (outside this PR) may benefit from a follow-up update to align its notebook link with the renamed file referenced in the new guide.

Important Files Changed

Filename Overview
docs/docs/extraction/custom-metadata.md Complete rewrite of the custom metadata guide: replaces legacy notebook-first flow with a focused sidecar-ingest (meta_dataframe/meta_source_field/meta_fields) and Retriever.query filter pattern. A numeric LIKE predicate example may silently over-match (e.g., meta_b:10 matches rows with meta_b:100), and the linked notebook name diverges from what notebooks.md still references.
docs/docs/extraction/vdbs.md Adds a new "Metadata and filtering" ToC entry and section stub that defers to custom-metadata.md, removes duplicate filter guidance from the "Semantic and hybrid retrieval" bullet list, and adds two cross-reference links in "More information" and "Related Topics". Changes are additive and internally consistent.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Ingest documents] -->|meta_dataframe\nmeta_source_field\nmeta_fields| B[vdb_upload]
    B -->|compact JSON merged\ninto content_metadata| C[(LanceDB\nmetadata column)]

    D[Retriever.query] -->|vdb_kwargs: where / _filter| C
    C -->|LanceDB server-side\nDATAFUSION SQL LIKE| E[Filtered hits]
    E -->|filter_hits_by_content_metadata\nlambda predicate| F[Client-side refined hits]
    E -->|parse_hit_content_metadata| G[Metadata dict]

    style C fill:#f0f4ff,stroke:#4466cc
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fff8e1,stroke:#ff9800
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
docs/docs/extraction/custom-metadata.md:99-102
**Numeric LIKE pattern can produce false positives**

The predicate `"meta_b":10` is a substring match, so it also matches rows where `meta_b` serializes to `100`, `1000`, `10.5`, etc. The note at the end of the section warns about false positives for string fields but the numeric example is the one that silently over-matches. Consider anchoring the numeric match with a trailing comma or closing brace to avoid this: `"meta_b":10,` (key in the middle of the object) or `"meta_b":10}` (key at the end), or guard both with an OR pattern.

### Issue 2 of 2
docs/docs/extraction/custom-metadata.md:141
**`notebooks.md` not updated to match the renamed notebook**

`docs/docs/extraction/notebooks.md` line 15 still links to `metadata_and_filtered_search.ipynb` (the old name) while this guide now points readers to `nemo_retriever_metadata_and_filtered_search.ipynb`. Both files currently exist in `examples/`, so no link is broken, but a reader following the "Getting Started" notebooks page lands on a different notebook than the one described here as the "graph ingest with sidecar metadata" reference. If the old notebook is considered superseded, `notebooks.md` should be updated to reference the renamed file.

Reviews (3): Last reviewed commit: "docs(extraction): link service sidecar i..." | Re-trigger Greptile

Comment on lines +101 to +102
# Combine predicates
where = "metadata LIKE '%\"meta_a\":\"bravo\"%' AND metadata LIKE '%\"meta_b\":10%'"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The combined-predicate example uses meta_a == "bravo" AND meta_b == 10, but the sample DataFrame defined earlier maps "bravo" to meta_b=20 and "alpha" to meta_b=10. This predicate would match zero rows against the example data, which is likely to mislead users who test it as written or base their own predicates on this template.

Suggested change
# Combine predicates
where = "metadata LIKE '%\"meta_a\":\"bravo\"%' AND metadata LIKE '%\"meta_b\":10%'"
# Combine predicates (meta_a="bravo" maps to meta_b=20 in the sample DataFrame)
where = "metadata LIKE '%\"meta_a\":\"bravo\"%' AND metadata LIKE '%\"meta_b\":20%'"
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/custom-metadata.md
Line: 101-102

Comment:
The combined-predicate example uses `meta_a == "bravo" AND meta_b == 10`, but the sample DataFrame defined earlier maps `"bravo"` to `meta_b=20` and `"alpha"` to `meta_b=10`. This predicate would match zero rows against the example data, which is likely to mislead users who test it as written or base their own predicates on this template.

```suggestion
# Combine predicates (meta_a="bravo" maps to meta_b=20 in the sample DataFrame)
where = "metadata LIKE '%\"meta_a\":\"bravo\"%' AND metadata LIKE '%\"meta_b\":20%'"
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread docs/docs/extraction/custom-metadata.md Outdated
…EADME

Add a Metadata and filtering section that defers to the published page
and the canonical vdb README anchor instead of duplicating guidance here.
@kheiss-uwzoo kheiss-uwzoo changed the title align custom-metadata with VDB filtering guide docs(extraction): align metadata docs with VDB filtering guide May 22, 2026
@kheiss-uwzoo kheiss-uwzoo changed the title docs(extraction): align metadata docs with VDB filtering guide align metadata docs with VDB filtering guide May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant