Update embeddings for table and column by id (with verified) by DinaLaptii · Pull Request #2049 · NVIDIA/NeMo-Retriever

DinaLaptii · 2026-05-18T09:50:36Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-05-18T09:55:10Z

Greptile Summary

This PR eliminates the Neo4j round-trip in the tabular embedding pipeline by threading (tables_df, columns_df) — carrying post-ingest UUIDs — from TabularSchemaExtractOp directly into TabularFetchEmbeddingsOp, and adds PutVdbOperator / LanceDB.put() for strict in-place updates of existing VDB rows keyed by "id".

TabularSchemaExtractOp.process() now returns tuple[pd.DataFrame, pd.DataFrame] instead of an empty DataFrame; store_relational_db_in_neo4j and populate_tabular_data propagate the {schema_name: Schema} dict so UUIDs assigned during Neo4j ingest are immediately available downstream.
TabularFetchEmbeddingsOp gains an inline row-builder (_build_rows) that produces embedding-ready rows from the DataFrames, with a fallback to the old Neo4j query path when the input is not a 2-tuple.
PutVdbOperator (with LanceDB.put()) implements strict update-only semantics via merge_insert, raising KeyError for missing or unrecognised IDs and FileNotFoundError when the target table does not yet exist; three new tests cover the guard, happy path, and sidecar-merge path.

Confidence Score: 5/5

Safe to merge; all changes are additive and the core put/embed pipeline logic is structurally sound.

The functional changes are well-contained: the Neo4j round-trip elimination is a straightforward data-threading refactor, and PutVdbOperator/LanceDB.put() both have explicit contracts and meaningful guard checks. The three new PutVdbOperator tests cover the key behavioral paths. The open items (missing type annotations and absent operator-layer tests for TabularFetchEmbeddingsOp) do not affect correctness at runtime.

tabular_fetch_embeddings_operator.py and tabular_schema_extract_operator.py — significant logic changes without operator-level unit tests.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py	Major refactor: operator now consumes (tables_df, columns_df) from upstream instead of querying Neo4j; adds schema-keyed column bucketing and _build_rows row builder. Logic appears correct but lacks unit tests.
nemo_retriever/src/nemo_retriever/graph/tabular_schema_extract_operator.py	Return type changed from pd.DataFrame to tuple[pd.DataFrame, pd.DataFrame]; now passes post-ingest UUIDs downstream without a Neo4j round-trip. Missing unit tests for the new tuple output.
nemo_retriever/src/nemo_retriever/tabular_data/ingestion/extract_data.py	store_relational_db_in_neo4j now propagates the schemas dict from populate_tabular_data. Missing return type annotation on the changed function signature.
nemo_retriever/src/nemo_retriever/tabular_data/ingestion/write_to_graph.py	populate_tabular_data now returns all_schemas dict instead of []. Dead assignment (all_schemas = {}) removed cleanly. Missing return type annotation.
nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py	Adds non-abstract put() with NotImplementedError stub; design rationale for omitting @AbstractMethod is documented. Base class signature annotated correctly.
nemo_retriever/src/nemo_retriever/vdb/lancedb.py	Adds id field to schema, extracts row_id from content/source metadata, and implements put() with strict update-only semantics via merge_insert. records parameter missing type annotation.
nemo_retriever/src/nemo_retriever/vdb/operators.py	Adds PutVdbOperator extending IngestVdbOperator; constructor guard correctly detects un-overridden put() stubs. Sidecar path reused from parent cleanly.
nemo_retriever/src/nemo_retriever/vdb/init.py	PutVdbOperator added to public exports and all. Straightforward additive change.
nemo_retriever/tests/test_nv_ingest_vdb_operator.py	Adds three meaningful PutVdbOperator tests: guard-rejects-stub, happy-path delegation, and sidecar-merge path. _StubPutVDB correctly verifies VDB.put is not @AbstractMethod.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[TabularSchemaExtractOp.process] --> B[store_relational_db_in_neo4j]
    B --> C[populate_tabular_data returns schemas dict]
    C --> D[Concat tables_df + columns_df with UUIDs]
    D --> E[Return tuple to downstream]
    E --> F[TabularFetchEmbeddingsOp.process]
    F --> G{Input is 2-tuple?}
    G -- Yes --> H[_build_rows: schema-keyed column buckets]
    G -- No --> I[Fallback: fetch from database]
    H --> J[DataFrame with text + metadata]
    I --> J
    J --> K[Embedding step]
    K --> L[PutVdbOperator.process]
    L --> M[LanceDB.put: validate all rows exist]
    M --> N[merge_insert when_matched_update_all]

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:540-545
**Missing type annotation on `records` parameter**

`LanceDB.put()` drops the `records: list` annotation that the base class declares (`VDB.put(self, records: list, **kwargs: Any)`). Any type checker will treat `records` as `Any` here, defeating the coverage the base-class annotation was meant to provide. Add `records: list` to match the parent signature.

### Issue 2 of 3
nemo_retriever/src/nemo_retriever/tabular_data/ingestion/extract_data.py:57
**Missing return type annotation on `store_relational_db_in_neo4j` and `populate_tabular_data`**

Both functions now return a meaningful `dict[str, Any]` (the `{schema_name_lower: Schema}` map) instead of nothing, but neither has a return type annotation. This affects `store_relational_db_in_neo4j` here and `populate_tabular_data` in `write_to_graph.py`. Without annotations, type checkers can't validate how callers use the returned dict, and the contract is invisible to readers. Add `-> dict[str, Any]` (or the concrete `Schema` type if it's exported) to both signatures.

### Issue 3 of 3
nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py:63-79
**No unit tests for the new operator logic in `TabularFetchEmbeddingsOp` or `TabularSchemaExtractOp`**

`TabularFetchEmbeddingsOp.process()` gained a large new code path — including tuple unpacking, the `_build_rows` row-builder, schema-keyed column bucketing, `pd.isna` guards, and the fallback to fetching from the DB. `TabularSchemaExtractOp.process()` changed its return type from `pd.DataFrame` to `tuple[pd.DataFrame, pd.DataFrame]`. Neither operator has tests in the PR. `test_tabular_pipeline.py` covers extraction helpers but not the operator layer. The column-bucketing logic (unique `table_schema + table_name` keys), the empty-input sentinel, and the fallback-to-DB path are all meaningful edge cases that belong in tests.

_{Reviews (10): Last reviewed commit: "style(vdb): apply pre-commit formatting ..." | Re-trigger Greptile}

greptile-apps · 2026-05-18T09:55:16Z

+        if records and any(batch for batch in records):
+            self._vdb.upsert(records, table_name=self._table_name, key=self._key)
+        return data


The return value from vdb.upsert() carries the per-call counts (upserted, skipped_no_key, created_table, etc.) that are documented in LanceDB.upsert. Silently discarding them makes it impossible to observe incremental-update behaviour without attaching a debugger. At minimum the result should be logged at DEBUG level.

Suggested change

if records and any(batch for batch in records):

self._vdb.upsert(records, table_name=self._table_name, key=self._key)

return data

if records and any(batch for batch in records):

result = self._vdb.upsert(records, table_name=self._table_name, key=self._key)

logger.debug("UpsertVdbOperator: upsert result=%s", result)

return data

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/vdb/operators.py Line: 187-189 Comment: The return value from `vdb.upsert()` carries the per-call counts (`upserted`, `skipped_no_key`, `created_table`, etc.) that are documented in `LanceDB.upsert`. Silently discarding them makes it impossible to observe incremental-update behaviour without attaching a debugger. At minimum the result should be logged at DEBUG level. ```suggestion if records and any(batch for batch in records): result = self._vdb.upsert(records, table_name=self._table_name, key=self._key) logger.debug("UpsertVdbOperator: upsert result=%s", result) return data ``` How can I resolve this? If you propose a fix, please make it concise.

not sure that i should to add logs here

- TabularFetchEmbeddingsOp: replace TypeError on non-tuple input with a call to fetch_tabular_embedding_dataframe so the pipeline can still produce embedding rows from Neo4j when upstream did not pass (tables_df, columns_df). - Add nemo_retriever.tabular_data.ingestion.embeddings module that queries Neo4j for Table/Column docs and returns an embedding-ready DataFrame matching the unstructured pipeline format. - lancedb: drop stale comment in _create_lancedb_results.

greptile-apps · 2026-05-19T08:26:20Z

            pa.field("source", pa.string()),
+            # Stable entity id (e.g. Neo4j node UUID for tabular ingest).
+            # Empty string when the upstream payload has no stable identity;
+            # used as the merge key by ``LanceDB.upsert``.
+            pa.field("id", pa.string()),


Schema migration required for existing LanceDB tables

Adding "id" to _lancedb_arrow_schema means every call to create_index with overwrite=False (the default) will hit _validate_append_schema, which raises ValueError: "is missing required field 'id'; use overwrite=True to replace the table." for any table created before this PR. Any existing deployment that does not pass overwrite=True will fail on its next ingestion run. There is no migration path or backward-compat guard provided here.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/vdb/lancedb.py Line: 121-125 Comment: **Schema migration required for existing LanceDB tables** Adding `"id"` to `_lancedb_arrow_schema` means every call to `create_index` with `overwrite=False` (the default) will hit `_validate_append_schema`, which raises `ValueError: "is missing required field 'id'; use overwrite=True to replace the table."` for any table created before this PR. Any existing deployment that does not pass `overwrite=True` will fail on its next ingestion run. There is no migration path or backward-compat guard provided here. How can I resolve this? If you propose a fix, please make it concise.

it can be an issue. we must check it

greptile-apps · 2026-05-19T12:49:55Z

    all_schemas = populate_db(tables_df, columns_df, database, num_workers)

    if "fks" in data:
        populate_fks(fks=data["fks"], database_name=database)
+
    if "pks" in data:
        populate_pks(pks=data["pks"], database_name=database)

    if "queries" in data:
        populate_queries(all_schemas, data["queries"], num_workers, dialect)

-    return []
+    return all_schemas


Existing tables get wrong UUIDs on re-ingestion, causing LanceDB duplicates

In the DB-update path (when existing_db_id is not None and loaded), populate_db returns schemas whose tables_df["id"] entries for already-existing tables still hold the freshly generated UUIDs produced by the Schema constructor — they are never backfilled with the original Neo4j node IDs. Only columns_to_update (columns in the intersection with changed properties) get their id replaced with the graph UUID inside accumulate_updated_column; no equivalent correction happens for existing tables.

As a result, TabularSchemaExtractOp now propagates these wrong IDs into tables_df, TabularFetchEmbeddingsOp embeds them as the id field, and UpsertVdbOperator uses them as the LanceDB merge key. Because LanceDB holds the original UUIDs from the first ingest, the upsert path inserts new rows for every already-existing table instead of updating the existing entries, creating duplicates on every re-ingestion run.

The existing-db path in populate_db needs to backfill new_schema.tables_df["id"] (and consistently, columns_df["id"]) from existing_schema before returning, so the UUIDs in the returned schemas match what Neo4j and LanceDB already know.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/tabular_data/ingestion/write_to_graph.py Line: 46-57 Comment: **Existing tables get wrong UUIDs on re-ingestion, causing LanceDB duplicates** In the DB-update path (when `existing_db_id is not None and loaded`), `populate_db` returns `schemas` whose `tables_df["id"]` entries for **already-existing tables** still hold the freshly generated UUIDs produced by the `Schema` constructor — they are never backfilled with the original Neo4j node IDs. Only `columns_to_update` (columns in the intersection with changed properties) get their `id` replaced with the graph UUID inside `accumulate_updated_column`; no equivalent correction happens for existing tables. As a result, `TabularSchemaExtractOp` now propagates these wrong IDs into `tables_df`, `TabularFetchEmbeddingsOp` embeds them as the `id` field, and `UpsertVdbOperator` uses them as the LanceDB merge key. Because LanceDB holds the original UUIDs from the first ingest, the upsert path inserts **new rows** for every already-existing table instead of updating the existing entries, creating duplicates on every re-ingestion run. The existing-db path in `populate_db` needs to backfill `new_schema.tables_df["id"]` (and consistently, `columns_df["id"]`) from `existing_schema` before returning, so the UUIDs in the returned `schemas` match what Neo4j and LanceDB already know. How can I resolve this? If you propose a fix, please make it concise.

…mn-by-id

…into feature/update-emmdedding-for-table-and-column-by-id

tomer-levin-nv · 2026-05-21T07:29:23Z

            pa.field("source", pa.string()),
+            # Stable entity id (e.g. Neo4j node UUID for tabular ingest).
+            # Empty string when the upstream payload has no stable identity;
+            # used as the merge key by ``LanceDB.upsert``.
+            pa.field("id", pa.string()),


it can be an issue. we must check it

Rename `VDB.upsert` / `UpsertVdbOperator` to `VDB.put` / `PutVdbOperator` and tighten the contract to an in-place replace: missing keys and rows not already present in the target table now raise `KeyError` instead of being inserted, and `put` no longer creates tables on the fly. Updates the LanceDB implementation and the operator tests accordingly.

Add PutVdbOperator to the public API alongside IngestVdbOperator and RetrieveVdbOperator so callers using the standard package import path (`from nemo_retriever.vdb import PutVdbOperator`) can access it.

Collapse the multi-line NotImplementedError raise into a single line to match what the pre-commit hook produces, unblocking CI.

Update embeddings for table and column by id

fb49b1f

DinaLaptii requested review from a team as code owners May 18, 2026 09:50

DinaLaptii requested a review from edknv May 18, 2026 09:50

DinaLaptii marked this pull request as draft May 18, 2026 09:51

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

yuvalshkolar reviewed May 18, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/graph/tabular_fetch_embeddings_operator.py

resolve comments

004fecd

DinaLaptii marked this pull request as ready for review May 19, 2026 08:08

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread nemo_retriever/tests/test_nv_ingest_vdb_operator.py Outdated

fix test

200e2ad

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

DinaLaptii added 2 commits May 19, 2026 15:37

fix

2b3ca75

fix

b68d7bd

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

DinaLaptii added 2 commits May 20, 2026 16:00

Merge branch 'main' into feature/update-emmdedding-for-table-and-colu…

197e1e8

…mn-by-id

Merge branch 'main' of https://github.com/ftatiana-nv/NeMo-Retriever …

d5b5534

…into feature/update-emmdedding-for-table-and-column-by-id

tomer-levin-nv reviewed May 21, 2026

View reviewed changes

DinaLaptii added 3 commits May 21, 2026 13:35

fix(vdb): export PutVdbOperator from package init

c4dc2dc

Add PutVdbOperator to the public API alongside IngestVdbOperator and RetrieveVdbOperator so callers using the standard package import path (`from nemo_retriever.vdb import PutVdbOperator`) can access it.

style(vdb): apply pre-commit formatting to PutVdbOperator

1ebe6e6

Collapse the multi-line NotImplementedError raise into a single line to match what the pre-commit hook produces, unblocking CI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update embeddings for table and column by id (with verified)#2049

Update embeddings for table and column by id (with verified)#2049
DinaLaptii wants to merge 11 commits into
NVIDIA:mainfrom
ftatiana-nv:feature/update-emmdedding-for-table-and-column-by-id

DinaLaptii commented May 18, 2026

Uh oh!

greptile-apps Bot commented May 18, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 18, 2026

Uh oh!

DinaLaptii May 19, 2026

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

tomer-levin-nv May 21, 2026

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

Uh oh!

Uh oh!

tomer-levin-nv May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DinaLaptii commented May 18, 2026

Description

Checklist

Uh oh!

greptile-apps Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

DinaLaptii May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

tomer-levin-nv May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tomer-levin-nv May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented May 18, 2026 •

edited

Loading