-
Notifications
You must be signed in to change notification settings - Fork 322
Update embeddings for table and column by id (with verified) #2049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
fb49b1f
637e561
004fecd
200e2ad
2b3ca75
b68d7bd
197e1e8
d5b5534
729a9d4
c4dc2dc
1ebe6e6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,23 +38,23 @@ def populate_tabular_data(data, num_workers, dialect): | |
|
|
||
| if tables_df is None or tables_df.empty: | ||
| logger.warning("No tables found in source database; skipping graph population.") | ||
| return | ||
| return {} | ||
|
|
||
| database = data["database_name"] | ||
| logger.info(f"Started parsing db {database}.") | ||
|
|
||
| all_schemas = {} | ||
| all_schemas = populate_db(tables_df, columns_df, database, num_workers) | ||
|
|
||
| if "fks" in data: | ||
| populate_fks(fks=data["fks"], database_name=database) | ||
|
|
||
| if "pks" in data: | ||
| populate_pks(pks=data["pks"], database_name=database) | ||
|
|
||
| if "queries" in data: | ||
| populate_queries(all_schemas, data["queries"], num_workers, dialect) | ||
|
|
||
| return [] | ||
| return all_schemas | ||
|
Comment on lines
46
to
+57
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In the DB-update path (when As a result, The existing-db path in Prompt To Fix With AIThis is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/ingestion/write_to_graph.py
Line: 46-57
Comment:
**Existing tables get wrong UUIDs on re-ingestion, causing LanceDB duplicates**
In the DB-update path (when `existing_db_id is not None and loaded`), `populate_db` returns `schemas` whose `tables_df["id"]` entries for **already-existing tables** still hold the freshly generated UUIDs produced by the `Schema` constructor — they are never backfilled with the original Neo4j node IDs. Only `columns_to_update` (columns in the intersection with changed properties) get their `id` replaced with the graph UUID inside `accumulate_updated_column`; no equivalent correction happens for existing tables.
As a result, `TabularSchemaExtractOp` now propagates these wrong IDs into `tables_df`, `TabularFetchEmbeddingsOp` embeds them as the `id` field, and `UpsertVdbOperator` uses them as the LanceDB merge key. Because LanceDB holds the original UUIDs from the first ingest, the upsert path inserts **new rows** for every already-existing table instead of updating the existing entries, creating duplicates on every re-ingestion run.
The existing-db path in `populate_db` needs to backfill `new_schema.tables_df["id"]` (and consistently, `columns_df["id"]`) from `existing_schema` before returning, so the UUIDs in the returned `schemas` match what Neo4j and LanceDB already know.
How can I resolve this? If you propose a fix, please make it concise. |
||
|
|
||
|
|
||
| def populate_db(tables_df, columns_df, database, num_workers): | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.