-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[core] Validate lookup hits against the current deletion vector in LookupLevels #8206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -168,7 +168,9 @@ private void newLookupLevels(BinaryRow partition, int bucket, List<DataFileMeta> | |
| .getPathFile(), | ||
| lookupStoreFactory, | ||
| bfGenerator(options), | ||
| lookupFileCache); | ||
| lookupFileCache, | ||
| // TODO pass DeletionVector factory (see reader factory above) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. LocalTableQuery still passes DeletionVector. mptyFactory() to both reader and LookupLevels, and uses PersistValueProcessor with no position information available for new validation. Long term cache paths such as Flink lookup service/partial lookup may still return deleted cached values even after DV updates the same data file. The TODO here is not a small issue. The PR title says that fixing LookupLevels respects the current DV, but this public lookup path still completely disregards DV. |
||
| DeletionVector.emptyFactory()); | ||
|
|
||
| // Optimization - download lookup files if already persisted to object store | ||
| // We download these files if three conditions are met | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check only covers the situation where 'deleted rows are hit in the cache lookup file', but when the cold cache/cache is retired, it may still retrieve older versions at a deeper level. Create SstFileFromDataMile will use a reader with dvFactory to build a lookup file, and the reader will first filter deleted rows using DV; So the latest record at the current level is filtered and returns null, and LookupUtils.lookup continues to search the next layer, possibly returning an old version. This is semantically inconsistent with the comment 'deleted hit means newest version is gone; must not continue deeper'. Suggest not applying DV first when building lookup SST. Keep the position and have LookupLevels verify and short-circuit it, or explicitly distinguish between "hit but deleted by DV" and "current file has no key"