[core] Validate lookup hits against the current deletion vector in LookupLevels#8206
[core] Validate lookup hits against the current deletion vector in LookupLevels#8206yugeeklab wants to merge 1 commit into
Conversation
…okupLevels Lookup files are cached per data file name and freeze the deletion state of their build time. Data files are immutable but their deletion vectors are not, so a cached lookup file can keep serving a row that has since been marked deleted - and with deletion vectors enabled, delete records are dropped from compaction output at any non-zero output level (MergeTreeCompactManager), so no tombstone remains in the levels to shadow the stale hit. Such a stale hit corrupts every consumer of the lookup. In particular, the lookup changelog producer uses it as the changelog BEFORE image: a re-insert with content identical to the pre-delete row (modulo changelog-producer.row-deduplicate-ignore-fields) is judged "no change" and produces no changelog, although a DELETE changelog was already emitted by an earlier compaction. Downstream CDC consumers end up permanently diverged: the table holds a live row while the changelog stream says it was deleted. Validate the hit's position against the current deletion vector before returning it. A deleted hit means the newest version of the key in the searched levels is gone; deeper levels only hold older versions, so the key is reported as absent rather than continuing the search. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| @Nullable | ||
| public T lookup(InternalRow key, int startLevel) throws IOException { | ||
| return LookupUtils.lookup(levels, key, startLevel, this::lookup, this::lookupLevel0); | ||
| T result = LookupUtils.lookup(levels, key, startLevel, this::lookup, this::lookupLevel0); |
There was a problem hiding this comment.
This check only covers the situation where 'deleted rows are hit in the cache lookup file', but when the cold cache/cache is retired, it may still retrieve older versions at a deeper level. Create SstFileFromDataMile will use a reader with dvFactory to build a lookup file, and the reader will first filter deleted rows using DV; So the latest record at the current level is filtered and returns null, and LookupUtils.lookup continues to search the next layer, possibly returning an old version. This is semantically inconsistent with the comment 'deleted hit means newest version is gone; must not continue deeper'. Suggest not applying DV first when building lookup SST. Keep the position and have LookupLevels verify and short-circuit it, or explicitly distinguish between "hit but deleted by DV" and "current file has no key"
| bfGenerator(options), | ||
| lookupFileCache); | ||
| lookupFileCache, | ||
| // TODO pass DeletionVector factory (see reader factory above) |
There was a problem hiding this comment.
LocalTableQuery still passes DeletionVector. mptyFactory() to both reader and LookupLevels, and uses PersistValueProcessor with no position information available for new validation. Long term cache paths such as Flink lookup service/partial lookup may still return deleted cached values even after DV updates the same data file. The TODO here is not a small issue. The PR title says that fixing LookupLevels respects the current DV, but this public lookup path still completely disregards DV.
Purpose
Linked issue: close #8204
With deletion vectors enabled, delete records are dropped from compaction output at any non-zero output level, so the deletion of a key only lives in the deletion vector of the file holding the old row.
LookupLevelscaches lookup files per data file name and freezes the deletion state of build time — data files are immutable but their deletion vectors are not — so a cached lookup file can keep serving a row that has since been marked deleted.Such a stale hit corrupts every consumer of the lookup. In particular the lookup changelog producer uses it as the changelog BEFORE image: a re-insert with content identical to the pre-delete row (modulo
changelog-producer.row-deduplicate-ignore-fields) is judged "no change" and produces no changelog, although a-Dwas already emitted by an earlier compaction. Downstream CDC consumers end up permanently diverged: the table holds a live row while the changelog stream says it was deleted.This PR validates the hit's position against the current deletion vector before returning it from
LookupLevels.lookup. A deleted hit means the newest version of the key in the searched levels is gone; deeper levels only hold older versions, so the key is reported as absent rather than continuing the search. Hits without position information (value-only processors) keep the previous behaviour.Tests
LookupLevelsTest#testLookupRespectsDeletionVectorUpdatesexercises the real lookup-file cache:The test fails without the fix and passes with it. Full paimon-core suite: 0 failures (remaining errors are environmental — docker-dependent and a pre-existing JDK/Hadoop
Subject.getSubjectincompatibility, identical on master).🤖 Generated with Claude Code