Skip to content

[core] Validate lookup hits against the current deletion vector in LookupLevels#8206

Open
yugeeklab wants to merge 1 commit into
apache:masterfrom
yugeeklab:fix/lookup-levels-respect-dv
Open

[core] Validate lookup hits against the current deletion vector in LookupLevels#8206
yugeeklab wants to merge 1 commit into
apache:masterfrom
yugeeklab:fix/lookup-levels-respect-dv

Conversation

@yugeeklab

Copy link
Copy Markdown

Purpose

Linked issue: close #8204

With deletion vectors enabled, delete records are dropped from compaction output at any non-zero output level, so the deletion of a key only lives in the deletion vector of the file holding the old row. LookupLevels caches lookup files per data file name and freezes the deletion state of build time — data files are immutable but their deletion vectors are not — so a cached lookup file can keep serving a row that has since been marked deleted.

Such a stale hit corrupts every consumer of the lookup. In particular the lookup changelog producer uses it as the changelog BEFORE image: a re-insert with content identical to the pre-delete row (modulo changelog-producer.row-deduplicate-ignore-fields) is judged "no change" and produces no changelog, although a -D was already emitted by an earlier compaction. Downstream CDC consumers end up permanently diverged: the table holds a live row while the changelog stream says it was deleted.

This PR validates the hit's position against the current deletion vector before returning it from LookupLevels.lookup. A deleted hit means the newest version of the key in the searched levels is gone; deeper levels only hold older versions, so the key is reported as absent rather than continuing the search. Hits without position information (value-only processors) keep the previous behaviour.

Tests

LookupLevelsTest#testLookupRespectsDeletionVectorUpdates exercises the real lookup-file cache:

  1. control: lookup returns the live row and warms the cache,
  2. a deletion of an unrelated position in the same file does not affect the live row,
  3. after marking the returned position deleted (cache not rebuilt), the same lookup returns null.

The test fails without the fix and passes with it. Full paimon-core suite: 0 failures (remaining errors are environmental — docker-dependent and a pre-existing JDK/Hadoop Subject.getSubject incompatibility, identical on master).

🤖 Generated with Claude Code

…okupLevels

Lookup files are cached per data file name and freeze the deletion state of
their build time. Data files are immutable but their deletion vectors are
not, so a cached lookup file can keep serving a row that has since been
marked deleted - and with deletion vectors enabled, delete records are
dropped from compaction output at any non-zero output level
(MergeTreeCompactManager), so no tombstone remains in the levels to shadow
the stale hit.

Such a stale hit corrupts every consumer of the lookup. In particular, the
lookup changelog producer uses it as the changelog BEFORE image: a re-insert
with content identical to the pre-delete row (modulo
changelog-producer.row-deduplicate-ignore-fields) is judged "no change" and
produces no changelog, although a DELETE changelog was already emitted by an
earlier compaction. Downstream CDC consumers end up permanently diverged:
the table holds a live row while the changelog stream says it was deleted.

Validate the hit's position against the current deletion vector before
returning it. A deleted hit means the newest version of the key in the
searched levels is gone; deeper levels only hold older versions, so the key
is reported as absent rather than continuing the search.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Nullable
public T lookup(InternalRow key, int startLevel) throws IOException {
return LookupUtils.lookup(levels, key, startLevel, this::lookup, this::lookupLevel0);
T result = LookupUtils.lookup(levels, key, startLevel, this::lookup, this::lookupLevel0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check only covers the situation where 'deleted rows are hit in the cache lookup file', but when the cold cache/cache is retired, it may still retrieve older versions at a deeper level. Create SstFileFromDataMile will use a reader with dvFactory to build a lookup file, and the reader will first filter deleted rows using DV; So the latest record at the current level is filtered and returns null, and LookupUtils.lookup continues to search the next layer, possibly returning an old version. This is semantically inconsistent with the comment 'deleted hit means newest version is gone; must not continue deeper'. Suggest not applying DV first when building lookup SST. Keep the position and have LookupLevels verify and short-circuit it, or explicitly distinguish between "hit but deleted by DV" and "current file has no key"

bfGenerator(options),
lookupFileCache);
lookupFileCache,
// TODO pass DeletionVector factory (see reader factory above)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalTableQuery still passes DeletionVector. mptyFactory() to both reader and LookupLevels, and uses PersistValueProcessor with no position information available for new validation. Long term cache paths such as Flink lookup service/partial lookup may still return deleted cached values even after DV updates the same data file. The TODO here is not a small issue. The PR title says that fixing LookupLevels respects the current DV, but this public lookup path still completely disregards DV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Lookup changelog producer with deletion vectors can permanently diverge CDC consumers: re-insert after delete produces no changelog

2 participants