Skip to content

[Bug] Lookup changelog producer with deletion vectors can permanently diverge CDC consumers: re-insert after delete produces no changelog #8204

@yugeeklab

Description

@yugeeklab

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

master (1.5-SNAPSHOT), also present in release-1.4

Compute Engine

Observed with Flink writer + Spark streaming consumer, but the bug is engine-independent (paimon-core compaction path).

Minimal reproduce step

Unit test in the linked PR reproduces it deterministically: a lookup hit whose position is already marked deleted in the current deletion vector is used as the changelog BEFORE image; a re-insert with identical content then produces no changelog at all (fails without the fix, passes with it).

Conceptual sequence on a table with changelog-producer = lookup, deletion-vectors.enabled = true, changelog-producer.row-deduplicate = true:

  1. Key K exists with content C (row in a high-level file F, alive).
  2. K is deleted. Compaction window 1 emits -D (correct), marks F's row position in the deletion vector, and drops the delete record from the output (with DVs enabled dropDelete is true for any non-zero output level, see MergeTreeCompactManager).
  3. K is re-created with the same content C (only fields listed in row-deduplicate-ignore-fields differ).
  4. Compaction window 2: pickHighLevel finds nothing (the tombstone was dropped). The lookup is served by a cached lookup file built before the DV update (LookupLevels caches per data file name; data files are immutable, so the cache is never rebuilt and the only invalidation hook is file drop). It returns the pre-delete row C as BEFORE.
  5. LookupChangelogMergeFunctionWrapper#setChangelog: BEFORE=C (add), AFTER=C (add), valueEqualiser.equals is true → no changelog emitted.

Net changelog stream for K: ... , -D while the table holds a live row — downstream CDC consumers are permanently diverged and there is no later event that repairs them.

What doesn't meet your expectations?

The re-insert in step 3 must emit +I (or -U/+U), because a -D was already emitted for the same key in an earlier compaction. Row-level dedup compares against the stale pre-delete row instead of the current (deleted) state.

Observed in production on a ~300k-key table with periodic delete/re-create churn: a steady drip of keys whose changelog ends with -D while the table row is alive. Verified by reading $audit_log with incremental-between over the suspect window together with a control rowkind count (control non-empty, suspect keys zero events).

Anything else?

Proposed fix (PR follows): validate lookup hits against the current deletion vector before returning them from LookupLevels — positions are already available via PositionedKeyValue/FilePosition. A deleted hit is reported as absent; deeper levels only hold older versions of the key, so continuing the search would be wrong as well.

Related: LocalTableQuery carries a // TODO pass DeletionVector factory, the same integration gap on the read path.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions