Search before asking
Paimon version
master (1.5-SNAPSHOT), also present in release-1.4
Compute Engine
Observed with Flink writer + Spark streaming consumer, but the bug is engine-independent (paimon-core compaction path).
Minimal reproduce step
Unit test in the linked PR reproduces it deterministically: a lookup hit whose position is already marked deleted in the current deletion vector is used as the changelog BEFORE image; a re-insert with identical content then produces no changelog at all (fails without the fix, passes with it).
Conceptual sequence on a table with changelog-producer = lookup, deletion-vectors.enabled = true, changelog-producer.row-deduplicate = true:
- Key K exists with content C (row in a high-level file F, alive).
- K is deleted. Compaction window 1 emits
-D (correct), marks F's row position in the deletion vector, and drops the delete record from the output (with DVs enabled dropDelete is true for any non-zero output level, see MergeTreeCompactManager).
- K is re-created with the same content C (only fields listed in
row-deduplicate-ignore-fields differ).
- Compaction window 2:
pickHighLevel finds nothing (the tombstone was dropped). The lookup is served by a cached lookup file built before the DV update (LookupLevels caches per data file name; data files are immutable, so the cache is never rebuilt and the only invalidation hook is file drop). It returns the pre-delete row C as BEFORE.
LookupChangelogMergeFunctionWrapper#setChangelog: BEFORE=C (add), AFTER=C (add), valueEqualiser.equals is true → no changelog emitted.
Net changelog stream for K: ... , -D while the table holds a live row — downstream CDC consumers are permanently diverged and there is no later event that repairs them.
What doesn't meet your expectations?
The re-insert in step 3 must emit +I (or -U/+U), because a -D was already emitted for the same key in an earlier compaction. Row-level dedup compares against the stale pre-delete row instead of the current (deleted) state.
Observed in production on a ~300k-key table with periodic delete/re-create churn: a steady drip of keys whose changelog ends with -D while the table row is alive. Verified by reading $audit_log with incremental-between over the suspect window together with a control rowkind count (control non-empty, suspect keys zero events).
Anything else?
Proposed fix (PR follows): validate lookup hits against the current deletion vector before returning them from LookupLevels — positions are already available via PositionedKeyValue/FilePosition. A deleted hit is reported as absent; deeper levels only hold older versions of the key, so continuing the search would be wrong as well.
Related: LocalTableQuery carries a // TODO pass DeletionVector factory, the same integration gap on the read path.
Are you willing to submit a PR?
Search before asking
Paimon version
master (1.5-SNAPSHOT), also present in release-1.4
Compute Engine
Observed with Flink writer + Spark streaming consumer, but the bug is engine-independent (paimon-core compaction path).
Minimal reproduce step
Unit test in the linked PR reproduces it deterministically: a lookup hit whose position is already marked deleted in the current deletion vector is used as the changelog BEFORE image; a re-insert with identical content then produces no changelog at all (fails without the fix, passes with it).
Conceptual sequence on a table with
changelog-producer = lookup,deletion-vectors.enabled = true,changelog-producer.row-deduplicate = true:-D(correct), marks F's row position in the deletion vector, and drops the delete record from the output (with DVs enableddropDeleteis true for any non-zero output level, seeMergeTreeCompactManager).row-deduplicate-ignore-fieldsdiffer).pickHighLevelfinds nothing (the tombstone was dropped). The lookup is served by a cached lookup file built before the DV update (LookupLevelscaches per data file name; data files are immutable, so the cache is never rebuilt and the only invalidation hook is file drop). It returns the pre-delete row C as BEFORE.LookupChangelogMergeFunctionWrapper#setChangelog: BEFORE=C (add), AFTER=C (add),valueEqualiser.equalsis true → no changelog emitted.Net changelog stream for K:
... , -Dwhile the table holds a live row — downstream CDC consumers are permanently diverged and there is no later event that repairs them.What doesn't meet your expectations?
The re-insert in step 3 must emit
+I(or-U/+U), because a-Dwas already emitted for the same key in an earlier compaction. Row-level dedup compares against the stale pre-delete row instead of the current (deleted) state.Observed in production on a ~300k-key table with periodic delete/re-create churn: a steady drip of keys whose changelog ends with
-Dwhile the table row is alive. Verified by reading$audit_logwithincremental-betweenover the suspect window together with a control rowkind count (control non-empty, suspect keys zero events).Anything else?
Proposed fix (PR follows): validate lookup hits against the current deletion vector before returning them from
LookupLevels— positions are already available viaPositionedKeyValue/FilePosition. A deleted hit is reported as absent; deeper levels only hold older versions of the key, so continuing the search would be wrong as well.Related:
LocalTableQuerycarries a// TODO pass DeletionVector factory, the same integration gap on the read path.Are you willing to submit a PR?