feat(table/write): merge partial-update rows at flush by TheR1sing3un · Pull Request #380 · apache/paimon-rust

TheR1sing3un · 2026-06-11T17:01:55Z

Purpose

The partial-update writer kept every row of a key in the flushed data file, deferring all merging to the read side. Java's MergeTreeWriter#flushWriteBuffer runs the merge function over the write buffer before flushing, so a data file never holds two rows of one key — an invariant that split planning and statistics rely on (a file's physical row count equals its logical row count).

The missing invariant surfaced in #374 (comment): a single-file partial-update split marked raw convertible reported its physical row count as exact, inflating COUNT(*) through exact scan statistics and starving LIMIT pushdown. PR #374 works around it by keeping partial-update splits non-raw-convertible; this PR fixes the root cause so that gating can be relaxed in a follow-up.

Brief change log

KeyValueFileWriter::flush merges each key group down to one row for merge-engine=partial-update, mirroring Java MergeTreeWriter#flushWriteBuffer with the same semantics as the read-side PartialUpdateMergeFunction:
- every column keeps its latest non-null value ordered by (sequence fields, system sequence); an all-null column stays null
- the merged row carries the group's highest sequence number
- DELETE / UPDATE_BEFORE rows are rejected, matching the read-side error
Changelog files (changelog-producer=input) still record the pre-merge rows, matching Java's rawConsumer.
Deduplicate / first-row flush behavior is unchanged; the key-grouping helper is now shared.
Cross-commit merging is unchanged: files from different commits still overlap on key range and go through the sort-merge reader.

Tests

test_merge_partial_update_rows_latest_non_null_per_column: per-column latest-non-null across a key group, all-null column stays null, merged _SEQUENCE_NUMBER is the group max
test_merge_partial_update_rows_rejects_retract: DELETE rows error at flush like the read side
e2e test_pk_partial_update_merges_within_single_commit: three partial updates of one key in one INSERT produce a single physical row — SELECT and COUNT(*) agree
test_flush_merge_matches_read_side_partial_update_merge: feeds the same key groups through the flush-time merge and the read-side PartialUpdateMergeFunction, asserting identical output — locks the two implementations together (Java reuses one MergeFunction for write flush, compaction, and reads; this test provides the equivalent guarantee for the vectorized write-side implementation)
existing partial-update e2e (cross-commit field-wise merge) unchanged and green

API and Format

No API change. Data files written for partial-update tables now contain one row per key per flush (same physical schema); files written by older versions keep working — the reader still sort-merges every split.

Documentation

No documentation change needed.

The partial-update writer kept every row of a key in the flushed file, deferring all merging to the read side. Java's MergeTreeWriter runs the merge function over the write buffer before flushing, so a data file never holds two rows of one key — an invariant split planning and statistics can rely on (a file's physical row count equals its logical row count). Mirror that: at flush, group sorted rows by primary key and emit one row per group with the same semantics as the read-side PartialUpdateMergeFunction — every column keeps its latest non-null value ordered by (sequence fields, system sequence), the merged row carries the group's highest sequence number, and retract rows are rejected. Changelog files (input producer) still record the pre-merge rows, matching Java's rawConsumer. Cross-commit merging is unchanged: files from different commits still overlap on key range and go through the sort-merge reader.

…tics Java reuses one MergeFunction across write flush, compaction, and reads, giving engine semantics a single source of truth. The Rust write side is a vectorized re-implementation of the read-side streaming merge, so feed the same key groups through merge_partial_update_rows and the read-side PartialUpdateMergeFunction and assert identical output, preventing the two implementations from drifting.

…on-flush # Conflicts: # crates/paimon/src/table/kv_file_writer.rs

TheR1sing3un added 3 commits June 12, 2026 00:48

Merge remote-tracking branch 'origin/main' into feat/pu-writer-merge-…

abb0d61

…on-flush # Conflicts: # crates/paimon/src/table/kv_file_writer.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(table/write): merge partial-update rows at flush#380

feat(table/write): merge partial-update rows at flush#380
TheR1sing3un wants to merge 3 commits into
apache:mainfrom
TheR1sing3un:feat/pu-writer-merge-on-flush

TheR1sing3un commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheR1sing3un commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheR1sing3un commented Jun 11, 2026 •

edited

Loading