Skip to content

feat(table/write): merge partial-update rows at flush#380

Open
TheR1sing3un wants to merge 3 commits into
apache:mainfrom
TheR1sing3un:feat/pu-writer-merge-on-flush
Open

feat(table/write): merge partial-update rows at flush#380
TheR1sing3un wants to merge 3 commits into
apache:mainfrom
TheR1sing3un:feat/pu-writer-merge-on-flush

Conversation

@TheR1sing3un

@TheR1sing3un TheR1sing3un commented Jun 11, 2026

Copy link
Copy Markdown
Member

Purpose

The partial-update writer kept every row of a key in the flushed data file, deferring all merging to the read side. Java's MergeTreeWriter#flushWriteBuffer runs the merge function over the write buffer before flushing, so a data file never holds two rows of one key — an invariant that split planning and statistics rely on (a file's physical row count equals its logical row count).

The missing invariant surfaced in #374 (comment): a single-file partial-update split marked raw convertible reported its physical row count as exact, inflating COUNT(*) through exact scan statistics and starving LIMIT pushdown. PR #374 works around it by keeping partial-update splits non-raw-convertible; this PR fixes the root cause so that gating can be relaxed in a follow-up.

Brief change log

  • KeyValueFileWriter::flush merges each key group down to one row for merge-engine=partial-update, mirroring Java MergeTreeWriter#flushWriteBuffer with the same semantics as the read-side PartialUpdateMergeFunction:
    • every column keeps its latest non-null value ordered by (sequence fields, system sequence); an all-null column stays null
    • the merged row carries the group's highest sequence number
    • DELETE / UPDATE_BEFORE rows are rejected, matching the read-side error
  • Changelog files (changelog-producer=input) still record the pre-merge rows, matching Java's rawConsumer.
  • Deduplicate / first-row flush behavior is unchanged; the key-grouping helper is now shared.
  • Cross-commit merging is unchanged: files from different commits still overlap on key range and go through the sort-merge reader.

Tests

  • test_merge_partial_update_rows_latest_non_null_per_column: per-column latest-non-null across a key group, all-null column stays null, merged _SEQUENCE_NUMBER is the group max
  • test_merge_partial_update_rows_rejects_retract: DELETE rows error at flush like the read side
  • e2e test_pk_partial_update_merges_within_single_commit: three partial updates of one key in one INSERT produce a single physical row — SELECT and COUNT(*) agree
  • test_flush_merge_matches_read_side_partial_update_merge: feeds the same key groups through the flush-time merge and the read-side PartialUpdateMergeFunction, asserting identical output — locks the two implementations together (Java reuses one MergeFunction for write flush, compaction, and reads; this test provides the equivalent guarantee for the vectorized write-side implementation)
  • existing partial-update e2e (cross-commit field-wise merge) unchanged and green

API and Format

No API change. Data files written for partial-update tables now contain one row per key per flush (same physical schema); files written by older versions keep working — the reader still sort-merges every split.

Documentation

No documentation change needed.

The partial-update writer kept every row of a key in the flushed file,
deferring all merging to the read side. Java's MergeTreeWriter runs the
merge function over the write buffer before flushing, so a data file
never holds two rows of one key — an invariant split planning and
statistics can rely on (a file's physical row count equals its logical
row count).

Mirror that: at flush, group sorted rows by primary key and emit one row
per group with the same semantics as the read-side
PartialUpdateMergeFunction — every column keeps its latest non-null
value ordered by (sequence fields, system sequence), the merged row
carries the group's highest sequence number, and retract rows are
rejected. Changelog files (input producer) still record the pre-merge
rows, matching Java's rawConsumer.

Cross-commit merging is unchanged: files from different commits still
overlap on key range and go through the sort-merge reader.
…tics

Java reuses one MergeFunction across write flush, compaction, and reads,
giving engine semantics a single source of truth. The Rust write side is
a vectorized re-implementation of the read-side streaming merge, so feed
the same key groups through merge_partial_update_rows and the read-side
PartialUpdateMergeFunction and assert identical output, preventing the
two implementations from drifting.
…on-flush

# Conflicts:
#	crates/paimon/src/table/kv_file_writer.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant