Skip to content

feat(parquet): RowSelection can be backed by a BooleanBuffer#10141

Open
haohuaijin wants to merge 4 commits into
apache:mainfrom
haohuaijin:export-mask
Open

feat(parquet): RowSelection can be backed by a BooleanBuffer#10141
haohuaijin wants to merge 4 commits into
apache:mainfrom
haohuaijin:export-mask

Conversation

@haohuaijin

@haohuaijin haohuaijin commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

RowSelection currently stores selections as Vec<RowSelector> (16 bytes per selector). This is compact for long runs, but expensive for scattered matches. With ~35% isolated single-row hits, it uses about 11.2 bytes per input row. A BooleanBuffer uses 1 bit per input row, about 90x less memory.

The reader can also choose the Mask strategy, which converts selectors back into a bitmap. When the caller already had a bitmap, this conversion round-trip is unnecessary.

What changes are included in this PR?

RowSelection can now be backed by either Vec<RowSelector> or BooleanBuffer. New public construction:

pub fn RowSelection::from_boolean_buffer(mask: BooleanBuffer) -> Self;
impl From<BooleanBuffer> for RowSelection;

Methods that can work directly on the bitmap now do so:

  • iter() streams via BitSliceIterator
  • row_count / skipped_row_count use count_set_bits
  • selects_any uses set_indices().next()
  • trim preserves mask backing via BooleanBuffer::slice
  • intersection / union on Mask+Mask use BitAnd / BitOr
  • split_off on a mask uses BooleanBuffer::slice (O(1), both halves stay mask-backed)
  • limit slices at the selected-row boundary via find_nth_set_bit_position, staying mask-backed
  • offset finds the first selected row to keep via find_nth_set_bit_position and rebuilds only the mask buffer, avoiding selector materialization
  • and_then applies the inner selection over the mask's set positions, returning a mask-backed result
  • FromIterator<RowSelection> concatenates BooleanBuffers when every input is mask-backed

Mixed inputs, and existing selector-backed inputs, still use the existing selector helpers. Existing callers keep the same behavior.

The reader (ReadPlanBuilder::build) passes a mask-backed selection straight to RowSelectionCursor::new_mask_from_buffer, so it skips rebuilding the bitmap from selectors.

Are these changes tested?

Yes. This PR extends the existing RowSelection unit tests with coverage for:

  • constructing from BooleanBuffer, including empty and all-unset masks
  • From<BooleanBuffer>
  • preserving mask backing across clone, split_off, limit, offset, and_then, and all-mask FromIterator<RowSelection>
  • falling back to selector backing for mixed-backed concatenation
  • equality between equivalent selector-backed and mask-backed selections
  • mask-backed intersection / union, including uneven-length inputs
  • fuzz-style equivalence between mask-backed selections and the existing from_filters selector path

Are there any user-facing changes?

Yes — one source-breaking change suitable for the next major release: RowSelection::iter() now yields RowSelector (Item = RowSelector) instead of &RowSelector. This is needed because a mask-backed selection does not have a Vec<RowSelector> to borrow from.

RowSelector is Copy (16 bytes), so most call sites are source-compatible:

selection.iter().map(|s| s.row_count).sum::<usize>()    // unchanged
selection.iter().filter(|s| !s.skip).count()            // unchanged
selection.iter().any(|s| !s.skip)                       // unchanged

Call sites that need updating:

// before
let v: Vec<RowSelector> = selection.iter().cloned().collect();
let v: Vec<&RowSelector> = selection.iter().collect();

// after
let v: Vec<RowSelector> = selection.iter().collect();

Within arrow-rs, the required updates are limited to parquet's own RowSelection code and tests. The other crates do not use RowSelection::iter().

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow RowSelection to be backed by a BooleanBuffer

1 participant