perf: eagerly flush shuffle partitions once batch_size rows accumulated [experimental] by andygrove · Pull Request #3839 · apache/datafusion-comet

andygrove · 2026-03-30T01:45:53Z

Which issue does this PR close?

N/A - experimental performance optimization

Rationale for this change

The current multi-partition shuffle partitioner buffers all input batches in memory until shuffle_write() is called at the end. This requires large offheap memory allocations, especially at higher scale factors.

This PR modifies the multi-partition partitioner to eagerly flush a partition's accumulated row indices into compressed IPC output (via the existing spill infrastructure) as soon as that partition reaches batch_size rows. This bounds the index memory per partition while maintaining the same output format and performance as the current approach.

What changes are included in this PR?

In buffer_partitioned_batch_may_spill, after accumulating row indices, check if any partition has reached batch_size rows and flush it immediately via flush_partition()
New flush_partition() method that creates a PartitionedBatchIterator for a single partition, spills the produced batches to disk, and clears the indices
Made PartitionedBatchIterator::new visible to sibling modules

How are these changes tested?

Benchmarked on TPC-H SF1000 (Q1):

Baseline (main): 65.7s
This PR: 63.9s (no regression)
For comparison, the immediate-mode partitioner (PR feat: add immediate-mode shuffle partitioner [experimental] #3838): 111.4s

Full TPC-H SF1000 run in progress.

Add a `shuffle_bench` binary that benchmarks shuffle write and read performance independently from Spark, making it easy to profile with tools like `cargo flamegraph`, `perf`, or `instruments`. Supports reading Parquet files (e.g. TPC-H/TPC-DS) or generating synthetic data with configurable schema. Covers different scenarios including compression codecs, partition counts, partitioning schemes, and memory-constrained spilling.

…arquet - Add `spark.comet.exec.shuffle.maxBufferedBatches` config to limit the number of batches buffered before spilling, allowing earlier spilling to reduce peak memory usage on executors - Fix too-many-open-files: close spill file FD after each spill and reopen in append mode, rather than holding one FD open per partition - Refactor shuffle_bench to stream directly from Parquet instead of loading all input data into memory; remove synthetic data generation - Add --max-buffered-batches CLI arg to shuffle_bench - Add shuffle benchmark documentation to README

Merge latest from apache/main, resolve conflicts, and strip out COMET_SHUFFLE_MAX_BUFFERED_BATCHES config and all related plumbing. This branch now only adds the shuffle benchmark binary.

Spawns N parallel shuffle tasks to simulate executor parallelism. Each task reads the same input and writes to its own output files. Extracts core shuffle logic into shared async helper to avoid code duplication between single and concurrent paths.

andygrove added 13 commits March 21, 2026 07:43

feat: add --limit option to shuffle benchmark (default 1M rows)

9b5b305

perf: apply limit during parquet read to avoid scanning all files

e1ab490

feat: move shuffle_bench binary into shuffle crate

b7682f4

chore: add comment explaining parquet/rand deps in shuffle crate

ca36cbd

Merge remote-tracking branch 'apache/main' into shuffle-bench-binary

7225afd

merge apache/main, remove max_buffered_batches changes

16ce30f

Merge latest from apache/main, resolve conflicts, and strip out COMET_SHUFFLE_MAX_BUFFERED_BATCHES config and all related plumbing. This branch now only adds the shuffle benchmark binary.

cargo fmt

2ef57e7

prettier

9136e10

machete

7e16819

perf: eagerly flush partitions once batch_size rows accumulated

66c8480

andygrove closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: eagerly flush shuffle partitions once batch_size rows accumulated [experimental]#3839

perf: eagerly flush shuffle partitions once batch_size rows accumulated [experimental]#3839
andygrove wants to merge 13 commits intoapache:mainfrom
andygrove:eager-flush

andygrove commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Mar 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant