fix: improve memory accounting in shuffle write [WIP] by andygrove · Pull Request #3824 · apache/datafusion-comet

andygrove · 2026-03-27T22:57:41Z

Which issue does this PR close?

Closes #3821.

Rationale for this change

The shuffle writer's memory reservation doesn't match the reality of how much memory is actually used. There are untracked allocations.

These untracked allocations cause actual RSS to significantly exceed the configured memory pool limit, because the pool doesn't know about them when making spill decisions.

What changes are included in this PR?

Apply a 2x multiplier to mem_growth when reserving memory in buffer_partitioned_batch_may_spill. This triggers spills earlier, keeping total process memory closer to the configured limit.

How are these changes tested?

Benchmarked with shuffle_bench on TPCH SF100 lineitem (100M rows, 16 columns, hash partitioning with 200 partitions, lz4 compression) at memory limits of 2 GB, 4 GB, 8 GB, and 16 GB.

Before

Memory Limit	Peak RSS	RSS / Limit	Write Time	Throughput
2 GB	3.8 GB	1.78x	43.4s	2.30M rows/s
4 GB	6.7 GB	1.56x	45.1s	2.22M rows/s
8 GB	11.0 GB	1.28x	46.4s	2.15M rows/s
16 GB	13.1 GB	0.76x	51.9s	1.93M rows/s

After (2x multiplier)

Memory Limit	Peak RSS	RSS / Limit	Write Time	Throughput
2 GB	2.8 GB	1.37x	43.8s	2.34M rows/s
4 GB	4.0 GB	0.94x	43.2s	2.33M rows/s
8 GB	6.7 GB	0.78x	45.7s	2.20M rows/s
16 GB	10.8 GB	0.63x	47.9s	2.10M rows/s

Peak RSS is reduced by 26-40% across all configurations while throughput is equal or better. The 4 GB case now stays within the configured limit (0.94x). All existing tests pass.

Add a `shuffle_bench` binary that benchmarks shuffle write and read performance independently from Spark, making it easy to profile with tools like `cargo flamegraph`, `perf`, or `instruments`. Supports reading Parquet files (e.g. TPC-H/TPC-DS) or generating synthetic data with configurable schema. Covers different scenarios including compression codecs, partition counts, partitioning schemes, and memory-constrained spilling.

…arquet - Add `spark.comet.exec.shuffle.maxBufferedBatches` config to limit the number of batches buffered before spilling, allowing earlier spilling to reduce peak memory usage on executors - Fix too-many-open-files: close spill file FD after each spill and reopen in append mode, rather than holding one FD open per partition - Refactor shuffle_bench to stream directly from Parquet instead of loading all input data into memory; remove synthetic data generation - Add --max-buffered-batches CLI arg to shuffle_bench - Add shuffle benchmark documentation to README

Merge latest from apache/main, resolve conflicts, and strip out COMET_SHUFFLE_MAX_BUFFERED_BATCHES config and all related plumbing. This branch now only adds the shuffle benchmark binary.

The shuffle writer's memory reservation only tracks buffered RecordBatches and partition indices, but significant untracked allocations occur in the write path: BufBatchWriter buffers, BatchCoalescer state, Arrow IPC serialization, compression encoder state, and interleave_record_batch output. This caused actual RSS to be 1.5-1.8x the configured memory limit. Apply a 2x multiplier to the reservation growth to trigger spills earlier, keeping total process memory closer to the configured limit. Closes apache#3821

andygrove · 2026-03-27T23:05:35Z

I plan on running benchmarks at scale next week and experiment with different amounts of off heap memory. Leaving as draft until I can do this.

andygrove added 12 commits March 21, 2026 07:43

feat: add --limit option to shuffle benchmark (default 1M rows)

9b5b305

perf: apply limit during parquet read to avoid scanning all files

e1ab490

feat: move shuffle_bench binary into shuffle crate

b7682f4

chore: add comment explaining parquet/rand deps in shuffle crate

ca36cbd

Merge remote-tracking branch 'apache/main' into shuffle-bench-binary

7225afd

merge apache/main, remove max_buffered_batches changes

16ce30f

Merge latest from apache/main, resolve conflicts, and strip out COMET_SHUFFLE_MAX_BUFFERED_BATCHES config and all related plumbing. This branch now only adds the shuffle benchmark binary.

cargo fmt

2ef57e7

prettier

9136e10

machete

7e16819

andygrove changed the title ~~fix: apply 2x memory accounting multiplier for shuffle write overhead~~ fix: apply 2x memory accounting multiplier for shuffle write overhead [WIP] Mar 28, 2026

andygrove changed the title ~~fix: apply 2x memory accounting multiplier for shuffle write overhead [WIP]~~ fix: improve memory accounting in shuffle write [WIP] Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve memory accounting in shuffle write [WIP]#3824

fix: improve memory accounting in shuffle write [WIP]#3824
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:double-count-memory

andygrove commented Mar 27, 2026 •

edited

Loading

Uh oh!

andygrove commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Before

After (2x multiplier)

Uh oh!

andygrove commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented Mar 27, 2026 •

edited

Loading