Skip to content

fix: improve memory accounting in shuffle write [WIP]#3824

Draft
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:double-count-memory
Draft

fix: improve memory accounting in shuffle write [WIP]#3824
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:double-count-memory

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Mar 27, 2026

Which issue does this PR close?

Closes #3821.

Rationale for this change

The shuffle writer's memory reservation doesn't match the reality of how much memory is actually used. There are untracked allocations.

These untracked allocations cause actual RSS to significantly exceed the configured memory pool limit, because the pool doesn't know about them when making spill decisions.

What changes are included in this PR?

Apply a 2x multiplier to mem_growth when reserving memory in buffer_partitioned_batch_may_spill. This triggers spills earlier, keeping total process memory closer to the configured limit.

How are these changes tested?

Benchmarked with shuffle_bench on TPCH SF100 lineitem (100M rows, 16 columns, hash partitioning with 200 partitions, lz4 compression) at memory limits of 2 GB, 4 GB, 8 GB, and 16 GB.

Before

Memory Limit Peak RSS RSS / Limit Write Time Throughput
2 GB 3.8 GB 1.78x 43.4s 2.30M rows/s
4 GB 6.7 GB 1.56x 45.1s 2.22M rows/s
8 GB 11.0 GB 1.28x 46.4s 2.15M rows/s
16 GB 13.1 GB 0.76x 51.9s 1.93M rows/s

After (2x multiplier)

Memory Limit Peak RSS RSS / Limit Write Time Throughput
2 GB 2.8 GB 1.37x 43.8s 2.34M rows/s
4 GB 4.0 GB 0.94x 43.2s 2.33M rows/s
8 GB 6.7 GB 0.78x 45.7s 2.20M rows/s
16 GB 10.8 GB 0.63x 47.9s 2.10M rows/s

Peak RSS is reduced by 26-40% across all configurations while throughput is equal or better. The 4 GB case now stays within the configured limit (0.94x). All existing tests pass.

Add a `shuffle_bench` binary that benchmarks shuffle write and read
performance independently from Spark, making it easy to profile with
tools like `cargo flamegraph`, `perf`, or `instruments`.

Supports reading Parquet files (e.g. TPC-H/TPC-DS) or generating
synthetic data with configurable schema. Covers different scenarios
including compression codecs, partition counts, partitioning schemes,
and memory-constrained spilling.
…arquet

- Add `spark.comet.exec.shuffle.maxBufferedBatches` config to limit
  the number of batches buffered before spilling, allowing earlier
  spilling to reduce peak memory usage on executors
- Fix too-many-open-files: close spill file FD after each spill and
  reopen in append mode, rather than holding one FD open per partition
- Refactor shuffle_bench to stream directly from Parquet instead of
  loading all input data into memory; remove synthetic data generation
- Add --max-buffered-batches CLI arg to shuffle_bench
- Add shuffle benchmark documentation to README
Merge latest from apache/main, resolve conflicts, and strip out
COMET_SHUFFLE_MAX_BUFFERED_BATCHES config and all related plumbing.
This branch now only adds the shuffle benchmark binary.
The shuffle writer's memory reservation only tracks buffered RecordBatches
and partition indices, but significant untracked allocations occur in the
write path: BufBatchWriter buffers, BatchCoalescer state, Arrow IPC
serialization, compression encoder state, and interleave_record_batch
output. This caused actual RSS to be 1.5-1.8x the configured memory limit.

Apply a 2x multiplier to the reservation growth to trigger spills earlier,
keeping total process memory closer to the configured limit.

Closes apache#3821
@andygrove
Copy link
Copy Markdown
Member Author

I plan on running benchmarks at scale next week and experiment with different amounts of off heap memory. Leaving as draft until I can do this.

@andygrove andygrove changed the title fix: apply 2x memory accounting multiplier for shuffle write overhead fix: apply 2x memory accounting multiplier for shuffle write overhead [WIP] Mar 28, 2026
@andygrove andygrove changed the title fix: apply 2x memory accounting multiplier for shuffle write overhead [WIP] fix: improve memory accounting in shuffle write [WIP] Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shuffle memory pool accounting: peak RSS significantly exceeds configured memory limit

1 participant