-
Notifications
You must be signed in to change notification settings - Fork 296
Description
Description
Currently, when writing shuffle data with many output partitions, each input batch gets split into many small per-partition batches (e.g. 8192 rows across 200 partitions ≈ ~41 rows per partition). Before serialization, we use Arrow's BatchCoalescer to combine these small batches into larger ones to reduce per-batch IPC overhead.
However, BatchCoalescer allocates new arrays and copies row data to produce the coalesced batch. This is unnecessary work — the data is about to be serialized anyway.
Proposed Optimization
Explore writing multiple small batches as a single IPC record batch message by concatenating their Arrow buffers directly in the IPC writer, avoiding the intermediate copy. This would be a "coalesce-on-write" approach:
- Instead of: small batches →
BatchCoalescer(alloc + copy) → large batch → IPC serialize - Do: small batches → IPC serialize directly as one message (concatenate buffers)
This would eliminate one full data copy per batch in the shuffle write hot path.
Context
This applies to both the block-based and IPC stream shuffle formats. The BatchCoalescer is used in BufBatchWriter (block format) and in the IPC stream multi-partition write path.
Benchmark data (200 partitions, LZ4, 10M rows from TPC-H SF100 lineitem):
- Block format: 2.64M rows/s, 609 MiB output
- IPC stream format: 2.60M rows/s, 634 MiB output
Both paths use BatchCoalescer and would benefit from this optimization.