Skip to content

Explore zero-copy batch coalescing during shuffle write #3836

@andygrove

Description

@andygrove

Description

Currently, when writing shuffle data with many output partitions, each input batch gets split into many small per-partition batches (e.g. 8192 rows across 200 partitions ≈ ~41 rows per partition). Before serialization, we use Arrow's BatchCoalescer to combine these small batches into larger ones to reduce per-batch IPC overhead.

However, BatchCoalescer allocates new arrays and copies row data to produce the coalesced batch. This is unnecessary work — the data is about to be serialized anyway.

Proposed Optimization

Explore writing multiple small batches as a single IPC record batch message by concatenating their Arrow buffers directly in the IPC writer, avoiding the intermediate copy. This would be a "coalesce-on-write" approach:

  • Instead of: small batches → BatchCoalescer (alloc + copy) → large batch → IPC serialize
  • Do: small batches → IPC serialize directly as one message (concatenate buffers)

This would eliminate one full data copy per batch in the shuffle write hot path.

Context

This applies to both the block-based and IPC stream shuffle formats. The BatchCoalescer is used in BufBatchWriter (block format) and in the IPC stream multi-partition write path.

Benchmark data (200 partitions, LZ4, 10M rows from TPC-H SF100 lineitem):

  • Block format: 2.64M rows/s, 609 MiB output
  • IPC stream format: 2.60M rows/s, 634 MiB output

Both paths use BatchCoalescer and would benefit from this optimization.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions