Skip to content

Shuffle memory pool accounting: peak RSS significantly exceeds configured memory limit #3821

@andygrove

Description

@andygrove

Description

When running the shuffle benchmark with a configured memory pool limit, the actual peak RSS (resident set size) significantly exceeds the configured limit. This suggests there are significant memory allocations happening outside of the memory pool accounting.

Benchmark Data

Setup: TPCH SF100 lineitem (100M rows, 16 columns), hash partitioning (200 partitions), lz4 compression, single iteration.

Memory Limit Peak RSS RSS / Limit Write Time Throughput
2 GB 3.8 GB 1.78x 43.4s 2.30M rows/s
4 GB 6.7 GB 1.56x 45.1s 2.22M rows/s
8 GB 11.0 GB 1.28x 46.4s 2.15M rows/s
16 GB 13.1 GB 0.76x 51.9s 1.93M rows/s

Observations

  1. Peak RSS exceeds the memory limit by up to 1.78x. At a 2 GB memory limit, the process uses 3.8 GB of RSS. This means nearly half the memory usage is untracked by the memory pool.

  2. Overhead outside the pool grows with the limit. Subtracting the memory limit from peak RSS: 2g has ~1.8 GB overhead, 4g has ~2.7 GB, 8g has ~3.0 GB. This suggests some internal structures scale with available memory rather than being fixed-size.

  3. Higher memory limits are slower, not faster. The 2 GB run (43.4s) is 20% faster than the 16 GB run (51.9s). Smaller pools may keep the working set more cache-friendly, while larger buffers may cause more memory pressure at flush time.

Expected Behavior

The memory pool limit should more closely bound the total process memory usage. Untracked allocations (Arrow buffers, Parquet reader state, partition writer buffers, etc.) should either be accounted for in the pool or documented as expected overhead.

Reproduction

cargo build --release --bin shuffle_bench --features shuffle-bench

/usr/bin/time -l ./native/target/release/shuffle_bench \
    --input /opt/tpch/sf100/lineitem \
    --codec lz4 \
    --partitions 200 \
    --hash-columns 0,3 \
    --memory-limit 2147483648 \
    --limit 100000000

Metadata

Metadata

Assignees

Labels

area:shuffleShuffle (JVM and native)

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions