-
Notifications
You must be signed in to change notification settings - Fork 296
Description
Description
When running the shuffle benchmark with a configured memory pool limit, the actual peak RSS (resident set size) significantly exceeds the configured limit. This suggests there are significant memory allocations happening outside of the memory pool accounting.
Benchmark Data
Setup: TPCH SF100 lineitem (100M rows, 16 columns), hash partitioning (200 partitions), lz4 compression, single iteration.
| Memory Limit | Peak RSS | RSS / Limit | Write Time | Throughput |
|---|---|---|---|---|
| 2 GB | 3.8 GB | 1.78x | 43.4s | 2.30M rows/s |
| 4 GB | 6.7 GB | 1.56x | 45.1s | 2.22M rows/s |
| 8 GB | 11.0 GB | 1.28x | 46.4s | 2.15M rows/s |
| 16 GB | 13.1 GB | 0.76x | 51.9s | 1.93M rows/s |
Observations
-
Peak RSS exceeds the memory limit by up to 1.78x. At a 2 GB memory limit, the process uses 3.8 GB of RSS. This means nearly half the memory usage is untracked by the memory pool.
-
Overhead outside the pool grows with the limit. Subtracting the memory limit from peak RSS: 2g has ~1.8 GB overhead, 4g has ~2.7 GB, 8g has ~3.0 GB. This suggests some internal structures scale with available memory rather than being fixed-size.
-
Higher memory limits are slower, not faster. The 2 GB run (43.4s) is 20% faster than the 16 GB run (51.9s). Smaller pools may keep the working set more cache-friendly, while larger buffers may cause more memory pressure at flush time.
Expected Behavior
The memory pool limit should more closely bound the total process memory usage. Untracked allocations (Arrow buffers, Parquet reader state, partition writer buffers, etc.) should either be accounted for in the pool or documented as expected overhead.
Reproduction
cargo build --release --bin shuffle_bench --features shuffle-bench
/usr/bin/time -l ./native/target/release/shuffle_bench \
--input /opt/tpch/sf100/lineitem \
--codec lz4 \
--partitions 200 \
--hash-columns 0,3 \
--memory-limit 2147483648 \
--limit 100000000