PSplitAgg: single-walk stats + parallel sibling execution by whilo · Pull Request #25 · replikativ/stratum

whilo · 2026-05-11T17:07:47Z

Summary

Two related fixes for mixed-class aggregate workloads (the MIN(p), AVG(p), MEDIAN(p), MAX(p) shape), surfaced by EXPLAIN ANALYZE on a user-reported 50.6M-row query that was anomalously slow in the stats-only branch of a PSplitAgg.

Fix 1: execute-stats-only walked the chunk-entry vector once per agg. For 3 aggs on one column → 3 identical passes. Now it walks once per distinct column and projects all requested aggs from a per-column accumulator map.
Fix 2: PSplitAgg children ran serially (mapv #(execute-node …) children). Now they execute on Clojure futures and deref in declared order, so total wall time ≈ max(branches) instead of Σ branches. The *explain-collector* dynamic var is rebound across worker threads so EXPLAIN ANALYZE timings continue to flow.

Benchmarks (50.6M-row synthetic, varying chunk count)

Fix 1 — PStatsOnlyAgg single-walk:

chunks	before	after	speedup
6,178	4.0 ms	1.5 ms	~2.7×
49,417	9.7 ms	7.7 ms	~1.3×
197,668	34 ms	9.2 ms	~3.6×
790,672	116 ms	34 ms	~3.4×

Fix 2 — parallel PSplitAgg (total vs Σ vs max of children):

chunks	PSplitAgg total	Σ children	max child
6,178	745.6 ms	747.5 ms	745.3 ms
197,668	754.6 ms	764.0 ms	754.4 ms
790,672	805.4 ms	841.1 ms	805.1 ms

Total tracks max(children) to within JIT noise — parallelism confirmed.

Estimated impact on user report

Reported: PStatsOnlyAgg 768 ms + PPercentileAgg 317 ms = 1086 ms (vs DuckDB 1490 ms wall).

Fix 1 alone: stats walk 768 → ~256 ms; serial total ~573 ms
Fix 1 + Fix 2: parallel → max(256, 317) ≈ 317 ms (~3.4× faster than DuckDB on the same query)

The user's underlying 768 ms anomaly was almost certainly the 3-walk redundancy; Fix 2 additionally protects against any future per-branch first-touch slowness (page faults, JIT warmup on cold paths, GC) by ensuring those costs no longer compound across siblings.

Why this is safe

PSplitAgg children share read-only input — they receive the same scan context value, not a stream.
The two relevant branches do disjoint memory work: PStatsOnlyAgg reads chunk stats, PPercentileAgg materializes the column array.
*explain-collector* is the only per-execution mutable state; it's a Clojure dynamic var, rebound in each worker thread so ANALYZE timings are still recorded.
Single-child PSplitAgg short-circuits to sync execution (no future overhead).
Child results are deref'd in declared order, so result-row aggs appear in user-declared order regardless of completion order.

Tests

Three new tests in query_test.clj:

stats-only-single-pass-multi-col-test — aggs across two index columns
stats-only-three-aggs-one-column-test — MIN+AVG+MAX on one column (the exact shape Fix 1 optimizes)
split-agg-parallel-preserves-order-test — heavily-aliased mixed query verifying result order survives parallel merge

Full suite: 1039 tests, 4777 assertions, 0 failures.

Test plan

Full suite passes
cljfmt clean
Synthetic benchmarks across 4 chunk-size configurations
Ask rschmukler to re-run his `EXPLAIN ANALYZE SELECT MIN, AVG, MEDIAN, MAX FROM options_trades_alt` after merge to confirm the predicted ~300 ms result on his data.

Two related fixes for mixed-class aggregate workloads (MIN/AVG/MEDIAN/MAX-style), surfaced by EXPLAIN ANALYZE on a 50.6M-row user query that ran ~1100 ms — anomalously dominated by the supposedly "O(chunks)" PStatsOnlyAgg sub-plan rather than the unavoidable percentile pass. Fix 1 — single-pass PStatsOnlyAgg ---------------------------------- `execute-stats-only` previously walked the chunk-entry vector once per aggregate. For a query like SELECT MIN(p), AVG(p), MAX(p) FROM t, the same chunk list got iterated three times producing identical sum/count/min/max. The refactor pulls out `walk-chunk-stats` and invokes it once per *distinct column*, then projects requested aggs out of the per-column accumulator map. Measured on synthetic 50.6M-row indices: chunks before after speedup ------ ------ ----- ------- 6 K 4 ms 1.5 ms ~2.7x 49 K 9.7 ms 7.7 ms ~1.3x 197 K 34 ms 9.2 ms ~3.6x 790 K 116 ms 34 ms ~3.4x The scaling matches the 3-aggs-on-one-column reduction-of-passes theory (3 walks → 1 walk per column; some constant overhead from the per-column hash lookup explains why the speedup is below the theoretical 3x at high chunk counts). Fix 2 — parallel sibling execution in PSplitAgg ----------------------------------------------- `execute-split-agg` previously ran sub-plans serially: `(mapv #(execute-node % false) children)`. The design comment claimed L3-cache reuse across passes, but the practical effect is total wall-time = Σ branches, which is a problem when one branch is unexpectedly slow (first-touch page faults, cold JIT on a heterogeneous code path, GC pressure). Children now execute on Clojure futures and are deref'd in declared order so result-row ordering is preserved: child-results (if (= 1 (count children)) [(execute-node (first children) false)] ; trivial degenerate (let [coll *explain-collector* fs (mapv (fn [c] (future (binding [*explain-collector* coll] (execute-node c false)))) children)] (mapv deref fs))) Three observations made this safe: - The PSplitAgg children share read-only input (the upstream scan ctx is a value, not a stream), so parallel readers compete only for OS-page-cache / L3 bandwidth, not for mutable state. - PStatsOnlyAgg does not materialize columns; PPercentileAgg does. The two branches do disjoint memory work even when they reference the same column. - The `*explain-collector*` dynamic var is the only per-execution mutable state. It is rebound across the future boundary so EXPLAIN ANALYZE per-node timings continue to flow. Measured: PSplitAgg total now tracks max(branches), not Σ. On the baseline (cheap stats branch), the savings are small (~2 ms / 9 ms / 36 ms at 6 K / 197 K / 790 K chunks); the real protection is against scenarios like the one in the user report where one branch unexpectedly takes 100s of milliseconds — total wall time stays bounded by the slow branch instead of summing both. A forward `(declare ^:dynamic *explain-collector*)` near the top of executor.clj is required because `execute-split-agg` is defined far above the var. Estimated impact on the user's reported case -------------------------------------------- Before: PStatsOnlyAgg 768 ms + PPercentileAgg 317 ms = 1086 ms Fix 1 alone: ~256 ms + 317 ms = ~573 ms Fix 1 + Fix 2: max(256, 317) = ~317 ms i.e. roughly 3.4× faster than the DuckDB run on the same query (1490 ms wall), assuming the user's underlying stats-walk slowness was the 3-walk redundancy and not something else still unaccounted for. Tests ----- Three new tests in query_test.clj cover the changed paths: - stats-only-single-pass-multi-col-test: aggs across two index columns produce correct results from the per-column accumulator map. - stats-only-three-aggs-one-column-test: min+avg+max on one column (the exact shape Fix 1 optimizes) returns correct values. - split-agg-parallel-preserves-order-test: heavily-aliased mixed query verifies result aggs appear in declared order through the parallel future-merge. Full suite: 1039 tests, 4777 assertions, 0 failures. Signed-off-by: Christian Weilbach <christian@weilbach.name>

whilo merged commit 0118da5 into main May 11, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSplitAgg: single-walk stats + parallel sibling execution#25

PSplitAgg: single-walk stats + parallel sibling execution#25
whilo merged 1 commit into
mainfrom
psplitagg-fusion

whilo commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whilo commented May 11, 2026

Summary

Benchmarks (50.6M-row synthetic, varying chunk count)

Estimated impact on user report

Why this is safe

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant