PSplitAgg: single-walk stats + parallel sibling execution#25
Merged
Conversation
Two related fixes for mixed-class aggregate workloads
(MIN/AVG/MEDIAN/MAX-style), surfaced by EXPLAIN ANALYZE on a 50.6M-row
user query that ran ~1100 ms — anomalously dominated by the supposedly
"O(chunks)" PStatsOnlyAgg sub-plan rather than the unavoidable
percentile pass.
Fix 1 — single-pass PStatsOnlyAgg
----------------------------------
`execute-stats-only` previously walked the chunk-entry vector once
per aggregate. For a query like SELECT MIN(p), AVG(p), MAX(p) FROM t,
the same chunk list got iterated three times producing identical
sum/count/min/max. The refactor pulls out `walk-chunk-stats` and
invokes it once per *distinct column*, then projects requested aggs
out of the per-column accumulator map.
Measured on synthetic 50.6M-row indices:
chunks before after speedup
------ ------ ----- -------
6 K 4 ms 1.5 ms ~2.7x
49 K 9.7 ms 7.7 ms ~1.3x
197 K 34 ms 9.2 ms ~3.6x
790 K 116 ms 34 ms ~3.4x
The scaling matches the 3-aggs-on-one-column reduction-of-passes
theory (3 walks → 1 walk per column; some constant overhead from the
per-column hash lookup explains why the speedup is below the
theoretical 3x at high chunk counts).
Fix 2 — parallel sibling execution in PSplitAgg
-----------------------------------------------
`execute-split-agg` previously ran sub-plans serially:
`(mapv #(execute-node % false) children)`. The design comment
claimed L3-cache reuse across passes, but the practical effect is
total wall-time = Σ branches, which is a problem when one branch is
unexpectedly slow (first-touch page faults, cold JIT on a
heterogeneous code path, GC pressure). Children now execute on
Clojure futures and are deref'd in declared order so result-row
ordering is preserved:
child-results
(if (= 1 (count children))
[(execute-node (first children) false)] ; trivial degenerate
(let [coll *explain-collector*
fs (mapv (fn [c]
(future
(binding [*explain-collector* coll]
(execute-node c false))))
children)]
(mapv deref fs)))
Three observations made this safe:
- The PSplitAgg children share read-only input (the upstream scan
ctx is a value, not a stream), so parallel readers compete only
for OS-page-cache / L3 bandwidth, not for mutable state.
- PStatsOnlyAgg does not materialize columns; PPercentileAgg does.
The two branches do disjoint memory work even when they reference
the same column.
- The `*explain-collector*` dynamic var is the only per-execution
mutable state. It is rebound across the future boundary so
EXPLAIN ANALYZE per-node timings continue to flow.
Measured: PSplitAgg total now tracks max(branches), not Σ. On the
baseline (cheap stats branch), the savings are small (~2 ms / 9 ms /
36 ms at 6 K / 197 K / 790 K chunks); the real protection is against
scenarios like the one in the user report where one branch
unexpectedly takes 100s of milliseconds — total wall time stays
bounded by the slow branch instead of summing both.
A forward `(declare ^:dynamic *explain-collector*)` near the top of
executor.clj is required because `execute-split-agg` is defined far
above the var.
Estimated impact on the user's reported case
--------------------------------------------
Before: PStatsOnlyAgg 768 ms + PPercentileAgg 317 ms = 1086 ms
Fix 1 alone: ~256 ms + 317 ms = ~573 ms
Fix 1 + Fix 2: max(256, 317) = ~317 ms
i.e. roughly 3.4× faster than the DuckDB run on the same query
(1490 ms wall), assuming the user's underlying stats-walk slowness
was the 3-walk redundancy and not something else still unaccounted
for.
Tests
-----
Three new tests in query_test.clj cover the changed paths:
- stats-only-single-pass-multi-col-test: aggs across two index
columns produce correct results from the per-column accumulator
map.
- stats-only-three-aggs-one-column-test: min+avg+max on one column
(the exact shape Fix 1 optimizes) returns correct values.
- split-agg-parallel-preserves-order-test: heavily-aliased mixed
query verifies result aggs appear in declared order through the
parallel future-merge.
Full suite: 1039 tests, 4777 assertions, 0 failures.
Signed-off-by: Christian Weilbach <christian@weilbach.name>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related fixes for mixed-class aggregate workloads (the
MIN(p), AVG(p), MEDIAN(p), MAX(p)shape), surfaced by EXPLAIN ANALYZE on a user-reported 50.6M-row query that was anomalously slow in the stats-only branch of a PSplitAgg.execute-stats-onlywalked the chunk-entry vector once per agg. For 3 aggs on one column → 3 identical passes. Now it walks once per distinct column and projects all requested aggs from a per-column accumulator map.mapv #(execute-node …) children). Now they execute on Clojurefutures and deref in declared order, so total wall time ≈max(branches)instead ofΣ branches. The*explain-collector*dynamic var is rebound across worker threads so EXPLAIN ANALYZE timings continue to flow.Benchmarks (50.6M-row synthetic, varying chunk count)
Fix 1 — PStatsOnlyAgg single-walk:
Fix 2 — parallel PSplitAgg (total vs Σ vs max of children):
Total tracks
max(children)to within JIT noise — parallelism confirmed.Estimated impact on user report
Reported:
PStatsOnlyAgg 768 ms + PPercentileAgg 317 ms = 1086 ms(vs DuckDB 1490 ms wall).max(256, 317)≈ 317 ms (~3.4× faster than DuckDB on the same query)The user's underlying 768 ms anomaly was almost certainly the 3-walk redundancy; Fix 2 additionally protects against any future per-branch first-touch slowness (page faults, JIT warmup on cold paths, GC) by ensuring those costs no longer compound across siblings.
Why this is safe
*explain-collector*is the only per-execution mutable state; it's a Clojure dynamic var, rebound in each worker thread so ANALYZE timings are still recorded.Tests
Three new tests in
query_test.clj:stats-only-single-pass-multi-col-test— aggs across two index columnsstats-only-three-aggs-one-column-test— MIN+AVG+MAX on one column (the exact shape Fix 1 optimizes)split-agg-parallel-preserves-order-test— heavily-aliased mixed query verifying result order survives parallel mergeFull suite: 1039 tests, 4777 assertions, 0 failures.
Test plan