PSplitAgg: split mixed-class aggregations by whilo · Pull Request #22 · replikativ/stratum

whilo · 2026-05-10T23:26:33Z

Summary

Reported issue: SELECT min(price), avg(price), median(price), max(price) FROM t ran ~3× slower than DuckDB on the same data (6M rows: 457ms vs DuckDB 152ms).
Root cause: the cascading global-agg / group-by strategy cond collapses to PScalarAgg (per-row Clojure reduce) whenever the agg list mixes SIMD-friendly aggs with median / percentile / approx-quantile. The "fast" siblings get dragged into the slow path.
Fix: new physical IR node PSplitAgg that partitions aggs by strategy class, plans each subset with the regular chooser, and merges results. Single-class queries are unaffected.

Why this design

We studied DuckDB and ClickHouse before settling on the architecture:

Both engines use single-scan / per-agg dispatch (UngroupedAggregateExecuteState::Sink in DuckDB; Aggregator::executeOnIntervalWithoutKey with AggregateFunctionInstruction array in ClickHouse). They call each agg's add_batch once per chunk, sharing column data by reference.
Crucially, they do not fuse filter+agg in their kernels — Stratum does, and that's our biggest perf advantage over DuckDB on filtered queries.
Adopting a DuckDB-style unified PMultiAgg operator would require new masked-kernel signatures duplicated across every agg type and would unwind the filter+agg fusion. The theoretical upside (one extra predicate eval saved) is ms-scale.
PSplitAgg pays one extra column scan per class (cheap — L3-hot after the first pass) and reuses every existing fused kernel unchanged. New fast paths (variance, count-distinct/HLL) plug in by joining the right partition.

Numbers (6M rows, NT, vs DuckDB JDBC in-process)

Query	Before	After	DuckDB	Result
min+avg+median+max global	457ms	83ms	152ms	1.84× faster than Duck
min + median global	300ms	77ms	145ms	1.88× faster
sum + median global	215ms	78ms	121ms	1.55× faster
median alone	60ms	73ms	157ms	2.14× faster
GROUP BY min+median+max	1073ms	79ms	128ms	1.62× faster
GROUP BY sum+median	953ms	76ms	127ms	1.67× faster

Single-class fast-path numbers (min+avg+max only, etc.) are unchanged.

Plan tree shape

Before:

PScalarAgg   est-rows=1
  PScan  cols=[:price] len=6000000

After:

PSplitAgg  aggs=[:min :avg :median :max] classes=2
  PDenseGroupBy  groups=[] aggs=[:min :avg :max] max-key=1
    PScan  cols=[:price] len=6000000
  PPercentileAgg
    PScan  cols=[:price] len=6000000

Test plan

11 new split-agg tests under stratum.query-test/split-agg-* cover: global / GROUP BY, aliases preserved, predicates respected, single-class does NOT split, percentile-only does NOT split, mixed DOES split, results match individual queries.
Full test suite passes: clojure -M:test → 1012 tests / 4680 assertions / 0 failures.
OLAP regression check: T1 + T2 (H2O) + T3 (ClickBench) tiers run; numbers match memory baselines on the paths PSplitAgg doesn't touch. B2 idx NT slowness on 10M-row scale pre-exists on origin/main (verified via stash-and-test) and is unrelated.

Files

src/stratum/query/ir.clj — PSplitAgg defrecord + map-children walk support
src/stratum/query/plan.clj — agg-class / partition-aggs-by-class / cond branch in select-global-agg-strategy and select-group-by-strategy + propagate-est-rows + EXPLAIN printer
src/stratum/query/executor.clj — execute-split-agg (global row-map merge + GROUP BY group-key merge)
test/stratum/query_test.clj — 11 new tests

When a global agg or GROUP BY mixes SIMD-friendly aggs (min/max/avg/sum/ count) with aggs that need a different physical operator (median / percentile / approx-quantile), the cascading strategy cond collapses to PScalarAgg — a per-row Clojure reduce loop that ran 6× slower than executing each class separately. The new PSplitAgg physical node partitions aggs by strategy class, plans each subset with the regular strategy chooser, and merges the results. Existing fused filter+agg SIMD kernels are reused unchanged. DuckDB/ClickHouse use a single-scan / per-agg-dispatch design that forgoes filter+agg fusion; PSplitAgg keeps Stratum's fusion advantage while matching their per-class specialization. Performance on 6M rows (NT, vs DuckDB): min+avg+median+max global 457ms → 83ms (1.84× faster than Duck) min+median global 300ms → 77ms (1.88× faster) sum+median global 215ms → 78ms (1.55× faster) GROUP BY min+median+max 1073ms → 79ms (1.62× faster) GROUP BY sum+median 953ms → 76ms (1.67× faster) Single-class queries fall through to the existing path with no plan or execution change. Future fast paths (variance, count-distinct/HLL) plug in by joining the right partition — no executor change needed. All 1012 existing tests pass; 11 new split-agg tests added. Signed-off-by: Christian Weilbach <christian@weilbach.name>

whilo merged commit 04cc23a into main May 10, 2026
5 of 6 checks passed

whilo deleted the bugfix/multi-agg-same-column branch May 10, 2026 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSplitAgg: split mixed-class aggregations#22

PSplitAgg: split mixed-class aggregations#22
whilo merged 1 commit into
mainfrom
bugfix/multi-agg-same-column

whilo commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whilo commented May 10, 2026

Summary

Why this design

Numbers (6M rows, NT, vs DuckDB JDBC in-process)

Plan tree shape

Test plan

Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant