Skip to content

PSplitAgg: split mixed-class aggregations#22

Merged
whilo merged 1 commit into
mainfrom
bugfix/multi-agg-same-column
May 10, 2026
Merged

PSplitAgg: split mixed-class aggregations#22
whilo merged 1 commit into
mainfrom
bugfix/multi-agg-same-column

Conversation

@whilo
Copy link
Copy Markdown
Member

@whilo whilo commented May 10, 2026

Summary

  • Reported issue: SELECT min(price), avg(price), median(price), max(price) FROM t ran ~3× slower than DuckDB on the same data (6M rows: 457ms vs DuckDB 152ms).
  • Root cause: the cascading global-agg / group-by strategy cond collapses to PScalarAgg (per-row Clojure reduce) whenever the agg list mixes SIMD-friendly aggs with median / percentile / approx-quantile. The "fast" siblings get dragged into the slow path.
  • Fix: new physical IR node PSplitAgg that partitions aggs by strategy class, plans each subset with the regular chooser, and merges results. Single-class queries are unaffected.

Why this design

We studied DuckDB and ClickHouse before settling on the architecture:

  • Both engines use single-scan / per-agg dispatch (UngroupedAggregateExecuteState::Sink in DuckDB; Aggregator::executeOnIntervalWithoutKey with AggregateFunctionInstruction array in ClickHouse). They call each agg's add_batch once per chunk, sharing column data by reference.
  • Crucially, they do not fuse filter+agg in their kernels — Stratum does, and that's our biggest perf advantage over DuckDB on filtered queries.
  • Adopting a DuckDB-style unified PMultiAgg operator would require new masked-kernel signatures duplicated across every agg type and would unwind the filter+agg fusion. The theoretical upside (one extra predicate eval saved) is ms-scale.
  • PSplitAgg pays one extra column scan per class (cheap — L3-hot after the first pass) and reuses every existing fused kernel unchanged. New fast paths (variance, count-distinct/HLL) plug in by joining the right partition.

Numbers (6M rows, NT, vs DuckDB JDBC in-process)

Query Before After DuckDB Result
min+avg+median+max global 457ms 83ms 152ms 1.84× faster than Duck
min + median global 300ms 77ms 145ms 1.88× faster
sum + median global 215ms 78ms 121ms 1.55× faster
median alone 60ms 73ms 157ms 2.14× faster
GROUP BY min+median+max 1073ms 79ms 128ms 1.62× faster
GROUP BY sum+median 953ms 76ms 127ms 1.67× faster

Single-class fast-path numbers (min+avg+max only, etc.) are unchanged.

Plan tree shape

Before:

PScalarAgg   est-rows=1
  PScan  cols=[:price] len=6000000

After:

PSplitAgg  aggs=[:min :avg :median :max] classes=2
  PDenseGroupBy  groups=[] aggs=[:min :avg :max] max-key=1
    PScan  cols=[:price] len=6000000
  PPercentileAgg
    PScan  cols=[:price] len=6000000

Test plan

  • 11 new split-agg tests under stratum.query-test/split-agg-* cover: global / GROUP BY, aliases preserved, predicates respected, single-class does NOT split, percentile-only does NOT split, mixed DOES split, results match individual queries.
  • Full test suite passes: clojure -M:test → 1012 tests / 4680 assertions / 0 failures.
  • OLAP regression check: T1 + T2 (H2O) + T3 (ClickBench) tiers run; numbers match memory baselines on the paths PSplitAgg doesn't touch. B2 idx NT slowness on 10M-row scale pre-exists on origin/main (verified via stash-and-test) and is unrelated.

Files

  • src/stratum/query/ir.cljPSplitAgg defrecord + map-children walk support
  • src/stratum/query/plan.cljagg-class / partition-aggs-by-class / cond branch in select-global-agg-strategy and select-group-by-strategy + propagate-est-rows + EXPLAIN printer
  • src/stratum/query/executor.cljexecute-split-agg (global row-map merge + GROUP BY group-key merge)
  • test/stratum/query_test.clj — 11 new tests

When a global agg or GROUP BY mixes SIMD-friendly aggs (min/max/avg/sum/
count) with aggs that need a different physical operator (median /
percentile / approx-quantile), the cascading strategy cond collapses to
PScalarAgg — a per-row Clojure reduce loop that ran 6× slower than
executing each class separately.

The new PSplitAgg physical node partitions aggs by strategy class,
plans each subset with the regular strategy chooser, and merges the
results. Existing fused filter+agg SIMD kernels are reused unchanged.
DuckDB/ClickHouse use a single-scan / per-agg-dispatch design that
forgoes filter+agg fusion; PSplitAgg keeps Stratum's fusion advantage
while matching their per-class specialization.

Performance on 6M rows (NT, vs DuckDB):

  min+avg+median+max global       457ms → 83ms   (1.84× faster than Duck)
  min+median global               300ms → 77ms   (1.88× faster)
  sum+median global               215ms → 78ms   (1.55× faster)
  GROUP BY min+median+max        1073ms → 79ms   (1.62× faster)
  GROUP BY sum+median             953ms → 76ms   (1.67× faster)

Single-class queries fall through to the existing path with no plan
or execution change. Future fast paths (variance, count-distinct/HLL)
plug in by joining the right partition — no executor change needed.

All 1012 existing tests pass; 11 new split-agg tests added.

Signed-off-by: Christian Weilbach <christian@weilbach.name>
@whilo whilo merged commit 04cc23a into main May 10, 2026
5 of 6 checks passed
@whilo whilo deleted the bugfix/multi-agg-same-column branch May 10, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant