Skip to content

PSplitAgg: single-walk stats + parallel sibling execution#25

Merged
whilo merged 1 commit into
mainfrom
psplitagg-fusion
May 11, 2026
Merged

PSplitAgg: single-walk stats + parallel sibling execution#25
whilo merged 1 commit into
mainfrom
psplitagg-fusion

Conversation

@whilo
Copy link
Copy Markdown
Member

@whilo whilo commented May 11, 2026

Summary

Two related fixes for mixed-class aggregate workloads (the MIN(p), AVG(p), MEDIAN(p), MAX(p) shape), surfaced by EXPLAIN ANALYZE on a user-reported 50.6M-row query that was anomalously slow in the stats-only branch of a PSplitAgg.

  • Fix 1: execute-stats-only walked the chunk-entry vector once per agg. For 3 aggs on one column → 3 identical passes. Now it walks once per distinct column and projects all requested aggs from a per-column accumulator map.
  • Fix 2: PSplitAgg children ran serially (mapv #(execute-node …) children). Now they execute on Clojure futures and deref in declared order, so total wall time ≈ max(branches) instead of Σ branches. The *explain-collector* dynamic var is rebound across worker threads so EXPLAIN ANALYZE timings continue to flow.

Benchmarks (50.6M-row synthetic, varying chunk count)

Fix 1 — PStatsOnlyAgg single-walk:

chunks before after speedup
6,178 4.0 ms 1.5 ms ~2.7×
49,417 9.7 ms 7.7 ms ~1.3×
197,668 34 ms 9.2 ms ~3.6×
790,672 116 ms 34 ms ~3.4×

Fix 2 — parallel PSplitAgg (total vs Σ vs max of children):

chunks PSplitAgg total Σ children max child
6,178 745.6 ms 747.5 ms 745.3 ms
197,668 754.6 ms 764.0 ms 754.4 ms
790,672 805.4 ms 841.1 ms 805.1 ms

Total tracks max(children) to within JIT noise — parallelism confirmed.

Estimated impact on user report

Reported: PStatsOnlyAgg 768 ms + PPercentileAgg 317 ms = 1086 ms (vs DuckDB 1490 ms wall).

  • Fix 1 alone: stats walk 768 → ~256 ms; serial total ~573 ms
  • Fix 1 + Fix 2: parallel → max(256, 317)317 ms (~3.4× faster than DuckDB on the same query)

The user's underlying 768 ms anomaly was almost certainly the 3-walk redundancy; Fix 2 additionally protects against any future per-branch first-touch slowness (page faults, JIT warmup on cold paths, GC) by ensuring those costs no longer compound across siblings.

Why this is safe

  • PSplitAgg children share read-only input — they receive the same scan context value, not a stream.
  • The two relevant branches do disjoint memory work: PStatsOnlyAgg reads chunk stats, PPercentileAgg materializes the column array.
  • *explain-collector* is the only per-execution mutable state; it's a Clojure dynamic var, rebound in each worker thread so ANALYZE timings are still recorded.
  • Single-child PSplitAgg short-circuits to sync execution (no future overhead).
  • Child results are deref'd in declared order, so result-row aggs appear in user-declared order regardless of completion order.

Tests

Three new tests in query_test.clj:

  • stats-only-single-pass-multi-col-test — aggs across two index columns
  • stats-only-three-aggs-one-column-test — MIN+AVG+MAX on one column (the exact shape Fix 1 optimizes)
  • split-agg-parallel-preserves-order-test — heavily-aliased mixed query verifying result order survives parallel merge

Full suite: 1039 tests, 4777 assertions, 0 failures.

Test plan

  • Full suite passes
  • cljfmt clean
  • Synthetic benchmarks across 4 chunk-size configurations
  • Ask rschmukler to re-run his `EXPLAIN ANALYZE SELECT MIN, AVG, MEDIAN, MAX FROM options_trades_alt` after merge to confirm the predicted ~300 ms result on his data.

Two related fixes for mixed-class aggregate workloads
(MIN/AVG/MEDIAN/MAX-style), surfaced by EXPLAIN ANALYZE on a 50.6M-row
user query that ran ~1100 ms — anomalously dominated by the supposedly
"O(chunks)" PStatsOnlyAgg sub-plan rather than the unavoidable
percentile pass.

  Fix 1 — single-pass PStatsOnlyAgg
  ----------------------------------
  `execute-stats-only` previously walked the chunk-entry vector once
  per aggregate. For a query like SELECT MIN(p), AVG(p), MAX(p) FROM t,
  the same chunk list got iterated three times producing identical
  sum/count/min/max. The refactor pulls out `walk-chunk-stats` and
  invokes it once per *distinct column*, then projects requested aggs
  out of the per-column accumulator map.

  Measured on synthetic 50.6M-row indices:

    chunks    before    after   speedup
    ------    ------    -----   -------
      6 K       4 ms     1.5 ms   ~2.7x
     49 K     9.7 ms     7.7 ms   ~1.3x
    197 K      34 ms     9.2 ms   ~3.6x
    790 K     116 ms      34 ms   ~3.4x

  The scaling matches the 3-aggs-on-one-column reduction-of-passes
  theory (3 walks → 1 walk per column; some constant overhead from the
  per-column hash lookup explains why the speedup is below the
  theoretical 3x at high chunk counts).

  Fix 2 — parallel sibling execution in PSplitAgg
  -----------------------------------------------
  `execute-split-agg` previously ran sub-plans serially:
  `(mapv #(execute-node % false) children)`. The design comment
  claimed L3-cache reuse across passes, but the practical effect is
  total wall-time = Σ branches, which is a problem when one branch is
  unexpectedly slow (first-touch page faults, cold JIT on a
  heterogeneous code path, GC pressure). Children now execute on
  Clojure futures and are deref'd in declared order so result-row
  ordering is preserved:

      child-results
      (if (= 1 (count children))
        [(execute-node (first children) false)]    ; trivial degenerate
        (let [coll *explain-collector*
              fs   (mapv (fn [c]
                           (future
                             (binding [*explain-collector* coll]
                               (execute-node c false))))
                         children)]
          (mapv deref fs)))

  Three observations made this safe:

    - The PSplitAgg children share read-only input (the upstream scan
      ctx is a value, not a stream), so parallel readers compete only
      for OS-page-cache / L3 bandwidth, not for mutable state.
    - PStatsOnlyAgg does not materialize columns; PPercentileAgg does.
      The two branches do disjoint memory work even when they reference
      the same column.
    - The `*explain-collector*` dynamic var is the only per-execution
      mutable state. It is rebound across the future boundary so
      EXPLAIN ANALYZE per-node timings continue to flow.

  Measured: PSplitAgg total now tracks max(branches), not Σ. On the
  baseline (cheap stats branch), the savings are small (~2 ms / 9 ms /
  36 ms at 6 K / 197 K / 790 K chunks); the real protection is against
  scenarios like the one in the user report where one branch
  unexpectedly takes 100s of milliseconds — total wall time stays
  bounded by the slow branch instead of summing both.

  A forward `(declare ^:dynamic *explain-collector*)` near the top of
  executor.clj is required because `execute-split-agg` is defined far
  above the var.

  Estimated impact on the user's reported case
  --------------------------------------------
    Before: PStatsOnlyAgg 768 ms + PPercentileAgg 317 ms = 1086 ms
    Fix 1 alone:           ~256 ms + 317 ms          = ~573 ms
    Fix 1 + Fix 2:      max(256, 317)                = ~317 ms

  i.e. roughly 3.4× faster than the DuckDB run on the same query
  (1490 ms wall), assuming the user's underlying stats-walk slowness
  was the 3-walk redundancy and not something else still unaccounted
  for.

  Tests
  -----
  Three new tests in query_test.clj cover the changed paths:
    - stats-only-single-pass-multi-col-test: aggs across two index
      columns produce correct results from the per-column accumulator
      map.
    - stats-only-three-aggs-one-column-test: min+avg+max on one column
      (the exact shape Fix 1 optimizes) returns correct values.
    - split-agg-parallel-preserves-order-test: heavily-aliased mixed
      query verifies result aggs appear in declared order through the
      parallel future-merge.

  Full suite: 1039 tests, 4777 assertions, 0 failures.

Signed-off-by: Christian Weilbach <christian@weilbach.name>
@whilo whilo merged commit 0118da5 into main May 11, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant