Skip to content

Time-series query engine: micros precision, RANGE frames, FIRST/LAST/EMA/RLEID, window-join, latest-on#21

Merged
whilo merged 7 commits into
mainfrom
feature/time-series
May 9, 2026
Merged

Time-series query engine: micros precision, RANGE frames, FIRST/LAST/EMA/RLEID, window-join, latest-on#21
whilo merged 7 commits into
mainfrom
feature/time-series

Conversation

@whilo
Copy link
Copy Markdown
Member

@whilo whilo commented May 9, 2026

Summary

Adds the time-series feature set surveyed across DuckDB / TimescaleDB / QuestDB / kdb / Polars / datajure. Each phase is its own commit; tests cover every new operator and the legacy seconds-based code paths still behave exactly as before.

Phase A — Sub-day temporal precision (epoch-microseconds, DuckDB-compatible)

  • New column metadata: :temporal-unit ∈ #{:days :seconds :millis :micros} (defaults to legacy :seconds).
  • Java SIMD kernels for micros: arrayDateTrunc{Year,Month,Day,Hour,Minute,Second,Milli,Micro}Micros, arrayExtract{Hour,Minute,Second,Millisecond,Microsecond}Micros, arrayDateAddMicrosMicros, arrayDateAddMonthsMicros, arrayDateDiffMicros.
  • Both vectorized (eval-expr-{to-long,vectorized}) and scalar (gb/eval-agg-expr) paths dispatch on :temporal-unit via the existing expr/*columns-meta* dynamic var (filter widened to retain temporal-unit columns alongside dict-encoded ones).
  • The same metadata flows through both the planner and the legacy fallback.

Phase B — TIME_BUCKET, FIRST_VALUE, LAST_VALUE, NTH_VALUE

  • [:time-bucket N unit col] — arbitrary-width bucketing on micros, days (with :weeks/:months), or seconds.
  • FIRST_VALUE / LAST_VALUE / NTH_VALUE window functions; the latter reuses the :offset slot for n. SQL parser registers all three.

Phase C — RANGE BETWEEN INTERVAL frames + GENERATE_SERIES

  • Two-pointer sliding window keyed off the (single) ascending ORDER BY column, O(N) per partition, for :sum/:count/:avg. Required for correct rolling aggregates over irregular time series.
  • Fixed a long-standing partition-reset bug in compute-sliding-window-sum: at a partition boundary, the cumulative was overwritten, silently corrupting the last row of every non-final partition. Now uses a single monotonic prefix array, with partition-local endpoint subtraction localizing each query.
  • stratum.api/generate-series produces a dense column suitable as :from, with a temporal form ((generate-series 0 (* 5 day-us) 1 :days :micros)) that tags the output with the right :temporal-unit. Combined with ASOF LEFT JOIN, this gives canonical gap-fill + LOCF without new join machinery.

Phase D — FILLS / LOCF and named moving aggregates

  • :fills window op forward-fills NaN/NULL within partition (leading NaNs stay NaN).
  • MAVG / MSUM / MMIN / MMAX / MCOUNT / MDEV (q-style sugar) expand at execute time to op OVER (ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW). :min / :max gained sliding-frame ROWS branches; :mdev is moving population stddev.

Phase E — EMA and RLEID

  • :ema op: per-partition exponential moving average. :offset >= 1.0 is treated as period N (α = 2/(N+1)); otherwise α directly. Initializes at first non-NaN; NaN inputs carry the previous EMA.
  • :rleid: run-length-encoding group ID (handles long, double, string columns; restarts per partition).

Phase F — Window-join and LATEST ON / DISTINCT ON

  • stratum.api/window-join (q wj semantics): for each left row at time t, aggregate the right rows whose time falls in [t+lo, t+hi]. Sorts both sides by their ts columns, two-pointer sweep with monotonic lo/hi pointers. SUM/AVG/COUNT use right-side prefix sums (O(1) per left row); MIN/MAX scan the slice.
  • stratum.api/latest-on: most-recent-row-per-partition via ROW_NUMBER OVER (PARTITION BY … ORDER BY ts DESC) + HAVING rn=1.

Bonus

  • validate-query now accepts SELECT references that point at :window outputs.
  • :offset window-spec field accepts doubles too (needed for EMA's α parameter).

Tests

  • New: temporal_micros_test.clj (44 assertions), range_frame_test.clj (16), window_value_test.clj (18), window_extra_test.clj (28), moving_agg_test.clj (27), generate_series_test.clj (16), temporal_join_test.clj (27).
  • All existing tests pass; full sweep (query-test, sql-test, asof-join-test, planner tests, …) green.

OLAP-bench regression check (10M rows, T1)

Benchmark main 1T / NT feat 1T / NT
B1 TPC-H Q6 16.5 / 10.7 17.3 / 10.5
B2 TPC-H Q1 114.3 / 204.2 113.2 / 203.5
B3 SSB Q1.1 16.5 / 8.1 17.0 / 8.3
B5 Filtered COUNT 2.7 / 1.5 3.3 / 2.3
B6 Low-card group-by 17.5 / 8.3 17.9 / 8.5
SSB-Q1.2 15.8 / 8.0 16.3 / 8.1

No major regressions — five of six within run-to-run noise. B5 (Filtered COUNT NEQ) shows a small absolute slowdown (~0.6 ms 1T) likely attributable to the widened *columns-meta* filter; left as-is for follow-up since absolute time is sub-millisecond and the filter widening is needed for the temporal-unit dispatch.

Test plan

  • cljfmt formatting applied (clj -M:ffix)
  • Full test suite green (1094 assertions across the new + touched test files)
  • OLAP T1 bench compared against main, no significant regressions
  • CI passes

whilo added 7 commits May 9, 2026 00:34
…VALUE

Phase A — Microsecond-precision TIMESTAMP support:
- Column metadata gains :temporal-unit ∈ #{:days :seconds :millis :micros}
  defaulting to legacy :seconds when absent. TIMESTAMP literals encoded
  as long[] of microseconds since epoch (DuckDB-compatible).
- Java kernels added for micros: arrayDateTrunc{Year,Month,Day,Hour,
  Minute,Second,Milli,Micro}Micros, arrayExtract{Hour,Minute,Second,
  Millisecond,Microsecond}Micros, arrayDateAddMicrosMicros,
  arrayDateAddMonthsMicros, arrayDateDiffMicros.
- Expression eval (vectorized + scalar) dispatches on the column's
  :temporal-unit via the existing *columns-meta* dynamic var, which now
  retains temporal-unit columns alongside dict-encoded ones.
- All paths (planner + legacy) thread temporal-unit through.

Phase B1 — TIME_BUCKET function:
- New :time-bucket op. arrayTimeBucketMicros / Days / Months Java
  kernels do floor-div by arbitrary width with optional origin offset.
- Sub-day units (microseconds → hours) on micros columns; days/weeks/
  months on epoch-day DATE columns.

Phase B2 — FIRST_VALUE / LAST_VALUE / NTH_VALUE window functions:
- New :first-value / :last-value / :nth-value window ops in
  query.window. NTH_VALUE uses the LAG/LEAD :offset slot for n.
- LAST_VALUE follows DuckDB convention of full-partition scope so OHLC
  bar generation works with the typical default frame.
- SQL parser registers FIRST_VALUE/LAST_VALUE/NTH_VALUE as analytic
  functions; spec accepts the new ops.

validate-query: SELECT references can now point at :window outputs.

Tests: 1739 assertions across 528 tests pass (44 new for micros, 18
new for value windows). Legacy seconds-based path unchanged.

Signed-off-by: Christian Weilbach <christian@weilbach.name>
Phase C1 — RANGE BETWEEN INTERVAL frames:
- Two-pointer sliding window keyed off the (single) ascending ORDER BY
  column, O(N) per partition. compute-sliding-window-sum-range and
  compute-sliding-window-count-range live alongside the existing
  ROWS-mode helpers; SUM/COUNT/AVG dispatch on (range-frame? frame).
- Fixes a long-standing partition-reset bug in the existing prefix-sum
  helper: at a partition boundary, the cumulative was overwritten,
  silently corrupting the last row of every non-final partition. Switch
  to a single monotonic prefix array and rely on partition-local
  endpoint subtraction to localize the result.

Phase D1 — FILLS (LOCF):
- Forward-fill NaN/NULL within partition. Leading NaNs stay NaN.

Phase E1 — EMA:
- Per-partition exponential moving average. Smoothing factor passed via
  :offset; values >= 1.0 are interpreted as a period N (alpha = 2/(N+1)).
- Initializes at the first non-NaN value of the partition; NaN inputs
  are treated as no-op (carry the previous EMA).

Phase E2 — RLEID:
- Run-length-encoding group ID. Increments when the value differs from
  the previous row in sorted order. Handles long, double, and string
  columns; restarts at 1 per partition.

Tests: +28 new (window-extra: 28 incl. EMA/FILLS/RLEID semantics),
+12 new (range-frame). Full sweep 1573/1573 pass; the reset-bug fix
also makes existing partitioned ROWS-frame queries correct.

Signed-off-by: Christian Weilbach <christian@weilbach.name>
stratum.api/generate-series produces a dense column as a `:from`-ready
column map. Three forms:

  (generate-series 1 10)               → 1..10 step 1, long[]
  (generate-series 0 100 25)           → 0,25,50,75,100, long[]
  (generate-series 0.0 1.0 0.25)       → double[] when step is float
  (generate-series 0 (* 5 day-us)
                   1 :days :micros)    → temporal-tagged spine

Combined with ASOF LEFT JOIN, this enables the canonical gap-fill
pattern (dense time spine + LOCF carry-forward of sparse data) without
any new join machinery. Verified via test/stratum/generate_series_test.

5 tests / 16 assertions pass.

Signed-off-by: Christian Weilbach <christian@weilbach.name>
Phase D2 — Named moving aggregates (q-style sugar):
- MAVG / MSUM / MMIN / MMAX / MCOUNT / MDEV window ops, expanded at
  execute-window-functions to AVG/SUM/… OVER (ROWS BETWEEN N-1 PRECEDING
  AND CURRENT ROW). Width N rides on :offset.
- :min and :max gained sliding-frame ROWS branches (previously only
  full-partition or running). Simple per-row scan within the frame —
  monotonic-deque optimization deferred but the contract is correct.
- New :mdev op: moving population stddev (ddof=0), two-pass mean +
  variance to avoid cancellation.

Phase F1 — window join (q `wj` semantics):
- stratum.api/window-join: for each left row at time t, aggregate the
  right rows whose time falls in [t+lo, t+hi]. Sorts both sides ascending
  by their respective ts columns, two-pointer sweep over left while lo/hi
  pointers monotonically advance through right.
- SUM / AVG / COUNT use right-side prefix sums (O(1) per left row);
  MIN / MAX scan the matching slice.
- Single-partition (no equality keys) for now; that's the bulk of the
  realistic usage and matches q's `wj` over a single sym slice.

Phase F2 — LATEST ON / DISTINCT ON:
- stratum.api/latest-on: most recent row per partition, expressed via
  ROW_NUMBER OVER (PARTITION BY … ORDER BY ts DESC) + HAVING rn=1. SQL
  WHERE doesn't see window outputs, but HAVING does, so the rewrite goes
  through cleanly without engine changes.

Tests: +27 (moving-agg) +27 (temporal-join). Full sweep: 551 tests /
1857 assertions, all green.

Signed-off-by: Christian Weilbach <christian@weilbach.name>
Output of `clj -M:ffix` (cljfmt 0.9.2) over the time-series branch.
Touches pre-existing formatting nits in files unrelated to this branch
in addition to the new code; full test suite (1094 assertions) green
post-format.

Signed-off-by: Christian Weilbach <christian@weilbach.name>
Updates SQL Capabilities, DSL Reference, and Features sections so the
README reflects the new operators and helpers shipped on this branch:
RANGE BETWEEN INTERVAL frames, TIME_BUCKET, FIRST/LAST/NTH_VALUE,
FILLS/LOCF, EMA, RLEID, the q-style MAVG/MSUM/MMIN/MMAX/MDEV moving
aggregates, the :temporal-unit metadata model, and the
window-join / latest-on / generate-series Clojure helpers.

Signed-off-by: Christian Weilbach <christian@weilbach.name>
Wires the time-series operators added earlier on this branch into the
SQL parser so they're reachable from SELECT / WHERE / GROUP BY / OVER
clauses, and adds five sqllogictest files exercising them end-to-end.

SQL parser additions (src/stratum/sql.clj):
- TIMESTAMP literal: `TIMESTAMP '2024-01-15 10:30:45.123456'` parses to
  epoch-microseconds (matches the canonical micros-precision storage).
- DATE / TIMESTAMP / TIMESTAMPTZ in CREATE TABLE: column descriptors
  now carry `:temporal-unit :days` / `:micros`. The schema rides on the
  table-registry value as Clojure metadata so existing INSERT / UPDATE /
  UPSERT / DELETE paths (which assume raw arrays) keep working unchanged.
- ExtractExpression handler: `EXTRACT(field FROM col)` translates field
  to the granular op (:hour, :millisecond, :microsecond, :day-of-week,
  …) so normalization recognizes it.
- TIME_BUCKET(width, 'unit', col [, origin]): registered as a scalar
  function, both in translate-function and in translate-group-expr so
  it works in GROUP BY too.
- Window function names: MAVG, MSUM, MMIN, MMAX, MCOUNT, MDEV (q-style
  sliding aggregates with width passed via the second positional arg →
  picked up via .getOffset by JSqlParser); FILLS / LOCF, EMA, RLEID.

server.clj + sqllogictest_test.clj — propagate the table's column-schema
metadata across INSERT/UPDATE/UPSERT/DELETE atomic swaps so temporal
columns retain their `:temporal-unit` after mutations.

translate-select wraps temporal columns at query-input time using the
schema metadata so the engine sees `{:type :int64 :data arr
:temporal-unit U}` even though the table itself stores raw arrays.

sqllogictest coverage:
- test_temporal_micros.test  — TIMESTAMP literal, EXTRACT MS/US,
  DATE_TRUNC at sub-day precisions, TIMESTAMP comparisons.
- test_time_bucket.test      — 5-min / 1-hour / 1-second bucketing,
  GROUP BY TIME_BUCKET aggregation.
- test_window_value.test     — FIRST_VALUE, LAST_VALUE, OHLC pattern.
- test_moving_aggs.test      — MAVG, MSUM, MMIN, MMAX, MCOUNT, MDEV.
- test_window_locf_ema_rleid.test — RLEID, EMA.

Full sweep: 552 tests / 2529 assertions, all green.

Signed-off-by: Christian Weilbach <christian@weilbach.name>
@whilo whilo merged commit 10aa0e0 into main May 9, 2026
5 of 6 checks passed
@whilo whilo deleted the feature/time-series branch May 9, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant