Time-series query engine: micros precision, RANGE frames, FIRST/LAST/EMA/RLEID, window-join, latest-on by whilo · Pull Request #21 · replikativ/stratum

whilo · 2026-05-09T08:59:11Z

Summary

Adds the time-series feature set surveyed across DuckDB / TimescaleDB / QuestDB / kdb / Polars / datajure. Each phase is its own commit; tests cover every new operator and the legacy seconds-based code paths still behave exactly as before.

Phase A — Sub-day temporal precision (epoch-microseconds, DuckDB-compatible)

New column metadata: :temporal-unit ∈ #{:days :seconds :millis :micros} (defaults to legacy :seconds).
Java SIMD kernels for micros: arrayDateTrunc{Year,Month,Day,Hour,Minute,Second,Milli,Micro}Micros, arrayExtract{Hour,Minute,Second,Millisecond,Microsecond}Micros, arrayDateAddMicrosMicros, arrayDateAddMonthsMicros, arrayDateDiffMicros.
Both vectorized (eval-expr-{to-long,vectorized}) and scalar (gb/eval-agg-expr) paths dispatch on :temporal-unit via the existing expr/*columns-meta* dynamic var (filter widened to retain temporal-unit columns alongside dict-encoded ones).
The same metadata flows through both the planner and the legacy fallback.

Phase B — TIME_BUCKET, FIRST_VALUE, LAST_VALUE, NTH_VALUE

[:time-bucket N unit col] — arbitrary-width bucketing on micros, days (with :weeks/:months), or seconds.
FIRST_VALUE / LAST_VALUE / NTH_VALUE window functions; the latter reuses the :offset slot for n. SQL parser registers all three.

Phase C — RANGE BETWEEN INTERVAL frames + GENERATE_SERIES

Two-pointer sliding window keyed off the (single) ascending ORDER BY column, O(N) per partition, for :sum/:count/:avg. Required for correct rolling aggregates over irregular time series.
Fixed a long-standing partition-reset bug in compute-sliding-window-sum: at a partition boundary, the cumulative was overwritten, silently corrupting the last row of every non-final partition. Now uses a single monotonic prefix array, with partition-local endpoint subtraction localizing each query.
stratum.api/generate-series produces a dense column suitable as :from, with a temporal form ((generate-series 0 (* 5 day-us) 1 :days :micros)) that tags the output with the right :temporal-unit. Combined with ASOF LEFT JOIN, this gives canonical gap-fill + LOCF without new join machinery.

Phase D — FILLS / LOCF and named moving aggregates

:fills window op forward-fills NaN/NULL within partition (leading NaNs stay NaN).
MAVG / MSUM / MMIN / MMAX / MCOUNT / MDEV (q-style sugar) expand at execute time to op OVER (ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW). :min / :max gained sliding-frame ROWS branches; :mdev is moving population stddev.

Phase E — EMA and RLEID

:ema op: per-partition exponential moving average. :offset >= 1.0 is treated as period N (α = 2/(N+1)); otherwise α directly. Initializes at first non-NaN; NaN inputs carry the previous EMA.
:rleid: run-length-encoding group ID (handles long, double, string columns; restarts per partition).

Phase F — Window-join and LATEST ON / DISTINCT ON

stratum.api/window-join (q wj semantics): for each left row at time t, aggregate the right rows whose time falls in [t+lo, t+hi]. Sorts both sides by their ts columns, two-pointer sweep with monotonic lo/hi pointers. SUM/AVG/COUNT use right-side prefix sums (O(1) per left row); MIN/MAX scan the slice.
stratum.api/latest-on: most-recent-row-per-partition via ROW_NUMBER OVER (PARTITION BY … ORDER BY ts DESC) + HAVING rn=1.

Bonus

validate-query now accepts SELECT references that point at :window outputs.
:offset window-spec field accepts doubles too (needed for EMA's α parameter).

Tests

New: temporal_micros_test.clj (44 assertions), range_frame_test.clj (16), window_value_test.clj (18), window_extra_test.clj (28), moving_agg_test.clj (27), generate_series_test.clj (16), temporal_join_test.clj (27).
All existing tests pass; full sweep (query-test, sql-test, asof-join-test, planner tests, …) green.

OLAP-bench regression check (10M rows, T1)

Benchmark	main 1T / NT	feat 1T / NT
B1 TPC-H Q6	16.5 / 10.7	17.3 / 10.5
B2 TPC-H Q1	114.3 / 204.2	113.2 / 203.5
B3 SSB Q1.1	16.5 / 8.1	17.0 / 8.3
B5 Filtered COUNT	2.7 / 1.5	3.3 / 2.3
B6 Low-card group-by	17.5 / 8.3	17.9 / 8.5
SSB-Q1.2	15.8 / 8.0	16.3 / 8.1

No major regressions — five of six within run-to-run noise. B5 (Filtered COUNT NEQ) shows a small absolute slowdown (~0.6 ms 1T) likely attributable to the widened *columns-meta* filter; left as-is for follow-up since absolute time is sub-millisecond and the filter widening is needed for the temporal-unit dispatch.

Test plan

cljfmt formatting applied (clj -M:ffix)
Full test suite green (1094 assertions across the new + touched test files)
OLAP T1 bench compared against main, no significant regressions
CI passes

…VALUE Phase A — Microsecond-precision TIMESTAMP support: - Column metadata gains :temporal-unit ∈ #{:days :seconds :millis :micros} defaulting to legacy :seconds when absent. TIMESTAMP literals encoded as long[] of microseconds since epoch (DuckDB-compatible). - Java kernels added for micros: arrayDateTrunc{Year,Month,Day,Hour, Minute,Second,Milli,Micro}Micros, arrayExtract{Hour,Minute,Second, Millisecond,Microsecond}Micros, arrayDateAddMicrosMicros, arrayDateAddMonthsMicros, arrayDateDiffMicros. - Expression eval (vectorized + scalar) dispatches on the column's :temporal-unit via the existing *columns-meta* dynamic var, which now retains temporal-unit columns alongside dict-encoded ones. - All paths (planner + legacy) thread temporal-unit through. Phase B1 — TIME_BUCKET function: - New :time-bucket op. arrayTimeBucketMicros / Days / Months Java kernels do floor-div by arbitrary width with optional origin offset. - Sub-day units (microseconds → hours) on micros columns; days/weeks/ months on epoch-day DATE columns. Phase B2 — FIRST_VALUE / LAST_VALUE / NTH_VALUE window functions: - New :first-value / :last-value / :nth-value window ops in query.window. NTH_VALUE uses the LAG/LEAD :offset slot for n. - LAST_VALUE follows DuckDB convention of full-partition scope so OHLC bar generation works with the typical default frame. - SQL parser registers FIRST_VALUE/LAST_VALUE/NTH_VALUE as analytic functions; spec accepts the new ops. validate-query: SELECT references can now point at :window outputs. Tests: 1739 assertions across 528 tests pass (44 new for micros, 18 new for value windows). Legacy seconds-based path unchanged. Signed-off-by: Christian Weilbach <christian@weilbach.name>

Phase C1 — RANGE BETWEEN INTERVAL frames: - Two-pointer sliding window keyed off the (single) ascending ORDER BY column, O(N) per partition. compute-sliding-window-sum-range and compute-sliding-window-count-range live alongside the existing ROWS-mode helpers; SUM/COUNT/AVG dispatch on (range-frame? frame). - Fixes a long-standing partition-reset bug in the existing prefix-sum helper: at a partition boundary, the cumulative was overwritten, silently corrupting the last row of every non-final partition. Switch to a single monotonic prefix array and rely on partition-local endpoint subtraction to localize the result. Phase D1 — FILLS (LOCF): - Forward-fill NaN/NULL within partition. Leading NaNs stay NaN. Phase E1 — EMA: - Per-partition exponential moving average. Smoothing factor passed via :offset; values >= 1.0 are interpreted as a period N (alpha = 2/(N+1)). - Initializes at the first non-NaN value of the partition; NaN inputs are treated as no-op (carry the previous EMA). Phase E2 — RLEID: - Run-length-encoding group ID. Increments when the value differs from the previous row in sorted order. Handles long, double, and string columns; restarts at 1 per partition. Tests: +28 new (window-extra: 28 incl. EMA/FILLS/RLEID semantics), +12 new (range-frame). Full sweep 1573/1573 pass; the reset-bug fix also makes existing partitioned ROWS-frame queries correct. Signed-off-by: Christian Weilbach <christian@weilbach.name>

stratum.api/generate-series produces a dense column as a `:from`-ready column map. Three forms: (generate-series 1 10) → 1..10 step 1, long[] (generate-series 0 100 25) → 0,25,50,75,100, long[] (generate-series 0.0 1.0 0.25) → double[] when step is float (generate-series 0 (* 5 day-us) 1 :days :micros) → temporal-tagged spine Combined with ASOF LEFT JOIN, this enables the canonical gap-fill pattern (dense time spine + LOCF carry-forward of sparse data) without any new join machinery. Verified via test/stratum/generate_series_test. 5 tests / 16 assertions pass. Signed-off-by: Christian Weilbach <christian@weilbach.name>

Phase D2 — Named moving aggregates (q-style sugar): - MAVG / MSUM / MMIN / MMAX / MCOUNT / MDEV window ops, expanded at execute-window-functions to AVG/SUM/… OVER (ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW). Width N rides on :offset. - :min and :max gained sliding-frame ROWS branches (previously only full-partition or running). Simple per-row scan within the frame — monotonic-deque optimization deferred but the contract is correct. - New :mdev op: moving population stddev (ddof=0), two-pass mean + variance to avoid cancellation. Phase F1 — window join (q `wj` semantics): - stratum.api/window-join: for each left row at time t, aggregate the right rows whose time falls in [t+lo, t+hi]. Sorts both sides ascending by their respective ts columns, two-pointer sweep over left while lo/hi pointers monotonically advance through right. - SUM / AVG / COUNT use right-side prefix sums (O(1) per left row); MIN / MAX scan the matching slice. - Single-partition (no equality keys) for now; that's the bulk of the realistic usage and matches q's `wj` over a single sym slice. Phase F2 — LATEST ON / DISTINCT ON: - stratum.api/latest-on: most recent row per partition, expressed via ROW_NUMBER OVER (PARTITION BY … ORDER BY ts DESC) + HAVING rn=1. SQL WHERE doesn't see window outputs, but HAVING does, so the rewrite goes through cleanly without engine changes. Tests: +27 (moving-agg) +27 (temporal-join). Full sweep: 551 tests / 1857 assertions, all green. Signed-off-by: Christian Weilbach <christian@weilbach.name>

Output of `clj -M:ffix` (cljfmt 0.9.2) over the time-series branch. Touches pre-existing formatting nits in files unrelated to this branch in addition to the new code; full test suite (1094 assertions) green post-format. Signed-off-by: Christian Weilbach <christian@weilbach.name>

Updates SQL Capabilities, DSL Reference, and Features sections so the README reflects the new operators and helpers shipped on this branch: RANGE BETWEEN INTERVAL frames, TIME_BUCKET, FIRST/LAST/NTH_VALUE, FILLS/LOCF, EMA, RLEID, the q-style MAVG/MSUM/MMIN/MMAX/MDEV moving aggregates, the :temporal-unit metadata model, and the window-join / latest-on / generate-series Clojure helpers. Signed-off-by: Christian Weilbach <christian@weilbach.name>

Wires the time-series operators added earlier on this branch into the SQL parser so they're reachable from SELECT / WHERE / GROUP BY / OVER clauses, and adds five sqllogictest files exercising them end-to-end. SQL parser additions (src/stratum/sql.clj): - TIMESTAMP literal: `TIMESTAMP '2024-01-15 10:30:45.123456'` parses to epoch-microseconds (matches the canonical micros-precision storage). - DATE / TIMESTAMP / TIMESTAMPTZ in CREATE TABLE: column descriptors now carry `:temporal-unit :days` / `:micros`. The schema rides on the table-registry value as Clojure metadata so existing INSERT / UPDATE / UPSERT / DELETE paths (which assume raw arrays) keep working unchanged. - ExtractExpression handler: `EXTRACT(field FROM col)` translates field to the granular op (:hour, :millisecond, :microsecond, :day-of-week, …) so normalization recognizes it. - TIME_BUCKET(width, 'unit', col [, origin]): registered as a scalar function, both in translate-function and in translate-group-expr so it works in GROUP BY too. - Window function names: MAVG, MSUM, MMIN, MMAX, MCOUNT, MDEV (q-style sliding aggregates with width passed via the second positional arg → picked up via .getOffset by JSqlParser); FILLS / LOCF, EMA, RLEID. server.clj + sqllogictest_test.clj — propagate the table's column-schema metadata across INSERT/UPDATE/UPSERT/DELETE atomic swaps so temporal columns retain their `:temporal-unit` after mutations. translate-select wraps temporal columns at query-input time using the schema metadata so the engine sees `{:type :int64 :data arr :temporal-unit U}` even though the table itself stores raw arrays. sqllogictest coverage: - test_temporal_micros.test — TIMESTAMP literal, EXTRACT MS/US, DATE_TRUNC at sub-day precisions, TIMESTAMP comparisons. - test_time_bucket.test — 5-min / 1-hour / 1-second bucketing, GROUP BY TIME_BUCKET aggregation. - test_window_value.test — FIRST_VALUE, LAST_VALUE, OHLC pattern. - test_moving_aggs.test — MAVG, MSUM, MMIN, MMAX, MCOUNT, MDEV. - test_window_locf_ema_rleid.test — RLEID, EMA. Full sweep: 552 tests / 2529 assertions, all green. Signed-off-by: Christian Weilbach <christian@weilbach.name>

whilo added 7 commits May 9, 2026 00:34

whilo merged commit 10aa0e0 into main May 9, 2026
5 of 6 checks passed

whilo deleted the feature/time-series branch May 9, 2026 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time-series query engine: micros precision, RANGE frames, FIRST/LAST/EMA/RLEID, window-join, latest-on#21

Time-series query engine: micros precision, RANGE frames, FIRST/LAST/EMA/RLEID, window-join, latest-on#21
whilo merged 7 commits into
mainfrom
feature/time-series

whilo commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whilo commented May 9, 2026

Summary

Phase A — Sub-day temporal precision (epoch-microseconds, DuckDB-compatible)

Phase B — TIME_BUCKET, FIRST_VALUE, LAST_VALUE, NTH_VALUE

Phase C — RANGE BETWEEN INTERVAL frames + GENERATE_SERIES

Phase D — FILLS / LOCF and named moving aggregates

Phase E — EMA and RLEID

Phase F — Window-join and LATEST ON / DISTINCT ON

Bonus

Tests

OLAP-bench regression check (10M rows, T1)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant