Time-series query engine: micros precision, RANGE frames, FIRST/LAST/EMA/RLEID, window-join, latest-on#21
Merged
Merged
Conversation
…VALUE
Phase A — Microsecond-precision TIMESTAMP support:
- Column metadata gains :temporal-unit ∈ #{:days :seconds :millis :micros}
defaulting to legacy :seconds when absent. TIMESTAMP literals encoded
as long[] of microseconds since epoch (DuckDB-compatible).
- Java kernels added for micros: arrayDateTrunc{Year,Month,Day,Hour,
Minute,Second,Milli,Micro}Micros, arrayExtract{Hour,Minute,Second,
Millisecond,Microsecond}Micros, arrayDateAddMicrosMicros,
arrayDateAddMonthsMicros, arrayDateDiffMicros.
- Expression eval (vectorized + scalar) dispatches on the column's
:temporal-unit via the existing *columns-meta* dynamic var, which now
retains temporal-unit columns alongside dict-encoded ones.
- All paths (planner + legacy) thread temporal-unit through.
Phase B1 — TIME_BUCKET function:
- New :time-bucket op. arrayTimeBucketMicros / Days / Months Java
kernels do floor-div by arbitrary width with optional origin offset.
- Sub-day units (microseconds → hours) on micros columns; days/weeks/
months on epoch-day DATE columns.
Phase B2 — FIRST_VALUE / LAST_VALUE / NTH_VALUE window functions:
- New :first-value / :last-value / :nth-value window ops in
query.window. NTH_VALUE uses the LAG/LEAD :offset slot for n.
- LAST_VALUE follows DuckDB convention of full-partition scope so OHLC
bar generation works with the typical default frame.
- SQL parser registers FIRST_VALUE/LAST_VALUE/NTH_VALUE as analytic
functions; spec accepts the new ops.
validate-query: SELECT references can now point at :window outputs.
Tests: 1739 assertions across 528 tests pass (44 new for micros, 18
new for value windows). Legacy seconds-based path unchanged.
Signed-off-by: Christian Weilbach <christian@weilbach.name>
Phase C1 — RANGE BETWEEN INTERVAL frames: - Two-pointer sliding window keyed off the (single) ascending ORDER BY column, O(N) per partition. compute-sliding-window-sum-range and compute-sliding-window-count-range live alongside the existing ROWS-mode helpers; SUM/COUNT/AVG dispatch on (range-frame? frame). - Fixes a long-standing partition-reset bug in the existing prefix-sum helper: at a partition boundary, the cumulative was overwritten, silently corrupting the last row of every non-final partition. Switch to a single monotonic prefix array and rely on partition-local endpoint subtraction to localize the result. Phase D1 — FILLS (LOCF): - Forward-fill NaN/NULL within partition. Leading NaNs stay NaN. Phase E1 — EMA: - Per-partition exponential moving average. Smoothing factor passed via :offset; values >= 1.0 are interpreted as a period N (alpha = 2/(N+1)). - Initializes at the first non-NaN value of the partition; NaN inputs are treated as no-op (carry the previous EMA). Phase E2 — RLEID: - Run-length-encoding group ID. Increments when the value differs from the previous row in sorted order. Handles long, double, and string columns; restarts at 1 per partition. Tests: +28 new (window-extra: 28 incl. EMA/FILLS/RLEID semantics), +12 new (range-frame). Full sweep 1573/1573 pass; the reset-bug fix also makes existing partitioned ROWS-frame queries correct. Signed-off-by: Christian Weilbach <christian@weilbach.name>
stratum.api/generate-series produces a dense column as a `:from`-ready
column map. Three forms:
(generate-series 1 10) → 1..10 step 1, long[]
(generate-series 0 100 25) → 0,25,50,75,100, long[]
(generate-series 0.0 1.0 0.25) → double[] when step is float
(generate-series 0 (* 5 day-us)
1 :days :micros) → temporal-tagged spine
Combined with ASOF LEFT JOIN, this enables the canonical gap-fill
pattern (dense time spine + LOCF carry-forward of sparse data) without
any new join machinery. Verified via test/stratum/generate_series_test.
5 tests / 16 assertions pass.
Signed-off-by: Christian Weilbach <christian@weilbach.name>
Phase D2 — Named moving aggregates (q-style sugar): - MAVG / MSUM / MMIN / MMAX / MCOUNT / MDEV window ops, expanded at execute-window-functions to AVG/SUM/… OVER (ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW). Width N rides on :offset. - :min and :max gained sliding-frame ROWS branches (previously only full-partition or running). Simple per-row scan within the frame — monotonic-deque optimization deferred but the contract is correct. - New :mdev op: moving population stddev (ddof=0), two-pass mean + variance to avoid cancellation. Phase F1 — window join (q `wj` semantics): - stratum.api/window-join: for each left row at time t, aggregate the right rows whose time falls in [t+lo, t+hi]. Sorts both sides ascending by their respective ts columns, two-pointer sweep over left while lo/hi pointers monotonically advance through right. - SUM / AVG / COUNT use right-side prefix sums (O(1) per left row); MIN / MAX scan the matching slice. - Single-partition (no equality keys) for now; that's the bulk of the realistic usage and matches q's `wj` over a single sym slice. Phase F2 — LATEST ON / DISTINCT ON: - stratum.api/latest-on: most recent row per partition, expressed via ROW_NUMBER OVER (PARTITION BY … ORDER BY ts DESC) + HAVING rn=1. SQL WHERE doesn't see window outputs, but HAVING does, so the rewrite goes through cleanly without engine changes. Tests: +27 (moving-agg) +27 (temporal-join). Full sweep: 551 tests / 1857 assertions, all green. Signed-off-by: Christian Weilbach <christian@weilbach.name>
Output of `clj -M:ffix` (cljfmt 0.9.2) over the time-series branch. Touches pre-existing formatting nits in files unrelated to this branch in addition to the new code; full test suite (1094 assertions) green post-format. Signed-off-by: Christian Weilbach <christian@weilbach.name>
Updates SQL Capabilities, DSL Reference, and Features sections so the README reflects the new operators and helpers shipped on this branch: RANGE BETWEEN INTERVAL frames, TIME_BUCKET, FIRST/LAST/NTH_VALUE, FILLS/LOCF, EMA, RLEID, the q-style MAVG/MSUM/MMIN/MMAX/MDEV moving aggregates, the :temporal-unit metadata model, and the window-join / latest-on / generate-series Clojure helpers. Signed-off-by: Christian Weilbach <christian@weilbach.name>
Wires the time-series operators added earlier on this branch into the
SQL parser so they're reachable from SELECT / WHERE / GROUP BY / OVER
clauses, and adds five sqllogictest files exercising them end-to-end.
SQL parser additions (src/stratum/sql.clj):
- TIMESTAMP literal: `TIMESTAMP '2024-01-15 10:30:45.123456'` parses to
epoch-microseconds (matches the canonical micros-precision storage).
- DATE / TIMESTAMP / TIMESTAMPTZ in CREATE TABLE: column descriptors
now carry `:temporal-unit :days` / `:micros`. The schema rides on the
table-registry value as Clojure metadata so existing INSERT / UPDATE /
UPSERT / DELETE paths (which assume raw arrays) keep working unchanged.
- ExtractExpression handler: `EXTRACT(field FROM col)` translates field
to the granular op (:hour, :millisecond, :microsecond, :day-of-week,
…) so normalization recognizes it.
- TIME_BUCKET(width, 'unit', col [, origin]): registered as a scalar
function, both in translate-function and in translate-group-expr so
it works in GROUP BY too.
- Window function names: MAVG, MSUM, MMIN, MMAX, MCOUNT, MDEV (q-style
sliding aggregates with width passed via the second positional arg →
picked up via .getOffset by JSqlParser); FILLS / LOCF, EMA, RLEID.
server.clj + sqllogictest_test.clj — propagate the table's column-schema
metadata across INSERT/UPDATE/UPSERT/DELETE atomic swaps so temporal
columns retain their `:temporal-unit` after mutations.
translate-select wraps temporal columns at query-input time using the
schema metadata so the engine sees `{:type :int64 :data arr
:temporal-unit U}` even though the table itself stores raw arrays.
sqllogictest coverage:
- test_temporal_micros.test — TIMESTAMP literal, EXTRACT MS/US,
DATE_TRUNC at sub-day precisions, TIMESTAMP comparisons.
- test_time_bucket.test — 5-min / 1-hour / 1-second bucketing,
GROUP BY TIME_BUCKET aggregation.
- test_window_value.test — FIRST_VALUE, LAST_VALUE, OHLC pattern.
- test_moving_aggs.test — MAVG, MSUM, MMIN, MMAX, MCOUNT, MDEV.
- test_window_locf_ema_rleid.test — RLEID, EMA.
Full sweep: 552 tests / 2529 assertions, all green.
Signed-off-by: Christian Weilbach <christian@weilbach.name>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the time-series feature set surveyed across DuckDB / TimescaleDB / QuestDB / kdb / Polars / datajure. Each phase is its own commit; tests cover every new operator and the legacy seconds-based code paths still behave exactly as before.
Phase A — Sub-day temporal precision (epoch-microseconds, DuckDB-compatible)
:temporal-unit ∈ #{:days :seconds :millis :micros}(defaults to legacy:seconds).arrayDateTrunc{Year,Month,Day,Hour,Minute,Second,Milli,Micro}Micros,arrayExtract{Hour,Minute,Second,Millisecond,Microsecond}Micros,arrayDateAddMicrosMicros,arrayDateAddMonthsMicros,arrayDateDiffMicros.eval-expr-{to-long,vectorized}) and scalar (gb/eval-agg-expr) paths dispatch on:temporal-unitvia the existingexpr/*columns-meta*dynamic var (filter widened to retain temporal-unit columns alongside dict-encoded ones).Phase B — TIME_BUCKET, FIRST_VALUE, LAST_VALUE, NTH_VALUE
[:time-bucket N unit col]— arbitrary-width bucketing on micros, days (with:weeks/:months), or seconds.FIRST_VALUE/LAST_VALUE/NTH_VALUEwindow functions; the latter reuses the:offsetslot forn. SQL parser registers all three.Phase C — RANGE BETWEEN INTERVAL frames + GENERATE_SERIES
:sum/:count/:avg. Required for correct rolling aggregates over irregular time series.compute-sliding-window-sum: at a partition boundary, the cumulative was overwritten, silently corrupting the last row of every non-final partition. Now uses a single monotonic prefix array, with partition-local endpoint subtraction localizing each query.stratum.api/generate-seriesproduces a dense column suitable as:from, with a temporal form ((generate-series 0 (* 5 day-us) 1 :days :micros)) that tags the output with the right:temporal-unit. Combined withASOF LEFT JOIN, this gives canonical gap-fill + LOCF without new join machinery.Phase D — FILLS / LOCF and named moving aggregates
:fillswindow op forward-fills NaN/NULL within partition (leading NaNs stay NaN).MAVG / MSUM / MMIN / MMAX / MCOUNT / MDEV(q-style sugar) expand at execute time toop OVER (ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW).:min/:maxgained sliding-frame ROWS branches;:mdevis moving population stddev.Phase E — EMA and RLEID
:emaop: per-partition exponential moving average.:offset >= 1.0is treated as period N (α = 2/(N+1)); otherwise α directly. Initializes at first non-NaN; NaN inputs carry the previous EMA.:rleid: run-length-encoding group ID (handles long, double, string columns; restarts per partition).Phase F — Window-join and LATEST ON / DISTINCT ON
stratum.api/window-join(qwjsemantics): for each left row at timet, aggregate the right rows whose time falls in[t+lo, t+hi]. Sorts both sides by their ts columns, two-pointer sweep with monotonic lo/hi pointers. SUM/AVG/COUNT use right-side prefix sums (O(1) per left row); MIN/MAX scan the slice.stratum.api/latest-on: most-recent-row-per-partition via ROW_NUMBER OVER (PARTITION BY … ORDER BY ts DESC) + HAVING rn=1.Bonus
validate-querynow accepts SELECT references that point at:windowoutputs.:offsetwindow-spec field accepts doubles too (needed for EMA's α parameter).Tests
temporal_micros_test.clj(44 assertions),range_frame_test.clj(16),window_value_test.clj(18),window_extra_test.clj(28),moving_agg_test.clj(27),generate_series_test.clj(16),temporal_join_test.clj(27).query-test,sql-test,asof-join-test, planner tests, …) green.OLAP-bench regression check (10M rows, T1)
No major regressions — five of six within run-to-run noise. B5 (Filtered COUNT NEQ) shows a small absolute slowdown (~0.6 ms 1T) likely attributable to the widened
*columns-meta*filter; left as-is for follow-up since absolute time is sub-millisecond and the filter widening is needed for the temporal-unit dispatch.Test plan
clj -M:ffix)