|
1 | | -# Tumble Window Aggregation {#tumble_window} |
| 1 | +# Tumble Window Aggregation |
2 | 2 |
|
3 | 3 | ## Overview |
4 | | -Tumble slices the unbounded data into different windows according to its parameters. Internally, Timeplus observes the data streaming and automatically decides when to close a sliced window and emit the final results for that window. |
| 4 | + |
| 5 | +A **tumbling window** divides an unbounded data stream into **fixed-size, non-overlapping intervals** based on event time or processing time. |
| 6 | + |
| 7 | +Each event belongs to **exactly one** window. This is useful for periodic aggregations such as per-minute averages, hourly counts, or daily summaries. |
| 8 | + |
| 9 | +Unlike **hopping** or **session** windows, tumbling windows do not overlap — once a window closes, a new one immediately starts. |
| 10 | +This makes them simple, deterministic, and ideal for producing periodic reports or metrics. |
| 11 | + |
| 12 | +## Syntax |
5 | 13 |
|
6 | 14 | ```sql |
7 | | -SELECT <column_name1>, <column_name2>, <aggr_function> |
8 | | -FROM tumble(<stream_name>, [<timestamp_column>], <tumble_window_size>, [<time_zone>]) |
| 15 | +SELECT <grouping-keys>, <aggr_functions> |
| 16 | +FROM tumble(<stream-name>, [<timestamp-column>], <window-size>]) |
9 | 17 | [WHERE clause] |
10 | | -GROUP BY [window_start | window_end], ... |
11 | | -EMIT <window_emit_policy> |
| 18 | +GROUP BY [window_start | window_end], <other-group-keys> ... |
| 19 | +EMIT <emit-policy> |
12 | 20 | SETTINGS <key1>=<value1>, <key2>=<value2>, ... |
13 | 21 | ``` |
14 | 22 |
|
15 | | -Tumble window means a fixed non-overlapped time window. Here is one example for a 5 seconds tumble window: |
| 23 | +### Parameters |
| 24 | + |
| 25 | +- `<stream-name>` : the source stream the tumble window applied to. **Required** |
| 26 | +- `<timestamp-column>` : the event timestamp column which is used to calculate window starts / ends and internal watermark. You can use `now()` or `now64(3)` to enable processing time tumble window. Default is `_tp_time` if absent. **Optional** |
| 27 | +- `<window-size>` : tumble window interval size. Supported interval units are listed below. **Required** |
| 28 | + - `s` : second |
| 29 | + - `m` : miniute |
| 30 | + - `h` : hour |
| 31 | + - `d` : day |
| 32 | + - `w` : week |
| 33 | + |
| 34 | +### Example |
16 | 35 |
|
17 | 36 | ``` |
18 | | -["2020-01-01 00:00:00", "2020-01-01 00:00:05] |
19 | | -["2020-01-01 00:00:05", "2020-01-01 00:00:10] |
20 | | -["2020-01-01 00:00:10", "2020-01-01 00:00:15] |
21 | | -... |
| 37 | +CREATE STREAM device_metrics ( |
| 38 | + device_id string, |
| 39 | + cpu_usage float, |
| 40 | + event_time datetime64(3) |
| 41 | +); |
| 42 | +
|
| 43 | +SELECT |
| 44 | + window_start, |
| 45 | + window_end, |
| 46 | + device_id, |
| 47 | + avg(cpu_usage) AS avg_cpu |
| 48 | +FROM tumble(device_metrics, event_time, 5s) |
| 49 | +GROUP BY |
| 50 | + window_start, |
| 51 | + device_id |
| 52 | +EMIT AFTER WINDOW CLOSE; |
22 | 53 | ``` |
23 | 54 |
|
24 | | -`tumble` window in Timeplus is left closed and right open `[)` meaning it includes all events which have timestamps **greater or equal** to the **lower bound** of the window, but **less** than the **upper bound** of the window. |
| 55 | +**Explanation**: |
| 56 | +- Events are grouped into **5-second, non-overlapping windows** based on their `event_time`. |
| 57 | +- Each `device_id`’s events are aggregated independently within each window. |
| 58 | +- When a window closes, the system emits one aggregated result per device with the computed `avg_cpu`. |
25 | 59 |
|
26 | | -`tumble` in the above SQL spec is a table function whose core responsibility is assigning tumble window to each event in a streaming way. The `tumble` table function will generate 2 new columns: `window_start, window_end` which correspond to the low and high bounds of a tumble window. |
| 60 | +**Example timeline**: |
27 | 61 |
|
28 | | -`tumble` table function accepts 4 parameters: `<timestamp_column>` and `<time-zone>` are optional, the others are mandatory. |
| 62 | +| Window | Time Range | Events Included | |
| 63 | +| :----: | :------------------ | :----------------------------- | |
| 64 | +| W1 | 00:00:00 → 00:00:05 | Events in [00:00:00, 00:00:05) | |
| 65 | +| W2 | 00:00:05 → 00:00:10 | Events in [00:00:05, 00:00:10) | |
| 66 | +| W3 | 00:00:10 → 00:00:15 | Events in [00:00:10, 00:00:15) | |
29 | 67 |
|
30 | | -When the `<timestamp_column>` parameter is omitted from the query, the stream's default event timestamp column which is `_tp_time` will be used. |
| 68 | + |
31 | 69 |
|
32 | | -When the `<time_zone>` parameter is omitted the system's default timezone will be used. `<time_zone>` is a string type parameter, for example `UTC`. |
| 70 | +Each event falls into exactly one window, ensuring deterministic aggregation and predictable output intervals. |
33 | 71 |
|
34 | | -`<tumble_window_size>` is an interval parameter: `<n><UNIT>` where `<UNIT>` supports `s`, `m`, `h`, `d`, `w`. |
35 | | -It doesn't yet support `M`, `q`, `y`. For example, `tumble(my_stream, 5s)`. |
| 72 | +## Emit Policies |
36 | 73 |
|
37 | | -More concrete examples: |
| 74 | +Emit policies define **when** Timeplus should output results from **time-windowed** aggregations such as **tumble**, **hop**, **session** and **global-windowed** aggregation. |
38 | 75 |
|
39 | | -```sql |
40 | | -SELECT device, max(cpu_usage) |
41 | | -FROM tumble(device_utils, 5s) |
42 | | -GROUP BY device, window_end |
43 | | -``` |
| 76 | +These policies control whether results are emitted **only after the window closes**, **after a delay** to honor late events, or **incrementally during the window**. |
| 77 | + |
| 78 | +### `EMIT AFTER WINDOW CLOSE` |
44 | 79 |
|
45 | | -The above example SQL continuously aggregates max cpu usage per device per tumble window for the stream `devices_utils`. Every time a window is closed, Timeplus Proton emits the aggregation results. |
| 80 | +This is the **default behavior** for all time window aggregations. Timeplus emits the aggregated results once the window closes. |
46 | 81 |
|
47 | | -Let's change `tumble(stream, 5s)` to `tumble(stream, timestmap, 5s)` : |
| 82 | +**Example**: |
48 | 83 |
|
49 | 84 | ```sql |
50 | | -SELECT device, max(cpu_usage) |
51 | | -FROM tumble(devices, timestamp, 5s) |
52 | | -GROUP BY device, window_end |
53 | | -EMIT AFTER WINDOW CLOSE WITH DELAY 2s; |
| 85 | +SELECT window_start, device, max(cpu_usage) |
| 86 | +FROM tumble(device_utils, 5s) |
| 87 | +GROUP BY window_start, device; |
54 | 88 | ``` |
55 | 89 |
|
56 | | -Same as the above delayed tumble window aggregation, except in this query, user specifies a **specific time column** `timestamp` for tumble windowing. |
| 90 | +This query continuously computes the **maximum CPU usage** per device in every 5-second tumble window from the stream `device_utils`. |
| 91 | +Each time a window closes (as determined by the internal watermark), Timeplus emits the results once. |
57 | 92 |
|
58 | | -The example below is so called processing time processing which uses wall clock time to assign windows. Timeplus internally processes `now/now64` in a streaming way. |
| 93 | +:::info |
| 94 | +A watermark is an internal timestamp that advances monotonically per stream, determining when a window can be safely closed. |
| 95 | +::: |
59 | 96 |
|
60 | | -```sql |
61 | | -SELECT device, max(cpu_usage) |
62 | | -FROM tumble(devices, now64(3, 'UTC'), 5s) |
63 | | -GROUP BY device, window_end |
64 | | -EMIT AFTER WINDOW CLOSE WITH DELAY 2s; |
65 | | -``` |
| 97 | +### `EMIT AFTER WINDOW CLOSE WITH DELAY` |
66 | 98 |
|
67 | | -## Emit Policies |
68 | | - |
69 | | -### EMIT AFTER WINDOW CLOSE {#emit_after} |
| 99 | +Adds a **delay period** before emitting window results, allowing **late events** to be included. |
70 | 100 |
|
71 | | -You can omit `EMIT AFTER WINDOW CLOSE`, since this is the default behavior for time window aggregations. For example: |
| 101 | +**Example**: |
72 | 102 |
|
73 | 103 | ```sql |
74 | | -SELECT device, max(cpu_usage) |
| 104 | +SELECT window_start, device, max(cpu_usage) |
75 | 105 | FROM tumble(device_utils, 5s) |
76 | | -GROUP BY device, window_end |
| 106 | +GROUP BY window_start, device |
| 107 | +EMIT AFTER WINDOW CLOSE WITH DELAY 2s; |
77 | 108 | ``` |
78 | 109 |
|
79 | | -The above example SQL continuously aggregates max cpu usage per device per tumble window for the stream `devices_utils`. Every time a window is closed, Timeplus Proton emits the aggregation results. How to determine the window should be closed? This is done by [Watermark](/understanding-watermark), which is an internal timestamp. It is guaranteed to be increased monotonically per stream query. |
| 110 | +This query aggregates the **maximum CPU usage** for each device per 5-second tumble window, then waits 2 additional seconds after the window closes before emitting results. This helps capture any late-arriving events that fall within the window period. |
80 | 111 |
|
81 | | -### EMIT AFTER WINDOW CLOSE WITH DELAY {#emit_after_with_delay} |
| 112 | +### `EMIT TIMEOUT` |
82 | 113 |
|
83 | | -Example: |
| 114 | +In some streaming scenarios, the **last tumble window** might remain open because no new events arrive to advance the watermark (i.e., the event time progress). |
| 115 | +The **`EMIT TIMEOUT`** clause helps forcefully close such idle windows after a specified period of inactivity. |
| 116 | + |
| 117 | +**Example**: |
84 | 118 |
|
85 | 119 | ```sql |
86 | | -SELECT device, max(cpu_usage) |
| 120 | +SELECT window_start, device, max(cpu_usage) |
87 | 121 | FROM tumble(device_utils, 5s) |
88 | | -GROUP BY device, widnow_end |
89 | | -EMIT AFTER WINDOW CLOSE WITH DELAY 2s; |
| 122 | +GROUP BY window_start, device |
| 123 | +EMIT TIMEOUT 3s; |
90 | 124 | ``` |
91 | 125 |
|
92 | | -The above example SQL continuously aggregates max cpu usage per device per tumble window for the stream `device_utils`. Every time a window is closed, Timeplus Proton waits for another 2 seconds and then emits the aggregation results. |
| 126 | +In this example: |
| 127 | + |
| 128 | +- The query continuously aggregates the maximum CPU usage per device in **5-second tumble windows**. |
| 129 | +- If **no new events** arrive for **3 seconds**, Timeplus will **force-close** the most recent window and emit the final results. |
| 130 | +- Once the window is closed, the **internal watermark** (event time) advances as well. |
| 131 | +- Any **late events** that belong to this now-closed window will be discarded. |
| 132 | + |
| 133 | +### `EMIT ON UPDATE` |
93 | 134 |
|
94 | | -### EMIT ON UPDATE {#emit_on_update} |
| 135 | +Emits **intermediate aggregation updates** whenever the results change within an open window. |
| 136 | +This is useful for near real-time visibility into evolving metrics. |
95 | 137 |
|
96 | | -You can apply `EMIT ON UPDATE` in time windows, such as tumble/hop/session, with `GROUP BY` keys. For example: |
| 138 | +**Example**: |
97 | 139 |
|
98 | 140 | ```sql |
99 | 141 | SELECT |
100 | | - window_start, cid, count() AS cnt |
| 142 | + window_start, device, max(cpu_usage) |
101 | 143 | FROM |
102 | | - tumble(car_live_data, 5s) |
103 | | -WHERE |
104 | | - cid IN ('c00033', 'c00022') |
| 144 | + tumble(device_utils, 5s) |
105 | 145 | GROUP BY |
106 | | - window_start, cid |
107 | | -EMIT ON UPDATE |
| 146 | + window_start, device |
| 147 | +EMIT ON UPDATE; |
108 | 148 | ``` |
109 | 149 |
|
110 | | -During the 5 second tumble window, even the window is not closed, as long as the aggregation value(`cnt`) for the same `cid` is different , the results will be emitted. |
| 150 | +Here, during each 5-second window, the system emits updates whenever there are new events flowing into the open window, even before the window closes. |
111 | 151 |
|
112 | | -### EMIT ON UPDATE WITH DELAY {#emit_on_update_with_delay} |
| 152 | +### `EMIT ON UPDATE WITH BATCH` |
113 | 153 |
|
114 | | -Adding the `WITH DELAY` to `EMIT ON UPDATE` will allow late event for the window aggregation. |
| 154 | +Combines **periodic emission** with **update-based** triggers. |
| 155 | +Timeplus checks the intermediate aggregation results at regular intervals and emits them if they have changed which can significally improve the emit efficiency and throughput compared with `EMIT ON UPDATE`. |
| 156 | + |
| 157 | +**Example**: |
115 | 158 |
|
116 | 159 | ```sql |
117 | 160 | SELECT |
118 | | - window_start, cid, count() AS cnt |
| 161 | + window_start, device, max(cpu_usage) |
119 | 162 | FROM |
120 | | - tumble(car_live_data, 5s) |
121 | | -WHERE |
122 | | - cid IN ('c00033', 'c00022') |
| 163 | + tumble(device_utils, 5s) |
123 | 164 | GROUP BY |
124 | | - window_start, cid |
125 | | -EMIT ON UPDATE WITH DELAY 2s |
| 165 | + window_start, device |
| 166 | +EMIT ON UPDATE WITH BATCH 1s; |
126 | 167 | ``` |
127 | 168 |
|
128 | | -### EMIT ON UPDATE WITH BATCH {#emit_on_update_with_batch} |
| 169 | +### `EMIT ON UPDATE WITH DELAY` |
| 170 | + |
| 171 | +Similar to **`EMIT ON UPDATE`**, but includes a delay to allow late events before emitting incremental updates. |
| 172 | + |
| 173 | +**Example**: |
129 | 174 |
|
130 | | -You can combine `EMIT PERIODIC` and `EMIT ON UPDATE` together. In this case, even the window is not closed, Timeplus will check the intermediate aggregation result at the specified interval and emit rows if the result is changed. |
131 | 175 | ```sql |
132 | 176 | SELECT |
133 | | - window_start, cid, count() AS cnt |
| 177 | + window_start, device, max(cpu_usage) |
134 | 178 | FROM |
135 | | - tumble(car_live_data, 5s) |
136 | | -WHERE |
137 | | - cid IN ('c00033', 'c00022') |
| 179 | + tumble(device_utils, 5s) |
138 | 180 | GROUP BY |
139 | | - window_start, cid |
140 | | -EMIT ON UPDATE WITH BATCH 2s |
| 181 | + window_start, device |
| 182 | +EMIT ON UPDATE WITH DELAY 2s; |
141 | 183 | ``` |
0 commit comments