|
| 1 | +--- |
| 2 | +title: Optimizing Queries |
| 3 | +description: How to write jsonata_query expressions that stream in constant memory — and what happens when they don't |
| 4 | +--- |
| 5 | + |
| 6 | +# Optimizing Queries |
| 7 | + |
| 8 | +`jsonata_query` includes a query planner that decomposes expressions into streaming accumulators at compile time. When it can decompose, queries run in **constant memory** with a single table scan — matching native SQL performance. When it can't, it falls back to accumulating all rows in memory. |
| 9 | + |
| 10 | +The difference is significant: **83ms vs 439ms** on 100K rows for the same 5-aggregate report. On larger datasets, the gap widens further since streaming is O(1) memory while accumulation is O(n). |
| 11 | + |
| 12 | +## What streams |
| 13 | + |
| 14 | +These patterns are recognized at compile time and never buffer rows: |
| 15 | + |
| 16 | +| Pattern | Example | Memory | |
| 17 | +|---|---|---| |
| 18 | +| Simple aggregates | `$sum(amount)`, `$count($)`, `$max(price)` | O(1) | |
| 19 | +| Filtered aggregates | `$sum($filter($, function($v){$v.status = "completed"}).amount)` | O(1) | |
| 20 | +| Count distinct | `$count($distinct(region))` | O(unique) | |
| 21 | +| Object/array constructors | `{ "a": $sum(x), "b": $max(y) }` | O(1) | |
| 22 | +| Post-aggregate arithmetic | `$sum(x) - $count($)` | O(1) | |
| 23 | +| Finalizer functions | `$round($average(x), 2)` | O(1) | |
| 24 | +| Constants | `"Q1 Report"`, `42` | O(1) | |
| 25 | +| Constant folding | `$sum(amount * 1.1)` → `$sum(amount) * 1.1` | O(1) | |
| 26 | + |
| 27 | +## What falls back to O(n) |
| 28 | + |
| 29 | +These patterns require all rows in memory. They work correctly, but memory and time scale linearly with row count: |
| 30 | + |
| 31 | +| Pattern | Why it can't stream | |
| 32 | +|---|---| |
| 33 | +| `$sort($, function($a,$b){...})` | Needs all rows to determine order | |
| 34 | +| `$reduce($, function($a,$v){...}, init)` | Each step depends on all previous rows | |
| 35 | +| `$map($, function($v){...})` | Output is one element per row — O(n) by definition | |
| 36 | +| Variable bindings + nested lambdas | `($x := $sum(amount); $map($, function($v){$v.amount / $x}))` — two-pass dependency | |
| 37 | + |
| 38 | +### What O(n) costs in practice |
| 39 | + |
| 40 | +On 100K rows with a 5-aggregate report: |
| 41 | + |
| 42 | +| Mode | Time | Memory | |
| 43 | +|---|---|---| |
| 44 | +| Streaming | 83ms | O(1) | |
| 45 | +| Accumulating | 439ms | O(100K rows) | |
| 46 | + |
| 47 | +At 1M rows, accumulation means holding every row in memory before evaluation begins. Streaming processes each row once and discards it. |
| 48 | + |
| 49 | +## Mixed expressions: partial fallback |
| 50 | + |
| 51 | +When streaming and opaque patterns coexist, **only the opaque keys pay the O(n) cost**: |
| 52 | + |
| 53 | +```sql |
| 54 | +jsonata_query('{ |
| 55 | + "total": $sum(amount), -- streams: O(1) |
| 56 | + "avg": $average(amount), -- streams: O(1) |
| 57 | + "top_5": $sort($, fn)[0..4] -- accumulates: O(n) |
| 58 | +}', data) |
| 59 | +``` |
| 60 | + |
| 61 | +`total` and `avg` run in constant memory regardless. The planner doesn't give up on the entire expression because one key is expensive. |
| 62 | + |
| 63 | +## Keeping expressions on the fast path |
| 64 | + |
| 65 | +### Use identical predicate text for shared filters |
| 66 | + |
| 67 | +Predicates are deduplicated by **string equality**. Identical text shares one evaluation per row; rephrased predicates evaluate separately: |
| 68 | + |
| 69 | +```sql |
| 70 | +-- Shared: one predicate evaluation per row |
| 71 | +$sum($filter($, function($v){$v.status = "completed"}).amount) |
| 72 | +$average($filter($, function($v){$v.status = "completed"}).amount) |
| 73 | + |
| 74 | +-- NOT shared: different parameter name → two evaluations per row |
| 75 | +$sum($filter($, function($v){$v.status = "completed"}).amount) |
| 76 | +$average($filter($, function($row){$row.status = "completed"}).amount) |
| 77 | +``` |
| 78 | + |
| 79 | +### Push sorting and filtering into SQL |
| 80 | + |
| 81 | +If you need the top N results, filter in SQL before the expression touches rows: |
| 82 | + |
| 83 | +```sql |
| 84 | +-- Instead of jsonata_query('$sort($, fn)[0..4]', data) over 100K rows: |
| 85 | +SELECT jsonata('...', data) FROM orders |
| 86 | +ORDER BY json_extract(data, '$.amount') DESC LIMIT 5; |
| 87 | +``` |
| 88 | + |
| 89 | +### Use json_each for simple array expansion |
| 90 | + |
| 91 | +`jsonata_each` evaluates a full JSONata expression per row. For simple array expansion, `json_each` is ~6x faster: |
| 92 | + |
| 93 | +```sql |
| 94 | +-- Simple expand: prefer json_each |
| 95 | +SELECT j.value FROM events, json_each(data, '$.items') j; |
| 96 | + |
| 97 | +-- Filter + transform: jsonata_each earns its cost |
| 98 | +SELECT * FROM events, jsonata_each('items[price > 100].{ |
| 99 | + "name": product, "total": price * qty |
| 100 | +}', data); |
| 101 | +``` |
| 102 | + |
| 103 | +### Use json_set for simple mutations |
| 104 | + |
| 105 | +`jsonata_set` re-parses the entire document. For simple path updates, `json_set` is 5-7x faster: |
| 106 | + |
| 107 | +```sql |
| 108 | +-- Simple: prefer json_set |
| 109 | +SELECT json_set(data, '$.status', 'done') FROM events; |
| 110 | + |
| 111 | +-- Nested creation: jsonata_set earns its cost (creates intermediate objects) |
| 112 | +SELECT jsonata_set(data, 'meta.source.type', '"import"') FROM events; |
| 113 | +``` |
| 114 | + |
| 115 | +### Watch for format functions |
| 116 | + |
| 117 | +`$base64`, `$urlencode`, `$htmlescape`, and other format functions bypass the GJSON fast path, requiring full JSONata evaluation (~8-18 us/row vs ~0.25 us/row for simple paths). In mixed expressions, only the key using the format function pays this cost. |
| 118 | + |
| 119 | +## Quick reference |
| 120 | + |
| 121 | +| Expression | Streams? | Notes | |
| 122 | +|---|---|---| |
| 123 | +| `$sum(amount)` | yes | Simple path accumulator | |
| 124 | +| `$sum(amount * 1.1)` | yes | Constant folded | |
| 125 | +| `$sum($filter($, fn).amount)` | yes | Predicate + conditional accumulator | |
| 126 | +| `$count($distinct(region))` | yes | O(unique) memory | |
| 127 | +| `{ "a": $sum(x), "b": $max(y) }` | yes | Parallel accumulators, batch extraction | |
| 128 | +| `$round($average(x), 2)` | yes | Finalizer on streaming average | |
| 129 | +| `$sum(x) - $count($)` | yes | Post-aggregate arithmetic | |
| 130 | +| `$sort(...)` | **no** | O(n) — needs all data | |
| 131 | +| `$reduce($, fn, init)` | **no** | O(n) — cross-row state | |
| 132 | +| `$map($, fn)` | **no** | O(n) — output is one element per row | |
| 133 | + |
| 134 | +See the [query planner](/docs/explanation/query-planner) for the full decomposition model and internal optimization details. |
0 commit comments