From e4f24c106b44ffd64d131bc5b6706a862474089c Mon Sep 17 00:00:00 2001 From: rob Date: Fri, 15 May 2026 14:54:27 +0300 Subject: [PATCH 01/15] Add initial prompt --- .../sync-service/subquery-index-prompt.md | 152 ++++++++++++++++++ 1 file changed, 152 insertions(+) create mode 100644 packages/sync-service/subquery-index-prompt.md diff --git a/packages/sync-service/subquery-index-prompt.md b/packages/sync-service/subquery-index-prompt.md new file mode 100644 index 0000000000..2d7d9b6cdb --- /dev/null +++ b/packages/sync-service/subquery-index-prompt.md @@ -0,0 +1,152 @@ +# Shared Subquery Views with Logical-Time Reads + +## Summary + +Electric v1.6 introduced per-shape subquery indexing so consumers can keep +boolean subquery shapes live while dependency rows move across `WHERE` +boundaries. That solved correctness, but it made memory scale with the number of +shape consumers. Each consumer can keep its own materialized dependency view in +the `SubqueryIndex`, and while move-ins are buffered it can also hold both a +before and after view. + +This RFC proposes replacing per-consumer materialized subquery views with one +shared, versioned view per subquery. Consumers do not copy the view. +Instead, they read the shared view at a logical time. + + +## Background + +Related implementation work: + +- Commit: https://github.com/electric-sql/electric/commit/a04b25962cdb7ca86c4434585b6f74c758e1a31b +- PR: https://github.com/electric-sql/electric/pull/4051 +- issue: https://github.com/electric-sql/electric/issues/4279 +- Current index: `packages/sync-service/lib/electric/shapes/filter/indexes/subquery_index.ex` + +The v1.6 work lets shapes with boolean combinations around subqueries remain +live when dependency rows move. The key correctness problem is that consumers can +temporarily disagree about a subquery's membership while one consumer has +processed a move and another has not. + +The current implementation handles that by letting each shape consumer seed and +update exact per-shape membership rows. That keeps each consumer correct, but it +duplicates the same view across many shapes. During move-in buffering, the +consumer also carries before and after views so it can convert buffered +transactions and build the move-in query. + +## Problem + +The memory problem is broader than value-keyed routing rows in `SubqueryIndex`. +There are at least two duplicated memory pools: + +1. `SubqueryIndex` membership and routing rows, currently keyed by shape. +2. Consumer/materializer views, including before and after views during active + move-in buffering. + +Adding a reverse index such as `shape_handle -> all values` would make removal +faster, but it would increase memory. + +## Goals + +- Reduce memory footprint of subqueries significantly while remaining consitant and performant +- have near O(1) performance for: + - subquery addition and removal + - row processing by the where clause filter (so for afffected_shapes in the SubqueryIndex) +- Store one shared materialized view per subquery. +- Support exact membership reads at separate logical times. +- Preserve positive, negated, AND, OR, and NOT subquery correctness from v1.6. + +## Non-Goals + +- Do not change the client wire protocol. + +## Proposal + +### Components + +#### SubqueryIndex.MultiTimeView + +The MultiTimeView is an Materialized view of a subquery, queryable at multiple points in time. + +It's implimented as an ETS table (one ETS table per stack_id) + +subquery_id, value -> list(times) + +the meaning of the result: + doesn't exist - the value is not a member of the subquery for all logical times + [] - the value is a member of the subquery for all logical times + [:out, 9] - the value was out of the set before 9 and in the set from time 9 and above + [:out, 9, 11] - the value was out of the set before 9 and in the set from 9 to 10 and out the set again from time 11 and above + [:in, 9] - the value was in of the set before 9 and out the set from time 9 and above + [:in, 9, 11] - the value was in of the set before 9 and out the set from 9 to 10 and in the set again from time 11 and above + +note: the list(times) structure above has been chosen for memory efficientcy, but if you can think of a smaller structure let me know. for example if `[]` takes up more space than `true` then we should use `true` since this will be the most common case and we want to be memory efficient. + +so for subquery_id, value - [:in, 9, 11] + +member?(subquery_id, value, time: 8) = true +member?(subquery_id, value, time: 9) = false +member?(subquery_id, value, time: 10) = false +member?(subquery_id, value, time: 11) = true + + +rather than specifying a time you can also ask for membership across all times: + +member_of_union?(subquery_id, value) = true +member_of_intersection?(subquery_id, value) = false + +These are useful for the where clause filter which needs to keep the filter broad enough so that all consumers get all the changes they need while they may be at any of the logical times. + +For each subquery there will be a minimum logical time needed (the minimum in-flight logical time for the subquery) which the SubqueryProgressMonitor will set on the MultiTimeView. This allows the MultiTimeViewETS table to be compacted for memory and performace efficientcy. For any given list(times) it can be compacted by removing times from before the minimum in-flight logical time, making sure to update the :in/:out marker at the beginning of the list appropriately or removing it if there are no times left. + +Compacting should happen: +- when the list is read (e.g. when member? for the value is called) +- when the list is written to (e.g. when a value is moved in or out) +- when an async compaction routine is run (the design of this will need to be discussed) + +Removing a subquery should not involve a full ETS table scan as this will be too slow with lots of subqueries. If the ETS table is orderd we should be able to find the first item for the subqery, delete that, then find the next, and continue until the whole subquery is gone. That means it scales with the number of values (which is acceptable) rather than the number of subqueries. + + +#### SubqueryIndex + +This is a complete re-write of the existing SubqueryIndex that delegates most of it's resposibility to the the MultiTimeView. +It initialised the MultiTimeView. + +The Subquery index will need to know if a shapes use of a subquery is positive or negative (as it does now) and use: +- member_of_union? for positive +- member_of_intersection? for negative +This will ensure that the rows are included for all available times + +If the MultiTimeView has not been populated by the Materializer yet, the SubqueryIndex should just widen as much as possible. + +#### Materializer + +This is the existing Materializer. It will just need to be updated to: +- populate the MultiTimeView when the Materializer has initialised (it has a full materialized view). This should be at logical time 0. +- increment logical time for each `{:materializer_changes` message it sends to outer consumers, and include the new logical time in that message +- before the `{:materializer_changes` message is sent, the MultiTimeView should be updated with the changes giving the new logical time as the time of the change + +#### Logical Time + +Logical Time is monotonically incrementing counter per subquery. + +This needs to be a memory efficient data staructure that can be incremented indefinately. If it needs to wrap we need to make sure we use appropriate conparison functions when comparing times. Wrapping is an acceptable solution since there will only ever be so many moves in flight for any given subquery and memory would explode due to that before wrapping would cause comparison failures. + +#### SubqueryProgressMonitor + +Pin worked out from acks +-LRU algorithm + +#### Consumer EventProcessors + +These should be updated so that rather than holding views of the subquery, they just hold the logical time. so the before and after views should instead just be the before and after logical times. +- `convert_change` should have a function passed to it that access MultiTimeView.member? at the specified time +- the move-in query needs entire views at specific times and so should call MultiTimeView.get(time) and care should be made to not keep this in memory for too long, perhaps we should GC the consumer process afterwards, or perhaps the task process that runs the query should call MultiTimeView.get(time) so that the memory is freed when the process ends + + + +Please: +- read the codebase to check this design would work +- ask any clarifying questions about anything you are unsure of about my proposal +- give me your evaluation of the design +- make suggestions to improve the design for memory, performance or simplicity reasons From 181e1552a3b9881bc1cabdc0e1fc380a3b1425ca Mon Sep 17 00:00:00 2001 From: rob Date: Fri, 15 May 2026 15:33:08 +0100 Subject: [PATCH 02/15] Update subqueryIndex's role --- .../sync-service/subquery-index-prompt.md | 35 ++++++++++++++----- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/packages/sync-service/subquery-index-prompt.md b/packages/sync-service/subquery-index-prompt.md index 2d7d9b6cdb..4f812161ed 100644 --- a/packages/sync-service/subquery-index-prompt.md +++ b/packages/sync-service/subquery-index-prompt.md @@ -89,11 +89,10 @@ member?(subquery_id, value, time: 9) = false member?(subquery_id, value, time: 10) = false member?(subquery_id, value, time: 11) = true - rather than specifying a time you can also ask for membership across all times: -member_of_union?(subquery_id, value) = true -member_of_intersection?(subquery_id, value) = false +member_at_some_time?(subquery_id, value) = true +member_at_all_times?(subquery_id, value) = false These are useful for the where clause filter which needs to keep the filter broad enough so that all consumers get all the changes they need while they may be at any of the logical times. @@ -110,14 +109,34 @@ Removing a subquery should not involve a full ETS table scan as this will be too #### SubqueryIndex This is a complete re-write of the existing SubqueryIndex that delegates most of it's resposibility to the the MultiTimeView. -It initialised the MultiTimeView. -The Subquery index will need to know if a shapes use of a subquery is positive or negative (as it does now) and use: -- member_of_union? for positive -- member_of_intersection? for negative +When a shape is added to the SubqueryIndex at a particular node in the filter tree, SubqueryIndex will need to keep something like: + +node_id, subquery_id, polarity -> child_node_id + +and add the shape to the child WhereCondition node + +When the SubqueryIndex is asked the affected_shapes for a given value, it will need to iterate through all {subquery_id, polarity} pairs it has and MapSet.union the affected shapes for each. + +For a given {subquery_id, :positive} pair the affected shaped will be: + +if MultiTimeView.member_at_some_time?(subquery_id, value) do + WhereCondition.affected_shapes(node_id) +else + MapSet.new() +end + +For a given {subquery_id, :negative} pair the affected shaped will be: + +if MultiTimeView.member_at_all_times?(subquery_id, value) do + MapSet.new() +else + WhereCondition.affected_shapes(node_id) +end + This will ensure that the rows are included for all available times -If the MultiTimeView has not been populated by the Materializer yet, the SubqueryIndex should just widen as much as possible. +If the MultiTimeView has not been populated by the Materializer yet, the SubqueryIndex should return WhereCondition.affected_shapes(node_id) #### Materializer From 0993d5148b31ed7b308ca761e9195e6c3567064a Mon Sep 17 00:00:00 2001 From: rob Date: Sat, 16 May 2026 10:11:58 +0100 Subject: [PATCH 03/15] Add problem --- .../sync-service/subquery-index-prompt.md | 28 +++++++++++++++---- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/packages/sync-service/subquery-index-prompt.md b/packages/sync-service/subquery-index-prompt.md index 4f812161ed..bf8634d769 100644 --- a/packages/sync-service/subquery-index-prompt.md +++ b/packages/sync-service/subquery-index-prompt.md @@ -163,9 +163,27 @@ These should be updated so that rather than holding views of the subquery, they - the move-in query needs entire views at specific times and so should call MultiTimeView.get(time) and care should be made to not keep this in memory for too long, perhaps we should GC the consumer process afterwards, or perhaps the task process that runs the query should call MultiTimeView.get(time) so that the memory is freed when the process ends +# The Problem With The Above Design -Please: -- read the codebase to check this design would work -- ask any clarifying questions about anything you are unsure of about my proposal -- give me your evaluation of the design -- make suggestions to improve the design for memory, performance or simplicity reasons +Subqueries have different subquery_ids even if they only differ in a constant, so: +- SELECT id FROM users WHERE company_id=7 +- SELECT id FROM users WHERE company_id=8 +are two different subqueries. If the SubqueryIndex iterates through {subquery_id, :positive} pairs that may be thousands of pairs and be too slow since it's in the replication stream hot path. + +Instead we should, at each node, for each {field, polarity} pair, keep a reverse index for all the subqueries for that pair. So: +WHERE user_id IN (SELECT id FROM users WHERE company_id=7) +WHERE user_id IN (SELECT id FROM users WHERE company_id=8) + +would be in the same reverse index because they have the same field (user_id) and polarity (:positive). + +Perhaps the index could have the form: +subquery_cohort_id, value -> list({child_node_id, list(times)}) + +where: +subquery_cohort_id is a number (whatever is smallest in memory) and represents {node_id, field, polarity} but to save memory (as it's going to be repeated lots in the ETS table) we keep it small and also store: +subquery_cohort_id -> {node_id, field, polarity} and +{node_id, field, polarity} -> subquery_cohort_id + +and there's one child_node_id per subquery_id for the cohort + +Shape removal can be quick because we can keep track of subquery_id -> child_node_id and remove the shape from the child node, but removing nodes becomes slow since they're scattered throughout the ETS table. I suggest the cleaning up of nodes should be done asynchronously by a process that walks through the ETS table for nodes with no shapes and removes them. Race conditions can be avoided by doing an atomic conditional replace in the ETS table. From 1c71f8a1ff54200b72458e9048d022e324ba43f6 Mon Sep 17 00:00:00 2001 From: rob Date: Mon, 18 May 2026 10:34:53 +0100 Subject: [PATCH 04/15] Update problem --- .../sync-service/subquery-index-prompt.md | 49 ++++++++++++++----- 1 file changed, 37 insertions(+), 12 deletions(-) diff --git a/packages/sync-service/subquery-index-prompt.md b/packages/sync-service/subquery-index-prompt.md index bf8634d769..a12a711e3e 100644 --- a/packages/sync-service/subquery-index-prompt.md +++ b/packages/sync-service/subquery-index-prompt.md @@ -94,9 +94,9 @@ rather than specifying a time you can also ask for membership across all times: member_at_some_time?(subquery_id, value) = true member_at_all_times?(subquery_id, value) = false -These are useful for the where clause filter which needs to keep the filter broad enough so that all consumers get all the changes they need while they may be at any of the logical times. +These are useful for the where clause filter which needs to keep the filter broad enough so that all consumers get all the changes they need while they may be at any of the logical times. -For each subquery there will be a minimum logical time needed (the minimum in-flight logical time for the subquery) which the SubqueryProgressMonitor will set on the MultiTimeView. This allows the MultiTimeViewETS table to be compacted for memory and performace efficientcy. For any given list(times) it can be compacted by removing times from before the minimum in-flight logical time, making sure to update the :in/:out marker at the beginning of the list appropriately or removing it if there are no times left. +For each subquery there will be a minimum logical time needed (the minimum in-flight logical time for the subquery) which the SubqueryProgressMonitor will set on the MultiTimeView. This allows the MultiTimeViewETS table to be compacted for memory and performace efficientcy. For any given list(times) it can be compacted by removing times from before the minimum in-flight logical time, making sure to update the :in/:out marker at the beginning of the list appropriately or removing it if there are no times left. Compacting should happen: - when the list is read (e.g. when member? for the value is called) @@ -147,19 +147,20 @@ This is the existing Materializer. It will just need to be updated to: #### Logical Time -Logical Time is monotonically incrementing counter per subquery. +Logical Time is monotonically incrementing counter per subquery. This needs to be a memory efficient data staructure that can be incremented indefinately. If it needs to wrap we need to make sure we use appropriate conparison functions when comparing times. Wrapping is an acceptable solution since there will only ever be so many moves in flight for any given subquery and memory would explode due to that before wrapping would cause comparison failures. #### SubqueryProgressMonitor -Pin worked out from acks --LRU algorithm +This can be a separate process that the outer consumer calls to acknoledge that it's finished with a logical time for a subquery. The SubqueryProgressMonitor can then keep track of the minimum in-flight logical time for each subquery and set that on the MultiTimeView so that the MultiTimeView can compact it's ETS table for memory and performance efficientcy. + +The SubqueryProgressMonitor can be implimented as an ETS table ordered by subquery_id then logical time with an index to where an outer shape_id entry is so that when an outer consumer acks a logical time for a subquery, the outer shape can be found in the the ordered list and removed and replaced with the acked time. The minimum of theses times is the minimum in-flight logical time for the subquery. This should mean that updating a outer shape's logical time is O(1) and reading the minimum in-flight logical time is O(1). The SubqueryProgressMonitor should notify the MultiTimeView when the minimum in-flight logical time for a subquery changes so that the MultiTimeView can compact it's ETS table. #### Consumer EventProcessors -These should be updated so that rather than holding views of the subquery, they just hold the logical time. so the before and after views should instead just be the before and after logical times. -- `convert_change` should have a function passed to it that access MultiTimeView.member? at the specified time +These should be updated so that rather than holding views of the subquery, they just hold the logical time. so the before and after views should instead just be the before and after logical times. +- `convert_change` should have a function passed to it that access MultiTimeView.member? at the specified time - the move-in query needs entire views at specific times and so should call MultiTimeView.get(time) and care should be made to not keep this in memory for too long, perhaps we should GC the consumer process afterwards, or perhaps the task process that runs the query should call MultiTimeView.get(time) so that the memory is freed when the process ends @@ -177,13 +178,37 @@ WHERE user_id IN (SELECT id FROM users WHERE company_id=8) would be in the same reverse index because they have the same field (user_id) and polarity (:positive). Perhaps the index could have the form: -subquery_cohort_id, value -> list({child_node_id, list(times)}) +subquery_group_id, value -> list({child_node_id, list(times)}) where: -subquery_cohort_id is a number (whatever is smallest in memory) and represents {node_id, field, polarity} but to save memory (as it's going to be repeated lots in the ETS table) we keep it small and also store: -subquery_cohort_id -> {node_id, field, polarity} and -{node_id, field, polarity} -> subquery_cohort_id +subquery_group_id is a number (whatever is smallest in memory) and represents {node_id, field, polarity} but to save memory (as it's going to be repeated lots in the ETS table) we keep it small and also store: +subquery_group_id -> {node_id, field, polarity} and +{node_id, field, polarity} -> subquery_group_id -and there's one child_node_id per subquery_id for the cohort +and there's one child_node_id per subquery_id for the group Shape removal can be quick because we can keep track of subquery_id -> child_node_id and remove the shape from the child node, but removing nodes becomes slow since they're scattered throughout the ETS table. I suggest the cleaning up of nodes should be done asynchronously by a process that walks through the ETS table for nodes with no shapes and removes them. Race conditions can be avoided by doing an atomic conditional replace in the ETS table. + +Perhaps this will replace the MultiTimeView proposes above, we're still multi-time but we work at a group level rather than the subquery level. This would mean that the Materializer calls to add/remove values from the subquery must update all groups that the subquery is in. + + +### Definitions + +#### Subquery + +Each subquery gets it's own shape. If the select statement differs at all we count it as a different subquery, even if the difference is just in a constant. So: +- SELECT id FROM users WHERE company_id=7 +- SELECT id FROM users WHERE company_id=8 +are two different subqueries and each get their own subquery_id (the handle for the subquery shape) + +#### Subquery Group + +A subquery group is a set of subqueries that have the same field and polarity at a particular node in the filter tree. + +So for example the two subqueries in the two shapes below are differnt subqueries (because they differ by the company_id constant) but they are in the same subquery group because they have the same field (user_id) and polarity (:positive) at the same node in the filter tree: +WHERE user_id IN (SELECT id FROM users WHERE company_id=7) +WHERE user_id IN (SELECT id FROM users WHERE company_id=8) + +A subquery_id may appear in multiple subquery groups if it appears at multiple nodes in the filter tree. For the subquery is the same (has the same subquery_id) in the two shapes below but falls into different subquery groups because it appears at a differnt node in the filter tree: +WHERE user_id IN (SELECT id FROM users WHERE company_id=7) +WHERE project_id=4 AND user_id IN (SELECT id FROM users WHERE company_id=7) From 1919d2f98aeb9f932eab17c8d4c1dbca7b94ef7c Mon Sep 17 00:00:00 2001 From: rob Date: Mon, 18 May 2026 17:48:51 +0100 Subject: [PATCH 05/15] Add concurrency models an O(1) removal with groups --- .../sync-service/subquery-index-prompt.md | 111 ++++++++---------- 1 file changed, 50 insertions(+), 61 deletions(-) diff --git a/packages/sync-service/subquery-index-prompt.md b/packages/sync-service/subquery-index-prompt.md index a12a711e3e..a75af9a287 100644 --- a/packages/sync-service/subquery-index-prompt.md +++ b/packages/sync-service/subquery-index-prompt.md @@ -46,12 +46,33 @@ There are at least two duplicated memory pools: Adding a reverse index such as `shape_handle -> all values` would make removal faster, but it would increase memory. +## Definitions + +### Subquery + +Each subquery gets it's own shape. If the select statement differs at all we count it as a different subquery, even if the difference is just in a constant. So: +- SELECT id FROM users WHERE company_id=7 +- SELECT id FROM users WHERE company_id=8 +are two different subqueries and each get their own subquery_id (the handle for the subquery shape) + +### Subquery Group + +A subquery group is a set of subqueries that have the same field and polarity at a particular node in the filter tree. + +So for example the two subqueries in the two shapes below are differnt subqueries (because they differ by the company_id constant) but they are in the same subquery group because they have the same field (user_id) and polarity (:positive) at the same node in the filter tree: +WHERE user_id IN (SELECT id FROM users WHERE company_id=7) +WHERE user_id IN (SELECT id FROM users WHERE company_id=8) + +A subquery_id may appear in multiple subquery groups if it appears at multiple nodes in the filter tree. For the subquery is the same (has the same subquery_id) in the two shapes below but falls into different subquery groups because it appears at a differnt node in the filter tree: +WHERE user_id IN (SELECT id FROM users WHERE company_id=7) +WHERE project_id=4 AND user_id IN (SELECT id FROM users WHERE company_id=7) + ## Goals - Reduce memory footprint of subqueries significantly while remaining consitant and performant - have near O(1) performance for: - - subquery addition and removal - - row processing by the where clause filter (so for afffected_shapes in the SubqueryIndex) + - subquery addition and removal, including subquery group addition and removal where needed + - row processing by the where clause filter (so for afffected_shapes in the SubqueryIndex) even when there are thousands of subqueries in a subquery group - Store one shared materialized view per subquery. - Support exact membership reads at separate logical times. - Preserve positive, negated, AND, OR, and NOT subquery correctness from v1.6. @@ -108,20 +129,24 @@ Removing a subquery should not involve a full ETS table scan as this will be too #### SubqueryIndex -This is a complete re-write of the existing SubqueryIndex that delegates most of it's resposibility to the the MultiTimeView. +This is a complete re-write of the existing SubqueryIndex that delegates some of it's resposibility to the the MultiTimeView. + +Since there may be many subqueries in a subquery group, the SubqueryIndex should keep: -When a shape is added to the SubqueryIndex at a particular node in the filter tree, SubqueryIndex will need to keep something like: +subquery_group_id, value -> list(child_node_id) -node_id, subquery_id, polarity -> child_node_id +where: +subquery_group_id is a number (whatever is smallest in memory) and represents {node_id, field, polarity} but to save memory (as it's going to be repeated lots in the ETS table) we keep it small and also store: +subquery_group_id -> {node_id, field, polarity} and +{node_id, field, polarity} -> subquery_group_id -and add the shape to the child WhereCondition node +and there's one child_node_id per subquery_id for the group. child_node_id is smaller in memory so we keep that in places where it's going to be repeated lots in the ETS table (e.g. in `subquery_group_id, value -> list(child_node_id)`) -When the SubqueryIndex is asked the affected_shapes for a given value, it will need to iterate through all {subquery_id, polarity} pairs it has and MapSet.union the affected shapes for each. -For a given {subquery_id, :positive} pair the affected shaped will be: +So for `afffected_shapes` for a particular value, we'd look up the list of child_node_ids from the subquery_group_id, value pair then lookup the subquery_ids from the child_node_ids then for each subquery_id: if MultiTimeView.member_at_some_time?(subquery_id, value) do - WhereCondition.affected_shapes(node_id) + WhereCondition.affected_shapes(child_node_id) else MapSet.new() end @@ -131,19 +156,21 @@ For a given {subquery_id, :negative} pair the affected shaped will be: if MultiTimeView.member_at_all_times?(subquery_id, value) do MapSet.new() else - WhereCondition.affected_shapes(node_id) + WhereCondition.affected_shapes(child_node_id) end -This will ensure that the rows are included for all available times +This will ensure that the rows are included for all available times. -If the MultiTimeView has not been populated by the Materializer yet, the SubqueryIndex should return WhereCondition.affected_shapes(node_id) +If the MultiTimeView has not been marked ready by the Materializer yet, the SubqueryIndex should return WhereCondition.affected_shapes(child_node_id) + +Removal of a subquery must not scale with the total number of shapes or the number of subqueries in the group, but can scale with the number of values for the subquery. This can be achived by getting the getting the values for the subquery from the MultiTimeView (as discussed above in the MultiTimeView section when talking about subquery removal) - whilst iterating though those values we can also delete those values in the SubqueryIndex for all the groups that it's in. #### Materializer This is the existing Materializer. It will just need to be updated to: -- populate the MultiTimeView when the Materializer has initialised (it has a full materialized view). This should be at logical time 0. +- populate the SubqueryIndex when the Materializer has initialised (it has a full materialized view). This should be at logical time 0. - increment logical time for each `{:materializer_changes` message it sends to outer consumers, and include the new logical time in that message -- before the `{:materializer_changes` message is sent, the MultiTimeView should be updated with the changes giving the new logical time as the time of the change +- before the `{:materializer_changes` message is sent, the SubqueryIndex should be updated with the changes giving the new logical time as the time of the change #### Logical Time @@ -157,58 +184,20 @@ This can be a separate process that the outer consumer calls to acknoledge that The SubqueryProgressMonitor can be implimented as an ETS table ordered by subquery_id then logical time with an index to where an outer shape_id entry is so that when an outer consumer acks a logical time for a subquery, the outer shape can be found in the the ordered list and removed and replaced with the acked time. The minimum of theses times is the minimum in-flight logical time for the subquery. This should mean that updating a outer shape's logical time is O(1) and reading the minimum in-flight logical time is O(1). The SubqueryProgressMonitor should notify the MultiTimeView when the minimum in-flight logical time for a subquery changes so that the MultiTimeView can compact it's ETS table. +The SubqueryProgressMonitor must know about all shapes for a subquery (so for example if it's not seen an ack from one of them it needs to know the minimum time is still 0) or a subquery and have those shapes removed + #### Consumer EventProcessors These should be updated so that rather than holding views of the subquery, they just hold the logical time. so the before and after views should instead just be the before and after logical times. - `convert_change` should have a function passed to it that access MultiTimeView.member? at the specified time - the move-in query needs entire views at specific times and so should call MultiTimeView.get(time) and care should be made to not keep this in memory for too long, perhaps we should GC the consumer process afterwards, or perhaps the task process that runs the query should call MultiTimeView.get(time) so that the memory is freed when the process ends +### Concurrency model -# The Problem With The Above Design +Reads and writes to the MultiTimeView and SubqueryIndex ETS tables will mostly not be concurrent: +- add_shape and remove_shape will happen on the ShapeLogCollector process +- add_value and remove_value will happen while the ShapeLogCollector process is blocked so acts as if it were on the ShapeLogCollector process (ShapeLogCollector calls the Consumer which calls the Materializer which calls the SubqueryIndex to add/remove values, all synchronously) +- a Materializer seeding a subquery will happen when the Materializer is ready (so asyncronously to the ShapeLogCollector process) but will then call mark_ready on the SubqueryIndex which is an atomic process +- read of MultiTimeView may happen async by a consumer, but will be a read at a specific logical time so concurrentcy should not be an issue +- the mimimum in-flight logical time for a subquery will be updated by the SubqueryProgressMonitor async, but this will just update a single number, so concurrentcy should not be an issue -Subqueries have different subquery_ids even if they only differ in a constant, so: -- SELECT id FROM users WHERE company_id=7 -- SELECT id FROM users WHERE company_id=8 -are two different subqueries. If the SubqueryIndex iterates through {subquery_id, :positive} pairs that may be thousands of pairs and be too slow since it's in the replication stream hot path. - -Instead we should, at each node, for each {field, polarity} pair, keep a reverse index for all the subqueries for that pair. So: -WHERE user_id IN (SELECT id FROM users WHERE company_id=7) -WHERE user_id IN (SELECT id FROM users WHERE company_id=8) - -would be in the same reverse index because they have the same field (user_id) and polarity (:positive). - -Perhaps the index could have the form: -subquery_group_id, value -> list({child_node_id, list(times)}) - -where: -subquery_group_id is a number (whatever is smallest in memory) and represents {node_id, field, polarity} but to save memory (as it's going to be repeated lots in the ETS table) we keep it small and also store: -subquery_group_id -> {node_id, field, polarity} and -{node_id, field, polarity} -> subquery_group_id - -and there's one child_node_id per subquery_id for the group - -Shape removal can be quick because we can keep track of subquery_id -> child_node_id and remove the shape from the child node, but removing nodes becomes slow since they're scattered throughout the ETS table. I suggest the cleaning up of nodes should be done asynchronously by a process that walks through the ETS table for nodes with no shapes and removes them. Race conditions can be avoided by doing an atomic conditional replace in the ETS table. - -Perhaps this will replace the MultiTimeView proposes above, we're still multi-time but we work at a group level rather than the subquery level. This would mean that the Materializer calls to add/remove values from the subquery must update all groups that the subquery is in. - - -### Definitions - -#### Subquery - -Each subquery gets it's own shape. If the select statement differs at all we count it as a different subquery, even if the difference is just in a constant. So: -- SELECT id FROM users WHERE company_id=7 -- SELECT id FROM users WHERE company_id=8 -are two different subqueries and each get their own subquery_id (the handle for the subquery shape) - -#### Subquery Group - -A subquery group is a set of subqueries that have the same field and polarity at a particular node in the filter tree. - -So for example the two subqueries in the two shapes below are differnt subqueries (because they differ by the company_id constant) but they are in the same subquery group because they have the same field (user_id) and polarity (:positive) at the same node in the filter tree: -WHERE user_id IN (SELECT id FROM users WHERE company_id=7) -WHERE user_id IN (SELECT id FROM users WHERE company_id=8) - -A subquery_id may appear in multiple subquery groups if it appears at multiple nodes in the filter tree. For the subquery is the same (has the same subquery_id) in the two shapes below but falls into different subquery groups because it appears at a differnt node in the filter tree: -WHERE user_id IN (SELECT id FROM users WHERE company_id=7) -WHERE project_id=4 AND user_id IN (SELECT id FROM users WHERE company_id=7) From b3e0c40bfcca726b576875ba978597d3cbf4bf4f Mon Sep 17 00:00:00 2001 From: rob Date: Mon, 18 May 2026 19:16:57 +0100 Subject: [PATCH 06/15] Add RFC --- docs/rfcs/subquery-index.md | 606 ++++++++++++++++++++++++++++++++++++ 1 file changed, 606 insertions(+) create mode 100644 docs/rfcs/subquery-index.md diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md new file mode 100644 index 0000000000..a9780d5cc5 --- /dev/null +++ b/docs/rfcs/subquery-index.md @@ -0,0 +1,606 @@ +# RFC: Shared Subquery Indexes with Logical-Time Views + +Status: Draft + +Scope: `packages/sync-service` + +## Summary + +Electric v1.6 introduced per-shape subquery indexing so boolean subquery +shapes stay live while dependency rows move across `WHERE` boundaries. That +solved correctness, but it made memory scale with the number of outer shape +consumers. + +This RFC proposes replacing per-consumer materialized subquery views with one +shared, versioned view per subquery. Consumers do not copy the subquery view. +Instead, each consumer keeps the logical time it is reading and asks the shared +view for membership at that time. + +The design keeps current move-in/move-out correctness for positive, negated, +`AND`, `OR`, and `NOT` subquery expressions, while reducing duplicate memory in +the filter index and in consumer event handlers. + +## Background + +The current implementation stores subquery state in two duplicated places: + +- `Electric.Shapes.Filter.Indexes.SubqueryIndex` stores per-shape routing and + exact membership rows. +- `Electric.Shapes.Consumer.EventHandler.Subqueries` stores per-consumer + `MapSet` views, including both before and after views while a move-in is + buffering. + +The key correctness problem is that consumers can temporarily disagree about a +subquery's membership. One consumer may have processed a dependency move while +another has not. The current implementation handles that by letting each outer +shape seed and update its own exact view. + +This is correct, but it duplicates the same dependency view across many +consumers. + +## Problem + +For a popular subquery, memory currently scales roughly with: + +```text +number_of_outer_consumers * number_of_values_in_subquery +``` + +There are two major pools of duplicated memory: + +1. Subquery routing and exact membership rows keyed by outer shape. +2. Consumer-held dependency views, including before and after views during + active move-in buffering. + +A reverse index such as `shape_handle -> all values` would make removal faster, +but it would add another per-consumer value list and worsen the memory problem. + +## Goals + +- Store one shared materialized view per subquery. +- Allow consumers to read exact membership at separate logical times. +- Keep routing conservative enough while consumers are at different logical + times. +- Keep subquery addition and removal proportional to the subquery and group + data being changed, not to the total number of shapes in the stack. +- Preserve correctness for positive subqueries, negated subqueries, `AND`, + `OR`, and `NOT`. +- Avoid changing the client wire protocol. + +## Non-Goals + +- This RFC does not change Electric's HTTP protocol. +- This RFC does not change the semantics of supported subqueries. +- This RFC does not attempt to make negated-subquery routing better than + `O(number_of_affected_shapes)`. If a value is absent from a large negated + subquery group, all of those shapes are genuinely affected. + +## Definitions + +### Subquery + +A subquery is represented by its dependency shape. The `subquery_id` is the +handle of that dependency shape. + +Different `SELECT` statements are different subqueries, even if they differ +only by constants. For example: + +```sql +SELECT id FROM users WHERE company_id = 7 +SELECT id FROM users WHERE company_id = 8 +``` + +These are two different subqueries and get two different `subquery_id` values. + +### Subquery Group + +A subquery group is a set of subquery occurrences with the same: + +- filter tree node +- field key +- polarity + +For example, these two outer shapes use different subqueries, but the same +subquery group if the subquery occurrence appears at the same filter node: + +```sql +WHERE user_id IN (SELECT id FROM users WHERE company_id = 7) +WHERE user_id IN (SELECT id FROM users WHERE company_id = 8) +``` + +The subqueries differ by `company_id`, but the group is the same because the +field key is `user_id` and the polarity is positive. + +A single `subquery_id` can appear in multiple groups if it appears at multiple +nodes in outer filter trees. + +### Child Node + +A `child_node_id` is created per `{subquery_group_id, subquery_id}` pair. + +The child node owns a child `WhereCondition` that contains all outer shapes +using that subquery in that group. This means many outer shapes can share one +child node. + +### Logical Time + +Logical time is a monotonically increasing integer per subquery. + +Time `0` represents the materializer's initial view. Each committed dependency +move that changes subquery membership increments the logical time and records +the move at the new time. + +Logical time should use normal BEAM integers. Wrapping is unnecessary and would +make comparison and compaction harder to reason about. + +### Processed-Up-To Time + +Consumers call: + +```elixir +SubqueryProgressMonitor.notify_processed_up_to(time, subquery_id) +``` + +after they no longer need to read the subquery at `time` or earlier. + +For a move from logical time `a` to logical time `b`, once the consumer has +finished processing that move and is steady at `b`, it notifies that it has +processed up to `a`. + +The minimum required time for compaction is therefore: + +```text +min_processed_up_to_for_live_consumers + 1 +``` + +Consumers registered at current time `t` start with `processed_up_to = t - 1`, +because they need to read time `t`. + +## Proposal + +### MultiTimeView + +`SubqueryIndex.MultiTimeView` stores one shared materialized view per subquery. + +It is an ETS-backed structure, with one ETS table per stack. The main logical +key is: + +```text +{subquery_id, value} -> membership_history +``` + +Absence means the value is not a member at any retained logical time. + +The common case is a value that is always present for the retained window. That +should be represented compactly, for example: + +```elixir +true +``` + +Values that have moved use a small transition history. The exact structure +should be benchmarked before implementation. A simple starting point is: + +```elixir +{:out, [9]} +{:out, [9, 11]} +{:in, [9]} +{:in, [9, 11]} +``` + +The first atom is the membership state before the first transition. Each time in +the list toggles membership from that time onwards. + +Examples: + +```elixir +# Out before 9, in from 9 onwards. +{:out, [9]} + +# Out before 9, in from 9 to 10, out from 11 onwards. +{:out, [9, 11]} + +# In before 9, out from 9 to 10, in from 11 onwards. +{:in, [9, 11]} +``` + +The API should support: + +```elixir +member?(subquery_id, value, time) +member_at_some_time?(subquery_id, value) +member_at_all_times?(subquery_id, value) +values(subquery_id) +values(subquery_id, time) +mark_ready(subquery_id) +ready?(subquery_id) +set_min_required_time(subquery_id, time) +remove_subquery(subquery_id) +``` + +`member_at_some_time?/2` and `member_at_all_times?/2` operate over the retained +time window for that subquery. + +#### Compaction + +The `SubqueryProgressMonitor` provides the minimum required logical time for +each subquery. `MultiTimeView` can compact entries by removing transitions +before that time. + +Compaction must preserve membership at all retained times. For example: + +```elixir +{:out, [9, 11]} +``` + +If `min_required_time = 10`, membership at time `10` is `true`, and the compacted +history becomes: + +```elixir +{:in, [11]} +``` + +If `min_required_time = 12`, the value is out for the whole retained window, so +the row can be deleted. + +Compaction should run: + +- when a value is read +- when a value is written +- in a periodic asynchronous compaction pass +- when the progress monitor advances the minimum required time + +`remove_subquery/1` must not scan the whole ETS table. The table should be an +ordered set with keys ordered by `subquery_id`, so removal can iterate the +contiguous key range for one subquery. + +### SubqueryIndex + +`SubqueryIndex` becomes responsible for topology and routing, while +`MultiTimeView` owns exact membership. + +The index stores compact integer identifiers for repeated values: + +```text +{node_id, field_key, polarity} -> subquery_group_id +subquery_group_id -> {node_id, field_key, polarity} + +{subquery_group_id, subquery_id} -> child_node_id +child_node_id -> {subquery_group_id, subquery_id, next_condition_id} +subquery_id -> [{subquery_group_id, child_node_id}] +``` + +Using small integer ids avoids repeating large tuples, field keys, and shape +handles in per-value ETS rows. + +For positive groups, the routing index stores values that can be members at +some retained logical time: + +```text +{positive_value, subquery_group_id, value} -> [child_node_id] +``` + +For negative groups, the index stores all negative children for the group: + +```text +{negative_children, subquery_group_id} -> [child_node_id] +``` + +The negative path cannot avoid considering all affected children when a value +is absent from the subquery views. That is acceptable because those children are +affected. + +#### Affected Shapes + +For a root-table change, `SubqueryIndex.affected_shapes/4` evaluates the +left-hand side value for the subquery node. + +If evaluation fails, it falls back to all children in the group. + +If a subquery is not ready, the child node is routed conservatively. + +For a positive group: + +```elixir +for child_node_id <- positive_children_for_value(group_id, value), + subquery_id = subquery_id_for_child(child_node_id), + MultiTimeView.member_at_some_time?(subquery_id, value) do + WhereCondition.affected_shapes(child_node_id, record) +end +``` + +For a negative group: + +```elixir +for child_node_id <- all_negative_children(group_id), + subquery_id = subquery_id_for_child(child_node_id), + not MultiTimeView.member_at_all_times?(subquery_id, value) do + WhereCondition.affected_shapes(child_node_id, record) +end +``` + +This keeps routing broad enough for all consumers reading any retained logical +time. + +#### Shape Addition + +Adding an outer shape to an existing `{group, subquery_id}` child is near O(1): +the shape is added to the child `WhereCondition`. + +Creating the first child for `{group, subquery_id}` requires indexing current +values for that subquery in the group. That is O(number of values in the +subquery), which is acceptable and unavoidable unless the child stays in a +fallback mode until asynchronous seeding completes. + +#### Shape Removal + +Removing an outer shape removes it from the child `WhereCondition`. + +If the child becomes empty, remove the `{group, subquery_id}` child and update +the group value indexes by iterating the values for `subquery_id` from +`MultiTimeView`. This is proportional to the number of values in that subquery, +not to the total number of shapes or subqueries. + +### Materializer + +The materializer continues to track dependency shape membership from the +dependency log. + +It changes in three ways: + +1. On initial load, it populates `MultiTimeView` for the subquery at logical + time `0`, then marks the subquery ready. +2. On each committed batch that produces net move events, it increments the + subquery logical time. +3. Before sending `{:materializer_changes, ...}` to subscribers, it writes the + move events to `MultiTimeView` at the new logical time. + +The subscriber payload should include both the old and new logical time: + +```elixir +%{ + move_in: [{value, original_string}], + move_out: [{value, original_string}], + txids: [txid], + from_time: old_time, + to_time: new_time +} +``` + +If a committed batch has no net membership move, logical time does not need to +advance. + +The existing `Materializer.LinkValues` ETS cache should be removed or replaced +by `MultiTimeView` rather than kept as a second full shared copy. + +### Consumer Event Handlers + +Consumers store logical times instead of materialized subquery views. + +The steady handler keeps: + +```elixir +%{subquery_ref => logical_time} +``` + +`Shape.convert_change/3` and DNF metadata projection need to accept a membership +callback instead of requiring a concrete `MapSet` view: + +```elixir +fn subquery_ref, value, time -> + MultiTimeView.member?(subquery_id, value, time) +end +``` + +In steady state, old and new records use the same logical time. + +During a buffered move-in, `ActiveMove` stores: + +```elixir +times_before_move +times_after_move +``` + +instead of: + +```elixir +views_before_move +views_after_move +``` + +Buffered transactions before the splice boundary are converted using +`times_before_move`. Buffered transactions after the splice boundary are +converted using `times_after_move`. + +After the move is spliced and the consumer becomes steady at `to_time`, it +calls: + +```elixir +SubqueryProgressMonitor.notify_processed_up_to(from_time, subquery_id) +``` + +#### Move-In Queries + +Move-in queries currently build SQL from whole before and after views. The new +implementation should avoid retaining large views in the consumer process. + +Preferred approach: + +- Build the triggering dependency candidate predicate from the move delta + values when possible. +- Read full view values at a specific time only for positions that require + exclusion logic. +- If full views are required, materialize them inside the task process that + runs the query so the memory is released when the task exits. + +This is important because replacing long-lived consumer views with +short-lived task views is where much of the memory win comes from. + +### SubqueryProgressMonitor + +`SubqueryProgressMonitor` tracks the earliest logical time still needed by live +outer consumers. + +Consumers register for each subquery they read. Registration at current time +`t` inserts: + +```text +processed_up_to = t - 1 +``` + +When a consumer finishes a move from `from_time` to `to_time`, it calls: + +```elixir +SubqueryProgressMonitor.notify_processed_up_to(from_time, subquery_id) +``` + +The monitor maintains two ETS indexes: + +```text +{subquery_id, consumer_shape_handle} -> processed_up_to +{subquery_id, processed_up_to, consumer_shape_handle} -> true +``` + +The first index makes updates O(1). The second index makes the minimum +processed time for a subquery cheap to read. + +When the minimum changes, the monitor notifies `MultiTimeView`: + +```elixir +MultiTimeView.set_min_required_time(subquery_id, min_processed_up_to + 1) +``` + +When an outer shape is removed, the monitor removes that consumer from every +subquery it was registered for and recomputes the affected minima. + +### Concurrency Model + +Most writes are already serialized by existing processes: + +- Shape addition and removal happen through the ShapeLogCollector path. +- Dependency changes are applied by materializers. +- Outer consumers process events synchronously when ShapeLogCollector publishes + to them. + +`MultiTimeView` writes happen in the materializer before it sends +`materializer_changes` to subscribers. + +`SubqueryIndex` topology changes happen when shapes are added or removed from +the filter. + +`SubqueryIndex` reads happen from ShapeLogCollector while routing replication +changes. + +Consumer reads from `MultiTimeView` can happen concurrently with writes. This +is safe because membership is always read at an explicit logical time, and ETS +updates replace complete membership-history values atomically. + +The ready flag is important. Until a subquery has been seeded into +`MultiTimeView` and any group routing rows have been created, routing must be +conservative. + +## Expected Benefits + +- One retained membership view per subquery instead of one per outer shape. +- Consumer processes retain logical times instead of large `MapSet` views. +- Move-in buffering retains before/after logical times instead of before/after + `MapSet` views. +- Subquery removal is proportional to the removed subquery's values and group + entries, not to total stack size. +- Positive routing remains value-keyed and efficient. + +## Risks + +### Off-by-One Compaction + +The biggest correctness risk is compacting away a logical time that some +consumer still needs. + +The invariant is: + +```text +MultiTimeView may compact only times < min_required_time +min_required_time = min(processed_up_to_by_live_consumer) + 1 +``` + +Tests should cover move-in, move-out, repeated toggles, consumer registration, +consumer removal, and compaction across all of those cases. + +### Negated Routing Cost + +For negated subquery groups, a value absent from all subqueries affects all +children in the group. This can be large, but it is proportional to the number +of affected shapes. + +The implementation should avoid extra memory-heavy complement indexes unless +there is evidence they are necessary. + +### Move-In Query Memory + +If move-in query generation materializes full before and after views in the +consumer process, the design will keep a major source of memory duplication. + +Full view materialization should be avoided where possible and isolated to +short-lived task processes where not possible. + +### Fallback Windows + +Unready subqueries and not-yet-seeded group routing must route conservatively. +This may temporarily over-route, but must not under-route. + +## Testing Plan + +Add focused unit tests for `MultiTimeView`: + +- membership at exact logical times +- `member_at_some_time?/2` +- `member_at_all_times?/2` +- move-in and move-out transitions +- repeated toggles +- compaction after `set_min_required_time/2` +- subquery removal by ordered key range + +Add focused unit tests for `SubqueryProgressMonitor`: + +- registration at current time +- `notify_processed_up_to/2` +- minimum required time updates +- consumer removal +- multiple consumers at different times + +Add `SubqueryIndex` tests: + +- positive group routing +- negative group routing +- shared child nodes per `{group, subquery_id}` +- conservative routing for unready subqueries +- child removal after the last outer shape is removed +- no full-table scan on subquery removal + +Update consumer event-handler tests: + +- steady conversion using logical times +- buffered move-in using before and after logical times +- queued moves across multiple logical times +- progress notifications after splice +- negated move-in and move-out behavior + +Keep or extend integration tests for: + +- dependency move-in +- dependency move-out +- nested subqueries +- subqueries combined with non-subquery predicates +- rows moving between two dependency values in one transaction + +## Open Questions + +- Should `MultiTimeView` expose `values(subquery_id, time)` as a materialized + `MapSet`, a stream, or both? +- Should `SubqueryProgressMonitor.notify_processed_up_to/2` infer the consumer + identity from process state, or should callers pass the outer shape handle + explicitly? +- Should first-time child creation seed synchronously, or should it use fallback + routing while an asynchronous seeding task populates group value rows? +- Which transition-history representation is smallest in practice for ETS: + list, tuple, or binary? From 3ceac97ac3defee1c1f8a3dcaa9f590efa0e7535 Mon Sep 17 00:00:00 2001 From: rob Date: Mon, 18 May 2026 19:35:21 +0100 Subject: [PATCH 07/15] Human scan of RFC --- docs/rfcs/subquery-index.md | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md index a9780d5cc5..236bc220d4 100644 --- a/docs/rfcs/subquery-index.md +++ b/docs/rfcs/subquery-index.md @@ -248,7 +248,6 @@ Compaction should run: - when a value is read - when a value is written - in a periodic asynchronous compaction pass -- when the progress monitor advances the minimum required time `remove_subquery/1` must not scan the whole ETS table. The table should be an ordered set with keys ordered by `subquery_id`, so removal can iterate the @@ -424,13 +423,7 @@ SubqueryProgressMonitor.notify_processed_up_to(from_time, subquery_id) Move-in queries currently build SQL from whole before and after views. The new implementation should avoid retaining large views in the consumer process. -Preferred approach: - -- Build the triggering dependency candidate predicate from the move delta - values when possible. -- Read full view values at a specific time only for positions that require - exclusion logic. -- If full views are required, materialize them inside the task process that +Preferred approach - materialize them inside the task process that runs the query so the memory is released when the task exits. This is important because replacing long-lived consumer views with From 537286439c93b734f64479b736f6a7e2cffb1179 Mon Sep 17 00:00:00 2001 From: rob Date: Mon, 18 May 2026 19:45:03 +0100 Subject: [PATCH 08/15] Updates to RFC --- docs/rfcs/subquery-index.md | 103 ++++++++++++++++++++++++------------ 1 file changed, 68 insertions(+), 35 deletions(-) diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md index 236bc220d4..1f98ad68ab 100644 --- a/docs/rfcs/subquery-index.md +++ b/docs/rfcs/subquery-index.md @@ -135,6 +135,9 @@ make comparison and compaction harder to reason about. ### Processed-Up-To Time +The progress monitor's public API is expressed as a processed-up-to +notification: + Consumers call: ```elixir @@ -142,19 +145,26 @@ SubqueryProgressMonitor.notify_processed_up_to(time, subquery_id) ``` after they no longer need to read the subquery at `time` or earlier. +The public function can include `self()` in the message to the monitor, so the +consumer identity is resolved from registration rather than passed at every +call site. For a move from logical time `a` to logical time `b`, once the consumer has finished processing that move and is steady at `b`, it notifies that it has processed up to `a`. -The minimum required time for compaction is therefore: +Internally, the progress monitor should track `required_time`: the earliest +logical time a live consumer may still read. `notify_processed_up_to(a, +subquery_id)` advances that consumer's `required_time` to `a + 1`. + +The minimum required time for compaction is: ```text -min_processed_up_to_for_live_consumers + 1 +min(required_time_for_live_consumers) ``` -Consumers registered at current time `t` start with `processed_up_to = t - 1`, -because they need to read time `t`. +Consumers register with the logical time they are starting from. If a consumer +starts from current logical time `t`, its initial `required_time` is `t`. ## Proposal @@ -172,36 +182,36 @@ key is: Absence means the value is not a member at any retained logical time. The common case is a value that is always present for the retained window. That -should be represented compactly, for example: +is represented as an empty history: ```elixir -true +[] ``` Values that have moved use a small transition history. The exact structure should be benchmarked before implementation. A simple starting point is: ```elixir -{:out, [9]} -{:out, [9, 11]} -{:in, [9]} -{:in, [9, 11]} +[:out, 9] +[:out, 9, 11] +[:in, 9] +[:in, 9, 11] ``` -The first atom is the membership state before the first transition. Each time in -the list toggles membership from that time onwards. +The first list item is the membership state before the first transition. Each +time after it toggles membership from that time onwards. Examples: ```elixir # Out before 9, in from 9 onwards. -{:out, [9]} +[:out, 9] # Out before 9, in from 9 to 10, out from 11 onwards. -{:out, [9, 11]} +[:out, 9, 11] # In before 9, out from 9 to 10, in from 11 onwards. -{:in, [9, 11]} +[:in, 9, 11] ``` The API should support: @@ -224,20 +234,20 @@ time window for that subquery. #### Compaction The `SubqueryProgressMonitor` provides the minimum required logical time for -each subquery. `MultiTimeView` can compact entries by removing transitions -before that time. +each subquery. `MultiTimeView` can compact entries by evaluating membership at +that time and removing transitions at or before it. Compaction must preserve membership at all retained times. For example: ```elixir -{:out, [9, 11]} +[:out, 9, 11] ``` If `min_required_time = 10`, membership at time `10` is `true`, and the compacted history becomes: ```elixir -{:in, [11]} +[:in, 11] ``` If `min_required_time = 12`, the value is out for the whole retained window, so @@ -423,7 +433,13 @@ SubqueryProgressMonitor.notify_processed_up_to(from_time, subquery_id) Move-in queries currently build SQL from whole before and after views. The new implementation should avoid retaining large views in the consumer process. -Preferred approach - materialize them inside the task process that +Preferred approach: + +- Build the triggering dependency candidate predicate from the move delta + values when possible. +- Read full view values at a specific time only for positions that require + exclusion logic. +- If full views are required, materialize them inside the task process that runs the query so the memory is released when the task exits. This is important because replacing long-lived consumer views with @@ -434,33 +450,53 @@ short-lived task views is where much of the memory win comes from. `SubqueryProgressMonitor` tracks the earliest logical time still needed by live outer consumers. -Consumers register for each subquery they read. Registration at current time -`t` inserts: +Registration must happen atomically with choosing the consumer's starting +logical time. The materializer should provide a serialized registration call, +for example: -```text -processed_up_to = t - 1 +```elixir +{:ok, current_time} = + Materializer.register_subquery_consumer( + subquery_id, + outer_shape_handle, + self() + ) ``` +The materializer handles the call in its GenServer, reads its current logical +time, registers the outer consumer with the progress monitor, and returns that +time to the consumer. This prevents a race where the materializer advances and +compacts a time before the new consumer is registered as needing it. + +The monitor stores the returned time as the consumer's internal +`required_time`. + When a consumer finishes a move from `from_time` to `to_time`, it calls: ```elixir SubqueryProgressMonitor.notify_processed_up_to(from_time, subquery_id) ``` +The monitor then advances that consumer's `required_time` to `from_time + 1`. + The monitor maintains two ETS indexes: ```text -{subquery_id, consumer_shape_handle} -> processed_up_to -{subquery_id, processed_up_to, consumer_shape_handle} -> true +{subquery_id, consumer_shape_handle} -> required_time +{subquery_id, required_time, consumer_shape_handle} -> true ``` The first index makes updates O(1). The second index makes the minimum -processed time for a subquery cheap to read. +required time for a subquery cheap to read. + +If `notify_processed_up_to/2` infers consumer identity from the caller process, +the monitor also needs a lookup from registered consumer pid to +`consumer_shape_handle`. When the minimum changes, the monitor notifies `MultiTimeView`: ```elixir -MultiTimeView.set_min_required_time(subquery_id, min_processed_up_to + 1) +MultiTimeView.set_min_required_time(subquery_id, min_required_time) ``` When an outer shape is removed, the monitor removes that consumer from every @@ -512,8 +548,10 @@ consumer still needs. The invariant is: ```text -MultiTimeView may compact only times < min_required_time -min_required_time = min(processed_up_to_by_live_consumer) + 1 +MultiTimeView may drop transitions at or before min_required_time +after rewriting the entry to preserve membership at min_required_time. + +min_required_time = min(required_time_by_live_consumer) ``` Tests should cover move-in, move-out, repeated toggles, consumer registration, @@ -590,10 +628,5 @@ Keep or extend integration tests for: - Should `MultiTimeView` expose `values(subquery_id, time)` as a materialized `MapSet`, a stream, or both? -- Should `SubqueryProgressMonitor.notify_processed_up_to/2` infer the consumer - identity from process state, or should callers pass the outer shape handle - explicitly? - Should first-time child creation seed synchronously, or should it use fallback routing while an asynchronous seeding task populates group value rows? -- Which transition-history representation is smallest in practice for ETS: - list, tuple, or binary? From c35af7d3e35dae91b9b9aa894bf1f49fe39b0459 Mon Sep 17 00:00:00 2001 From: rob Date: Mon, 18 May 2026 19:47:34 +0100 Subject: [PATCH 09/15] Answer question --- docs/rfcs/subquery-index.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md index 1f98ad68ab..74b3c6db6d 100644 --- a/docs/rfcs/subquery-index.md +++ b/docs/rfcs/subquery-index.md @@ -338,8 +338,9 @@ the shape is added to the child `WhereCondition`. Creating the first child for `{group, subquery_id}` requires indexing current values for that subquery in the group. That is O(number of values in the -subquery), which is acceptable and unavoidable unless the child stays in a -fallback mode until asynchronous seeding completes. +subquery), which is acceptable and unavoidable. First-time child creation should +seed synchronously so the child is fully routable before the shape is considered +indexed. #### Shape Removal @@ -628,5 +629,3 @@ Keep or extend integration tests for: - Should `MultiTimeView` expose `values(subquery_id, time)` as a materialized `MapSet`, a stream, or both? -- Should first-time child creation seed synchronously, or should it use fallback - routing while an asynchronous seeding task populates group value rows? From 5fe4769b1f95f8c6787e9991c6d5361cf515f6c3 Mon Sep 17 00:00:00 2001 From: rob Date: Mon, 18 May 2026 20:41:32 +0100 Subject: [PATCH 10/15] Follow RFC format --- docs/rfcs/subquery-index.md | 870 +++++++++++++++++++++--------------- 1 file changed, 498 insertions(+), 372 deletions(-) diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md index 74b3c6db6d..84feeff5a3 100644 --- a/docs/rfcs/subquery-index.md +++ b/docs/rfcs/subquery-index.md @@ -1,42 +1,65 @@ -# RFC: Shared Subquery Indexes with Logical-Time Views - -Status: Draft - -Scope: `packages/sync-service` +--- +title: Shared Subquery Indexes with Logical-Time Views +version: "0.1" +status: draft +owner: robacourt +contributors: [] +created: 2026-05-18 +last_updated: 2026-05-18 +prd: "N/A - based on https://github.com/electric-sql/electric/issues/4279" +prd_version: "N/A" +--- + +# Shared Subquery Indexes with Logical-Time Views ## Summary -Electric v1.6 introduced per-shape subquery indexing so boolean subquery -shapes stay live while dependency rows move across `WHERE` boundaries. That -solved correctness, but it made memory scale with the number of outer shape -consumers. - -This RFC proposes replacing per-consumer materialized subquery views with one -shared, versioned view per subquery. Consumers do not copy the subquery view. -Instead, each consumer keeps the logical time it is reading and asks the shared -view for membership at that time. - -The design keeps current move-in/move-out correctness for positive, negated, -`AND`, `OR`, and `NOT` subquery expressions, while reducing duplicate memory in -the filter index and in consumer event handlers. +Electric v1.6 added per-shape subquery indexing so shapes with boolean subquery +filters can stay live while dependency rows move across `WHERE` boundaries. +That solved correctness, but it stores the same dependency view repeatedly in +the filter index and in consumer event handlers. This RFC proposes one shared, +logical-time view per subquery. Consumers register the subqueries they read, +keep only the logical time they are reading, and call +`SubqueryProgressMonitor.notify_processed_up_to(time, subquery_id)` when they no +longer need older times. The filter index routes conservatively across retained +times and verifies exact membership by asking the shared view at the consumer's +logical time. ## Background -The current implementation stores subquery state in two duplicated places: - -- `Electric.Shapes.Filter.Indexes.SubqueryIndex` stores per-shape routing and - exact membership rows. -- `Electric.Shapes.Consumer.EventHandler.Subqueries` stores per-consumer - `MapSet` views, including both before and after views while a move-in is - buffering. - -The key correctness problem is that consumers can temporarily disagree about a -subquery's membership. One consumer may have processed a dependency move while -another has not. The current implementation handles that by letting each outer -shape seed and update its own exact view. - -This is correct, but it duplicates the same dependency view across many -consumers. +Issue: https://github.com/electric-sql/electric/issues/4279 + +Related work: + +- PR #4051 introduced the v1.6 subquery move correctness work: + https://github.com/electric-sql/electric/pull/4051 +- PR #4280 proposed a narrower SubqueryIndex memory design using shared base + views with sparse XOR exceptions: + https://github.com/electric-sql/electric/pull/4280 +- Current `SubqueryIndex`: + `packages/sync-service/lib/electric/shapes/filter/indexes/subquery_index.ex` +- Current consumer view setup: + `packages/sync-service/lib/electric/shapes/consumer/event_handler_builder.ex` +- Current move buffering: + `packages/sync-service/lib/electric/shapes/consumer/subqueries/active_move.ex` +- Current SQL move-in query construction: + `packages/sync-service/lib/electric/shapes/querying.ex` + +The v1.6 subquery work allowed shapes with boolean combinations around +subqueries to stay live when dependency rows move. Without that, Electric would +invalidate the outer shape and require a full resync. + +The current implementation achieves correctness by letting each consumer own a +local dependency view. `EventHandlerBuilder` reads each dependency +materializer's values into a per-consumer `MapSet`. During an active move, +`ActiveMove` stores `views_before_move` and `views_after_move`. Separately, +`SubqueryIndex` stores per-shape routing rows and exact membership rows keyed +by `shape_handle`, `subquery_ref`, and value. + +That model is correct because consumers can temporarily disagree about the same +subquery. One consumer may have processed a dependency move while another has +not. The current implementation represents that by copying the dependency view +per consumer. ## Problem @@ -46,41 +69,115 @@ For a popular subquery, memory currently scales roughly with: number_of_outer_consumers * number_of_values_in_subquery ``` -There are two major pools of duplicated memory: +There are two large duplicated pools: + +- `SubqueryIndex` stores value membership and routing rows per outer shape. +- Consumer event handlers store dependency views per outer shape, and store + both before and after views while a move-in is active. + +Shape removal is also expensive because current value-keyed membership rows do +not have a cheap reverse path from a shape to all of the rows it owns. Adding a +reverse index such as `shape_handle -> all values` would improve removal, but +it would add another copy of the full per-shape dependency view. -1. Subquery routing and exact membership rows keyed by outer shape. -2. Consumer-held dependency views, including before and after views during - active move-in buffering. +The wider design problem is that the current system optimizes for the +exceptional case, where every consumer has a distinct subquery view, by paying +that memory cost in the common case where many consumers share the same view +and only diverge briefly during moves. -A reverse index such as `shape_handle -> all values` would make removal faster, -but it would add another per-consumer value list and worsen the memory problem. +**Link to PRD hypothesis:** There is no PRD for this RFC. The working +hypothesis comes from issue #4279: -## Goals +> Redesigning the SubqueryIndex so it does not store full per-shape dependency +> views will make shape add/remove scalable and reduce memory consumption, +> while preserving v1.6 subquery move correctness. + +## Goals & Non-Goals + +### Goals - Store one shared materialized view per subquery. -- Allow consumers to read exact membership at separate logical times. -- Keep routing conservative enough while consumers are at different logical - times. -- Keep subquery addition and removal proportional to the subquery and group - data being changed, not to the total number of shapes in the stack. +- Allow consumers to read exact subquery membership at separate logical times. +- Remove long-lived per-consumer `MapSet` views from event handlers. +- Remove per-shape exact membership rows from `SubqueryIndex`. +- Keep routing conservative while consumers are at different logical times. +- Keep first-time child creation correct by synchronously seeding routing before + the child is considered indexed. +- Keep shape removal proportional to the shape's subquery participants and + routing edges, not to the full dependency view. - Preserve correctness for positive subqueries, negated subqueries, `AND`, `OR`, and `NOT`. - Avoid changing the client wire protocol. -## Non-Goals +### Non-Goals -- This RFC does not change Electric's HTTP protocol. -- This RFC does not change the semantics of supported subqueries. -- This RFC does not attempt to make negated-subquery routing better than +- Do not change Electric's HTTP protocol. +- Do not change supported subquery semantics. +- Do not redesign DNF planning, tags, or `active_conditions`. +- Do not remove the need to materialize SQL array parameters for move-in + queries in the first implementation. The goal is to avoid long-lived copies; + transient query-local arrays may remain. +- Do not make negated-subquery routing better than `O(number_of_affected_shapes)`. If a value is absent from a large negated - subquery group, all of those shapes are genuinely affected. + group, all of those shapes are genuinely affected. +- Do not intern equivalent SQL subqueries that have different dependency shape + handles. A `subquery_id` is the dependency shape handle for v1. + +## Proposal + +### Core Idea + +Move subquery membership out of per-shape state and into one versioned view per +subquery: + +```text +MultiTimeView[{subquery_id, value}] -> membership_history +consumer[{shape_handle, subquery_ref}] -> {subquery_id, filter_time} +``` + +Consumers no longer copy the subquery view. They register each subquery they +read, store the logical time returned by the materializer, and ask +`MultiTimeView.member?(subquery_id, value, time)` when they need exact +membership. + +The filter index no longer stores exact per-shape membership rows. It stores +compact routing topology: + +```text +subquery_group_id +child_node_id per {subquery_group_id, subquery_id} +shape participant rows +fallback rows while initial indexing is incomplete +``` -## Definitions +Positive routing is value-keyed for values that are members at some retained +logical time. Negated routing is group-keyed and then filtered by shared +membership history. -### Subquery +### Architecture + +```text +Dependency materializer + -> writes MultiTimeView at monotonically increasing logical times + -> emits dependency move events with from_time and to_time + +Consumer event handler + -> registers subqueries through the materializer + -> stores subquery_id and logical times, not MapSet views + -> calls notify_processed_up_to/2 after old times are no longer needed + +SubqueryIndex + -> stores subquery groups, child nodes, and participant routing + -> asks MultiTimeView for membership at some/all retained times for routing + -> asks MultiTimeView for membership at a consumer time for exact checks +``` + +### Definitions + +#### Subquery A subquery is represented by its dependency shape. The `subquery_id` is the -handle of that dependency shape. +dependency shape handle. Different `SELECT` statements are different subqueries, even if they differ only by constants. For example: @@ -90,90 +187,81 @@ SELECT id FROM users WHERE company_id = 7 SELECT id FROM users WHERE company_id = 8 ``` -These are two different subqueries and get two different `subquery_id` values. - -### Subquery Group +These get different `subquery_id` values. -A subquery group is a set of subquery occurrences with the same: +#### Subquery Group -- filter tree node -- field key -- polarity +A subquery group is a set of subquery occurrences with the same filter tree +node, field key, and polarity. -For example, these two outer shapes use different subqueries, but the same -subquery group if the subquery occurrence appears at the same filter node: +For example, these outer shapes use different subqueries but can share the same +subquery group if the occurrence is at the same filter node: ```sql WHERE user_id IN (SELECT id FROM users WHERE company_id = 7) WHERE user_id IN (SELECT id FROM users WHERE company_id = 8) ``` -The subqueries differ by `company_id`, but the group is the same because the -field key is `user_id` and the polarity is positive. +The field key is `user_id`, and the polarity is positive. -A single `subquery_id` can appear in multiple groups if it appears at multiple -nodes in outer filter trees. - -### Child Node +#### Child Node A `child_node_id` is created per `{subquery_group_id, subquery_id}` pair. -The child node owns a child `WhereCondition` that contains all outer shapes -using that subquery in that group. This means many outer shapes can share one -child node. +The child node owns a child `WhereCondition` containing all outer shapes using +that subquery in that group. Many outer shapes can therefore share one child +node. -### Logical Time +#### Logical Time Logical time is a monotonically increasing integer per subquery. Time `0` represents the materializer's initial view. Each committed dependency move that changes subquery membership increments the logical time and records -the move at the new time. - -Logical time should use normal BEAM integers. Wrapping is unnecessary and would -make comparison and compaction harder to reason about. +the transition at the new time. -### Processed-Up-To Time +Use normal BEAM integers. Wrapping is unnecessary and would make comparison and +compaction harder to reason about. -The progress monitor's public API is expressed as a processed-up-to -notification: +#### Processed-Up-To Time -Consumers call: +The public progress API is: ```elixir SubqueryProgressMonitor.notify_processed_up_to(time, subquery_id) ``` -after they no longer need to read the subquery at `time` or earlier. -The public function can include `self()` in the message to the monitor, so the -consumer identity is resolved from registration rather than passed at every -call site. - -For a move from logical time `a` to logical time `b`, once the consumer has -finished processing that move and is steady at `b`, it notifies that it has -processed up to `a`. +Consumers call this after they no longer need to read the subquery at `time` or +earlier. For a move from logical time `a` to logical time `b`, once the +consumer has finished processing that move and is steady at `b`, it notifies +that it has processed up to `a`. -Internally, the progress monitor should track `required_time`: the earliest -logical time a live consumer may still read. `notify_processed_up_to(a, -subquery_id)` advances that consumer's `required_time` to `a + 1`. +Internally, the monitor tracks `required_time`: the earliest logical time a +live consumer may still read. `notify_processed_up_to(a, subquery_id)` advances +that consumer's `required_time` to `a + 1`. -The minimum required time for compaction is: +The compaction lower bound is: ```text min(required_time_for_live_consumers) ``` -Consumers register with the logical time they are starting from. If a consumer -starts from current logical time `t`, its initial `required_time` is `t`. +Consumers register at the logical time they are starting from. If a consumer +starts from current logical time `t`, its initial `required_time` is `t` +because it may read time `t`. -## Proposal +`required_time` is a retention bound, not necessarily the same as the time used +for live exact filter checks. During an active move, a consumer may need the old +time for buffered conversion or move-in query work while live exact checks have +already advanced to the new time. The implementation must keep those two uses +explicit. ### MultiTimeView -`SubqueryIndex.MultiTimeView` stores one shared materialized view per subquery. +`Electric.Shapes.Filter.Indexes.SubqueryIndex.MultiTimeView` stores one +shared view per subquery in ETS, with one table per stack. -It is an ETS-backed structure, with one ETS table per stack. The main logical -key is: +The logical key is: ```text {subquery_id, value} -> membership_history @@ -188,8 +276,7 @@ is represented as an empty history: [] ``` -Values that have moved use a small transition history. The exact structure -should be benchmarked before implementation. A simple starting point is: +Values that moved use compact flat histories: ```elixir [:out, 9] @@ -198,8 +285,8 @@ should be benchmarked before implementation. A simple starting point is: [:in, 9, 11] ``` -The first list item is the membership state before the first transition. Each -time after it toggles membership from that time onwards. +The first list item is membership before the first transition. Each integer +after it is a logical time where membership toggles from that time onwards. Examples: @@ -214,6 +301,14 @@ Examples: [:in, 9, 11] ``` +Use `[]` rather than `true` for the always-present case for consistency with +other histories. On BEAM, both `[]` and `true` are immediate terms, so neither +is more compact as an ETS value. + +Use flat lists such as `[:out, 9]` rather than tuples containing lists such as +`{:out, [9]}` because the flat list is smaller and is enough for the common +short-history case. + The API should support: ```elixir @@ -231,11 +326,11 @@ remove_subquery(subquery_id) `member_at_some_time?/2` and `member_at_all_times?/2` operate over the retained time window for that subquery. -#### Compaction +### Compaction -The `SubqueryProgressMonitor` provides the minimum required logical time for -each subquery. `MultiTimeView` can compact entries by evaluating membership at -that time and removing transitions at or before it. +`SubqueryProgressMonitor` provides the minimum required logical time for each +subquery. `MultiTimeView` can compact entries by evaluating membership at that +time and removing transitions at or before it. Compaction must preserve membership at all retained times. For example: @@ -243,8 +338,8 @@ Compaction must preserve membership at all retained times. For example: [:out, 9, 11] ``` -If `min_required_time = 10`, membership at time `10` is `true`, and the compacted -history becomes: +If `min_required_time = 10`, membership at time `10` is `true`, and the +compacted history becomes: ```elixir [:in, 11] @@ -257,375 +352,406 @@ Compaction should run: - when a value is read - when a value is written -- in a periodic asynchronous compaction pass - -`remove_subquery/1` must not scan the whole ETS table. The table should be an -ordered set with keys ordered by `subquery_id`, so removal can iterate the -contiguous key range for one subquery. +- in a periodic asynchronous pass +- when a consumer unregisters and releases the minimum pinned time -### SubqueryIndex +### SubqueryIndex Data Model -`SubqueryIndex` becomes responsible for topology and routing, while -`MultiTimeView` owns exact membership. +The hot ETS rows should use compact integer IDs for groups, children, and +subqueries where practical. Full shape handles and dependency handles can be +stored in metadata rows and interned at boundaries. -The index stores compact integer identifiers for repeated values: +Suggested logical rows: ```text -{node_id, field_key, polarity} -> subquery_group_id -subquery_group_id -> {node_id, field_key, polarity} - -{subquery_group_id, subquery_id} -> child_node_id -child_node_id -> {subquery_group_id, subquery_id, next_condition_id} -subquery_id -> [{subquery_group_id, child_node_id}] +{:group, group_key} -> group_id +{:child, group_id, subquery_id} -> child_node_id +{:child_meta, child_node_id} -> {group_id, subquery_id, polarity, next_condition_id} +{:child_shape, child_node_id} -> {shape_handle, branch_key} +{:shape_child, shape_handle} -> child_node_id +{:shape_subquery, shape_handle, subquery_ref} -> {subquery_id, filter_time} +{:fallback, shape_handle} -> true ``` -Using small integer ids avoids repeating large tuples, field keys, and shape -handles in per-value ETS rows. - -For positive groups, the routing index stores values that can be members at -some retained logical time: +Positive routing keeps value-keyed entries: ```text -{positive_value, subquery_group_id, value} -> [child_node_id] +{:positive, group_id, value} -> child_node_id ``` -For negative groups, the index stores all negative children for the group: +Negated routing keeps group-keyed entries: ```text -{negative_children, subquery_group_id} -> [child_node_id] +{:negated, group_id} -> child_node_id ``` -The negative path cannot avoid considering all affected children when a value -is absent from the subquery views. That is acceptable because those children are -affected. +This replaces per-shape value membership rows with per-child routing rows and a +shared membership view. -#### Affected Shapes +### First-Time Child Creation -For a root-table change, `SubqueryIndex.affected_shapes/4` evaluates the -left-hand side value for the subquery node. +First-time child creation must seed synchronously. -If evaluation fails, it falls back to all children in the group. +When `SubqueryIndex` creates a new `child_node_id` for +`{subquery_group_id, subquery_id}`, it must: -If a subquery is not ready, the child node is routed conservatively. +1. Ensure the dependency materializer has populated `MultiTimeView` and marked + the subquery ready. +2. Create the child `WhereCondition`. +3. Insert the outer shapes into the child condition. +4. Seed positive routing for every value in + `MultiTimeView.values(subquery_id, current_time)`. +5. Add negated group routing if the group is negated. +6. Remove fallback only after the child is fully routable. -For a positive group: +This is `O(number_of_values_in_subquery)` for the first child of a +`{group, subquery_id}` pair. That cost is acceptable because it happens on +child creation, not on every consumer using the same child. -```elixir -for child_node_id <- positive_children_for_value(group_id, value), - subquery_id = subquery_id_for_child(child_node_id), - MultiTimeView.member_at_some_time?(subquery_id, value) do - WhereCondition.affected_shapes(child_node_id, record) -end -``` +### Routing -For a negative group: +Positive routing should route a root-table value to a child if the value is a +member of the child subquery at any retained logical time: ```elixir -for child_node_id <- all_negative_children(group_id), - subquery_id = subquery_id_for_child(child_node_id), - not MultiTimeView.member_at_all_times?(subquery_id, value) do - WhereCondition.affected_shapes(child_node_id, record) -end +MultiTimeView.member_at_some_time?(subquery_id, value) ``` -This keeps routing broad enough for all consumers reading any retained logical -time. - -#### Shape Addition - -Adding an outer shape to an existing `{group, subquery_id}` child is near O(1): -the shape is added to the child `WhereCondition`. - -Creating the first child for `{group, subquery_id}` requires indexing current -values for that subquery in the group. That is O(number of values in the -subquery), which is acceptable and unavoidable. First-time child creation should -seed synchronously so the child is fully routable before the shape is considered -indexed. - -#### Shape Removal - -Removing an outer shape removes it from the child `WhereCondition`. +This is conservative. If some consumers still read an old time and others read +a new time, both old and new members remain routable until compaction proves no +consumer can read the old time. -If the child becomes empty, remove the `{group, subquery_id}` child and update -the group value indexes by iterating the values for `subquery_id` from -`MultiTimeView`. This is proportional to the number of values in that subquery, -not to the total number of shapes or subqueries. - -### Materializer - -The materializer continues to track dependency shape membership from the -dependency log. - -It changes in three ways: - -1. On initial load, it populates `MultiTimeView` for the subquery at logical - time `0`, then marks the subquery ready. -2. On each committed batch that produces net move events, it increments the - subquery logical time. -3. Before sending `{:materializer_changes, ...}` to subscribers, it writes the - move events to `MultiTimeView` at the new logical time. - -The subscriber payload should include both the old and new logical time: +Negated routing should enumerate the negated children for the group and keep +children where the value is not a member at all retained times: ```elixir -%{ - move_in: [{value, original_string}], - move_out: [{value, original_string}], - txids: [txid], - from_time: old_time, - to_time: new_time -} +not MultiTimeView.member_at_all_times?(subquery_id, value) ``` -If a committed batch has no net membership move, logical time does not need to -advance. - -The existing `Materializer.LinkValues` ETS cache should be removed or replaced -by `MultiTimeView` rather than kept as a second full shared copy. - -### Consumer Event Handlers +This is `O(number_of_affected_shapes)` for large negated groups. That is +acceptable because a value absent from a large negated group genuinely affects +all of those shapes. -Consumers store logical times instead of materialized subquery views. - -The steady handler keeps: +Exact filter verification uses the consumer's filter time: ```elixir -%{subquery_ref => logical_time} +MultiTimeView.member?(subquery_id, typed_value, filter_time) ``` -`Shape.convert_change/3` and DNF metadata projection need to accept a membership -callback instead of requiring a concrete `MapSet` view: - -```elixir -fn subquery_ref, value, time -> - MultiTimeView.member?(subquery_id, value, time) -end -``` +`WhereClause.subquery_member_from_index/2` should therefore resolve +`shape_handle + subquery_ref` to `{subquery_id, filter_time}` and call the +shared view. The callback remains the boundary used by +`WhereClause.includes_record?/3`. -In steady state, old and new records use the same logical time. +### Materializer Integration -During a buffered move-in, `ActiveMove` stores: +The materializer owns the source of truth for a dependency subquery. It should +populate `MultiTimeView` during initial materialization and mark the subquery +ready only after the full initial view is visible. -```elixir -times_before_move -times_after_move -``` +When a committed dependency change alters membership, the materializer should: -instead of: +1. Read the current logical time `a`. +2. Increment to logical time `b`. +3. Write the transition into `MultiTimeView` at `b`. +4. Update positive routing before emitting the move if the value is newly + routable at some retained time. +5. Emit the dependency move with `from_time: a`, `to_time: b`, `subquery_id`, + changed values, and move kind. -```elixir -views_before_move -views_after_move -``` +Consumers must not observe a move event whose target time is absent from +`MultiTimeView`. -Buffered transactions before the splice boundary are converted using -`times_before_move`. Buffered transactions after the splice boundary are -converted using `times_after_move`. +### Consumer Registration -After the move is spliced and the consumer becomes steady at `to_time`, it -calls: +Consumers register for each subquery they read. Registration should be +serialized through the dependency materializer so the returned time and the +shared view are consistent: ```elixir -SubqueryProgressMonitor.notify_processed_up_to(from_time, subquery_id) +{:ok, current_time} = + Materializer.register_subquery_consumer( + subquery_id, + outer_shape_handle, + self() + ) ``` -#### Move-In Queries +The registration side effects are: -Move-in queries currently build SQL from whole before and after views. The new -implementation should avoid retaining large views in the consumer process. +- wait until the dependency materializer has finished initial population +- register the consumer with `SubqueryProgressMonitor` +- set the consumer's initial `required_time` to `current_time` +- return `current_time` to the caller -Preferred approach: +This replaces the current `Materializer.get_link_values/1` setup path for +subquery event handlers. The handler should keep compact references such as: -- Build the triggering dependency candidate predicate from the move delta - values when possible. -- Read full view values at a specific time only for positions that require - exclusion logic. -- If full views are required, materialize them inside the task process that - runs the query so the memory is released when the task exits. +```elixir +%{ + ["$sublink", "0"] => %{subquery_id: dep_handle, time: current_time} +} +``` -This is important because replacing long-lived consumer views with -short-lived task views is where much of the memory win comes from. +not `MapSet` views. -### SubqueryProgressMonitor +The monitor should track consumers by process monitor plus registered +subqueries so dead consumers automatically release pinned times. An explicit +unregister path can be added for normal shutdown, but correctness must not +depend on it. -`SubqueryProgressMonitor` tracks the earliest logical time still needed by live -outer consumers. +### Consumer Move Handling -Registration must happen atomically with choosing the consumer's starting -logical time. The materializer should provide a serialized registration call, -for example: +For a move from time `a` to time `b`, `ActiveMove` should store times, not +views: ```elixir -{:ok, current_time} = - Materializer.register_subquery_consumer( - subquery_id, - outer_shape_handle, - self() - ) +%ActiveMove{ + subquery_id: subquery_id, + from_time: a, + to_time: b, + values: values +} ``` -The materializer handles the call in its GenServer, reads its current logical -time, registers the outer consumer with the progress monitor, and returns that -time to the consumer. This prevents a race where the materializer advances and -compacts a time before the new consumer is registered as needing it. - -The monitor stores the returned time as the consumer's internal -`required_time`. - -When a consumer finishes a move from `from_time` to `to_time`, it calls: +Elixir-side evaluation of buffered transactions should use callbacks into +`MultiTimeView`: ```elixir -SubqueryProgressMonitor.notify_processed_up_to(from_time, subquery_id) +before_member? = fn ref, value -> member?(ref, value, a) end +after_member? = fn ref, value -> member?(ref, value, b) end ``` -The monitor then advances that consumer's `required_time` to `from_time + 1`. +For SQL move-in queries, the first implementation can still materialize +query-local arrays by calling `MultiTimeView.values(subquery_id, time)`. The +important change is that these arrays are transient query parameters, not +long-lived per-consumer state. -The monitor maintains two ETS indexes: +After the move is spliced and the consumer no longer needs time `a`, it calls: -```text -{subquery_id, consumer_shape_handle} -> required_time -{subquery_id, required_time, consumer_shape_handle} -> true +```elixir +SubqueryProgressMonitor.notify_processed_up_to(a, subquery_id) ``` -The first index makes updates O(1). The second index makes the minimum -required time for a subquery cheap to read. +The consumer's exact-filter time is separate from this retention notification. +It should advance to `b` at the same point the current implementation would +update per-shape membership rows for subsequent routing. The important +invariant is that live routing must not under-route, while `required_time` +continues to pin `a` until the consumer no longer needs the old view. -If `notify_processed_up_to/2` infers consumer identity from the caller process, -the monitor also needs a lookup from registered consumer pid to -`consumer_shape_handle`. +### Querying Changes -When the minimum changes, the monitor notifies `MultiTimeView`: +`Querying.move_in_where_clause/5` currently receives +`views_before_move` and `views_after_move` maps. Replace those maps with a view +resolver that can provide values for a subquery ref at a logical time: ```elixir -MultiTimeView.set_min_required_time(subquery_id, min_required_time) +values_for.(subquery_ref, time) ``` -When an outer shape is removed, the monitor removes that consumer from every -subquery it was registered for and recomputes the affected minima. - -### Concurrency Model +Initial implementation can adapt this resolver back into arrays at the SQL +boundary, preserving existing SQL generation behavior. A later optimization can +special-case the triggering subquery position and use only the changed values +for candidate selection when the DNF plan makes that safe. + +This keeps the first implementation smaller while still removing long-lived +view copies. + +### Failure Modes + +If `MultiTimeView` is not ready for a subquery, shapes using that subquery must +stay in fallback routing. They must not be marked ready. + +If a consumer dies while it pins an old time, `SubqueryProgressMonitor` must +release its registration via the process monitor. Otherwise compaction can be +blocked indefinitely. + +If a dependency materializer is removed, `MultiTimeView.remove_subquery/1` must +remove the view and `SubqueryIndex` must remove the children and participants +associated with that subquery without scanning unrelated shapes. + +If compaction falls behind, correctness is preserved but routing becomes more +conservative and histories grow. Add telemetry so this is visible. + +### Telemetry + +Add enough telemetry to prove or disprove the design: + +- number of values per subquery +- number of retained histories per subquery +- max and average history length +- min/current logical time gap per subquery +- number of registered consumers per subquery +- number of child nodes per subquery group +- first-child synchronous seed duration +- shape removal duration +- transient SQL move-in array size + +### Complexity Check + +- **Is this the simplest approach?** No. The simplest immediate fix is adding a + reverse index for shape-owned values or using tombstones. Those approaches do + less architectural work, but they keep or increase the duplicated full-view + memory that caused the problem. This proposal is more complex because it + crosses the materializer, event handler, querying, and filter index + boundaries, but it removes both major long-lived duplicate view pools. +- **What could we cut?** The first implementation can keep existing SQL array + generation, materializing arrays only at query time. It can also postpone + aggressive history encoding, background compaction tuning, and cross-handle + subquery interning. +- **What's the 90/10 solution?** Implement `MultiTimeView`, serialized + registration, per-consumer logical times, and shared child routing. Keep + move-in SQL generation structurally the same by resolving values from the + shared view at the SQL boundary. Add telemetry before optimizing the query + format further. -Most writes are already serialized by existing processes: +## Open Questions -- Shape addition and removal happen through the ShapeLogCollector path. -- Dependency changes are applied by materializers. -- Outer consumers process events synchronously when ShapeLogCollector publishes - to them. +Unresolved questions that need further discussion or will be determined during +implementation: -`MultiTimeView` writes happen in the materializer before it sends -`materializer_changes` to subscribers. +| Question | Options | Resolution Path | +|----------|---------|-----------------| +| **How should `values(subquery_id, time)` expose large views?** | Materialized `MapSet`, stream, both | Start with query-local materialization for compatibility, then prototype streaming or chunked array construction if telemetry shows spikes. | +| **Where should exact filter times live?** | In `SubqueryIndex` participant rows, in `SubqueryProgressMonitor`, or in consumer-owned state with callbacks | Decide during implementation. The filter needs fast `shape_handle + subquery_ref -> time` lookup, so `SubqueryIndex` is the likely owner. | +| **When should positive routing rows be removed after compaction?** | Opportunistically on read/write, periodic cleanup, immediate cleanup when min time advances | Implement opportunistic plus periodic cleanup first. Add immediate cleanup only if stale positive routes are expensive. | +| **Should long histories switch representation?** | Keep flat lists, switch to tuples/arrays after a threshold, or compact eagerly | Keep flat lists for v1 and add telemetry for max history length before adding another representation. | -`SubqueryIndex` topology changes happen when shapes are added or removed from -the filter. +## Definition of Success -`SubqueryIndex` reads happen from ShapeLogCollector while routing replication -changes. +### Primary Hypothesis -Consumer reads from `MultiTimeView` can happen concurrently with writes. This -is safe because membership is always read at an explicit logical time, and ETS -updates replace complete membership-history values atomically. +> We believe that implementing shared subquery logical-time views will enable +> the issue #4279 hypothesis: subquery indexing can become scalable for shape +> add/remove and memory use while preserving v1.6 subquery move correctness. +> +> We'll know we're right if shared subqueries no longer allocate full +> per-consumer dependency views in steady state, shape removal no longer scans +> value-keyed membership rows owned by unrelated shapes, and existing subquery +> move correctness tests continue to pass. +> +> We'll know we're wrong if retained histories grow without bound under normal +> consumer lag, move-in query memory still dominates production incidents, or +> the cross-subsystem complexity creates correctness regressions compared with +> the current per-consumer view model. -The ready flag is important. Until a subquery has been seeded into -`MultiTimeView` and any group routing rows have been created, routing must be -conservative. +### Functional Requirements -## Expected Benefits +| Requirement | Acceptance Criteria | +|-------------|---------------------| +| Shared subquery view | One `MultiTimeView` view exists per `subquery_id`, and steady-state consumers do not store full `MapSet` views. | +| Per-consumer logical time | Each consumer can evaluate subquery membership at its own logical time. | +| Correct registration | Consumer registration is serialized with the materializer and returns a current logical time whose view is ready. | +| Progress notification | Consumers call `notify_processed_up_to(time, subquery_id)` after finishing moves, and compaction uses the minimum required time. | +| Synchronous first child seed | First-time child creation seeds routing for the current view before removing fallback. | +| Positive routing correctness | Values that are members at any retained time route to the relevant child node. | +| Negated routing correctness | Negated groups route conservatively and filter with `member_at_all_times?/2`. | +| Shape removal scalability | Removing a shape follows participant and child rows, not all subquery values for unrelated shapes. | +| Move-in compatibility | Existing move-in SQL behavior can be produced from logical-time views without long-lived before/after `MapSet` copies. | +| Observability | Telemetry reports retained time gaps, history sizes, seed duration, and removal duration. | -- One retained membership view per subquery instead of one per outer shape. -- Consumer processes retain logical times instead of large `MapSet` views. -- Move-in buffering retains before/after logical times instead of before/after - `MapSet` views. -- Subquery removal is proportional to the removed subquery's values and group - entries, not to total stack size. -- Positive routing remains value-keyed and efficient. +### Learning Goals -## Risks +1. Measure how large retained logical-time windows become under realistic + consumer lag. +2. Measure whether transient move-in SQL arrays remain a material memory cost + after removing long-lived view copies. +3. Determine whether flat list histories are sufficient or whether a threshold + representation is needed. +4. Determine whether conservative positive routing creates measurable extra + filter work before compaction catches up. -### Off-by-One Compaction +## Alternatives Considered -The biggest correctness risk is compacting away a logical time that some -consumer still needs. +These alternatives are based on the discussion and rejected approaches in +PR #4280. -The invariant is: +### Alternative 1: Add `shape_handle -> all values` -```text -MultiTimeView may drop transitions at or before min_required_time -after rewriting the entry to preserve membership at min_required_time. +**Description:** Add a reverse index from each shape to the full set of values +it has inserted into `SubqueryIndex`. -min_required_time = min(required_time_by_live_consumer) -``` +**Why not:** This improves shape removal, but it adds another full per-shape +dependency view. It makes the removal path easier by increasing the same memory +duplication this RFC is trying to remove. -Tests should cover move-in, move-out, repeated toggles, consumer registration, -consumer removal, and compaction across all of those cases. +### Alternative 2: Tombstone Removed Shapes And Clean Later -### Negated Routing Cost +**Description:** Mark removed shapes as tombstoned and clean their value-keyed +membership rows asynchronously. -For negated subquery groups, a value absent from all subqueries affects all -children in the group. This can be large, but it is proportional to the number -of affected shapes. +**Why not:** This is useful as an emergency mitigation, but it is not a +structural memory fix. It leaves stale rows in the hot routing path and +requires liveness checks or cleanup debt elsewhere. -The implementation should avoid extra memory-heavy complement indexes unless -there is evidence they are necessary. +### Alternative 3: One Global Widened Filter -### Move-In Query Memory +**Description:** Store one widened filter for each subquery and route every +value that might match any participant, relying on downstream exact filtering. -If move-in query generation materializes full before and after views in the -consumer process, the design will keep a major source of memory duplication. +**Why not:** A slow or stalled consumer can keep the shared filter broad and +over-route work for every other participant. This preserves correctness, but it +can move cost from memory to sustained routing and filtering work. -Full view materialization should be avoided where possible and isolated to -short-lived task processes where not possible. +### Alternative 4: Intern Full Dependency Views -### Fallback Windows +**Description:** Deduplicate identical full dependency views by interning +`MapSet` values or equivalent view structures. -Unready subqueries and not-yet-seeded group routing must route conservatively. -This may temporarily over-route, but must not under-route. +**Why not:** This handles exact equality at a point in time, but one-value +moves immediately create new views or require a second delta representation. +At that point the design becomes a versioned or sparse-delta view. Logical time +models that state directly. -## Testing Plan +### Alternative 5: Versioned Lazy Exception Clearing -Add focused unit tests for `MultiTimeView`: +**Description:** Keep sparse exceptions and clear or promote them lazily with +versions instead of doing eager cleanup. + +**Why not:** This can reduce some hot-path work, but it adds versioning and +cleanup complexity while retaining a separate exception model. This is better +as a follow-up optimization if measurements show cleanup cost is high. + +### Alternative 6: Shared Base View With Sparse XOR Exceptions -- membership at exact logical times -- `member_at_some_time?/2` -- `member_at_all_times?/2` -- move-in and move-out transitions -- repeated toggles -- compaction after `set_min_required_time/2` -- subquery removal by ordered key range +**Description:** The design in PR #4280 stores one base dependency view per +cohort and stores sparse per-participant XOR exceptions for values where a +participant temporarily differs from the base. -Add focused unit tests for `SubqueryProgressMonitor`: +**Why not:** This is a lower-risk, index-focused approach and may still be the +right short-term fix if this RFC is too broad. However, it leaves consumer-held +before/after views in place and represents temporary divergence as +per-participant exceptions instead of as consumers reading different logical +times. +The logical-time design is a broader refactor, but it addresses the duplicated +state in both `SubqueryIndex` and consumer event handlers. -- registration at current time -- `notify_processed_up_to/2` -- minimum required time updates -- consumer removal -- multiple consumers at different times +## Revision History -Add `SubqueryIndex` tests: +| Version | Date | Author | Changes | +|---------|------|--------|---------| +| 0.1 | 2026-05-18 | robacourt | Initial draft using the Stratovolt RFC template and alternatives from PR #4280. | -- positive group routing -- negative group routing -- shared child nodes per `{group, subquery_id}` -- conservative routing for unready subqueries -- child removal after the last outer shape is removed -- no full-table scan on subquery removal +--- -Update consumer event-handler tests: +## RFC Quality Checklist -- steady conversion using logical times -- buffered move-in using before and after logical times -- queued moves across multiple logical times -- progress notifications after splice -- negated move-in and move-out behavior +Before submitting for review, verify: -Keep or extend integration tests for: +**Alignment** +- [x] RFC implements the working issue hypothesis, with no separate PRD. +- [x] API naming matches ElectricSQL conventions. +- [x] Success criteria link back to the issue hypothesis. -- dependency move-in -- dependency move-out -- nested subqueries -- subqueries combined with non-subquery predicates -- rows moving between two dependency values in one transaction - -## Open Questions +**Calibration for Level 1-2 PMF** +- [x] This is the smallest version of the logical-time design that validates + the memory hypothesis. +- [x] Non-goals explicitly defer protocol changes, DNF redesign, and deeper + query optimization. +- [x] Complexity Check section is filled out honestly. +- [x] An engineer could start implementing tomorrow. -- Should `MultiTimeView` expose `values(subquery_id, time)` as a materialized - `MapSet`, a stream, or both? +**Completeness** +- [x] Happy path is clear. +- [x] Critical failure modes are addressed. +- [x] Open questions are acknowledged, not glossed over. From 205c875210ff5c1248e4b047e5d381f6de2fef70 Mon Sep 17 00:00:00 2001 From: rob Date: Mon, 18 May 2026 20:49:15 +0100 Subject: [PATCH 11/15] Remove use of phrase 'filter time' --- docs/rfcs/subquery-index.md | 35 +++++++++++++++++++---------------- 1 file changed, 19 insertions(+), 16 deletions(-) diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md index 84feeff5a3..fa35859166 100644 --- a/docs/rfcs/subquery-index.md +++ b/docs/rfcs/subquery-index.md @@ -132,7 +132,7 @@ subquery: ```text MultiTimeView[{subquery_id, value}] -> membership_history -consumer[{shape_handle, subquery_ref}] -> {subquery_id, filter_time} +consumer[{shape_handle, subquery_ref}] -> {subquery_id, logical_time} ``` Consumers no longer copy the subquery view. They register each subquery they @@ -250,10 +250,11 @@ Consumers register at the logical time they are starting from. If a consumer starts from current logical time `t`, its initial `required_time` is `t` because it may read time `t`. -`required_time` is a retention bound, not necessarily the same as the time used -for live exact filter checks. During an active move, a consumer may need the old -time for buffered conversion or move-in query work while live exact checks have -already advanced to the new time. The implementation must keep those two uses +`required_time` is a retention bound. It is separate from the consumer's current +logical time for a specific subquery. During an active move, a consumer may need +the old time for buffered conversion or move-in query work while its current +logical time for that subquery has already advanced to the new time. The +implementation must keep `required_time` and per-subquery `logical_time` explicit. ### MultiTimeView @@ -369,7 +370,7 @@ Suggested logical rows: {:child_meta, child_node_id} -> {group_id, subquery_id, polarity, next_condition_id} {:child_shape, child_node_id} -> {shape_handle, branch_key} {:shape_child, shape_handle} -> child_node_id -{:shape_subquery, shape_handle, subquery_ref} -> {subquery_id, filter_time} +{:shape_subquery, shape_handle, subquery_ref} -> {subquery_id, logical_time} {:fallback, shape_handle} -> true ``` @@ -432,14 +433,15 @@ This is `O(number_of_affected_shapes)` for large negated groups. That is acceptable because a value absent from a large negated group genuinely affects all of those shapes. -Exact filter verification uses the consumer's filter time: +Exact filter verification uses the consumer's current logical time for the +requested subquery: ```elixir -MultiTimeView.member?(subquery_id, typed_value, filter_time) +MultiTimeView.member?(subquery_id, typed_value, logical_time) ``` `WhereClause.subquery_member_from_index/2` should therefore resolve -`shape_handle + subquery_ref` to `{subquery_id, filter_time}` and call the +`shape_handle + subquery_ref` to `{subquery_id, logical_time}` and call the shared view. The callback remains the boundary used by `WhereClause.includes_record?/3`. @@ -533,11 +535,12 @@ After the move is spliced and the consumer no longer needs time `a`, it calls: SubqueryProgressMonitor.notify_processed_up_to(a, subquery_id) ``` -The consumer's exact-filter time is separate from this retention notification. -It should advance to `b` at the same point the current implementation would -update per-shape membership rows for subsequent routing. The important -invariant is that live routing must not under-route, while `required_time` -continues to pin `a` until the consumer no longer needs the old view. +The consumer's current logical time for that subquery is separate from this +retention notification. It should advance to `b` at the same point the current +implementation would update per-shape membership rows for subsequent routing. +The important invariant is that live routing must not under-route, while +`required_time` continues to pin `a` until the consumer no longer needs the old +view. ### Querying Changes @@ -613,7 +616,7 @@ implementation: | Question | Options | Resolution Path | |----------|---------|-----------------| | **How should `values(subquery_id, time)` expose large views?** | Materialized `MapSet`, stream, both | Start with query-local materialization for compatibility, then prototype streaming or chunked array construction if telemetry shows spikes. | -| **Where should exact filter times live?** | In `SubqueryIndex` participant rows, in `SubqueryProgressMonitor`, or in consumer-owned state with callbacks | Decide during implementation. The filter needs fast `shape_handle + subquery_ref -> time` lookup, so `SubqueryIndex` is the likely owner. | +| **Where should per-subquery logical times live?** | In `SubqueryIndex` participant rows, in `SubqueryProgressMonitor`, or in consumer-owned state with callbacks | Decide during implementation. Exact membership checks need fast `shape_handle + subquery_ref -> {subquery_id, logical_time}` lookup, so `SubqueryIndex` is the likely owner. | | **When should positive routing rows be removed after compaction?** | Opportunistically on read/write, periodic cleanup, immediate cleanup when min time advances | Implement opportunistic plus periodic cleanup first. Add immediate cleanup only if stale positive routes are expensive. | | **Should long histories switch representation?** | Keep flat lists, switch to tuples/arrays after a threshold, or compact eagerly | Keep flat lists for v1 and add telemetry for max history length before adding another representation. | @@ -640,7 +643,7 @@ implementation: | Requirement | Acceptance Criteria | |-------------|---------------------| | Shared subquery view | One `MultiTimeView` view exists per `subquery_id`, and steady-state consumers do not store full `MapSet` views. | -| Per-consumer logical time | Each consumer can evaluate subquery membership at its own logical time. | +| Per-consumer per-subquery logical time | Each consumer can evaluate each subquery at that subquery's own logical time. | | Correct registration | Consumer registration is serialized with the materializer and returns a current logical time whose view is ready. | | Progress notification | Consumers call `notify_processed_up_to(time, subquery_id)` after finishing moves, and compaction uses the minimum required time. | | Synchronous first child seed | First-time child creation seeds routing for the current view before removing fallback. | From 7c15a853c60705c1d5beaf4839c7d555df5e171c Mon Sep 17 00:00:00 2001 From: rob Date: Mon, 18 May 2026 21:04:07 +0100 Subject: [PATCH 12/15] Add examples --- docs/rfcs/subquery-index.md | 589 ++++++++++++++++++++++++++++++++++++ 1 file changed, 589 insertions(+) diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md index fa35859166..1d7e7bf192 100644 --- a/docs/rfcs/subquery-index.md +++ b/docs/rfcs/subquery-index.md @@ -368,6 +368,7 @@ Suggested logical rows: {:group, group_key} -> group_id {:child, group_id, subquery_id} -> child_node_id {:child_meta, child_node_id} -> {group_id, subquery_id, polarity, next_condition_id} +{:subquery_child, subquery_id} -> child_node_id {:child_shape, child_node_id} -> {shape_handle, branch_key} {:shape_child, shape_handle} -> child_node_id {:shape_subquery, shape_handle, subquery_ref} -> {subquery_id, logical_time} @@ -445,6 +446,594 @@ MultiTimeView.member?(subquery_id, typed_value, logical_time) shared view. The callback remains the boundary used by `WhereClause.includes_record?/3`. +### Operation Examples And Costs + +Use this concrete setup for the examples: + +```sql +-- subquery_id = s7, current logical time 0 +SELECT id FROM users WHERE company_id = 7 +-- current values: 10, 20 + +-- subquery_id = s8, current logical time 0 +SELECT id FROM users WHERE company_id = 8 +-- current values: 30 +``` + +Outer shapes: + +```sql +-- shape_a and shape_b share the same positive group and subquery. +WHERE user_id IN (SELECT id FROM users WHERE company_id = 7) + +-- shape_c uses the same positive group but a different subquery. +WHERE user_id IN (SELECT id FROM users WHERE company_id = 8) + +-- shape_n uses a negated group for s7. +WHERE user_id NOT IN (SELECT id FROM users WHERE company_id = 7) +``` + +Symbols used below: + +- `V_s`: values in subquery `s` over the retained window. +- `H_v`: transition history length for one `{subquery_id, value}` row. +- `C_s`: child nodes attached to subquery `s`. +- `P_c`: outer shapes attached to child node `c`. +- `K`: changed values in one dependency move. +- `N_g`: child nodes in a negated group. +- `R`: root-table candidate rows returned by a move-in SQL query. + +#### Initial `MultiTimeView` State + +The initial materializer state for `s7` stores one row per dependency value, +not one row per outer shape: + +```text +{s7, 10} -> [] +{s7, 20} -> [] +{:current_time, s7} -> 0 +{:min_required_time, s7} -> 0 +{:ready, s7} -> true +``` + +The empty history means the value is present for the whole retained window. + +Memory is `O(V_s)` for the shared view. In this example, `shape_a` and +`shape_b` do not duplicate `{10, 20}`. + +#### `register_subquery_consumer` + +Before an outer consumer can read `s7`, it registers through the materializer: + +```elixir +{:ok, 0} = + Materializer.register_subquery_consumer( + s7, + shape_a, + consumer_pid_a + ) +``` + +Progress monitor rows are conceptually: + +```text +{s7, shape_a} -> required_time 0 +{s7, 0, shape_a} -> true +``` + +The shape's subquery reference is then recorded by the indexing/setup path: + +```text +{:shape_subquery, shape_a, ["$sublink", "0"]} -> {s7, 0} +``` + +If registration is called from `add_shape`, this row is inserted once as part +of that setup path; it is shown here to make the registration result explicit. + +What is evaluated: + +1. Wait until `s7` is ready. +2. Read `{:current_time, s7}`. +3. Insert progress monitor rows for `shape_a`. +4. Return `0` to the consumer. + +Cost: + +```text +O(wait_until_ready + progress_index_insert) +``` + +No dependency values are copied. Memory added is +`O(number_of_subqueries_read_by_shape)`. + +#### `add_shape`: First Positive Shape For `{group, subquery}` + +Adding `shape_a` creates a positive group `g_user_pos` and a child `c_s7_pos` +for `{g_user_pos, s7}`. + +Rows stored: + +```text +{:group, {:node_1, :user_id, :positive}} -> g_user_pos +{:child, g_user_pos, s7} -> c_s7_pos +{:child_meta, c_s7_pos} -> {g_user_pos, s7, :positive, wc_s7_pos} +{:subquery_child, s7} -> c_s7_pos + +{:child_shape, c_s7_pos} -> {shape_a, branch_a} +{:shape_child, shape_a} -> c_s7_pos +{:shape_subquery, shape_a, ["$sublink", "0"]} -> {s7, 0} + +{:positive, g_user_pos, 10} -> c_s7_pos +{:positive, g_user_pos, 20} -> c_s7_pos +``` + +The child `WhereCondition` `wc_s7_pos` also stores `shape_a` with the residual +non-subquery predicates for the branch. + +What is evaluated: + +1. Compile or reuse the DNF subquery group key. +2. Register the consumer with the dependency materializer and get logical time + `0`. +3. Create the child `WhereCondition`. +4. Insert `shape_a` into the child condition. +5. Synchronously seed positive routing from `MultiTimeView.values(s7, 0)`. +6. Remove fallback for `shape_a`. + +Cost: + +```text +O(number_of_subquery_occurrences_in_shape + V_s + child_where_insert) +``` + +The `V_s` term only applies because this is the first child for +`{g_user_pos, s7}`. Memory added is `O(V_s)` positive routing rows for the +child plus `O(number_of_subquery_occurrences_in_shape)` participant rows. + +#### `add_shape`: Additional Shape Sharing An Existing Child + +Adding `shape_b` finds the existing child `c_s7_pos`. + +Rows added: + +```text +{:child_shape, c_s7_pos} -> {shape_b, branch_b} +{:shape_child, shape_b} -> c_s7_pos +{:shape_subquery, shape_b, ["$sublink", "0"]} -> {s7, 0} +``` + +No new rows are added for values `10` or `20`. + +What is evaluated: + +1. Resolve `{g_user_pos, s7}` to `c_s7_pos`. +2. Register the consumer and get logical time `0`. +3. Insert `shape_b` into the child condition. +4. Remove fallback for `shape_b`. + +Cost: + +```text +O(number_of_subquery_occurrences_in_shape + child_where_insert) +``` + +Memory added is per-shape metadata only, not `O(V_s)`. + +#### `add_shape`: Same Group, Different Subquery + +Adding `shape_c` reuses group `g_user_pos`, but creates child `c_s8_pos` for +`{g_user_pos, s8}`. + +Rows added include: + +```text +{:child, g_user_pos, s8} -> c_s8_pos +{:child_meta, c_s8_pos} -> {g_user_pos, s8, :positive, wc_s8_pos} +{:subquery_child, s8} -> c_s8_pos +{:positive, g_user_pos, 30} -> c_s8_pos +{:shape_subquery, shape_c, ["$sublink", "0"]} -> {s8, 0} +``` + +Cost is `O(V_s8)` for the first `s8` child in this group. This is expected: +`s8` has different dependency values from `s7`. + +#### `add_shape`: Negated Shape + +Adding `shape_n` creates or reuses a negated group `g_user_neg` and child +`c_s7_neg`. + +Rows stored: + +```text +{:group, {:node_2, :user_id, :negated}} -> g_user_neg +{:child, g_user_neg, s7} -> c_s7_neg +{:child_meta, c_s7_neg} -> {g_user_neg, s7, :negated, wc_s7_neg} +{:subquery_child, s7} -> c_s7_neg +{:negated, g_user_neg} -> c_s7_neg + +{:child_shape, c_s7_neg} -> {shape_n, branch_n} +{:shape_child, shape_n} -> c_s7_neg +{:shape_subquery, shape_n, ["$sublink", "0"]} -> {s7, 0} +``` + +No per-value negated routing rows are stored. + +Cost: + +```text +O(number_of_subquery_occurrences_in_shape + child_where_insert) +``` + +Memory added for negated routing is `O(1)` per child, not `O(V_s)`. + +#### `affected_shapes`: Positive Group + +For a root-table record: + +```text +%{"user_id" => 10} +``` + +Routing does: + +1. Evaluate the left-hand side `user_id` to `10`. +2. Look up `{:positive, g_user_pos, 10}` and get `[c_s7_pos]`. +3. Evaluate child condition `wc_s7_pos`, which considers `shape_a` and + `shape_b`. +4. For each candidate shape, exact subquery checks resolve: + +```text +{:shape_subquery, shape_a, ["$sublink", "0"]} -> {s7, 0} +{:shape_subquery, shape_b, ["$sublink", "0"]} -> {s7, 0} +MultiTimeView.member?(s7, 10, 0) -> true +``` + +Both shapes are affected. + +Cost: + +```text +O(children_for_value + child_where_eval + exact_subquery_checks * H_v) +``` + +For this example, `children_for_value = 1`. There is no scan of all shapes and +no scan of all values in `s7`. + +#### `affected_shapes`: Positive Group With Divergent Consumer Times + +Suppose the materializer adds value `30` to `s7` at logical time `1`: + +```text +{s7, 30} -> [:out, 1] +{:current_time, s7} -> 1 +{:positive, g_user_pos, 30} -> c_s7_pos +``` + +Now `shape_a` has advanced to logical time `1`, but `shape_b` still reads +logical time `0`: + +```text +{:shape_subquery, shape_a, ["$sublink", "0"]} -> {s7, 1} +{:shape_subquery, shape_b, ["$sublink", "0"]} -> {s7, 0} +``` + +For: + +```text +%{"user_id" => 30} +``` + +routing finds `c_s7_pos` because `30` is a member at some retained time. Exact +checks then split the result: + +```text +MultiTimeView.member?(s7, 30, 1) -> true +MultiTimeView.member?(s7, 30, 0) -> false +``` + +Only `shape_a` is affected. + +Cost remains: + +```text +O(children_for_value + child_where_eval + exact_subquery_checks * H_v) +``` + +The extra memory for the move is one history row for `{s7, 30}` plus one +positive routing row per positive child for `s7` in that group. + +#### `affected_shapes`: Negated Group + +For: + +```text +%{"user_id" => 30} +``` + +while `{s7, 30} -> [:out, 1]` is retained, `30` is absent at time `0` and +present at time `1`. Negated routing does: + +1. Look up `{:negated, g_user_neg}` and get `[c_s7_neg]`. +2. Keep `c_s7_neg` because: + +```elixir +not MultiTimeView.member_at_all_times?(s7, 30) +``` + +3. Evaluate `wc_s7_neg` and exact membership at each candidate shape's + subquery logical time. + +For `shape_n` at logical time `0`, `NOT IN s7` is true for `30`. If it later +advances to logical time `1`, `NOT IN s7` is false for `30`. + +Cost: + +```text +O(N_g * H_v + child_where_eval + exact_subquery_checks * H_v) +``` + +This is intentionally proportional to the number of affected negated children. +No complement index is stored. + +#### Dependency Move: Add Or Remove Values + +For a move that adds `30` to `s7`: + +```text +from_time = 0 +to_time = 1 +changed_values = [30] +``` + +Rows written: + +```text +{s7, 30} -> [:out, 1] +{:current_time, s7} -> 1 +{:positive, g_user_pos, 30} -> c_s7_pos +``` + +Rows not written: + +```text +{:membership, shape_a, ["$sublink", "0"], 30} +{:membership, shape_b, ["$sublink", "0"], 30} +``` + +What is evaluated: + +1. Update the `MultiTimeView` history for each changed value. +2. Find children from `{:subquery_child, s7}`. +3. For each positive child, insert a positive routing row if the value changed + from not routable to routable for the retained window. +4. Emit a move event containing `from_time`, `to_time`, `subquery_id`, and + changed values. + +Cost: + +```text +O(K * (history_update + C_s)) +``` + +For a remove of `20` from `s7` at time `2`, the history becomes: + +```text +{s7, 20} -> [:in, 2] +``` + +The positive routing row for `20` stays while any retained time still contains +`20`. It is removed later when compaction proves `member_at_some_time?(s7, 20)` +is false. + +#### Consumer Move Handling + +When `shape_a` receives the `s7` move from `0` to `1`, `ActiveMove` stores: + +```elixir +%ActiveMove{ + subquery_id: s7, + from_time: 0, + to_time: 1, + values: [{30, "30"}] +} +``` + +It does not store: + +```text +views_before_move: MapSet.new([10, 20]) +views_after_move: MapSet.new([10, 20, 30]) +``` + +Buffered row conversion evaluates exact membership by calling +`MultiTimeView.member?/3` at `from_time` or `to_time`. Move-in SQL may +materialize `values(s7, 1)` as a query-local parameter array, but that memory +belongs to the query task and is released after the query. + +Steady memory added per active move is: + +```text +O(number_of_changed_values + number_of_subquery_refs) +``` + +not `O(V_s)`. + +#### `notify_processed_up_to` And Compaction + +After `shape_a` no longer needs time `0`, it calls: + +```elixir +SubqueryProgressMonitor.notify_processed_up_to(0, s7) +``` + +Progress monitor rows conceptually change from: + +```text +{s7, shape_a} -> required_time 0 +{s7, shape_b} -> required_time 0 +``` + +to: + +```text +{s7, shape_a} -> required_time 1 +{s7, shape_b} -> required_time 0 +``` + +The minimum is still `0`, so `MultiTimeView` cannot compact away time `0`. +After `shape_b` also notifies up to `0`, the minimum becomes `1`. Then: + +```text +{s7, 30} -> [:out, 1] +``` + +can compact to: + +```text +{s7, 30} -> [] +``` + +For a removed value: + +```text +{s7, 20} -> [:in, 2] +``` + +if the minimum required time later advances past `2`, compaction can delete the +`MultiTimeView` row and remove stale positive routes: + +```text +delete {s7, 20} +delete {:positive, g_user_pos, 20} -> c_s7_pos +``` + +Cost for notification: + +```text +O(progress_index_update + min_recompute_for_subquery) +``` + +With an index keyed by `{subquery_id, required_time, consumer_id}`, reading the +minimum is `O(1)` or `O(log consumers_for_subquery)` depending on the ETS +layout chosen. Compaction cost is paid separately and can be incremental. For +one compacted value it is: + +```text +O(H_v + positive_children_for_subquery) +``` + +If compaction is batched, total work is proportional to the histories visited +and the stale route rows removed. + +#### Move-In Query Construction + +For the `s7` move from time `0` to `1`, existing SQL generation may still need +arrays for the before and after views. The new design builds them from +`MultiTimeView` inside the query task: + +```elixir +values_for.(["$sublink", "0"], 0) -> [10, 20] +values_for.(["$sublink", "0"], 1) -> [10, 20, 30] +``` + +What is stored persistently: + +```text +nothing beyond the ActiveMove times and changed values +``` + +What is allocated transiently: + +```text +query-local arrays for values(s7, 0) and values(s7, 1) +move-in snapshot rows returned by Postgres +``` + +Cost for the compatibility implementation: + +```text +O(V_s + R) +``` + +where `R` is the number of root-table rows returned by the move-in query. + +This does not yet minimize move-in query memory, but it moves full-view arrays +out of steady consumer state and into short-lived query tasks. + +#### `remove_shape` + +Removing `shape_a` reads: + +```text +{:shape_child, shape_a} -> c_s7_pos +{:shape_subquery, shape_a, ["$sublink", "0"]} -> {s7, 1} +``` + +Rows removed: + +```text +{:child_shape, c_s7_pos} -> {shape_a, branch_a} +{:shape_child, shape_a} -> c_s7_pos +{:shape_subquery, shape_a, ["$sublink", "0"]} -> {s7, 1} +``` + +The monitor registration for `{shape_a, s7}` is removed. `shape_a` is removed +from the child `WhereCondition`. + +If `shape_b` still uses `c_s7_pos`, no value routing rows are touched. Cost is: + +```text +O(children_for_shape + subqueries_for_shape + child_where_remove) +``` + +If this removes the last shape from `c_s7_pos`, the child is deleted too: + +```text +{:child, g_user_pos, s7} +{:child_meta, c_s7_pos} +{:subquery_child, s7} -> c_s7_pos +{:positive, g_user_pos, value} -> c_s7_pos for each retained value +``` + +The positive route cleanup iterates `MultiTimeView.values(s7)` and deletes the +specific `{group, value, child}` route rows. That last-child case costs: + +```text +O(V_s + child_metadata) +``` + +It does not scan unrelated subqueries or unrelated shapes. + +#### `remove_subquery` + +Removing dependency subquery `s7` reads: + +```text +{:subquery_child, s7} -> c_s7_pos +{:subquery_child, s7} -> c_s7_neg +``` + +Then it removes: + +```text +child metadata for c_s7_pos and c_s7_neg +participant rows for shapes attached to those children +positive routing rows for s7 values +negated group rows for s7 negated children +MultiTimeView rows with key prefix s7 +progress monitor rows for s7 +``` + +Cost: + +```text +O(C_s + sum(P_c) + V_s) +``` + +This is proportional to the removed subquery's children, participants, and +values. It should not scan the whole `SubqueryIndex` or all shapes in the +stack. + ### Materializer Integration The materializer owns the source of truth for a dependency subquery. It should From 75be436c50db26255a74426beed6cd11367264da Mon Sep 17 00:00:00 2001 From: rob Date: Tue, 19 May 2026 08:58:49 +0100 Subject: [PATCH 13/15] Add memory figures --- docs/rfcs/subquery-index.md | 107 +++++++++++++++++++++++++++++++++++- 1 file changed, 106 insertions(+), 1 deletion(-) diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md index 1d7e7bf192..7e058b008d 100644 --- a/docs/rfcs/subquery-index.md +++ b/docs/rfcs/subquery-index.md @@ -1,6 +1,6 @@ --- title: Shared Subquery Indexes with Logical-Time Views -version: "0.1" +version: "0.2" status: draft owner: robacourt contributors: [] @@ -1034,6 +1034,110 @@ This is proportional to the removed subquery's children, participants, and values. It should not scan the whole `SubqueryIndex` or all shapes in the stack. +### Memory Savings Prototype + +The prototype script is: + +```text +packages/sync-service/scripts/subquery_logical_time_memory.exs +``` + +Run it directly with Elixir so it does not start the sync-service application: + +```sh +elixir scripts/subquery_logical_time_memory.exs +``` + +There is also a focused test file: + +```text +packages/sync-service/test/electric/shapes/filter/subquery_logical_time_memory_bench_test.exs +``` + +The prototype compares: + +- the current model: current `SubqueryIndex`-style ETS rows, per-consumer + `MapSet` views, and active-move before/after views; +- the logical-time model: shared `MultiTimeView` rows, shared child routing and + metadata rows, progress-monitor rows, compact per-consumer subquery + references, and active moves that store changed values plus logical times. + +The model intentionally uses small integer dependency values. That is +conservative for workloads with large text, UUID, or composite values because +the current model duplicates those values per shape, while the logical-time +model stores them once per retained subquery value plus routing rows. + +The local run below was generated on: + +```text +OTP: 28 +Elixir: 1.19.5 +Architecture: aarch64-apple-darwin24.5.0 +Word size: 8 bytes +``` + +#### Local Measured Scenarios + +| Scenario | Current total | Current index | Current consumers | Logical total | Logical ETS | Logical consumers | Savings | +|----------|---------------|---------------|-------------------|---------------|-------------|-------------------|---------| +| 1 shape, 1k values, steady | 331.6 KiB | 302.4 KiB | 29.3 KiB | 222.9 KiB | 222.6 KiB | 256 B | 32.8% | +| 10 shapes, 1k values, steady | 3.2 MiB | 2.91 MiB | 292.5 KiB | 229.1 KiB | 226.6 KiB | 2.5 KiB | 93.0% | +| 100 shapes, 1k values, steady | 31.92 MiB | 29.06 MiB | 2.86 MiB | 290.9 KiB | 265.9 KiB | 25.0 KiB | 99.1% | +| 100 shapes, 10k values, steady | 318.9 MiB | 290.02 MiB | 28.87 MiB | 1.78 MiB | 1.76 MiB | 25.0 KiB | 99.4% | +| 100 shapes, 1k base, 100 added x 10 advanced | 32.24 MiB | 29.35 MiB | 2.88 MiB | 309.6 KiB | 284.6 KiB | 25.0 KiB | 99.1% | +| 100 shapes, 1k base, 100 added x 99 advanced | 35.07 MiB | 31.94 MiB | 3.13 MiB | 309.6 KiB | 284.6 KiB | 25.0 KiB | 99.1% | +| 100 shapes, 1k base, 100 added x 10 active move | 32.87 MiB | 29.35 MiB | 3.52 MiB | 349.8 KiB | 284.6 KiB | 65.2 KiB | 99.0% | +| 100 shapes, 1k base, 1k added x 99 active move | 75.51 MiB | 57.77 MiB | 17.75 MiB | 4.25 MiB | 453.4 KiB | 3.81 MiB | 94.4% | + +Interpretation: + +- One-shape cohorts still save memory, but only by a constant factor. There is + no sharing benefit when a subquery has one participant. +- Shared steady-state cohorts get the largest win because the current model + stores value membership and consumer views once per shape. +- Active moves remain materially smaller because the logical-time model stores + changed values and times, not before and after full dependency views. +- The harsh `1k added x 99 active move` case still grows because every active + move stores the changed values. It is still much smaller than the current + model because it avoids duplicating the 1k base view twice per active move. + +#### Customer-Shaped Estimates + +These estimates use the same script. They extrapolate from measured row costs +and use the customer workload ratios from PR #4280: + +- HumanLayer: 75 observed `WHERE` clauses, 134 subquery occurrences, 13 literal + cohorts. +- AutoArc: 611 observed `WHERE` clauses, 291 subquery occurrences, 209 literal + cohorts. +- Hazel: 13 observed shape handles, 4 subquery occurrences, 4 literal cohorts. + +The extrapolation is for 100k shapes and preserves each workload's observed +ratio of subquery occurrences to literal cohorts. + +| Customer | Observed occurrences -> cohorts | Shared occurrences | Participants @100k | Cohorts @100k | Rows/subquery | Current | Logical-time | Savings | +|----------|---------------------------------|--------------------|--------------------|--------------|---------------|---------|--------------|---------| +| HumanLayer | 134 -> 13 | 90.3% | 178,667 | 17,334 | 1,000 | 55.77 GiB | 4.2 GiB | 92.5% | +| HumanLayer | 134 -> 13 | 90.3% | 178,667 | 17,334 | 10,000 | 556.19 GiB | 40.59 GiB | 92.7% | +| AutoArc | 291 -> 209 | 28.2% | 47,627 | 34,207 | 1,000 | 14.87 GiB | 8.04 GiB | 45.9% | +| AutoArc | 291 -> 209 | 28.2% | 47,627 | 34,207 | 10,000 | 148.26 GiB | 79.86 GiB | 46.1% | +| Hazel | 4 -> 4 | 0.0% | 30,770 | 30,770 | 1,000 | 9.61 GiB | 7.23 GiB | 24.8% | +| Hazel | 4 -> 4 | 0.0% | 30,770 | 30,770 | 10,000 | 95.79 GiB | 71.83 GiB | 25.0% | + +Interpretation: + +- HumanLayer benefits most because the captured workload has high literal + subquery sharing. +- AutoArc still benefits, but many literal subqueries are not shared, so the + logical-time model stores more per-cohort shared views. +- Hazel has no observed literal sharing. The estimate still shows a constant + factor reduction because the current model stores both index membership rows + and consumer `MapSet` views per shape, while the logical-time model stores + one shared view per one-participant cohort and compact consumer references. +- If a production workload has one-off subqueries with large dependency views, + the logical-time design is still better than current state, but it is not the + main win. The main win comes when multiple shapes share a subquery. + ### Materializer Integration The materializer owns the source of truth for a dependency subquery. It should @@ -1322,6 +1426,7 @@ state in both `SubqueryIndex` and consumer event handlers. | Version | Date | Author | Changes | |---------|------|--------|---------| +| 0.2 | 2026-05-18 | robacourt | Added operation examples, a memory prototype script, measured local memory scenarios, and customer-shaped estimates based on PR #4280 ratios. | | 0.1 | 2026-05-18 | robacourt | Initial draft using the Stratovolt RFC template and alternatives from PR #4280. | --- From 98cf709266837bddffd20dd75f47a08ffea588ea Mon Sep 17 00:00:00 2001 From: rob Date: Tue, 19 May 2026 09:04:29 +0100 Subject: [PATCH 14/15] Remove use of work 'cohort' --- docs/rfcs/subquery-index.md | 33 ++++++++++++++++++--------------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md index 7e058b008d..a8f04762be 100644 --- a/docs/rfcs/subquery-index.md +++ b/docs/rfcs/subquery-index.md @@ -1091,9 +1091,9 @@ Word size: 8 bytes Interpretation: -- One-shape cohorts still save memory, but only by a constant factor. There is - no sharing benefit when a subquery has one participant. -- Shared steady-state cohorts get the largest win because the current model +- Subqueries used by one shape still save memory, but only by a constant + factor. There is no sharing benefit when a subquery has one participant. +- Shared steady-state subqueries get the largest win because the current model stores value membership and consumer views once per shape. - Active moves remain materially smaller because the logical-time model stores changed values and times, not before and after full dependency views. @@ -1106,17 +1106,20 @@ Interpretation: These estimates use the same script. They extrapolate from measured row costs and use the customer workload ratios from PR #4280: -- HumanLayer: 75 observed `WHERE` clauses, 134 subquery occurrences, 13 literal - cohorts. -- AutoArc: 611 observed `WHERE` clauses, 291 subquery occurrences, 209 literal - cohorts. -- Hazel: 13 observed shape handles, 4 subquery occurrences, 4 literal cohorts. +- HumanLayer: 75 observed `WHERE` clauses, 134 subquery occurrences, 13 + distinct literal subqueries. +- AutoArc: 611 observed `WHERE` clauses, 291 subquery occurrences, 209 + distinct literal subqueries. +- Hazel: 13 observed shape handles, 4 subquery occurrences, 4 distinct literal + subqueries. The extrapolation is for 100k shapes and preserves each workload's observed -ratio of subquery occurrences to literal cohorts. +ratio of subquery occurrences to distinct literal subqueries. A distinct +literal subquery here means a distinct dependency subquery, not a subquery +group. -| Customer | Observed occurrences -> cohorts | Shared occurrences | Participants @100k | Cohorts @100k | Rows/subquery | Current | Logical-time | Savings | -|----------|---------------------------------|--------------------|--------------------|--------------|---------------|---------|--------------|---------| +| Customer | Observed occurrences -> distinct subqueries | Shared occurrences | Participants @100k | Distinct subqueries @100k | Rows/subquery | Current | Logical-time | Savings | +|----------|----------------------------------------------|--------------------|--------------------|---------------------------|---------------|---------|--------------|---------| | HumanLayer | 134 -> 13 | 90.3% | 178,667 | 17,334 | 1,000 | 55.77 GiB | 4.2 GiB | 92.5% | | HumanLayer | 134 -> 13 | 90.3% | 178,667 | 17,334 | 10,000 | 556.19 GiB | 40.59 GiB | 92.7% | | AutoArc | 291 -> 209 | 28.2% | 47,627 | 34,207 | 1,000 | 14.87 GiB | 8.04 GiB | 45.9% | @@ -1129,11 +1132,11 @@ Interpretation: - HumanLayer benefits most because the captured workload has high literal subquery sharing. - AutoArc still benefits, but many literal subqueries are not shared, so the - logical-time model stores more per-cohort shared views. + logical-time model stores more per-subquery shared views. - Hazel has no observed literal sharing. The estimate still shows a constant factor reduction because the current model stores both index membership rows and consumer `MapSet` views per shape, while the logical-time model stores - one shared view per one-participant cohort and compact consumer references. + one shared view per one-participant subquery and compact consumer references. - If a production workload has one-off subqueries with large dependency views, the logical-time design is still better than current state, but it is not the main win. The main win comes when multiple shapes share a subquery. @@ -1411,8 +1414,8 @@ as a follow-up optimization if measurements show cleanup cost is high. ### Alternative 6: Shared Base View With Sparse XOR Exceptions **Description:** The design in PR #4280 stores one base dependency view per -cohort and stores sparse per-participant XOR exceptions for values where a -participant temporarily differs from the base. +grouped subquery index entry and stores sparse per-participant XOR exceptions +for values where a participant temporarily differs from the base. **Why not:** This is a lower-risk, index-focused approach and may still be the right short-term fix if this RFC is too broad. However, it leaves consumer-held From 767442acc16fdf58d12d17a174a7d716f3b50f40 Mon Sep 17 00:00:00 2001 From: rob Date: Tue, 19 May 2026 09:19:22 +0100 Subject: [PATCH 15/15] Remove symbols --- docs/rfcs/subquery-index.md | 73 +++++++++++++++++++++---------------- 1 file changed, 42 insertions(+), 31 deletions(-) diff --git a/docs/rfcs/subquery-index.md b/docs/rfcs/subquery-index.md index a8f04762be..864daa8288 100644 --- a/docs/rfcs/subquery-index.md +++ b/docs/rfcs/subquery-index.md @@ -473,16 +473,6 @@ WHERE user_id IN (SELECT id FROM users WHERE company_id = 8) WHERE user_id NOT IN (SELECT id FROM users WHERE company_id = 7) ``` -Symbols used below: - -- `V_s`: values in subquery `s` over the retained window. -- `H_v`: transition history length for one `{subquery_id, value}` row. -- `C_s`: child nodes attached to subquery `s`. -- `P_c`: outer shapes attached to child node `c`. -- `K`: changed values in one dependency move. -- `N_g`: child nodes in a negated group. -- `R`: root-table candidate rows returned by a move-in SQL query. - #### Initial `MultiTimeView` State The initial materializer state for `s7` stores one row per dependency value, @@ -498,8 +488,8 @@ not one row per outer shape: The empty history means the value is present for the whole retained window. -Memory is `O(V_s)` for the shared view. In this example, `shape_a` and -`shape_b` do not duplicate `{10, 20}`. +Memory is `O(number_of_values_in_subquery_retained_window)` for the shared +view. In this example, `shape_a` and `shape_b` do not duplicate `{10, 20}`. #### `register_subquery_consumer` @@ -583,12 +573,17 @@ What is evaluated: Cost: ```text -O(number_of_subquery_occurrences_in_shape + V_s + child_where_insert) +O( + number_of_subquery_occurrences_in_shape + + number_of_values_in_s7_retained_window + + child_where_insert +) ``` -The `V_s` term only applies because this is the first child for -`{g_user_pos, s7}`. Memory added is `O(V_s)` positive routing rows for the -child plus `O(number_of_subquery_occurrences_in_shape)` participant rows. +The value-count term only applies because this is the first child for +`{g_user_pos, s7}`. Memory added is +`O(number_of_values_in_s7_retained_window)` positive routing rows for the child +plus `O(number_of_subquery_occurrences_in_shape)` participant rows. #### `add_shape`: Additional Shape Sharing An Existing Child @@ -617,7 +612,8 @@ Cost: O(number_of_subquery_occurrences_in_shape + child_where_insert) ``` -Memory added is per-shape metadata only, not `O(V_s)`. +Memory added is per-shape metadata only, not +`O(number_of_values_in_s7_retained_window)`. #### `add_shape`: Same Group, Different Subquery @@ -634,8 +630,8 @@ Rows added include: {:shape_subquery, shape_c, ["$sublink", "0"]} -> {s8, 0} ``` -Cost is `O(V_s8)` for the first `s8` child in this group. This is expected: -`s8` has different dependency values from `s7`. +Cost is `O(number_of_values_in_s8_retained_window)` for the first `s8` child in +this group. This is expected: `s8` has different dependency values from `s7`. #### `add_shape`: Negated Shape @@ -664,7 +660,8 @@ Cost: O(number_of_subquery_occurrences_in_shape + child_where_insert) ``` -Memory added for negated routing is `O(1)` per child, not `O(V_s)`. +Memory added for negated routing is `O(1)` per child, not +`O(number_of_values_in_s7_retained_window)`. #### `affected_shapes`: Positive Group @@ -693,7 +690,11 @@ Both shapes are affected. Cost: ```text -O(children_for_value + child_where_eval + exact_subquery_checks * H_v) +O( + children_for_value + + child_where_eval + + exact_subquery_checks * transition_history_length_for_value +) ``` For this example, `children_for_value = 1`. There is no scan of all shapes and @@ -736,7 +737,11 @@ Only `shape_a` is affected. Cost remains: ```text -O(children_for_value + child_where_eval + exact_subquery_checks * H_v) +O( + children_for_value + + child_where_eval + + exact_subquery_checks * transition_history_length_for_value +) ``` The extra memory for the move is one history row for `{s7, 30}` plus one @@ -769,7 +774,11 @@ advances to logical time `1`, `NOT IN s7` is false for `30`. Cost: ```text -O(N_g * H_v + child_where_eval + exact_subquery_checks * H_v) +O( + number_of_negated_children_in_group * transition_history_length_for_value + + child_where_eval + + exact_subquery_checks * transition_history_length_for_value +) ``` This is intentionally proportional to the number of affected negated children. @@ -812,7 +821,7 @@ What is evaluated: Cost: ```text -O(K * (history_update + C_s)) +O(number_of_changed_values * (history_update + child_nodes_for_subquery)) ``` For a remove of `20` from `s7` at time `2`, the history becomes: @@ -856,7 +865,7 @@ Steady memory added per active move is: O(number_of_changed_values + number_of_subquery_refs) ``` -not `O(V_s)`. +not `O(number_of_values_in_s7_retained_window)`. #### `notify_processed_up_to` And Compaction @@ -919,7 +928,7 @@ layout chosen. Compaction cost is paid separately and can be incremental. For one compacted value it is: ```text -O(H_v + positive_children_for_subquery) +O(transition_history_length_for_value + positive_children_for_subquery) ``` If compaction is batched, total work is proportional to the histories visited @@ -952,11 +961,9 @@ move-in snapshot rows returned by Postgres Cost for the compatibility implementation: ```text -O(V_s + R) +O(number_of_values_in_s7_retained_window + root_rows_returned_by_move_in_query) ``` -where `R` is the number of root-table rows returned by the move-in query. - This does not yet minimize move-in query memory, but it moves full-view arrays out of steady consumer state and into short-lived query tasks. @@ -999,7 +1006,7 @@ The positive route cleanup iterates `MultiTimeView.values(s7)` and deletes the specific `{group, value, child}` route rows. That last-child case costs: ```text -O(V_s + child_metadata) +O(number_of_values_in_s7_retained_window + child_metadata) ``` It does not scan unrelated subqueries or unrelated shapes. @@ -1027,7 +1034,11 @@ progress monitor rows for s7 Cost: ```text -O(C_s + sum(P_c) + V_s) +O( + child_nodes_for_subquery + + sum(shapes_attached_to_each_child) + + number_of_values_in_s7_retained_window +) ``` This is proportional to the removed subquery's children, participants, and