Add Continuous Materialized View to emulator#30
Add Continuous Materialized View to emulator#30jakeheft wants to merge 8 commits intobitly:masterfrom
Conversation
76994ff to
5e7c996
Compare
| var deletedKeys []string // populated for prefix case; nil for deleteAll | ||
|
|
||
| tbl.mu.Lock() | ||
| defer tbl.mu.Unlock() |
There was a problem hiding this comment.
I removed defer tbl.mu.Unlock() in functions that call CMV propagation helpers — those helpers acquire s.mu, and holding tbl.mu into them would invert the lock order and risk deadlock. Error paths now carry their own explicit unlocks as a result.
ea23f26 to
a30ef55
Compare
| s.mu.Lock() | ||
| sourceTbl := s.tables[fqSourceTable] | ||
| for _, cmv := range cmvs { | ||
| s.ensureCMVTable(cmv, sourceTbl) |
There was a problem hiding this comment.
it looks like we check if the CMV table exists on all inserts and updates can that be moved into NewServer func?
There was a problem hiding this comment.
It can't be moved there because the source table (and its column families) may not exist yet at registration time. The check on writes is just a map lookup once the shadow table exists — it returns immediately after the first write creates it.
| for _, action := range cmvActions { | ||
| if action.deleted { | ||
| s.deleteCMVRow(req.TableName, action.key) | ||
| } else { | ||
| s.syncCMVRow(req.TableName, action.rowCopy) | ||
| } | ||
| } |
There was a problem hiding this comment.
is it only possible to propagate CMV mutations if the primary table is unlocked? Ideally Id like to keep the defer tbl.mu.Unlock() line on 986 where it is
There was a problem hiding this comment.
The lock ordering throughout the file is s.mu → tbl.mu. Things like ReadRows hold s.mu and then acquire tbl.mu, so holding tbl.mu while syncCMVRow tries to grab s.mu would deadlock.
We could pre-resolve the view tables under s.mu before the defer to restore it, but it adds a pre-pass on every mutation. Happy to make that change if you prefer, but the lock ordering constraint itself can't be avoided.
bttest/cmv.go
Outdated
| func ParseCMVConfigFromSQL(viewID, query string) (*CMVConfig, error) { | ||
| cfg := &CMVConfig{ | ||
| ViewID: viewID, | ||
| KeySeparator: "#", |
There was a problem hiding this comment.
should this use the key seperator in the CMVConfig
There was a problem hiding this comment.
The "#" default doesn't actually matter here because ParseCMVConfigFromSQL derives the separator from the SQL itself and the default is never actually used. I'll remove it.
bttest/cmv_test.go
Outdated
| SourceTable: "events", | ||
| ViewID: "events_by_account", | ||
| KeySeparator: "#", | ||
| KeyMapping: []int{3, 4, 1, 2, 0}, |
There was a problem hiding this comment.
What happens if KeyMapping is short a key?
There was a problem hiding this comment.
This is by design as key mapping doesn't need to produce a 1:1 mapping with the source table. In attributions we actually won't pull on keys from attributions_conversion_events into attributions_conversion_events_by_client.
But that's a good callout as it makes me wonder about the key mapping being too many elements. I think it will just map empty strings which we don't want. I'll go ahead and add in a guard for this
1f1064c to
87dd05b
Compare
bttest/cmv.go
Outdated
| cfg.SourceTable = fromMatch[1] | ||
|
|
||
| // Extract column aliases and their SAFE_OFFSET indices from SELECT. | ||
| offsetRe := regexp.MustCompile(`SPLIT\(_key,\s*'([^']+)'\)\[SAFE_OFFSET\((\d+)\)\].*?\bAS\s+\w+\)\s+AS\s+(\w+)|SPLIT\(_key,\s*'([^']+)'\)\[SAFE_OFFSET\((\d+)\)\]\s+AS\s+(\w+)`) |
There was a problem hiding this comment.
Are there other options to parse the SQL with a parser instead of regex? It feels like this is very over-fitted to key constructs?
There was a problem hiding this comment.
I've been struggling to find something that can parse this specific SQL since this uses BigQuery-specific syntax SPLIT(_key, '#')[SAFE_OFFSET(n)]. I did find https://github.com/goccy/go-zetasql but that uses CGO (which is also used by sqllite package) but there are some downsides to the dependency (very large build dependency, not actively maintained and only 2 contributors, doesn't necessarily simplify parsing logic we need to write).
So while the regex isn't great, I don't think I've come across any options I like better
There was a problem hiding this comment.
I also had cursor whip up a string splitting approach but it adds a good deal of complexity and I don't think it makes it any more readable. I'm happy to push up a commit if you want to take a look and compare against the regex approach
There was a problem hiding this comment.
Let's align on example (test cases) of SQL strings we should be able to parse - and let that drive us.
Do you think we would benefit from splitting this parsing (ParseCMVConfigFromSQL) off into to a sql_parse.go to make it easier to review what's happening and test it specifically?
There was a problem hiding this comment.
Here are the SQL patterns the parser should handle, based on the Bigtable CMV query docs :
Supported (ORDER BY secondary index with SPLIT-based re-keying):
SPLIT(_key, '#')[SAFE_OFFSET(n)] AS alias— plain key component extraction_key AS alias in SELECT + ORDER BY— appends full source keyfamily AS family— column family inclusion- Custom separators (not just #)
I took another crack at a parser and got something I feel good about and pushed that up. You can see the 4 sql patterns we accept in the tests. The parser will return a clear error for any SQL it can't handle, so unsupported patterns fail loudly at CreateMaterializedView time rather than silently producing wrong results.
There was a problem hiding this comment.
@jakeheft I took the liberty pushing commits to rewrite the tests to a table driven test and added the following example query from Googles integration test which should parse, but doesn't.
https://github.com/googleapis/google-cloud-go/blob/main/bigtable/integration_test.go#L4831
There was a problem hiding this comment.
@jehiah Thanks for updating tests! Regarding the sql statements that won't parse, I left a comment about that in the code here:
// Only ORDER BY (secondary index) queries are supported. GROUP BY
// (aggregation) queries require maintaining running aggregates and are
// not implemented by the emulator.ORDER BY CMVs are pure key re-mapping — every write produces a deterministic output row, so the emulator just transforms the key on the fly. GROUP BY aggregation CMVs require maintaining running state, meaning the emulator would need to re-aggregate across all existing rows on every write. I opted not to add that as it's a fundamentally different execution model, not just a parser extension.
There was a problem hiding this comment.
I added a commit that sends an error back for any GROUP BY queries and updated the test cases accordingly.
Let me know if you disagree on this approach, but I think if we want to bring in the ability to account for GROUP BY then that blows up our scope here and that should be moved to its own ticket (and probably needs architecture/design discussions)
There was a problem hiding this comment.
Ok - I think i can be convinced to add minimal CMV support for 1:1 mappings, but we should support all the supported functions that are allowed in a CMV. I am ok if we also dissallow subqueries.
0eb2613 to
02e3325
Compare
| // syncCMVRow writes/updates the CMV shadow row for a given source row mutation. | ||
| // Must NOT be called with s.mu held (acquires its own locks). | ||
| func (s *server) syncCMVRow(fqSourceTable string, sourceRow *row) { | ||
| cmvs := s.cmvs.cmvsForTable(fqSourceTable) |
There was a problem hiding this comment.
these calls need to be within the s.mu.lock()/unlock to avoid concurrent map read/write. not sure it would actually happen in a dev scenario but for the case that something is writing to the table while a new cmv is getting created
There was a problem hiding this comment.
Good call. Rather than wrapping in s.mu (which would deadlock since syncCMVRow acquires s.mu internally), I added a dedicated RWMutex directly on cmvRegistry so reads and writes to the config map are independently safe. Same outcome, avoids the lock ordering issue.
Let me know if you disagree with that approach
zoemccormick
left a comment
There was a problem hiding this comment.
a couple of small comments. also as a note, I don't see the ability to pass in --cmv-config as described in the PR comment, not sure if that was by design after iterating but it would be nice to be able to test this locally without needing to create another program to call CreateMaterializedView if possible (or, if not possible, maybe we could include a small program that we can run locally to do this). interesting learning cmvs by reviewing this!
68e9b2b to
e7e837c
Compare
@zoemccormick yes, this was removed during development in order to match how production runs. But a good callout on testing so I've added it back in. |
I think we should not add a CLI argument to configure CMV and we should only support grpc as we do with all table configs, etc. It's good to avoid multiple ways to do something since this is an open source project and we want to avoid space for backwards incompatible changes which is likely where we specify our own config format - we want to enforce we are matching the google APIs for configuration. (presumably this will eventually be exposed in If we have tests we want to exercise we should just commit them as Go code to setup the state we want with the google APIs. |
| materializedViews: make(map[string]*btapb.MaterializedView), | ||
| db: db, | ||
| tableBackend: NewSqlTables(db), | ||
| cmvs: newCMVRegistry(), |
There was a problem hiding this comment.
newCMVRegistry is currently in-memory; the CMV's need to be persisted in db; little_bigtables design promise - it's whole point originally - is that it persists where the google provided emulator is in-memory.
I know this is a change because initially this implementation was framed around reading in a bespoke config instead of supporting the provisioning endpoints.
Description
This PR adds the ability to use a Continuous Materialized View (CMV) in the little bigtable emulator.
https://docs.cloud.google.com/bigtable/docs/continuous-materialized-views
CMVs are created via the standard CreateMaterializedView gRPC call, the same as production. On write, the emulator automatically creates the shadow table (if needed), transforms the row key per the SQL query, and syncs the row. Deletes propagate too.
JIRA
Resolves SQ-4524
Testing
Testing
Unit tests
go test -race ./bttest/... -vLocal end-to-end (using
--cmv-config)1. Build the emulator
go build -o little_bigtable .2. Create a CMV config file
3. Start the emulator
./little_bigtable --port 9000 --db-file /tmp/lbt.db --cmv-config /tmp/cmv_config.json export BIGTABLE_EMULATOR_HOST=localhost:90004. Create the source table
5. Write a row to the source table
Source key format:
item_id#ts#type#region#account_id6. Verify the row exists in the source table
7. Verify the re-keyed row appears in the CMV shadow table
Expected key:
region-a#account-42#9999999#type-x#item-abc#item-abc#9999999#type-x#region-a#account-428. Verify delete propagation
Local end-to-end (using
CreateMaterializedViewgRPC)The same flow above can be driven programmatically using the standard Go admin client
pointed at
BIGTABLE_EMULATOR_HOST. This matches the production code path and exercisesthe SQL parser. See
CMV_SUPPORT.mdfor a Go snippet.Deploy