Skip to content

Add Continuous Materialized View to emulator#30

Open
jakeheft wants to merge 8 commits intobitly:masterfrom
jakeheft:add_cmv_to_emulator
Open

Add Continuous Materialized View to emulator#30
jakeheft wants to merge 8 commits intobitly:masterfrom
jakeheft:add_cmv_to_emulator

Conversation

@jakeheft
Copy link
Copy Markdown

@jakeheft jakeheft commented Mar 17, 2026

Description

This PR adds the ability to use a Continuous Materialized View (CMV) in the little bigtable emulator.
https://docs.cloud.google.com/bigtable/docs/continuous-materialized-views

CMVs are created via the standard CreateMaterializedView gRPC call, the same as production. On write, the emulator automatically creates the shadow table (if needed), transforms the row key per the SQL query, and syncs the row. Deletes propagate too.

JIRA

Resolves SQ-4524

Testing

Testing

Unit tests

go test -race ./bttest/... -v

Local end-to-end (using --cmv-config)

1. Build the emulator

go build -o little_bigtable .

2. Create a CMV config file

cat > /tmp/cmv_config.json << 'EOF'
[
  {
    "source_table": "events",
    "view_id": "events_by_account",
    "key_separator": "#",
    "key_mapping": [3, 4, 1, 2, 0],
    "append_source_key": true,
    "include_families": ["cf1"]
  }
]
EOF

3. Start the emulator

./little_bigtable --port 9000 --db-file /tmp/lbt.db --cmv-config /tmp/cmv_config.json
export BIGTABLE_EMULATOR_HOST=localhost:9000

4. Create the source table

cbt -project local -instance local createtable events
cbt -project local -instance local createfamily events cf1

5. Write a row to the source table

Source key format: item_id#ts#type#region#account_id

cbt -project local -instance local set events \
  "item-abc#9999999#type-x#region-a#account-42" cf1:data=test

6. Verify the row exists in the source table

cbt -project local -instance local read events

7. Verify the re-keyed row appears in the CMV shadow table

Expected key: region-a#account-42#9999999#type-x#item-abc#item-abc#9999999#type-x#region-a#account-42

cbt -project local -instance local read events_by_account

8. Verify delete propagation

# Delete the source row
cbt -project local -instance local deleterow events "item-abc#9999999#type-x#region-a#account-42"

# Confirm source row is gone
cbt -project local -instance local read events

# Confirm CMV row is also gone
cbt -project local -instance local read events_by_account

Local end-to-end (using CreateMaterializedView gRPC)

The same flow above can be driven programmatically using the standard Go admin client
pointed at BIGTABLE_EMULATOR_HOST. This matches the production code path and exercises
the SQL parser. See CMV_SUPPORT.md for a Go snippet.

Deploy

  • squash
  • merge

@jakeheft jakeheft force-pushed the add_cmv_to_emulator branch 2 times, most recently from 76994ff to 5e7c996 Compare March 18, 2026 22:20
var deletedKeys []string // populated for prefix case; nil for deleteAll

tbl.mu.Lock()
defer tbl.mu.Unlock()
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed defer tbl.mu.Unlock() in functions that call CMV propagation helpers — those helpers acquire s.mu, and holding tbl.mu into them would invert the lock order and risk deadlock. Error paths now carry their own explicit unlocks as a result.

@jakeheft jakeheft force-pushed the add_cmv_to_emulator branch from ea23f26 to a30ef55 Compare March 18, 2026 23:15
s.mu.Lock()
sourceTbl := s.tables[fqSourceTable]
for _, cmv := range cmvs {
s.ensureCMVTable(cmv, sourceTbl)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we check if the CMV table exists on all inserts and updates can that be moved into NewServer func?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be moved there because the source table (and its column families) may not exist yet at registration time. The check on writes is just a map lookup once the shadow table exists — it returns immediately after the first write creates it.

Comment on lines +1190 to +1196
for _, action := range cmvActions {
if action.deleted {
s.deleteCMVRow(req.TableName, action.key)
} else {
s.syncCMVRow(req.TableName, action.rowCopy)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it only possible to propagate CMV mutations if the primary table is unlocked? Ideally Id like to keep the defer tbl.mu.Unlock() line on 986 where it is

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock ordering throughout the file is s.mutbl.mu. Things like ReadRows hold s.mu and then acquire tbl.mu, so holding tbl.mu while syncCMVRow tries to grab s.mu would deadlock.

We could pre-resolve the view tables under s.mu before the defer to restore it, but it adds a pre-pass on every mutation. Happy to make that change if you prefer, but the lock ordering constraint itself can't be avoided.

bttest/cmv.go Outdated
func ParseCMVConfigFromSQL(viewID, query string) (*CMVConfig, error) {
cfg := &CMVConfig{
ViewID: viewID,
KeySeparator: "#",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this use the key seperator in the CMVConfig

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "#" default doesn't actually matter here because ParseCMVConfigFromSQL derives the separator from the SQL itself and the default is never actually used. I'll remove it.

SourceTable: "events",
ViewID: "events_by_account",
KeySeparator: "#",
KeyMapping: []int{3, 4, 1, 2, 0},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if KeyMapping is short a key?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design as key mapping doesn't need to produce a 1:1 mapping with the source table. In attributions we actually won't pull on keys from attributions_conversion_events into attributions_conversion_events_by_client.

But that's a good callout as it makes me wonder about the key mapping being too many elements. I think it will just map empty strings which we don't want. I'll go ahead and add in a guard for this

@jakeheft jakeheft force-pushed the add_cmv_to_emulator branch 2 times, most recently from 1f1064c to 87dd05b Compare March 19, 2026 21:09
bttest/cmv.go Outdated
cfg.SourceTable = fromMatch[1]

// Extract column aliases and their SAFE_OFFSET indices from SELECT.
offsetRe := regexp.MustCompile(`SPLIT\(_key,\s*'([^']+)'\)\[SAFE_OFFSET\((\d+)\)\].*?\bAS\s+\w+\)\s+AS\s+(\w+)|SPLIT\(_key,\s*'([^']+)'\)\[SAFE_OFFSET\((\d+)\)\]\s+AS\s+(\w+)`)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other options to parse the SQL with a parser instead of regex? It feels like this is very over-fitted to key constructs?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been struggling to find something that can parse this specific SQL since this uses BigQuery-specific syntax SPLIT(_key, '#')[SAFE_OFFSET(n)]. I did find https://github.com/goccy/go-zetasql but that uses CGO (which is also used by sqllite package) but there are some downsides to the dependency (very large build dependency, not actively maintained and only 2 contributors, doesn't necessarily simplify parsing logic we need to write).

So while the regex isn't great, I don't think I've come across any options I like better

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had cursor whip up a string splitting approach but it adds a good deal of complexity and I don't think it makes it any more readable. I'm happy to push up a commit if you want to take a look and compare against the regex approach

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's align on example (test cases) of SQL strings we should be able to parse - and let that drive us.

Do you think we would benefit from splitting this parsing (ParseCMVConfigFromSQL) off into to a sql_parse.go to make it easier to review what's happening and test it specifically?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the SQL patterns the parser should handle, based on the Bigtable CMV query docs :
Supported (ORDER BY secondary index with SPLIT-based re-keying):

  1. SPLIT(_key, '#')[SAFE_OFFSET(n)] AS alias — plain key component extraction
  2. _key AS alias in SELECT + ORDER BY — appends full source key
  3. family AS family — column family inclusion
  4. Custom separators (not just #)

I took another crack at a parser and got something I feel good about and pushed that up. You can see the 4 sql patterns we accept in the tests. The parser will return a clear error for any SQL it can't handle, so unsupported patterns fail loudly at CreateMaterializedView time rather than silently producing wrong results.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakeheft I took the liberty pushing commits to rewrite the tests to a table driven test and added the following example query from Googles integration test which should parse, but doesn't.

https://github.com/googleapis/google-cloud-go/blob/main/bigtable/integration_test.go#L4831

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jehiah Thanks for updating tests! Regarding the sql statements that won't parse, I left a comment about that in the code here:

// Only ORDER BY (secondary index) queries are supported. GROUP BY
// (aggregation) queries require maintaining running aggregates and are
// not implemented by the emulator.

ORDER BY CMVs are pure key re-mapping — every write produces a deterministic output row, so the emulator just transforms the key on the fly. GROUP BY aggregation CMVs require maintaining running state, meaning the emulator would need to re-aggregate across all existing rows on every write. I opted not to add that as it's a fundamentally different execution model, not just a parser extension.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit that sends an error back for any GROUP BY queries and updated the test cases accordingly.

Let me know if you disagree on this approach, but I think if we want to bring in the ability to account for GROUP BY then that blows up our scope here and that should be moved to its own ticket (and probably needs architecture/design discussions)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - I think i can be convinced to add minimal CMV support for 1:1 mappings, but we should support all the supported functions that are allowed in a CMV. I am ok if we also dissallow subqueries.

@jakeheft jakeheft force-pushed the add_cmv_to_emulator branch from 0eb2613 to 02e3325 Compare March 24, 2026 20:12
// syncCMVRow writes/updates the CMV shadow row for a given source row mutation.
// Must NOT be called with s.mu held (acquires its own locks).
func (s *server) syncCMVRow(fqSourceTable string, sourceRow *row) {
cmvs := s.cmvs.cmvsForTable(fqSourceTable)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these calls need to be within the s.mu.lock()/unlock to avoid concurrent map read/write. not sure it would actually happen in a dev scenario but for the case that something is writing to the table while a new cmv is getting created

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Rather than wrapping in s.mu (which would deadlock since syncCMVRow acquires s.mu internally), I added a dedicated RWMutex directly on cmvRegistry so reads and writes to the config map are independently safe. Same outcome, avoids the lock ordering issue.

Let me know if you disagree with that approach

Copy link
Copy Markdown
Contributor

@zoemccormick zoemccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of small comments. also as a note, I don't see the ability to pass in --cmv-config as described in the PR comment, not sure if that was by design after iterating but it would be nice to be able to test this locally without needing to create another program to call CreateMaterializedView if possible (or, if not possible, maybe we could include a small program that we can run locally to do this). interesting learning cmvs by reviewing this!

@jakeheft jakeheft force-pushed the add_cmv_to_emulator branch from 68e9b2b to e7e837c Compare March 26, 2026 21:34
@jakeheft
Copy link
Copy Markdown
Author

I don't see the ability to pass in --cmv-config as described in the PR comment, not sure if that was by design after iterating but it would be nice to be able to test this locally without needing to create another program to call CreateMaterializedView if possible (or, if not possible, maybe we could include a small program that we can run locally to do this). interesting learning cmvs by reviewing this!

@zoemccormick yes, this was removed during development in order to match how production runs. But a good callout on testing so I've added it back in.

@jehiah
Copy link
Copy Markdown
Member

jehiah commented Mar 27, 2026

But a good callout on testing so I've added it back in.

I think we should not add a CLI argument to configure CMV and we should only support grpc as we do with all table configs, etc. It's good to avoid multiple ways to do something since this is an open source project and we want to avoid space for backwards incompatible changes which is likely where we specify our own config format - we want to enforce we are matching the google APIs for configuration. (presumably this will eventually be exposed in cbt as well)

If we have tests we want to exercise we should just commit them as Go code to setup the state we want with the google APIs.

materializedViews: make(map[string]*btapb.MaterializedView),
db: db,
tableBackend: NewSqlTables(db),
cmvs: newCMVRegistry(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newCMVRegistry is currently in-memory; the CMV's need to be persisted in db; little_bigtables design promise - it's whole point originally - is that it persists where the google provided emulator is in-memory.

I know this is a change because initially this implementation was framed around reading in a bespoke config instead of supporting the provisioning endpoints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants