Recently we had a production incident, where 2 things were broken:
- some datasets did not have a computed key block cache at all
- some contained incomplete key block cache (we stored only
AddPushSource event, but no Seed or SetDataSchema, as earlier events were somehow lost).
The reason why incident occured has been already resolved (incorrect streaming of dataset entries during re-indexing key blocks), although we had some "data consequences" that were not 100% cleanly recovered.
So, the idea would be to enforce a few invariants that detect broken consistency in our state:
-
All datasets in "dataset_entries` table should be:
- present in
dataset_references table as unique rows for "Head" ref
- present in
dataset_statistics table as unique rows
- present in
dataset_key_blocks table (at least once)
-
Records in dataset_key_blocks must respect validation criteria:
- Seed event is present and stays at seq number 0
-
The derivative datasets should be represented in dataset_dependecies table as a "downstream" edge.
For these invariants:
- implement SQL queries that detect them
- deliver Prometheus metric per problem type (i.e. gauge for anomalies count)
- configure alerts for new incidents
Since we are talking about quite heavyweight checks, it could be naive to bind Prometheus to those queries directly.
Consider the following implementation strategy that provides more control over performance:
- detect anomalizes in the form of a Materialized View in Postgres
- refresh it concurrently each 30-60 minutes (configure a cron job using
pg_cron or something similar)
- bind Prometheus metrics to that materialized view, so that it's just quickly checking the latest snapshot instead of heavy compute
In addition, it could make sense to run the refreshes manually after deployments, since deployments would bring 90%+ incidents.
The refresh should run after deployment stabilizes: this could be manual or this could be automated with a delay or post-deploy reaction.
Recently we had a production incident, where 2 things were broken:
AddPushSourceevent, but noSeedorSetDataSchema, as earlier events were somehow lost).The reason why incident occured has been already resolved (incorrect streaming of dataset entries during re-indexing key blocks), although we had some "data consequences" that were not 100% cleanly recovered.
So, the idea would be to enforce a few invariants that detect broken consistency in our state:
All datasets in "dataset_entries` table should be:
dataset_referencestable as unique rows for "Head" refdataset_statisticstable as unique rowsdataset_key_blockstable (at least once)Records in
dataset_key_blocksmust respect validation criteria:The derivative datasets should be represented in
dataset_dependeciestable as a "downstream" edge.For these invariants:
Since we are talking about quite heavyweight checks, it could be naive to bind Prometheus to those queries directly.
Consider the following implementation strategy that provides more control over performance:
pg_cronor something similar)In addition, it could make sense to run the refreshes manually after deployments, since deployments would bring 90%+ incidents.
The refresh should run after deployment stabilizes: this could be manual or this could be automated with a delay or post-deploy reaction.