Incident reaction: implement advanced consistency health checks for dataset tables

Recently we had a production incident, where 2 things were broken:
 - some datasets did not have a computed key block cache at all
 - some contained incomplete key block cache (we stored only `AddPushSource` event, but no `Seed` or `SetDataSchema`, as earlier events were somehow lost).

The reason why incident occured has been already resolved (incorrect streaming of dataset entries during re-indexing key blocks), although we had some "data consequences" that were not 100% cleanly recovered.

So, the idea would be to enforce a few invariants that detect broken consistency in our state:

1) All datasets in "dataset_entries` table should be:
   - present in `dataset_references` table as unique rows for "Head" ref
   - present in `dataset_statistics` table as unique rows
   - present in `dataset_key_blocks` table (at least once)

2) Records in `dataset_key_blocks` must respect validation criteria:
   - Seed event is present and stays at seq number 0

3) The derivative datasets should be represented in `dataset_dependecies` table as a "downstream" edge.

For these invariants:
- implement SQL queries that detect them
- deliver Prometheus metric per problem type (i.e. gauge for anomalies count)
- configure alerts for new incidents

Since we are talking about quite heavyweight checks, it could be naive to bind Prometheus to those queries directly.
Consider the following implementation strategy that provides more control over performance:
- detect anomalizes in the form of a Materialized View in Postgres
- refresh it concurrently each 30-60 minutes (configure a cron job using `pg_cron` or something similar)
- bind Prometheus metrics to that materialized view, so that it's just quickly checking the latest snapshot instead of heavy compute

In addition, it could make sense to run the refreshes manually after deployments, since deployments would bring 90%+ incidents.
The refresh should run after deployment stabilizes: this could be manual or this could be automated with a delay or post-deploy reaction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident reaction: implement advanced consistency health checks for dataset tables #279

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incident reaction: implement advanced consistency health checks for dataset tables #279

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions