data-review

Change-time review for data-engineering and financial-controlling work.

An agent skill that reviews the data impact of a code change: it re-runs the affected pipeline lanes and checks whether the numbers move the way the author said they would.

Code review checks the code; tests check the assertions you remembered to write. Neither runs a query, so locale mis-parses, sign flips, silent NULLs, join fan-out, and double counting all pass green. Scripts compute, agents judge: no agent does arithmetic a query can do, no script decides whether a number should have moved.

you             →  a PR touching parsers / transforms / models / deliverables
blast_radius    →  which lanes the diff hits                    (deterministic)
T1  tieouts     →  is the world in a known-good state?          (deterministic)
T2  rerun+diff  →  scratch rebuild vs committed baseline        (deterministic)
T3  row diff    →  row-level EXCEPT ALL + top-k moved groups    (deterministic)
judgment layer  →  declared / explained / UNEXPLAINED           (agents)
report          →  findings ranked by € impact + coverage table

Try it

A real before/after data bug caught in under a minute, no agent or account:

git clone https://github.com/10fra/data-review.git
cd data-review && ./demo.sh

It builds a toy two-stage pipeline, blesses a baseline, then drops a cents-to-dollars conversion in staging. A unit test and the orders-vs-staging tie-out both stay green (everything scaled in lockstep); the baseline diff catches sum(amount) jumping $134.50 → $13,450 and escalates it.

Setup

git clone https://github.com/10fra/data-review.git
cd data-review && ./install.sh

install.sh builds the venv, links the skill into ~/.claude/skills/, and runs the tests; re-run after a git pull. A consumer project carries a data-review.yml and committed baselines under .data-review/baselines/.

Usage

In a project with a data-review.yml, ask Claude Code:

review the data impact of this branch

Auto-discovered from ~/.claude/skills/, it runs the evidence scripts, then reviews the diff through ten failure-class lenses (each seeded from a real production bug), checks every delta against the PR's ## Expected data impact, and verifies each finding adversarially before it reaches the report.

Deterministic, so runnable by hand or in CI. The data-review launcher uses the repo's venv from any consumer directory:

data-review blast --manifest data-review.yml --diff main...HEAD   # diff × manifest → tier plan
data-review bless --manifest data-review.yml --lane <name>        # snapshot live state → committed baseline
data-review t1    --manifest data-review.yml                      # tieouts + live-vs-baseline conformance
data-review t2    --manifest data-review.yml --lane <name>        # scratch re-run, metric diff vs baseline
data-review t3    --manifest data-review.yml --lane <name> --scratch <db>  # row-level EXCEPT ALL
data-review xlsx  --manifest data-review.yml --workbook <path>    # Excel tie-outs, hardcodes, short SUMs

(alias dr=/path/to/data-review/data-review to shorten.)

The manifest

One data-review.yml per project, declaring what the skill cannot infer:

engine:                 # where the numbers live (DuckDB; adapters convert anything else)
lanes:                  # unit of blast radius: code globs + scratch-safe run command + output tables
baselines:              # metrics + group-by dims snapshotted into committed JSON
tieouts:                # cross-artifact invariants, verified on live data before being committed
deliverables:           # Excel workbooks audited against SQL
context:                # prose priors for the judgment agents (locale, sign conventions, known traps)

A lane run command must take a scratch-target override (scratch_env) or the skill refuses it, so a re-run never touches the canonical store. It must build what it consumes, or it re-runs stale code; non-DuckDB outputs convert there too.

How it works

Baselines are committed JSON. Re-blessing is a reviewed git change, so baseline drift shows up in diffs.
Tiers escalate mechanically. Any beyond-tolerance T2 delta makes the lane a T3 candidate; money_critical lanes always get T3. Whether the movement is explained is the agents' call.
Findings are falsifiable. Each carries a one-sentence claim, € impact from evidence, a reproduce command, and its verification votes.
Coverage is reported, not implied. The report ends with a lane × tier table; an empty report is not a clean review.

Proving ground

Tested on four stacks (one private, three public), replaying real historical bugs from their git history.

Project	Stack	Bug replayed / injected	Caught	Signal
consumer #1 (private)	TS parsers + DuckDB	supplier A amount explosion (real)	T2	sum €63.5M → €3.91B, 74 group deltas, zero leakage
consumer #1 (private)	TS parsers + DuckDB	supplier B US-locale dates (real)	T2	null_rate(accounting_period) +16pp, 96 group deltas
jaffle_shop_duckdb	dbt + DuckDB	fan-out / filter drift / unit-scale (injected)	T2	all three; onboarding took 3 minutes
owid/etl	Python ETL + feather	drug overdoses double-counted (`ef7a53c9b`, real)	T2	exactly 2 facts: +41,552 deaths, group `accidents`
pudl	dagster + parquet	FERC values silently dropped (`09d8efa7d`, real)	T2	14 facts, all in detailed tables, zero in core; +$54.5B distortion

Every clean-control rerun produced zero beyond-tolerance facts, deterministic to the cent, and one caught real drift on its own (a canonical store stale against merged parser fixes). The recurring lesson: tie-outs and pipeline tests are blind to scaling, double-counting, and drop distortions. Shares renormalize, identities scale in lockstep, sets dedupe. Only the committed-baseline diff is anchored outside the change.

Hard rules

Scripts emit facts, never findings. Severity is the judgment layer's job.
Exit 2 (evidence-unavailable) is never a pass. Missing input is a finding, not a skip.
Silence never looks green: unmatched files, skipped formulas, dropped metrics, and unavailable tiers are all reported facts.
A stale scratch database is never diffed; pre-existing scratch files are deleted before every re-run.
A tieout is only committed after it has been verified to hold on live data.
€ impact is computed from evidence by script, never estimated by an agent.

Onboarding

Write data-review.yml (schema: skills/data-review/manifest.schema.json).
data-review bless --manifest data-review.yml --lane <name>, commit the baselines.
data-review t1 --manifest data-review.yml; a clean run means ready.

Cost: 3 minutes (dbt + DuckDB) to half a day (dagster monorepo).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples/finance-demo		examples/finance-demo
skills/data-review		skills/data-review
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data-review		data-review
demo.sh		demo.sh
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-review

Try it

Setup

Usage

The manifest

How it works

Proving ground

Hard rules

Onboarding

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

data-review

Try it

Setup

Usage

The manifest

How it works

Proving ground

Hard rules

Onboarding

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages