Skip to content

10fra/data-review

Repository files navigation

data-review

Change-time review for data-engineering and financial-controlling work.

An agent skill that reviews the data impact of a code change: it re-runs the affected pipeline lanes and checks whether the numbers move the way the author said they would.

Code review checks the code; tests check the assertions you remembered to write. Neither runs a query, so locale mis-parses, sign flips, silent NULLs, join fan-out, and double counting all pass green. Scripts compute, agents judge: no agent does arithmetic a query can do, no script decides whether a number should have moved.

you             →  a PR touching parsers / transforms / models / deliverables
blast_radius    →  which lanes the diff hits                    (deterministic)
T1  tieouts     →  is the world in a known-good state?          (deterministic)
T2  rerun+diff  →  scratch rebuild vs committed baseline        (deterministic)
T3  row diff    →  row-level EXCEPT ALL + top-k moved groups    (deterministic)
judgment layer  →  declared / explained / UNEXPLAINED           (agents)
report          →  findings ranked by € impact + coverage table

Try it

A real before/after data bug caught in under a minute, no agent or account:

git clone https://github.com/10fra/data-review.git
cd data-review && ./demo.sh

It builds a toy two-stage pipeline, blesses a baseline, then drops a cents-to-dollars conversion in staging. A unit test and the orders-vs-staging tie-out both stay green (everything scaled in lockstep); the baseline diff catches sum(amount) jumping $134.50 → $13,450 and escalates it.

Setup

git clone https://github.com/10fra/data-review.git
cd data-review && ./install.sh

install.sh builds the venv, links the skill into ~/.claude/skills/, and runs the tests; re-run after a git pull. A consumer project carries a data-review.yml and committed baselines under .data-review/baselines/.

Usage

In a project with a data-review.yml, ask Claude Code:

review the data impact of this branch

Auto-discovered from ~/.claude/skills/, it runs the evidence scripts, then reviews the diff through ten failure-class lenses (each seeded from a real production bug), checks every delta against the PR's ## Expected data impact, and verifies each finding adversarially before it reaches the report.

Deterministic, so runnable by hand or in CI. The data-review launcher uses the repo's venv from any consumer directory:

data-review blast --manifest data-review.yml --diff main...HEAD   # diff × manifest → tier plan
data-review bless --manifest data-review.yml --lane <name>        # snapshot live state → committed baseline
data-review t1    --manifest data-review.yml                      # tieouts + live-vs-baseline conformance
data-review t2    --manifest data-review.yml --lane <name>        # scratch re-run, metric diff vs baseline
data-review t3    --manifest data-review.yml --lane <name> --scratch <db>  # row-level EXCEPT ALL
data-review xlsx  --manifest data-review.yml --workbook <path>    # Excel tie-outs, hardcodes, short SUMs

(alias dr=/path/to/data-review/data-review to shorten.)

The manifest

One data-review.yml per project, declaring what the skill cannot infer:

engine:                 # where the numbers live (DuckDB; adapters convert anything else)
lanes:                  # unit of blast radius: code globs + scratch-safe run command + output tables
baselines:              # metrics + group-by dims snapshotted into committed JSON
tieouts:                # cross-artifact invariants, verified on live data before being committed
deliverables:           # Excel workbooks audited against SQL
context:                # prose priors for the judgment agents (locale, sign conventions, known traps)

A lane run command must take a scratch-target override (scratch_env) or the skill refuses it, so a re-run never touches the canonical store. It must build what it consumes, or it re-runs stale code; non-DuckDB outputs convert there too.

How it works

  • Baselines are committed JSON. Re-blessing is a reviewed git change, so baseline drift shows up in diffs.
  • Tiers escalate mechanically. Any beyond-tolerance T2 delta makes the lane a T3 candidate; money_critical lanes always get T3. Whether the movement is explained is the agents' call.
  • Findings are falsifiable. Each carries a one-sentence claim, € impact from evidence, a reproduce command, and its verification votes.
  • Coverage is reported, not implied. The report ends with a lane × tier table; an empty report is not a clean review.

Proving ground

Tested on four stacks (one private, three public), replaying real historical bugs from their git history.

Project Stack Bug replayed / injected Caught Signal
consumer #1 (private) TS parsers + DuckDB supplier A amount explosion (real) T2 sum €63.5M → €3.91B, 74 group deltas, zero leakage
consumer #1 (private) TS parsers + DuckDB supplier B US-locale dates (real) T2 null_rate(accounting_period) +16pp, 96 group deltas
jaffle_shop_duckdb dbt + DuckDB fan-out / filter drift / unit-scale (injected) T2 all three; onboarding took 3 minutes
owid/etl Python ETL + feather drug overdoses double-counted (ef7a53c9b, real) T2 exactly 2 facts: +41,552 deaths, group accidents
pudl dagster + parquet FERC values silently dropped (09d8efa7d, real) T2 14 facts, all in detailed tables, zero in core; +$54.5B distortion

Every clean-control rerun produced zero beyond-tolerance facts, deterministic to the cent, and one caught real drift on its own (a canonical store stale against merged parser fixes). The recurring lesson: tie-outs and pipeline tests are blind to scaling, double-counting, and drop distortions. Shares renormalize, identities scale in lockstep, sets dedupe. Only the committed-baseline diff is anchored outside the change.

Hard rules

  • Scripts emit facts, never findings. Severity is the judgment layer's job.
  • Exit 2 (evidence-unavailable) is never a pass. Missing input is a finding, not a skip.
  • Silence never looks green: unmatched files, skipped formulas, dropped metrics, and unavailable tiers are all reported facts.
  • A stale scratch database is never diffed; pre-existing scratch files are deleted before every re-run.
  • A tieout is only committed after it has been verified to hold on live data.
  • € impact is computed from evidence by script, never estimated by an agent.

Onboarding

  1. Write data-review.yml (schema: skills/data-review/manifest.schema.json).
  2. data-review bless --manifest data-review.yml --lane <name>, commit the baselines.
  3. data-review t1 --manifest data-review.yml; a clean run means ready.

Cost: 3 minutes (dbt + DuckDB) to half a day (dagster monorepo).

About

Change-time data review for data-engineering and financial-controlling work. Re-runs affected pipeline lanes, diffs the numbers vs a blessed baseline, and judges whether movement matches declared intent. Catches the data bugs code review and unit tests miss.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors