-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathcontext.txt
More file actions
73 lines (71 loc) · 3.34 KB
/
context.txt
File metadata and controls
73 lines (71 loc) · 3.34 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
1 Create a new GitHub-ready repository named DataPact with a clean Python project layout, packaging, and tests.
2 Goal of the toolBuild a framework that validates datasets against a “data contract” (YAML) containing:
◦ schema expectations (columns, types, required/optional)
◦ quality rules (nulls, uniqueness, ranges, regex, allowed values)
◦ basic distribution rules (mean/std drift thresholds for numeric columns)
◦ contract versioning + compatibility checks
3 Primary user experience
◦ Provide a CLI command:datapact validate --contract <contract.yaml> --data <file.csv|parquet|json> --format <auto>
◦ Output:
▪ console summary (PASS/FAIL + key errors)
▪ machine-readable JSON report saved to ./reports/<timestamp>.json
▪ non-zero exit code on FAIL (so it works in CI/CD)
4 Contract file format
◦ Implement a YAML contract schema (v1) like:
▪ contract.name, contract.version, dataset.name
▪ fields[] with name, type, required, and rules
▪ rules can include: not_null, unique, min, max, regex, enum, max_null_ratio
▪ distribution rules: mean, std, max_drift_pct (or max_z_score)
◦ Add JSON Schema validation (or pydantic validation) so invalid contract YAML fails early with good errors.
5 Supported data inputs
◦ CSV (pandas)
◦ Parquet (pandas/pyarrow)
◦ JSON lines (pandas)
◦ Auto-detect format if not provided.
6 Core modules to implement
◦ contracts/Parse and validate contract YAML, map to typed Python models, handle versions.
◦ datasource/Load dataset to a pandas DataFrame + infer schema summary.
◦ validators/
▪ schema validator (missing/extra columns, type mismatches)
▪ quality validator (null, unique, range, regex, enum, null ratio)
▪ distribution validator (mean/std drift checks)
◦ reporting/Unified report model with:
▪ summary (pass/fail, counts)
▪ errors (list with code, field, message, sample values)
▪ warnings
▪ metadata (contract version, dataset, timestamp, tool version)
◦ cli/Argument parsing, wiring modules together, exit codes.
7 Validation semantics
◦ Schema validation runs first. If schema fails badly (missing required fields), still run remaining checks where possible and record partial results.
◦ Output severity levels: ERROR (fails build) and WARN (does not fail, but visible).
◦ Configurable thresholds in contract (ex: max_null_ratio).
8 Test suite requirements
◦ Use pytest.
◦ Include sample contracts + sample datasets under tests/fixtures/.
◦ Minimum test scenarios:
▪ valid dataset passes
▪ missing required column fails
▪ type mismatch fails
▪ null constraint fails
▪ range constraint fails
▪ regex constraint fails
▪ enum constraint fails
▪ uniqueness constraint fails
▪ distribution drift produces warning or fail (depending on contract setting)
9 Quality requirements
◦ Type hints everywhere.
◦ Clear error messages.
◦ Lintable formatting (add ruff config or equivalent).
◦ Add README.md with:
▪ what the tool does
▪ contract YAML example
▪ CLI usage examples
▪ how to run tests
10 Deliverables
• Full repo code with modules above
• Working CLI
• Example contract + dataset
• Tests passing locally
11 Nice-to-have (only if time)
• GitHub Actions workflow for running tests on PRs
• datapact init command to generate a starter contract from a dataset (infer columns/types)