This guide maps each file to its purpose and shows how they fit together.
- Purpose: Package entry point, exports main classes
- Exports:
Contract,OdcsContract,ValidationReport,DataSource,DatabaseSource,DatabaseConfig,profile_dataframe - When to modify: Adding new top-level exports
- Purpose: Parse YAML contracts, model validation rules
- Classes:
Contract,Field,FieldRule,DistributionRule,PIIConfig,Dataset,FlattenConfig - Key methods:
Contract.from_yaml(path)- Load and parse contract fileContract._parse_rules()- Extract field validation rulesContract._parse_distribution()- Extract distribution rulesContract._parse_pii()- Extract PII metadata from field (category,masked,severity)
- PII contract keys:
pii_scan(bool, defaulttrue) at contract level;piiblock per field - When to modify: Adding new rule types, PII categories, or contract metadata
- Purpose: Contract provider dispatch (DataPact, ODCS, and API Pact)
- Files:
base.py,datapact_provider.py,odcs_provider.py,pact_provider.py - When to modify: Adding new contract formats or provider behavior
- Purpose: Load and convert API Pact contracts to DataPact format
- Classes:
PactProvider - Key methods:
load()- Load Pact JSON file and convert to Contract_infer_fields_from_body()- Extract fields from Pact response body_infer_type()- Map JSON types to DataPact types
- Limitations: Quality/distribution rules not inferred; must be added manually
- When to modify: Changing Pact-to-DataPact field mapping logic
- Purpose: Parse ODCS v3.1.0 contracts and map to DataPact models
- Classes:
OdcsContract,OdcsSchemaObject,OdcsSchemaProperty - Key methods:
OdcsContract.from_dict()- Parse ODCS contract dictOdcsContract.to_datapact_contract()- Map ODCS schema to DataPact
- When to modify: Expanding ODCS mapping or metadata coverage
- Purpose: Policy pack registry and merge logic
- Exports:
POLICY_PACKS,apply_policy_packs() - When to modify: Adding new policy packs or merge behavior
- Purpose: Load datasets in multiple formats
- Classes:
DataSource,DatabaseSource,DatabaseConfig - Supported file formats: CSV, Parquet, JSON Lines (.jsonl), Excel (XLSX, XLS)
- Key methods:
load()- Load data into DataFrameinfer_schema()- Discover column typesiter_chunks()- Stream CSV/JSONL in chunks (not supported for Excel/Parquet)sample_dataframe()- Sample rows for large datasets_detect_format()- Auto-detect file format from extension
- Excel support:
- Auto-detection of
.xlsxand.xlsfiles - Optional
sheet_nameparameter for sheet selection (defaults to 0 for first sheet) - Full-file load only (no chunking) due to Excel format limitations
- Sheet can be specified by name (string) or index (integer)
- Auto-detection of
- Database support:
DatabaseConfig,DatabaseSourcefor Postgres/MySQL/SQLite - Type inference: Maps pandas dtypes to DataPact types (integer, float, string, boolean)
- When to modify: Adding support for new data formats or DB engines
- Purpose: Contract-aware normalization scaffold
- Files:
config.py,normalizer.py - When to modify: Adding flatten or other normalization modes
- Purpose: Command-line interface and orchestration
- Functions:
main(),validate_command(),init_command(),profile_command() - Commands:
validate,init,profile - Data source parameters:
--data- Path to data file (CSV, Parquet, JSON Lines, Excel)--format- Override auto-detection (csv, parquet, jsonl, excel, auto)--sheet- For Excel files, specify sheet name or 0-indexed position (defaults to 0)
- Database parameters:
--db-type,--db-host,--db-port,--db-user,--db-password,--db-name,--db-table,--db-query - Limitations:
--chunksizenot supported with Excel (auto-rejected with error message)- Excel is always fully loaded into memory
- When to modify: Adding new CLI commands, options, or data source types
- Purpose: Profile data to infer rules for new contracts
- Functions:
profile_dataframe() - When to modify: Adjusting inference heuristics or defaults
- Purpose: Generate validation reports with lineage tracking
- Classes:
ErrorRecord,ValidationReport - ErrorRecord fields:
- Core:
field,rule,severity,message - Lineage (Phase 9):
logical_path(contract field path),actual_column(physical column after normalization) - Timestamps and metadata
- Core:
- Key methods:
to_dict()- Convert to JSON-serializable format; includes lineage fields when presentsave_json()- Write report to./reports/<timestamp>.jsonprint_summary()- Print human-readable console output; displays lineage info (e.g., "field 'user_id' (path: user.id, column: user__id)" when flattening used)
- Report sinks:
FileReportSink,StdoutReportSink,WebhookReportSink - Lineage tracking: When contracts include normalization (flatten config), errors automatically populate
logical_path(contract field name) andactual_column(flattened dataframe column) for debugging - When to modify: Changing report format, adding metadata, or enhancing error attribution
- Purpose: Contract version management, migration, and compatibility checking
- Classes:
VersionInfo,VersionMigration - Functions:
validate_version(),check_tool_compatibility(),get_breaking_changes(),get_deprecation_message() - Key data:
VERSION_REGISTRY,TOOL_COMPATIBILITY,LATEST_VERSION - When to modify: Adding new contract versions or migration paths
- Purpose: Check structure (columns, types, required fields)
- Classes:
SchemaValidator - Runs: First (blocks if critical issues)
- Output: ERROR or WARN (extra columns)
- When to modify: Adding new schema checks (e.g., column ordering, drift policy)
- Purpose: Validate data content (nulls, ranges, patterns, etc.)
- Classes:
QualityValidator - Runs: Second (non-blocking)
- Output: ERROR or WARN severity violations (rule-level severities supported)
- When to modify: Adding new validation rules (min, max, regex, enum, etc.)
- Purpose: Validate custom plugin rules
- Classes:
CustomRuleValidator - Runs: After SLA validation (non-blocking)
- Output: ERROR or WARN severity violations
- When to modify: Adjusting plugin rule interfaces or behavior
- Purpose: Validate SLA checks (row count thresholds)
- Classes:
SLAValidator - Runs: After quality validation (non-blocking)
- Output: ERROR or WARN severity violations
- When to modify: Adding new SLA checks
- Purpose: Monitor numeric distributions (mean, std drift)
- Classes:
DistributionValidator - Runs: After custom rules (always non-blocking)
- Output: WARN severity violations
- When to modify: Adding new statistical checks (e.g., percentile thresholds)
- Purpose: Detect PII — declared fields and auto-scan of undeclared columns
- Classes:
PIIValidator - Runs: Last in pipeline (non-blocking by default)
- Output: WARN (auto-detected or declared with
severity: WARN); ERROR when a declared field hasseverity: ERROR - Contract metadata:
PIIConfigonField(category, masked, severity);pii_scan: boolonContract - Pass 1 (declared): checks fields with
pii:block; skips ifmasked: trueor column absent - Pass 2 (auto-detect): column-name keywords + regex value-pattern matching on 500-row sample; disabled by
pii_scan: false - When to modify: Adding new PII categories, keywords, or detection patterns
- Purpose: Project metadata, dependencies, build config
- Sections:
[build-system]- Setuptools config[project]- Package metadata[project.optional-dependencies]- Dev tools (pytest, black, mypy, ruff)[project.scripts]- CLI entry pointdatapact[tool.ruff],[tool.mypy],[tool.pytest.ini_options]- Tool configs
- When to modify: Updating dependencies, adding new tool config
- Purpose: Setuptools configuration for editable installs
- When to modify: Rarely (pyproject.toml is preferred)
- Purpose: Exclude files from git
- Sections: Python artifacts, build files, IDE files, test coverage, reports
- When to modify: Adding project-specific exclusions
- Purpose: GitHub Actions CI/CD pipeline
- Jobs: Lint (ruff, black), type check (mypy), tests (pytest with coverage)
- Trigger: Push to main/develop, PR to main/develop
- When to modify: Adding new checks, changing Python versions
- Purpose: User-facing documentation
- Sections: Features, installation, quick start, contract format, CLI usage
- Audience: End users and integrators
- When to modify: Updating examples, adding features
- Purpose: Functional feature list with compact examples
- Sections: Validation, rules, policies, streaming, plugins, reporting
- Audience: End users and evaluators
- When to modify: Adding or changing feature capabilities
- Purpose: Get developers running in minutes
- Sections: Installation, running CLI, running tests, code quality
- When to modify: Changing setup process or common commands
- Purpose: Developer guide for contributions
- Sections: Setup, code standards, workflow, adding validators/formats
- When to modify: Updating contribution guidelines
- Purpose: Design documentation
- Sections: Component overview, validation semantics, error handling, extensibility
- When to modify: Major architectural changes
- Purpose: Contract versioning guide and reference
- Sections: Version history, breaking changes, migration guide, API usage, best practices
- Audience: Developers working with contract versions
- When to modify: Adding new contract versions or migration paths
- Purpose: Guide for AI agents (Copilot, Claude, etc.)
- Sections: Project overview, architecture, key files, contract format, versioning, workflows, conventions
- When to modify: New features, design changes, new patterns
- Note: This is auto-generated/updated from project structure
- Purpose: This file - provide overview of what was created
- When to modify: After major restructuring
- Purpose: Project overview dashboard with stats and quick references
- When to modify: Updating counts, versions, or quick links
- Purpose: Delivery summary for stakeholders
- When to modify: Updating scope, counts, or release status
- Purpose: Visual project tree
- When to modify: Adding or removing files/folders
- Purpose: Navigation guide for all docs
- When to modify: Adding new docs or cross-references
- Purpose: Feature and QA checklist
- When to modify: Updating scope or release status
- Purpose: Direct dependency list and tool descriptions
- When to modify: Updating dependencies or tooling
- Purpose: Guide for Mermaid sequence diagrams
- When to modify: Diagram location or viewing changes
- Purpose: Versioning implementation notes and test summary
- When to modify: Versioning changes or test count updates
- Purpose: Unit tests for validators and core functionality
- Test classes:
TestSchemaValidator,TestQualityValidator,TestDataSource,TestDistributionValidator - Fixtures:
customer_contract,valid_df,invalid_df - When to modify: Adding new test cases for features
- Purpose: Unit tests for contract versioning, migration, and compatibility
- Test classes:
TestVersionValidation,TestToolCompatibility,TestVersionMigration,TestContractVersionLoading,TestVersionInfo - Coverage: 18 test cases including tool 2.0.0 compatibility check
- When to modify: Adding new contract versions, migration logic, or tool versions
- Purpose: Multi-table banking/finance validation scenarios with consumer-specific contracts
- Test classes:
TestDepositsAccountsStrict,TestDepositsAccountsAggregate,TestDepositsTransactions,TestLendingLoansStrict,TestLendingLoansAggregate,TestLendingPayments,TestComplexConsumption - When to modify: Adding banking/finance rules, fixtures, or scenario coverage
- Purpose: Concurrency validation using threads
- When to modify: Changing concurrency behavior or validation safety checks
- Purpose: Concurrency validation using multiprocessing
- When to modify: Changing multiprocessing behavior or validation safety checks
- Purpose: Report sink tests and lineage tracking validation
- Coverage:
- File, stdout, webhook sinks
ErrorRecordfields:logical_path(contract field),actual_column(physical column)- JSON serialization with lineage (to_dict())
- Console output formatting with flattened column names
- Lineage display: "field 'email' (path: email, column: email_normalized)"
- When to modify: Adding new report sinks, changing output format, or enhancing lineage features
- Purpose: Database source tests (SQLite, MySQL)
- When to modify: Adding DB source capabilities or drivers
- Purpose: ODCS contract parsing and mapping tests
- When to modify: Expanding ODCS support or fixtures
- Purpose: Policy pack parsing and merge tests
- When to modify: Adding new policy packs or override behavior
- Purpose: Exhaustive positive/negative/boundary coverage for core features
- When to modify: Expanding feature coverage or new rule types
- Purpose: Chunked validation and sampling tests
- When to modify: Adjusting chunking or sampling behavior
- Purpose: Custom rule plugin tests
- When to modify: Adjusting plugin rule interfaces or examples
- Purpose: Provider dispatch and ODCS/DataPact provider tests
- When to modify: Adding new providers or provider behavior
- Purpose: Normalization scaffold and config mapping tests
- When to modify: Adding normalization modes or integration behavior
- Purpose: Example plugin module for custom rules
- When to modify: Updating plugin examples for documentation/tests
- Purpose: Profiling tests for inferred rules and distributions
- When to modify: Adjusting profiling heuristics or defaults
- Purpose: PIIValidator tests — declared PII, auto-detection, edge cases, ErrorRecord integration
- Test classes:
TestDeclaredPII,TestAutoDetectionByName,TestAutoDetectionByValue,TestPIIErrorCode,TestChunkedPII - Coverage: 23 test cases
- When to modify: Adding new PII categories, detection patterns, or severity behaviour
- Purpose: Contract with declared PII fields across all categories (email, phone, ssn)
- Usage: PIIValidator tests; template for PII-aware contracts
- When to modify: Adding examples of new PII categories
- Purpose: Sample data with unmasked PII values (email, phone), masked SSN
- When to modify: Expanding PII positive test cases
- Purpose: Contract with
pii_scan: falseto test auto-detection opt-out - When to modify: Testing changes to
pii_scanbehaviour
- Purpose: Example contract with all rule types (v2.0.0)
- Usage: Test reference and template for new contracts
- When to modify: Adding examples of new rule types
- Purpose: Example contract in v1.0.0 format for testing auto-migration
- When to modify: Testing legacy version support
- Purpose: Example contract in v2.0.0 format with advanced rules
- When to modify: Adding examples of new v2.0.0 features
- Purpose: Sample data that passes all validation rules
- When to modify: Adding test data for new rules
- Purpose: Sample data with intentional violations (missing fields, invalid email, etc.)
- When to modify: Adding test cases for new rules
- Purpose: Banking deposits contract with schema and quality rules
- When to modify: Adjusting deposits validation rules or schema
- Purpose: Banking lending contract with schema and quality rules
- When to modify: Adjusting lending validation rules or schema
- Purpose: Deposits accounts data (positive/negative/boundary cases)
- When to modify: Expanding deposits accounts coverage
- Purpose: Lending loans data (positive/negative/boundary cases)
- When to modify: Expanding lending loans coverage
- Purpose: Aggregate consumer contract for deposits accounts
- When to modify: Adjusting aggregate consumer requirements
- Purpose: Aggregate consumer contract for lending loans
- When to modify: Adjusting aggregate consumer requirements
- Purpose: Deposits transactions contract
- When to modify: Adjusting transaction validation rules
- Purpose: Lending payments contract
- When to modify: Adjusting payments validation rules
- Purpose: Deposits transactions data (positive/negative/boundary cases)
- When to modify: Expanding transaction coverage
- Purpose: Lending payments data (positive/negative/boundary cases)
- When to modify: Expanding payments coverage
- Purpose: Aggregate deposits accounts data with relaxed customer_id constraints
- When to modify: Adjusting aggregate dataset mix
- Purpose: Aggregate lending loans data with relaxed customer_id constraints
- When to modify: Adjusting aggregate dataset mix
cli.py (entry point)
↓
├→ contracts.py (parse YAML + version validation)
│ └→ versioning.py (auto-migration, compatibility)
├→ datasource.py (load data)
└→ validators/ (sequential pipeline)
├→ schema_validator.py
├→ quality_validator.py
├→ sla_validator.py
├→ custom_rule_validator.py
├→ distribution_validator.py
└→ pii_validator.py
└→ reporting.py (aggregate results + version info)
Tests:
test_validator.py (core tests: 10)
↓ uses
├→ fixtures/customer_contract.yaml (v2.0.0)
├→ fixtures/valid_customers.csv
└→ fixtures/invalid_customers.csv
test_versioning.py (versioning tests: 17)
↓ uses
├→ fixtures/customer_contract_v1.yaml (v1.0.0)
└→ fixtures/customer_contract_v2.yaml (v2.0.0)
test_banking_finance.py (banking/finance tests: 16)
↓ uses
├→ fixtures/deposits_contract.yaml
├→ fixtures/lending_contract.yaml
├→ fixtures/deposits_data.csv
├→ fixtures/lending_data.csv
├→ fixtures/deposits_accounts_agg_contract.yaml
├→ fixtures/lending_loans_agg_contract.yaml
├→ fixtures/deposits_transactions_contract.yaml
├→ fixtures/lending_payments_contract.yaml
├→ fixtures/deposits_transactions.csv
├→ fixtures/lending_payments.csv
├→ fixtures/deposits_accounts_agg.csv
└→ fixtures/lending_loans_agg.csv
test_concurrency.py (threaded concurrency)
test_concurrency_mp.py (multiprocessing concurrency)
- New validation rule → Modify
FieldRuleincontracts.py, check inquality_validator.py, add test totests/test_validator.py - New data format → Modify
DataSource._detect_format()andload(), add test fixture - New CLI command → Add to argument parser in
cli.py, create command function, document inREADME.md - New validator type → Create
validators/new_validator.py, import incli.py, call invalidate_command(), document in.github/copilot-instructions.md - New contract version → Add to
VERSION_REGISTRYinversioning.py, create migration path from previous version, add test fixtures intests/fixtures/, add tests totests/test_versioning.py, updatedocs/VERSIONING.md