The project includes a suite of automated performance and non-functional requirements (NFR) tests to ensure DataPact remains robust, efficient, and production-ready at scale. These tests are located in tests/test_performance.py and tests/test_performance_extra.py.
See PERFORMANCE_NFR_SUMMARY.md for the latest results, coverage, and CI integration details. Performance/NFR tests are run automatically in CI and reports are uploaded as artifacts.
- Large dataset validation time: Validates 1M+ row CSVs and asserts runtime is within SLA.
- Contract parsing speed: Measures time to parse large contracts (100+ fields, 50+ rules each).
- CLI startup time: Ensures CLI responds quickly for small contracts.
- Memory usage: Checks RAM usage when loading and validating large files.
- Batch and concurrent validation: Runs many validations in sequence and in parallel to test throughput and scalability.
- Performance degradation: Measures how validation time scales with increasing data size.
PYTHONPATH=src python3 -m pytest tests/test_performance.py tests/test_performance_extra.py --durations=10 --tb=short --maxfail=2 --junitxml=performance_report.xmlThis will generate a JUnit XML report (performance_report.xml) with timing and pass/fail status for each scenario.
- Add new scenarios to the performance test files.
- Use realistic data and contracts for benchmarking.
- Document any new NFRs in this section.
The DataPact follows a modular pipeline:
Contract YAML → Contract Provider → Contract Parser → Validators → Report → JSON/Console/Sinks
Data File / DB → DataSource Loader ↓
- Parses YAML contract files into typed Python models
- Supports ODCS v3.1.0 contracts via dedicated mapping
- Defines
Contract,Field,FieldRule,DistributionRuledataclasses - Handles contract versioning metadata
- Applies policy packs before field rule parsing
- Integrates with versioning module for auto-migration
- Responsibility: Contract validation, deserialization, and version management
- Resolves contract format via provider dispatch (DataPact YAML, ODCS, or API Pact)
- Encapsulates format-specific compatibility checks and mapping
- DataPact Provider: Loads YAML contracts in native DataPact format
- ODCS Provider: Maps Open Data Contract Standard v3.1.0 schemas to DataPact format
- Pact Provider: Infers DataPact fields from Pact API contracts via response body type inference
- Type Inference: Parses Pact JSON contracts and extracts field types from example response bodies
- Automatic Detection: Uses
--contract-format pactflag or auto-infers from.jsonextension - Nested Support: Schemas can be flattened for nested API responses (e.g.,
user.address.city→user__address__city) - Limitations: Type inference is automatic, but quality rules (uniqueness, ranges, regex, enums) and distribution rules must be added manually post-inference
- Workflow: Infer base contract from Pact JSON → Review inferred types → Add custom quality/distribution rules → Validate API responses
- Example Fixture: See
tests/fixtures/pact_user_api.jsonfor sample Pact contract with user API schema
- Responsibility: Format detection, contract loading, and type inference
- Loads CSV, Parquet, JSON Lines, and Excel (XLSX/XLS) formats
- Loads database tables and queries (Postgres, MySQL, SQLite)
- Auto-detects format from file extension (
.csv,.parquet,.jsonl,.xlsx,.xls) - Excel support:
- Loads Excel files via pandas
read_excel() - Optional sheet selection via
sheet_nameparameter (name as string, or 0-indexed position as int; defaults to 0) - Full-file load only (no chunking support due to Excel format characteristics)
- Loads Excel files via pandas
- Provides schema inference (column names and inferred types)
- Supports chunked streaming for large CSV/JSONL files and database queries via
--db-chunksize - Responsibility: Data I/O and schema discovery
- Provides a contract-aware normalization step (noop by default)
- Supports flatten metadata for future schema flattening
- Responsibility: Optional preprocessing before validation
- Generates contract rules from observed data
- Infers enums, ranges, null ratios, and distributions
- Responsibility: Baseline rule generation for new contracts
Validators run sequentially in this order:
- Validates structure: column existence, types, required fields
- Runs first because structural issues block detailed validation
- Produces
ERRORseverity violations - Applies schema drift policy for extra columns (WARN/ERROR)
- Exit: Errors prevent subsequent validators from running on affected fields
- Checks data content: nulls, uniqueness, ranges, patterns, enums
- Operates only on columns present in schema
- Produces both
ERROR(constraint violations) andWARN(soft failures) - Supports rule-level severity metadata and CLI overrides
- Checks dataset SLAs (row count thresholds)
- Produces
ERRORorWARNdepending on SLA severity
- Executes plugin-based custom rules
- Supports field-level and dataset-level checks
- Input: DataFrame + Field rules
- Monitors numeric column statistics: mean, std, outlier detection
- Compares current vs. expected distributions
- Always produces
WARN(never blocks validation) - Input: DataFrame + Distribution rules
- Two-pass PII detection: (1) declared PII fields from contract, (2) auto-detection on all other columns
- Pass 1 (declared): emits
WARNorERROR(field-configurable) for fields tagged withpii:in YAML wheremasked: false - Pass 2 (auto-detect): scans undeclared columns using column-name keywords (26 keywords, 8 categories) then regex value-pattern matching on a 500-row sample at 20% hit threshold; always
WARN - Auto-detection disabled when
pii_scan: falseon the contract - Contract metadata:
PIIConfigdataclass (category, masked, severity) onField;pii_scan: boolonContract - Output:
ErrorRecordwithcode="PII"
- Aggregates errors/warnings from all validators
- Produces machine-readable JSON and human-readable console output
- Supports report sinks for file, stdout, and webhooks
- Tracks metadata: timestamp, contract version, tool version, breaking changes
- Lineage tracking (Phase 9):
ErrorRecordextends with optionallogical_pathandactual_columnfieldslogical_path: Contract field name or path (e.g., "user" or "user.id")actual_column: Physical dataframe column after normalization (e.g., "user__id" if flattened with separator="__")- Enables error attribution when data is flattened or column-mapped
- Console output shows lineage: "field 'email' (path: email, column: email_normalized)"
- JSON output includes both fields for programmatic access
- Output:
./reports/<timestamp>.json
- Maintains version registry for all contract versions
- Handles automatic migration between versions (1.0.0 → 1.1.0 → 2.0.0)
- Checks tool-contract compatibility
- Tracks breaking changes and deprecation status
- Responsibility: Version validation, migration, compatibility checking
- Entry point: parses arguments, orchestrates validation
- Performs version compatibility checking before validation
- Commands:
validate(run validation),init(infer contract),profile(infer rules) - Handles exit codes (0 = pass, 1 = fail with errors)
- Supports chunked validation and sampling options for large datasets
- Supports multiple contract formats: DataPact YAML, ODCS, or API Pact JSON
- Format Resolution:
- Auto-detects format from file extension (
.yaml→ DataPact,.json→ Pact for API contracts) - Explicit format specification via
--contract-format datapack|odcs|pact - YAML files follow DataPact or ODCS schemas based on structure
.jsonfiles with Pact contract structure automatically load via Pact provider
- Auto-detects format from file extension (
- Pact CLI Examples:
datapact validate --contract pact_user_api.json --data api_response.json datapact validate --contract pact_user_api.json --data api_response.json --contract-format pact datapact validate --contract user_contract.yaml --data users.csv
- Applies normalization before validation
- Schema validation runs first and is blocking
- If required fields are missing, stop early
- Type mismatches are recorded as ERRORs
- Normalization runs before validation (noop unless enabled)
- Quality validation skips missing columns
- SLA validation runs after quality checks (non-blocking)
- Custom rule validation runs after SLA checks (non-blocking)
- Distribution validation is always non-blocking (WARNings only)
- PII validation runs last (non-blocking); ERRORs only when
severity: ERRORdeclared on a field - Exit code is non-zero if any ERRORs exist (for CI/CD)
# customer_contract.yaml
contract:
name: customers
version: 2.0.0
fields:
- name: email
type: string
required: true
rules:
regex: '^[a-z]+@[a-z]+\.[a-z]+$'# Process
1. Contract.from_yaml() → Contract object
2. DataSource.load() → DataFrame with email column
3. SchemaValidator → checks email column exists, is string type
4. QualityValidator → checks regex matches all non-null emails
5. ValidationReport → aggregates results
6. report.save_json() → writes ./reports/20260208_103045.jsonsequenceDiagram
autonumber
actor User as User/CLI
participant CLI as CLI Interface
participant Provider as Contract Provider
participant Parser as Contract Parser
participant Loader as Data Loader
participant Normalizer as Normalizer
participant Schema as Schema Validator
participant Quality as Quality Validator
participant SLA as SLA Validator
participant Custom as Custom Rule Validator
participant Distribution as Distribution Validator
participant Reporter as Report Generator
participant Output as JSON/Console/Sinks
User->>+CLI: datapact validate --contract.yaml --data.csv/--db-*
CLI->>+Provider: Resolve format and load contract
Provider->>Parser: Parse contract YAML
Parser->>Parser: Apply policy packs
Provider-->>-CLI: Contract object
CLI->>+Loader: Load data (file or DB)
Loader-->>-CLI: DataFrame
CLI->>+Normalizer: Normalize dataframe (noop by default)
Normalizer-->>-CLI: DataFrame
rect rgb(200, 220, 255)
Note over Schema,Distribution: VALIDATION PIPELINE (Sequential)
CLI->>+Schema: Validate schema
Schema-->>-CLI: Errors/OK
CLI->>+Quality: Validate quality rules
Quality-->>-CLI: Errors & warnings (non-blocking)
CLI->>+SLA: Validate SLA thresholds
SLA-->>-CLI: Errors & warnings (non-blocking)
CLI->>+Custom: Run custom rule plugins
Custom-->>-CLI: Errors & warnings (non-blocking)
CLI->>+Distribution: Check distributions
Distribution-->>-CLI: Warnings only (never blocks)
CLI->>+PII: Detect PII (declared + auto-scan)
PII-->>-CLI: Warnings/errors (non-blocking by default)
end
CLI->>+Reporter: Aggregate results
Reporter-->>-CLI: ValidationReport
CLI->>+Output: Generate output
Output->>Output: Save JSON report
Output->>Output: Send to report sinks
Output->>Output: Print summary
Output-->>-CLI: Done
CLI->>User: Exit 0 (pass) or 1 (fail)
The framework supports multiple contract versions with automatic migration:
Old Contract (v1.0.0) → Auto-Migration → Latest Contract (v2.0.0) → Validation
↓
Deprecation Warning
Breaking Changes Tracked
Migration Path Logged
- v1.0.0: Legacy version (basic rules)
- v1.1.0: Enhanced version (adds max_z_score)
- v2.0.0: Current latest (refactored quality rules)
- Contract is loaded from YAML
- Version is validated against registry
- If not latest version:
- Migration engine determines path (1.0→1.1 or 1.1→2.0)
- Applies schema transformations
- Logs deprecation warning to console
- Contract is upgraded to v2.0.0
- Validation proceeds with latest schema
- Tool version is tracked (currently 0.2.0)
- Compatibility matrix defines which tool versions support which contract versions
- Warnings issued if mismatch detected
See docs/VERSIONING.md for detailed migration guide.
All validators return (bool, List[ErrorRecord]):
bool: Overall pass/failList[ErrorRecord]: Structured error records, each containing:code: Error category (SCHEMA,QUALITY,SLA,CUSTOM,DISTRIBUTION,PII)severity:ERRORorWARNfield: Field name where the violation occurredmessage: Human-readable descriptionlogical_path(optional): Contract field name or path (e.g.,"user.id") — populated when lineage tracking is enabledactual_column(optional): Physical DataFrame column after normalization (e.g.,"user__id") — populated when data is flattened or column-mapped
Version information added to reports:
- Current contract version
- Tool version
- Breaking changes (if any)
- Migration status
To add a new validator:
- Implement in
validators/new_validator.pywithvalidate() -> Tuple[bool, List[str]] - Import in
cli.py - Call after appropriate validators in pipeline
- Errors are automatically aggregated into report
To add a new contract version:
- Add to
VERSION_REGISTRYinversioning.py - Implement migration path in
VersionMigrationclass - Add test fixtures and tests
- Update
docs/VERSIONING.mdwith breaking changes
To add new contract rules:
- Add field to
FieldRuleorDistributionRuledataclass - Parse in
Contract._parse_rules() - Check in corresponding validator
To extend PII detection:
- Add to
VALID_PII_CATEGORIESincontracts.py - Add regex to
_VALUE_PATTERNSor keyword to_NAME_KEYWORDSinpii_validator.py - Add test cases in
tests/test_pii_validator.py