forge-mock

Statistically realistic synthetic data from SQL DDL files.

forge-mock is a high-performance Python CLI that reads your CREATE TABLE statements and produces synthetic datasets that respect your schema — column types, foreign keys, nullability, and statistical distributions — without ever touching production data.

Overview

Real data is messy, private, and hard to move. Test data is usually either too fake to be useful or too real to be safe. Forge bridges that gap.

Schema-first — your DDL is the single source of truth
Referentially honest — FK constraints are respected, not ignored
Statistically tunable — override any column with a real distribution
Deterministic — pin a --seed and get the same data every time
Battle-hardened — inject corruption to test pipeline resilience

This is a Fynes Forge project built with precision over cleverness.

Installation

pip install forge-mock

Or from source:

git clone https://github.com/your-org/forge-mock
cd forge-mock
pip install -e ".[dev]"

Requirements: Python 3.10+

Quick Start

1. Point Forge at your DDL

forge generate schema.sql

This generates 1000 rows per table as .parquet files in the current directory.

2. Customise the run

forge generate schema.sql \
  --rows 5000 \
  --format csv \
  --dialect snowflake \
  --output ./data/synthetic \
  --seed 42

3. Inspect without generating

forge inspect schema.sql --dialect postgres

Prints a rich table of every column, its type mapping, PK/FK relationships, and nullability — no data written.

Statistical Profiles (YAML Config)

Override any column with a specific distribution via a YAML file:

# config.yaml
tables:
  orders:
    rows: 10000
    columns:
      total_amount:
        distribution: lognormal
        mean: 4.2
        sigma: 0.9
      status:
        distribution: choice
        values: [pending, shipped, delivered, cancelled]
  products:
    rows: 500
    columns:
      price:
        distribution: normal
        mean: 49.99
        std: 15.0
      stock_qty:
        distribution: poisson
        lam: 75

Run with:

forge generate schema.sql --config config.yaml

Supported Distributions

Distribution	Parameters
`normal`	`mean`, `std`, `decimals`
`uniform`	`low`, `high`, `decimals`
`lognormal`	`mean`, `sigma`, `decimals`
`poisson`	`lam`
`exponential`	`scale`, `decimals`
`binomial`	`n`, `p`
`beta`	`a`, `b`, `decimals`
`gamma`	`shape`, `scale`, `decimals`
`choice`	`values` (list)
`integer_range`	`low`, `high`

Schema Drift / Corruption Mode

Test your pipeline's resilience against bad data:

forge generate schema.sql --corrupt 0.05

--corrupt 0.05 injects bad values into ~5% of cells: nulls in non-nullable columns, type mismatches, out-of-range integers, invalid dates, empty strings, and control characters.

CLI Reference

Usage: forge [COMMAND] [OPTIONS]

Commands:
  generate   Generate synthetic data from a DDL file
  inspect    Inspect a DDL file schema without generating data
  version    Print forge-mock version

forge generate OPTIONS:
  DDL                     Path to SQL DDL file          [required]
  --rows, -r INT          Rows per table                [default: 1000]
  --output, -o PATH       Output directory              [default: .]
  --format, -f            Output format                 [parquet|csv|sql]
  --dialect, -d           SQL dialect                   [postgres|snowflake|bigquery|trino|...]
  --seed, -s INT          Random seed for reproducibility
  --config, -c PATH       YAML config for distribution overrides
  --corrupt FLOAT         Corruption injection rate (0.0–1.0)
  --verbose, -v           Show detailed schema info

Supported SQL Dialects

Dialect	Status
`postgres`	✅ Full support
`snowflake`	✅ Full support
`bigquery`	✅ Full support
`trino`	✅ Full support
`duckdb`	✅ Full support
`mysql`	✅ Full support
`sqlite`	✅ Full support

Output Formats

Format	Description
`parquet`	Snappy-compressed via PyArrow — best for analytics pipelines
`csv`	UTF-8, comma-delimited — universal compatibility
`sql`	Batched `INSERT INTO` statements — ready to replay

Project Structure

forge-mock/
├── src/
│   └── forge_mock/
│       ├── __init__.py
│       ├── cli/
│       │   └── main.py          # Typer CLI entry point
│       ├── parser/
│       │   ├── ddl_parser.py    # sqlglot-based DDL parser
│       │   └── schema_models.py # ColumnSchema, TableSchema dataclasses
│       ├── generators/
│       │   ├── column_generator.py      # Per-column value generation
│       │   ├── distribution_generator.py # NumPy statistical distributions
│       │   └── type_map.py              # SQL type → Faker provider mapping
│       └── engine/
│           ├── forge_engine.py     # Orchestration & serialisation
│           ├── dependency_graph.py # FK dependency resolution (networkx DAG)
│           └── config_loader.py    # YAML config loader
├── tests/
│   ├── conftest.py
│   ├── fixtures.py
│   ├── test_parser.py
│   └── test_engine.py
├── examples/
│   ├── ecommerce.sql
│   └── ecommerce_config.yaml
└── pyproject.toml

Development

# Install with dev extras
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage report
pytest --cov=forge_mock --cov-report=html

# Type check
mypy src/

# Lint
ruff check src/ tests/

CI/CD Integration

Pin a seed for byte-for-byte reproducible test fixtures:

# Generate fixed test data as part of your CI pipeline
forge generate tests/schema.sql \
  --rows 100 \
  --seed 42 \
  --format parquet \
  --output tests/fixtures/

Example: Full E-Commerce Run

forge generate examples/ecommerce.sql \
  --config examples/ecommerce_config.yaml \
  --format parquet \
  --output ./data \
  --seed 42 \
  --verbose

Expected output:

  ███████╗ ██████╗ ██████╗  ██████╗ ███████╗
  ██╔════╝██╔═══██╗██╔══██╗██╔════╝ ██╔════╝
  ...

┌─ Run Configuration ──────────────────────┐
│ DDL File    ecommerce.sql                 │
│ Rows/table  1000                          │
│ Format      parquet                       │
│ Seed        42                            │
└───────────────────────────────────────────┘

▶ customers  → 500 rows
▶ categories → 20 rows
▶ products   → 200 rows
▶ orders     → 2,000 rows
▶ order_items→ 5,000 rows

✓ Done. Generated 5 table(s) in ./data

╭─ Generated Datasets ──────────────────────────────────────╮
│ Table        │ Rows  │ Columns │ File                      │
│ customers    │ 500   │ 7       │ ./data/customers.parquet  │
│ categories   │ 20    │ 3       │ ./data/categories.parquet │
│ products     │ 200   │ 8       │ ./data/products.parquet   │
│ orders       │ 2,000 │ 7       │ ./data/orders.parquet     │
│ order_items  │ 5,000 │ 5       │ ./data/order_items.parquet│
╰───────────────────────────────────────────────────────────╯

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before opening a PR.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
docker		docker
examples		examples
src/forge_mock		src/forge_mock
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
forge_integration_test.db		forge_integration_test.db
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

forge-mock

Overview

Installation

Quick Start

1. Point Forge at your DDL

2. Customise the run

3. Inspect without generating

Statistical Profiles (YAML Config)

Supported Distributions

Schema Drift / Corruption Mode

CLI Reference

Supported SQL Dialects

Output Formats

Project Structure

Development

CI/CD Integration

Example: Full E-Commerce Run

Contributing

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

forge-mock

Overview

Installation

Quick Start

1. Point Forge at your DDL

2. Customise the run

3. Inspect without generating

Statistical Profiles (YAML Config)

Supported Distributions

Schema Drift / Corruption Mode

CLI Reference

Supported SQL Dialects

Output Formats

Project Structure

Development

CI/CD Integration

Example: Full E-Commerce Run

Contributing

Licence

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages