forgery

Fake data at the speed of Rust.

A high-performance fake data generation library for Python, powered by Rust. Designed to be 50-100x faster than Faker for batch operations.

Installation

pip install forgery

From source (for development)

git clone https://github.com/williajm/forgery.git
cd forgery
pip install maturin
maturin develop --release

Quick Start

from forgery import fake

# Generate 10,000 names in one fast call
names = fake.names(10_000)

# Single values work too
email = fake.email()
name = fake.name()

# Deterministic output with seeding
fake.seed(42)
data1 = fake.names(100)
fake.seed(42)
data2 = fake.names(100)
assert data1 == data2

Features

Batch-first design: Generate thousands of values in a single call
50-100x faster than Faker for batch operations
Multi-locale support: 7 locales with locale-specific data
Deterministic seeding: Reproducible output for testing
Type hints: Full type stub support for IDE autocompletion
Familiar API: Method names match Faker for easy migration

Locale Support

forgery supports 7 locales with locale-specific names, addresses, phone numbers, and more:

Locale	Language	Country
`en_US`	English	United States (default)
`en_GB`	English	United Kingdom
`de_DE`	German	Germany
`fr_FR`	French	France
`es_ES`	Spanish	Spain
`it_IT`	Italian	Italy
`ja_JP`	Japanese	Japan

from forgery import Faker

# Default locale is en_US
fake = Faker()
fake.names(5)  # American names

# Use a different locale
german = Faker("de_DE")
german.names(5)  # German names

japanese = Faker("ja_JP")
japanese.addresses(3)  # Japanese addresses with prefecture

Each locale provides:

Names: First names, last names, and full names in the local language
Addresses: Cities, regions/states, postal codes in the correct format
Phone numbers: Country-specific formats and country codes
Companies: Local company names and job titles
Colors: Color names in the local language
SSN/National IDs: Country-specific formats (US SSN, UK NINO, DE Steuer-ID, etc.)
License plates: Country-specific formats

API

Module-level functions (use default instance)

from forgery import seed, names, emails, integers, uuids

seed(42)  # Seed for reproducibility

# Batch generation (fast path)
names(1000)           # list[str] of full names
emails(1000)          # list[str] of email addresses
integers(1000, 0, 100)  # list[int] in range
uuids(1000)           # list[str] of UUIDv4

# Single values
name()                # str
email()               # str
integer(0, 100)       # int
uuid()                # str

Faker class (independent instances)

from forgery import Faker

# Each instance has its own RNG state
fake1 = Faker()
fake2 = Faker()

fake1.seed(42)
fake2.seed(99)

# Generate independently
fake1.names(100)
fake2.emails(100)

Available Generators

Names & Identity

Batch	Single	Description
`names(n)`	`name()`	Full names (first + last)
`first_names(n)`	`first_name()`	First names
`last_names(n)`	`last_name()`	Last names

Contact Information

Batch	Single	Description
`emails(n)`	`email()`	Email addresses
`safe_emails(n)`	`safe_email()`	Safe domain emails (@example.com, etc.)
`free_emails(n)`	`free_email()`	Free provider emails (@gmail.com, etc.)
`phone_numbers(n)`	`phone_number()`	Phone numbers in (XXX) XXX-XXXX format

Numbers & Identifiers

Batch	Single	Description
`integers(n, min, max)`	`integer(min, max)`	Random integers in range
`floats(n, min, max)`	`float_(min, max)`	Random floats in range (Note: `float_` avoids shadowing Python's `float` builtin)
`uuids(n)`	`uuid()`	UUID v4 strings
`md5s(n)`	`md5()`	Random 32-char hex strings (MD5-like format, not cryptographic hashes)
`sha256s(n)`	`sha256()`	Random 64-char hex strings (SHA256-like format, not cryptographic hashes)

Dates & Times

Batch	Single	Description
`dates(n, start, end)`	`date(start, end)`	Random dates (YYYY-MM-DD)
`datetimes(n, start, end)`	`datetime_(start, end)`	Random datetimes (ISO 8601). Note: `datetime_` avoids shadowing Python's `datetime` module
`dates_of_birth(n, min_age, max_age)`	`date_of_birth(min_age, max_age)`	Birth dates for given age range

Addresses

Batch	Single	Description
`street_addresses(n)`	`street_address()`	Street addresses (e.g., "123 Main Street")
`cities(n)`	`city()`	City names
`states(n)`	`state()`	State names
`countries(n)`	`country()`	Country names
`zip_codes(n)`	`zip_code()`	ZIP codes (5 or 9 digit)
`addresses(n)`	`address()`	Full addresses

Company & Business

Batch	Single	Description
`companies(n)`	`company()`	Company names
`jobs(n)`	`job()`	Job titles
`catch_phrases(n)`	`catch_phrase()`	Business catch phrases

Network

Batch	Single	Description
`urls(n)`	`url()`	URLs with https://
`domain_names(n)`	`domain_name()`	Domain names
`ipv4s(n)`	`ipv4()`	IPv4 addresses
`ipv6s(n)`	`ipv6()`	IPv6 addresses
`mac_addresses(n)`	`mac_address()`	MAC addresses

Web & HTML

Batch	Single	Description
`url_paths(n)`	`url_path()`	URL paths (e.g., "/blog/products/42")
`url_slugs(n)`	`url_slug()`	URL slugs (e.g., "ultimate-guide-2024")
`query_strings(n)`	`query_string()`	Query strings (e.g., "?page=2&sort=date")
`meta_descriptions(n)`	`meta_description()`	HTML meta description tags
`og_tags_batch(n)`	`og_tags()`	Open Graph meta tag sets (multi-line)
`hreflang_tags_batch(n)`	`hreflang_tags()`	Hreflang link tag sets with x-default
`img_tags(n, ratio)`	`img_tag(ratio)`	Image tags (configurable missing alt ratio)
`content_type_headers(n)`	`content_type_header()`	Content-Type header values
`http_headers_batch(n)`	`http_headers()`	HTTP response header dicts
`robots_txts(n)`	`robots_txt()`	robots.txt file contents
`html_pages(n, ...)`	`html_page(...)`	Full HTML5 pages with configurable SEO elements
-	`website(pages, domain)`	Interlinked website (dict of URL → HTML)

from forgery import Faker

fake = Faker()
fake.seed(42)

# Generate a full HTML page with SEO elements
page = fake.html_page(
    headings=4,
    internal_links=5,
    images=3,
    include_og_tags=True,
    domain="mysite.com",
)

# Generate an interlinked website for crawl testing
site = fake.website(pages=20, domain="example.com")
# site = {"https://example.com/": "<html>...", "https://example.com/blog/guide": "<html>...", ...}
# Every page is reachable from the homepage via link traversal

Finance

Batch	Single	Description
`credit_cards(n)`	`credit_card()`	Credit card numbers (valid Luhn)
`credit_card_providers(n)`	`credit_card_provider()`	Card network name (Visa, Mastercard, Amex, Discover)
`credit_card_expires(n)`	`credit_card_expire()`	Expiry date in MM/YY format
`credit_card_security_codes(n)`	`credit_card_security_code()`	CVV: 3 digits (Visa/MC/Discover) or 4 digits (Amex)
`credit_card_fulls(n)`	`credit_card_full()`	Complete card info dict (provider, number, expire, security_code, name)
`ibans(n)`	`iban()`	IBAN numbers (valid checksum)
`bics(n)`	`bic()`	BIC/SWIFT codes (8 or 11 characters)
`bank_accounts(n)`	`bank_account()`	Bank account numbers (8-17 digits)
`bank_names(n)`	`bank_name()`	Bank names (locale-specific)

Currency

Batch	Single	Description
`currency_codes(n)`	`currency_code()`	ISO 4217 currency codes (e.g., "USD", "EUR")
`currency_names(n)`	`currency_name()`	Currency names in English (e.g., "United States Dollar")
`currencies(n)`	`currency()`	(code, name) tuples
`prices(n, min, max)`	`price(min, max)`	Prices with 2 decimal places

UK Banking

Batch	Single	Description
`sort_codes(n)`	`sort_code()`	UK sort codes (XX-XX-XX format)
`uk_account_numbers(n)`	`uk_account_number()`	UK account numbers (exactly 8 digits)
`transaction_amounts(n, min, max)`	`transaction_amount(min, max)`	Transaction amounts (2 decimal places)
`transactions(n, balance, start, end)`	-	Full transaction records with running balance

Passwords

Batch	Single	Description
`passwords(n, ...)`	`password(...)`	Random passwords with configurable character sets

Password options:

length: Password length (default: 12)
uppercase: Include uppercase letters (default: True)
lowercase: Include lowercase letters (default: True)
digits: Include digits (default: True)
symbols: Include symbols (default: True)

Text & Lorem Ipsum

Batch	Single	Description
`sentences(n, word_count)`	`sentence(word_count)`	Lorem ipsum sentences
`paragraphs(n, sentence_count)`	`paragraph(sentence_count)`	Lorem ipsum paragraphs
`texts(n, min_chars, max_chars)`	`text(min_chars, max_chars)`	Text blocks with length limits

Colors

Batch	Single	Description
`colors(n)`	`color()`	Color names
`hex_colors(n)`	`hex_color()`	Hex color codes (#RRGGBB)
`rgb_colors(n)`	`rgb_color()`	RGB tuples (r, g, b)

Geographic

Batch	Single	Description
`latitudes(n)`	`latitude()`	Random latitude in [-90.0, 90.0]
`longitudes(n)`	`longitude()`	Random longitude in [-180.0, 180.0]
`coordinates(n)`	`coordinate()`	(latitude, longitude) tuples

User Agents

Batch	Single	Description
`user_agents(n)`	`user_agent()`	Random browser user agent string (any browser)
`chromes(n)`	`chrome()`	Chrome user agent string
`firefoxes(n)`	`firefox()`	Firefox user agent string
`safaris(n)`	`safari()`	Safari user agent string

Booleans

Batch	Single	Description
`booleans(n, probability)`	`boolean(probability)`	Random booleans (default: 50% True)

String Pattern Templates

Batch	Single	Description
`numerify_batch(pattern, n)`	`numerify(pattern)`	Replace `#` with random digits (0-9)
`letterify_batch(pattern, n)`	`letterify(pattern)`	Replace `?` with random lowercase letters (a-z)
`bothify_batch(pattern, n)`	`bothify(pattern)`	Replace `#` with digits and `?` with lowercase letters
`lexify_batch(pattern, n)`	`lexify(pattern)`	Replace `?` with random uppercase letters (A-Z)

from forgery import Faker

fake = Faker()
fake.numerify("###-###-####")   # "847-321-9056"
fake.letterify("??-??")         # "kx-bp"
fake.bothify("??-####")         # "mz-7314"
fake.lexify("???-###")          # "QWR-###" (only ? is replaced)

Barcode

Batch	Single	Description
`ean13s(n)`	`ean13()`	EAN-13 barcodes (valid check digit)
`ean8s(n)`	`ean8()`	EAN-8 barcodes (valid check digit)
`upc_as(n)`	`upc_a()`	UPC-A barcodes (valid check digit)
`upc_es(n)`	`upc_e()`	UPC-E barcodes (valid check digit)

ISBN

Batch	Single	Description
`isbn10s(n)`	`isbn10()`	ISBN-10 with hyphens (valid check digit, may end in X)
`isbn13s(n)`	`isbn13()`	ISBN-13 with hyphens (978/979 prefix, valid check digit)

File/System

Batch	Single	Description
`file_names(n)`	`file_name()`	File names with extension (e.g., "report.pdf")
`file_extensions(n)`	`file_extension()`	File extensions (e.g., "pdf", "csv")
`mime_types(n)`	`mime_type()`	MIME types (e.g., "application/pdf")
`file_paths(n)`	`file_path_()`	File paths (e.g., "/home/user/documents/report.pdf")

Commerce/Product

Batch	Single	Description
`product_names(n)`	`product_name()`	Product names (e.g., "Ergonomic Steel Chair")
`product_categories(n)`	`product_category()`	Product categories (e.g., "Electronics")
`departments(n)`	`department()`	Store departments (e.g., "Home & Garden")
`product_materials(n)`	`product_material()`	Product materials (e.g., "Cotton", "Steel")

SSN/National ID

Batch	Single	Description
`ssns(n)`	`ssn()`	Locale-specific national ID numbers

Formats by locale:

Locale	Format	Example
`en_US`	SSN (XXX-XX-XXXX)	`"123-45-6789"`
`en_GB`	NI Number (XX 99 99 99 X)	`"AB 12 34 56 C"`
`de_DE`	Steuer-ID (11 digits)	`"12345678901"`
`fr_FR`	NSS (15 digits with check key)	`"185076923400145"`
`es_ES`	DNI (8 digits + letter)	`"12345678Z"`
`it_IT`	Codice Fiscale (16 alphanumeric)	`"RSSMRA85M01H501Z"`
`ja_JP`	My Number (12 digits with check)	`"123456789012"`

Vehicle/Automotive

Batch	Single	Description
`license_plates(n)`	`license_plate()`	Locale-specific license plates
`vehicle_makes(n)`	`vehicle_make()`	Vehicle manufacturers (e.g., "Toyota")
`vehicle_models(n)`	`vehicle_model()`	Vehicle models (e.g., "Camry")
`vehicle_years(n)`	`vehicle_year()`	Model years (1990-2026)
`vins(n)`	`vin()`	17-character VINs (valid check digit, no I/O/Q)

License plate formats by locale:

Locale	Format	Example
`en_US`	ABC-1234	`"KHX-4829"`
`en_GB`	AB12 CDE	`"LM65 NXR"`
`de_DE`	X AB 1234	`"B KL 3847"`
`fr_FR`	AB-123-CD	`"FG-482-HJ"`
`es_ES`	1234 ABC	`"4829 FKH"`
`it_IT`	AB 123 CD	`"FG 482 HJ"`
`ja_JP`	300 12-34	`"500 38-47"`

Package Registry Data

For seeding test databases of package registries (PyPI, npm, Maven, Cargo, RubyGems). Cross-ecosystem primitives share one API; ecosystem-specific shapes have their own methods.

Cross-ecosystem primitives

Batch	Single	Description
`commit_shas(n)`	`commit_sha()`	40-hex-char git commit SHA
`short_commit_shas(n)`	`short_commit_sha()`	7-hex-char short SHA
`semvers(n)`	`semver()`	SemVer `MAJOR.MINOR.PATCH`
`semver_prereleases(n)`	`semver_prerelease()`	Pre-release (e.g. `1.2.3-alpha.1+build.5`)
`calvers(n)`	`calver()`	CalVer in mixed schemes (`YYYY.MM.DD`, `YY.MM`, ...)
`spdx_licenses(n)`	`spdx_license()`	SPDX identifier (50 common IDs)
`git_usernames(n)`	`git_username()`	GitHub/GitLab/Bitbucket-compatible username

Ecosystem-specific versions (where SemVer alone doesn't cover the format)

Batch	Single	Description
`pypi_versions(n)`	`pypi_version()`	PEP 440 (pre/post/dev releases)
`maven_versions(n)`	`maven_version()`	Maven version with qualifiers (`-SNAPSHOT`, `.RELEASE`, ...)

Version constraints

Batch	Single	Description
`pypi_version_specifiers(n)`	`pypi_version_specifier()`	PEP 440 (e.g. `>=1.2,<2.0`, `~=1.0`)
`npm_version_ranges(n)`	`npm_version_range()`	npm (e.g. `^1.2.3`, `~1.2.3`, `1.x`)
`cargo_version_reqs(n)`	`cargo_version_req()`	Cargo (e.g. `^1.0`, `~1.2`)
`maven_version_ranges(n)`	`maven_version_range()`	Maven (e.g. `[1.0,2.0)`)
`gem_version_requirements(n)`	`gem_version_requirement()`	RubyGems (e.g. `~> 1.2`)

Package identity

Batch	Single	Description
`pypi_package_names(n)`	`pypi_package_name()`	PEP 503 normalised (lowercase `[a-z0-9-]`)
`npm_package_names(n)`	`npm_package_name()`	Plain or `@scope/pkg` (~30% scoped)
`cargo_package_names(n)`	`cargo_package_name()`	Rust-ident flavour
`gem_names(n)`	`gem_name()`	RubyGems gem name
`maven_group_ids(n)`	`maven_group_id()`	Reverse domain (e.g. `com.example.tools`)
`maven_artifact_ids(n)`	`maven_artifact_id()`	Lowercase with hyphens
`maven_coordinates(n)`	`maven_coordinate()`	GAV (`group:artifact:version`)

Full requirement lines

Batch	Single	Description
`pypi_requirements(n)`	`pypi_requirement()`	e.g. `requests>=2.0.0,<3.0.0`

from forgery import Faker

fake = Faker()
fake.seed(42)
fake.pypi_requirement()       # 'requests>=2.0.0,<3.0.0'
fake.maven_coordinate()       # 'com.example.tools:widget-core:1.2.3-SNAPSHOT'
fake.npm_package_name()       # '@types/fast-parser'
fake.spdx_license()           # 'Apache-2.0'
fake.git_username()           # 'tiny-logger42'
fake.commit_sha()             # 'a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2'

The nine batch methods below accept unique=True for no-duplicate output, matching the names(n, unique=True) pattern — useful when seeding registry tables that have a unique-name constraint. Exhausting the combinatorial pool raises ValueError:

fake.pypi_package_names(100, unique=True)   # 100 distinct package names
fake.maven_coordinates(500, unique=True)    # 500 distinct GAVs
fake.spdx_licenses(60, unique=True)         # ValueError: only 50 SPDX IDs available

Methods with unique support: pypi_package_names, npm_package_names, cargo_package_names, gem_names, maven_group_ids, maven_artifact_ids, maven_coordinates, git_usernames, spdx_licenses.

Profile

Batch	Single	Description
`profiles(n)`	`profile()`	Complete personal profiles (returns dict)

Each profile dict contains: first_name, last_name, name, email, phone, address, city, state, zip_code, country, company, job, date_of_birth.

from forgery import Faker

fake = Faker()
fake.seed(42)
p = fake.profile()
# {"first_name": "Ryan", "last_name": "Grant", "name": "Ryan Grant",
#  "email": "rgrant@example.com", "phone": "(555) 123-4567", ...}

Unique Value Generation

For batch methods that select from finite lists (names, cities, countries, etc.), you can request unique values:

from forgery import Faker

fake = Faker()
fake.seed(42)

# Generate 50 unique names (no duplicates)
unique_names = fake.names(50, unique=True)
assert len(unique_names) == len(set(unique_names))

# Generate 20 unique cities
unique_cities = fake.cities(20, unique=True)

# Generate 50 unique countries
unique_countries = fake.countries(50, unique=True)

Important Notes:

Unique generation will raise ValueError if you request more unique values than are available in the underlying data set.
Performance: Unique generation uses O(n) memory (stores all outputs in a HashSet) and can be O(n × 100) time in worst case due to retry logic. For very large unique batches, consider whether duplicates are actually problematic for your use case.

Financial Transaction Generation

Generate realistic bank transaction data with running balances:

from forgery import Faker

fake = Faker()
fake.seed(42)

# Generate 50 transactions from Jan to Mar 2024, starting with £1000 balance
txns = fake.transactions(50, 1000.0, "2024-01-01", "2024-03-31")

for txn in txns[:3]:
    print(f"{txn['date']} | {txn['transaction_type']:15} | {txn['amount']:>10.2f} | {txn['balance']:>10.2f}")
# 2024-01-03 | Card Payment    |    -42.50 |     957.50
# 2024-01-05 | Direct Debit    |   -125.00 |     832.50
# 2024-01-08 | Faster Payment  |   1250.00 |    2082.50

Each transaction dict contains:

reference: 8-character alphanumeric reference
date: Transaction date (YYYY-MM-DD)
amount: Transaction amount (negative for debits)
transaction_type: e.g., "Card Payment", "Direct Debit", "Salary"
description: Merchant or payee name
balance: Running balance after transaction

Structured Data Generation

Generate entire datasets with a single call using schema definitions:

records()

Returns a list of dictionaries:

from forgery import records, seed

seed(42)
data = records(1000, {
    "id": "uuid",
    "name": "name",
    "email": "email",
    "age": ("int", 18, 65),
    "salary": ("float", 30000.0, 150000.0),
    "hire_date": ("date", "2020-01-01", "2024-12-31"),
    "bio": ("text", 50, 200),
    "status": ("choice", ["active", "inactive", "pending"]),
})

# data[0] = {"id": "88917925-...", "name": "Austin Bell", "age": 50, ...}

records_tuples()

Returns a list of tuples (faster, values in alphabetical key order):

from forgery import records_tuples, seed

seed(42)
data = records_tuples(1000, {
    "age": ("int", 18, 65),
    "name": "name",
})
# data[0] = (50, "Ryan Grant")  # (age, name) - alphabetical order

records_arrow()

Returns a PyArrow RecordBatch for high-performance data processing:

import pyarrow as pa
from forgery import records_arrow, seed

seed(42)
batch = records_arrow(100_000, {
    "id": "uuid",
    "name": "name",
    "age": ("int", 18, 65),
    "salary": ("float", 30000.0, 150000.0),
})

# batch is a pyarrow.RecordBatch
print(batch.num_rows)     # 100000
print(batch.num_columns)  # 4
print(batch.schema)
# age: int64 not null
# id: string not null
# name: string not null
# salary: double not null

# Convert to pandas DataFrame
df = batch.to_pandas()

# Or to Polars DataFrame
import polars as pl
df_polars = pl.from_arrow(batch)

Note: Requires pyarrow to be installed: pip install pyarrow

The records_arrow() function generates data in columnar format, which is more efficient for large batches and integrates seamlessly with the Arrow ecosystem (PyArrow, Polars, pandas, DuckDB, etc.).

Serialized Output Formats

Generate records directly as serialized strings or bytes, avoiding the overhead of creating Python objects just to serialize them.

records_csv()

Returns a CSV string with a header row (fields in alphabetical order):

from forgery import records_csv, seed

seed(42)
csv_str = records_csv(1000, {
    "name": "name",
    "email": "email",
    "age": ("int", 18, 65),
})
# age,email,name
# 50,austin.bell@example.com,Austin Bell
# ...

records_json()

Returns a JSON array of objects:

from forgery import records_json, seed

seed(42)
json_str = records_json(1000, {
    "name": "name",
    "age": ("int", 18, 65),
    "active": "boolean",
})
# [{"active":true,"age":50,"name":"Austin Bell"},...]

Integer and float values are JSON numbers, booleans are JSON booleans, and tuples (e.g., RGB colors, coordinates) become JSON arrays.

records_ndjson()

Returns newline-delimited JSON (one JSON object per line, no trailing newline):

from forgery import records_ndjson, seed

seed(42)
ndjson_str = records_ndjson(1000, {
    "id": "uuid",
    "name": "name",
})
# {"id":"88917925-...","name":"Austin Bell"}
# {"id":"a3c1e7f2-...","name":"Maria Garcia"}
# ...

records_parquet()

Returns Parquet file content as bytes (uses the Arrow path internally).

Note: Like records_arrow(), this uses column-major generation. With a fixed seed and multi-column schema, the row data will differ from the row-major methods (records_csv, records_json, records_ndjson, records_sql).

from forgery import records_parquet, seed

seed(42)
parquet_bytes = records_parquet(100_000, {
    "id": "uuid",
    "name": "name",
    "salary": ("float", 30000.0, 150000.0),
})

# Write to disk
with open("data.parquet", "wb") as f:
    f.write(parquet_bytes)

# Or load directly with PyArrow
import pyarrow.parquet as pq
import io
table = pq.read_table(io.BytesIO(parquet_bytes))

records_sql()

Returns ANSI SQL INSERT statements with properly escaped values:

from forgery import records_sql, seed

seed(42)
sql = records_sql(1000, {
    "name": "name",
    "email": "email",
    "age": ("int", 18, 65),
}, "users")
# INSERT INTO "users" ("age", "email", "name") VALUES
# (50, 'austin.bell@example.com', 'Austin Bell'),
# ...
# (34, 'maria.garcia@gmail.com', 'Maria Garcia');

For large batches, multiple INSERT statements are generated with up to 1000 rows each. Column names are double-quoted and string values use single-quote escaping.

Streaming File Writer

For datasets that exceed available memory, records_to_file() generates records in bounded-memory chunks and writes each chunk to disk before generating the next. Memory usage is proportional to chunk_size, not total n.

from forgery import Faker

fake = Faker()
fake.seed(42)

# Generate 100 million records — memory stays at ~500-800 MB
fake.records_to_file(
    100_000_000,
    {"id": "uuid", "name": "name", "amount": ("float", 0.01, 9999.99)},
    "transactions.parquet",
    chunk_size=1_000_000,  # records per chunk (default: 1M, max: 10M)
)

Supported formats: CSV (.csv), NDJSON (.ndjson/.jsonl), SQL (.sql), Parquet (.parquet). Format is auto-detected from the file extension, or set explicitly with format="csv".

SQL format requires a table parameter:

from forgery import records_to_file, seed

seed(42)
records_to_file(
    50_000_000,
    {"name": "name", "email": "email"},
    "users.sql",
    table="users",
    chunk_size=500_000,
)

Progress callback — track progress with an optional callback:

from forgery import records_to_file, seed

seed(42)
records_to_file(
    10_000_000,
    {"name": "name", "email": "email"},
    "users.csv",
    on_progress=lambda written, total: print(f"\r{written/total:.0%}", end=""),
)

Memory estimation — plan chunk sizes based on available RAM:

from forgery import Faker

schema = {"id": "uuid", "name": "name", "amount": ("float", 0.01, 9999.99)}
est = Faker.estimate_memory(1_000_000, schema)
print(f"~{est / 1024**2:.0f} MB per 1M records")

All streaming formats use row-major generation, so the same seed produces identical data across CSV, NDJSON, SQL, and Parquet output.

Schema Field Types

Type	Syntax	Example
Simple types	`"type_name"`	`"name"`, `"email"`, `"uuid"`, `"int"`, `"float"`
Integer range	`("int", min, max)`	`("int", 18, 65)`
Float range	`("float", min, max)`	`("float", 0.0, 100.0)`
Text with limits	`("text", min_chars, max_chars)`	`("text", 50, 200)`
Date range	`("date", start, end)`	`("date", "2020-01-01", "2024-12-31")`
Choice	`("choice", [options])`	`("choice", ["a", "b", "c"])`

All simple types from the generators above are supported: name, first_name, last_name, email, safe_email, free_email, phone, uuid, int, float, date, datetime, street_address, city, state, country, zip_code, address, company, job, catch_phrase, url, domain_name, ipv4, ipv6, mac_address, credit_card, iban, sentence, paragraph, text, color, hex_color, rgb_color, md5, sha256, latitude, longitude, coordinate, boolean, ssn, file_name, file_extension, mime_type, file_path, license_plate, vehicle_make, vehicle_model, vehicle_year, vin, ean13, ean8, upc_a, upc_e, isbn10, isbn13, product_name, product_category, department, product_material, url_path, url_slug, query_string.

Async Generation

For large datasets (millions of records), async methods prevent blocking the Python event loop:

records_async()

import asyncio
from forgery import records_async, seed

async def main():
    seed(42)
    records = await records_async(1_000_000, {
        "id": "uuid",
        "name": "name",
        "email": "email",
    })
    print(f"Generated {len(records)} records")

asyncio.run(main())

records_tuples_async()

import asyncio
from forgery import records_tuples_async, seed

async def main():
    seed(42)
    records = await records_tuples_async(1_000_000, {
        "age": ("int", 18, 65),
        "name": "name",
    })
    return records

asyncio.run(main())

records_arrow_async()

import asyncio
from forgery import records_arrow_async, seed

async def main():
    seed(42)
    batch = await records_arrow_async(1_000_000, {
        "id": "uuid",
        "name": "name",
        "salary": ("float", 30000.0, 150000.0),
    })
    return batch.to_pandas()

asyncio.run(main())

All async methods accept an optional chunk_size parameter (default: 10,000) that controls how frequently control is yielded to the event loop. Smaller chunks yield more frequently but have slightly higher overhead.

Note: Async methods use a snapshot of the RNG state at call time. The main Faker instance's RNG is not advanced, so calling the same async method twice with the same seed produces identical results. For unique results across multiple async calls, use different seeds or different Faker instances.

Arrow async chunking caveat: For records_arrow_async(), when n > chunk_size, the output differs from records_arrow() due to column-major RNG consumption within each chunk. If you need identical results to the sync version, set chunk_size >= n. The records_async() and records_tuples_async() methods always match their sync counterparts regardless of chunk size.

Custom Providers

Register your own data providers for domain-specific generation:

Basic Custom Provider

from forgery import Faker

fake = Faker()

# Register a uniform (equal probability) provider
fake.add_provider("team", ["Engineering", "Sales", "HR", "Marketing"])

# Generate values
team = fake.generate("team")
teams = fake.generate_batch("team", 100)

Weighted Custom Provider

# Register a weighted provider (higher weights = more likely)
fake.add_weighted_provider("status", [
    ("active", 80),    # 80% probability
    ("inactive", 20),  # 20% probability
])

# Generate with weighted distribution
statuses = fake.generate_batch("status", 1000)
# Expect ~800 "active", ~200 "inactive"

Custom Providers in Records

Custom providers integrate seamlessly with records():

from forgery import Faker

fake = Faker()
fake.add_provider("team", ["Eng", "Sales", "HR"])
fake.add_weighted_provider("priority", [("high", 20), ("medium", 50), ("low", 30)])

data = fake.records(1000, {
    "id": "uuid",
    "name": "name",
    "team": "team",              # Custom provider
    "priority": "priority",      # Weighted custom provider
})

Provider Management

fake.has_provider("team")  # Check if provider exists
fake.list_providers()      # List all custom provider names
fake.remove_provider("team")  # Remove a provider

Module-level Convenience

from forgery import add_provider, generate, generate_batch, seed

seed(42)
add_provider("tier", ["gold", "silver", "bronze"])
tier = generate("tier")
tiers = generate_batch("tier", 100)

Note: Custom provider names cannot conflict with built-in types (e.g., "name", "email", "uuid").

Performance

Benchmark generating 100,000 items:

Names:
  forgery.names():  0.015s
  Faker.name():     1.523s
  Speedup: 101x

Emails:
  forgery.emails():  0.021s
  Faker.email():     2.134s
  Speedup: 101x

Benchmark generating 1,000,000 items:

Names:
  forgery.names():   0.108s
  Faker.name():     47.111s
  Speedup: 436x

Emails:
  forgery.emails():   0.167s
  Faker.email():     46.984s
  Speedup: 281x

Seeding Contract

seed(n) affects the default fake instance only
Each Faker instance has its own independent RNG state
Single-threaded determinism only: Results are reproducible within one thread
No cross-version guarantee: Output may differ between forgery versions

Parallel Generation

For large batches, enable parallel generation to split work across multiple CPU cores:

from forgery import Faker

fake = Faker()
fake.seed(42)
fake.set_parallel(True)  # Auto-detect thread count

# All batch methods now run in parallel
names = fake.names(1_000_000)      # ~3.3x faster than sequential
emails = fake.emails(1_000_000)
uuids = fake.uuids(1_000_000)

# Explicit thread count (useful for reproducibility across machines)
fake.set_parallel(True, num_threads=4)

# Check current settings
fake.get_parallel()      # True
fake.get_num_threads()   # 4

# Disable parallel
fake.set_parallel(False)

Determinism contract:

Same seed + same num_threads = identical output
Changing num_threads produces different output
unique=True always uses the sequential path

Performance (names benchmark):

Batch Size	Sequential	Parallel	Speedup
10,000	443 µs	753 µs	0.6x (overhead)
100,000	8.5 ms	2.5 ms	3.4x
1,000,000	83 ms	25 ms	3.3x

Auto-detection ensures parallelism is only used when beneficial (minimum 1,000 items per thread).

Thread Safety

forgery is NOT thread-safe. Each Faker instance maintains mutable RNG state.

For multi-threaded applications, create one Faker instance per thread:

from concurrent.futures import ThreadPoolExecutor
from forgery import Faker

def generate_names(seed: int) -> list[str]:
    fake = Faker()  # Create per-thread instance
    fake.seed(seed)
    return fake.names(1000)

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(generate_names, range(4)))

Do NOT share a Faker instance across threads.

Note: set_parallel(True) uses Rayon's internal thread pool for parallel generation within a single Faker instance. This is different from sharing a Faker across Python threads, which remains unsafe.

Development

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install maturin
pip install maturin

# Build and install locally
maturin develop --release

# Run tests
cargo test          # Rust tests
pytest              # Python tests

# Run benchmarks
python tests/benchmarks/bench_vs_faker.py

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github		.github
benches		benches
fuzz		fuzz
python/forgery		python/forgery
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
sonar-project.properties		sonar-project.properties

Folders and files

Latest commit

History

Repository files navigation

forgery

Installation

From source (for development)

Quick Start

Features

Locale Support

API

Module-level functions (use default instance)

Faker class (independent instances)

Available Generators

Names & Identity

Contact Information

Numbers & Identifiers

Dates & Times

Addresses

Company & Business

Network

Web & HTML

Finance

Currency

UK Banking

Passwords

Text & Lorem Ipsum

Colors

Geographic

User Agents

Booleans

String Pattern Templates

Barcode

ISBN

File/System

Commerce/Product

SSN/National ID

Vehicle/Automotive

Package Registry Data

Profile

Unique Value Generation

Financial Transaction Generation

Structured Data Generation

records()

records_tuples()

records_arrow()

Serialized Output Formats

records_csv()

records_json()

records_ndjson()

records_parquet()

records_sql()

Streaming File Writer

Schema Field Types

Async Generation

records_async()

records_tuples_async()

records_arrow_async()

Custom Providers

Basic Custom Provider

Weighted Custom Provider

Custom Providers in Records

Provider Management

Module-level Convenience

Performance

Seeding Contract

Parallel Generation

Thread Safety

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Packages