feat: streaming file writer for bounded-memory generation by williajm · Pull Request #46 · williajm/forgery

williajm · 2026-04-03T23:15:50Z

Summary

Adds records_to_file() for generating arbitrarily large datasets (100M+ records) with bounded memory, by writing records in configurable chunks to disk
Supports CSV, NDJSON, SQL, and Parquet formats with auto-detection from file extension
Includes estimate_memory() utility for planning chunk sizes based on available RAM
Progress callback (on_progress) for tracking long-running writes, with proper exception propagation preserving the original Python exception type
I/O errors surface as OSError/FileNotFoundError/PermissionError (not ValueError)
Schema validated before file creation to avoid truncating existing files on invalid input

Example

from forgery import Faker

fake = Faker(seed=42)
fake.records_to_file(
    100_000_000,
    {"id": "uuid", "name": "name", "amount": ("float", 0.01, 9999.99)},
    "transactions.parquet",
    chunk_size=1_000_000,
    on_progress=lambda w, t: print(f"\r{w/t:.0%}", end=""),
)

Test plan

25 Rust unit tests in file_writer.rs (format detection, all 4 writers, chunking, determinism, progress, errors, memory estimation)
40 Python integration tests in test_file_writer.py (all formats, auto-detection, chunking equivalence, progress callbacks, exception preservation, OSError mapping, schema validation, custom providers, 10k-record stress tests)
2 convenience function tests in test_convenience_new.py
Full suite: 802 Rust tests, 1335 Python tests, 100% Python coverage
cargo fmt, clippy -D warnings, ruff, mypy --strict all clean

🤖 Generated with Claude Code

Add records_to_file() which generates records in chunks and writes each chunk to disk, keeping memory bounded by chunk_size regardless of total n. Supports CSV, NDJSON, SQL, and Parquet formats with auto-detection from file extension. Includes progress callback and estimate_memory() utility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…atch - chunk_size=0 caused an infinite loop; now returns a clear error - build_column_from_records used panic! for type mismatches on a pub function; replaced with Err(SchemaError) returns - Added regression tests for both issues Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… map I/O to OSError - P1: Call validate_schema_with_custom() before File::create() so invalid schemas (e.g. ("int", 10, 1)) fail fast without truncating the target file. Also rejects invalid schemas when n=0. - P2: Progress callback exceptions are now propagated instead of silently swallowed — callers can use the callback to abort writes. - P3: FileWriteError::Io is mapped to PyOSError/PyFileNotFoundError/ PyPermissionError instead of ValueError, matching the documented API. - Added regression tests for all three issues. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The progress callback error path was stringifying PyErr then wrapping it as FileWriteError::Config, losing the original exception type and traceback. Now stashes the PyErr in a RefCell and re-raises it directly, so a RuntimeError in the callback surfaces as RuntimeError to the caller. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-03T23:17:24Z

Codecov Report

❌ Patch coverage is 80.11494% with 173 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/providers/records.rs	28.57%	105 Missing ⚠️
src/providers/file_writer.rs	90.46%	68 Missing ⚠️

📢 Thoughts on this report? Let us know!

Extract each Arrow column type into its own helper function (extract_i64_column, extract_f64_column, extract_bool_column, extract_coordinate_column, extract_rgb_column) to bring the parent function under SonarCloud's cognitive complexity threshold of 15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-03T23:34:24Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

…elog Those were shipped in #46 and #45 but never made it into CHANGELOG.md. Adding them to the 0.4.0 release notes so users upgrading from 0.3.0 see every user-facing feature in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: package registry providers + 0.4.0 release Adds 22 method pairs for seeding test databases of package registries (PyPI, npm, Maven, Cargo, RubyGems). Cross-ecosystem primitives share one API; ecosystem-specific shapes have their own methods where the canonical form genuinely differs. New providers: - commit_sha / short_commit_sha - semver / semver_prerelease / calver - spdx_license (50 common IDs) - git_username (strict GitHub rules) - pypi_version (PEP 440), maven_version (with qualifiers) - pypi/npm/cargo/maven/gem version constraints - pypi/npm/cargo/gem package names, maven group/artifact/coordinate - pypi_requirement (full pip-install line) Bumps version to 0.4.0, updates README, ARCHITECTURE.md, CHANGELOG.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: backfill streaming writer + serialized formats into 0.4.0 changelog Those were shipped in #46 and #45 but never made it into CHANGELOG.md. Adding them to the 0.4.0 release notes so users upgrading from 0.3.0 see every user-facing feature in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: guard against empty git_username after trailing-hyphen strip Current word lists can't produce a username long enough to enter the truncation branch, so this fix is defensive rather than bug-chasing — but the invariant was wrong: if a future data entry were a run of hyphens, the pop loop would empty the string and we'd return "". Add a length guard so at least one character always remains. Found in review of PR branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: actually PEP 503-normalize pypi_package_name Response to Copilot review on PR #48. The docstring and README claimed "PEP 503 normalised" but the generator was emitting underscores ~20% of the time. PEP 503 §Normalized Names collapses runs of `[-_.]+` to a single `-`, so normalized output must contain only `[a-z0-9-]`. Changes: - Drop the underscore branch from generate_pypi_package_name; hyphen is the sole separator. Collapse the two now-identical `py-{primary}` match arms. - Tighten the Rust test to reject any char outside [a-z0-9-] and to enforce no leading/trailing/double hyphens. - Tighten the Python regex to `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` and add explicit assertions against `_` and `.`. - Revert the README / CHANGELOG softening from the previous iteration — the "PEP 503 normalised" claim is now accurate. - Update the Rust + Python docstrings to explain the normalization. Also addresses three other Copilot findings: - Add the 44 missing Faker-class stubs to python/forgery/_forgery.pyi so IDE autocomplete and type checking work for callers using the Faker class directly. - Fix a broken assertion in test_sometimes_has_qualifier: the check `"." in v.split(".")[-1]` was tautologically false (the last split segment never contains a dot). Replace with `v.count(".") > 2`, which correctly identifies dot-separated Maven qualifiers like `.Final` / `.RELEASE`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: add unique=True to package-name methods + expand keyword pool Nine methods that select from combinatorial patterns now accept unique=True to guarantee no duplicates, matching the names(n, unique=True) contract used elsewhere in the library. Useful for seeding registry tables that have a unique-name constraint. Methods with unique support: - pypi_package_names, npm_package_names, cargo_package_names, gem_names - maven_group_ids, maven_artifact_ids, maven_coordinates - git_usernames - spdx_licenses (capped at 50 — the pool size) Also: - PACKAGE_KEYWORDS: 94 -> 245 entries - PACKAGE_MODIFIERS: 32 -> 67 entries Combinatorial headroom is now ~77k distinct pypi names, ~1.9M distinct git usernames, millions for maven coordinates — plenty of room before UniqueExhaustedError hits for realistic batch sizes. Implementation: - New batch_simple_unique! macro mirrors batch_locale_unique! for generators that don't take locale; wraps them in a closure that ignores the locale argument generate_unique passes. - Return type shifts from Result<_, BatchSizeError> to Result<_, ForgeryError> for the affected methods, since ForgeryError covers both batch-size and unique-exhaustion failures. - Python signatures gain `unique: bool = False` (keyword-compatible, so existing calls keep working). - Both .pyi stub files updated. - New TestUnique pytest class covers all 9 methods: no-duplicates, determinism under seed, exhaustion error, non-unique path unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

williajm and others added 4 commits April 3, 2026 23:31

williajm merged commit 34640f2 into main Apr 3, 2026
13 checks passed

williajm deleted the feat/streaming-file-writer branch April 3, 2026 23:35

williajm mentioned this pull request Apr 17, 2026

feat: package registry providers + 0.4.0 release #48

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: streaming file writer for bounded-memory generation#46

feat: streaming file writer for bounded-memory generation#46
williajm merged 5 commits intomainfrom
feat/streaming-file-writer

williajm commented Apr 3, 2026

Uh oh!

codecov Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

williajm commented Apr 3, 2026

Summary

Example

Test plan

Uh oh!

codecov Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented Apr 3, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 3, 2026 •

edited

Loading