Skip to content

feat: streaming file writer for bounded-memory generation#46

Merged
williajm merged 5 commits intomainfrom
feat/streaming-file-writer
Apr 3, 2026
Merged

feat: streaming file writer for bounded-memory generation#46
williajm merged 5 commits intomainfrom
feat/streaming-file-writer

Conversation

@williajm
Copy link
Copy Markdown
Owner

@williajm williajm commented Apr 3, 2026

Summary

  • Adds records_to_file() for generating arbitrarily large datasets (100M+ records) with bounded memory, by writing records in configurable chunks to disk
  • Supports CSV, NDJSON, SQL, and Parquet formats with auto-detection from file extension
  • Includes estimate_memory() utility for planning chunk sizes based on available RAM
  • Progress callback (on_progress) for tracking long-running writes, with proper exception propagation preserving the original Python exception type
  • I/O errors surface as OSError/FileNotFoundError/PermissionError (not ValueError)
  • Schema validated before file creation to avoid truncating existing files on invalid input

Example

from forgery import Faker

fake = Faker(seed=42)
fake.records_to_file(
    100_000_000,
    {"id": "uuid", "name": "name", "amount": ("float", 0.01, 9999.99)},
    "transactions.parquet",
    chunk_size=1_000_000,
    on_progress=lambda w, t: print(f"\r{w/t:.0%}", end=""),
)

Test plan

  • 25 Rust unit tests in file_writer.rs (format detection, all 4 writers, chunking, determinism, progress, errors, memory estimation)
  • 40 Python integration tests in test_file_writer.py (all formats, auto-detection, chunking equivalence, progress callbacks, exception preservation, OSError mapping, schema validation, custom providers, 10k-record stress tests)
  • 2 convenience function tests in test_convenience_new.py
  • Full suite: 802 Rust tests, 1335 Python tests, 100% Python coverage
  • cargo fmt, clippy -D warnings, ruff, mypy --strict all clean

🤖 Generated with Claude Code

williajm and others added 4 commits April 3, 2026 23:31
Add records_to_file() which generates records in chunks and writes each
chunk to disk, keeping memory bounded by chunk_size regardless of total n.
Supports CSV, NDJSON, SQL, and Parquet formats with auto-detection from
file extension. Includes progress callback and estimate_memory() utility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…atch

- chunk_size=0 caused an infinite loop; now returns a clear error
- build_column_from_records used panic! for type mismatches on a pub
  function; replaced with Err(SchemaError) returns
- Added regression tests for both issues

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… map I/O to OSError

- P1: Call validate_schema_with_custom() before File::create() so
  invalid schemas (e.g. ("int", 10, 1)) fail fast without truncating
  the target file. Also rejects invalid schemas when n=0.
- P2: Progress callback exceptions are now propagated instead of
  silently swallowed — callers can use the callback to abort writes.
- P3: FileWriteError::Io is mapped to PyOSError/PyFileNotFoundError/
  PyPermissionError instead of ValueError, matching the documented API.
- Added regression tests for all three issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The progress callback error path was stringifying PyErr then wrapping
it as FileWriteError::Config, losing the original exception type and
traceback. Now stashes the PyErr in a RefCell and re-raises it directly,
so a RuntimeError in the callback surfaces as RuntimeError to the caller.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 80.11494% with 173 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/providers/records.rs 28.57% 105 Missing ⚠️
src/providers/file_writer.rs 90.46% 68 Missing ⚠️

📢 Thoughts on this report? Let us know!

Extract each Arrow column type into its own helper function
(extract_i64_column, extract_f64_column, extract_bool_column,
extract_coordinate_column, extract_rgb_column) to bring the parent
function under SonarCloud's cognitive complexity threshold of 15.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Apr 3, 2026

@williajm williajm merged commit 34640f2 into main Apr 3, 2026
13 checks passed
@williajm williajm deleted the feat/streaming-file-writer branch April 3, 2026 23:35
williajm added a commit that referenced this pull request Apr 17, 2026
…elog

Those were shipped in #46 and #45 but never made it into CHANGELOG.md.
Adding them to the 0.4.0 release notes so users upgrading from 0.3.0 see
every user-facing feature in one place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
williajm added a commit that referenced this pull request Apr 17, 2026
* feat: package registry providers + 0.4.0 release

Adds 22 method pairs for seeding test databases of package registries
(PyPI, npm, Maven, Cargo, RubyGems). Cross-ecosystem primitives share
one API; ecosystem-specific shapes have their own methods where the
canonical form genuinely differs.

New providers:
- commit_sha / short_commit_sha
- semver / semver_prerelease / calver
- spdx_license (50 common IDs)
- git_username (strict GitHub rules)
- pypi_version (PEP 440), maven_version (with qualifiers)
- pypi/npm/cargo/maven/gem version constraints
- pypi/npm/cargo/gem package names, maven group/artifact/coordinate
- pypi_requirement (full pip-install line)

Bumps version to 0.4.0, updates README, ARCHITECTURE.md, CHANGELOG.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: backfill streaming writer + serialized formats into 0.4.0 changelog

Those were shipped in #46 and #45 but never made it into CHANGELOG.md.
Adding them to the 0.4.0 release notes so users upgrading from 0.3.0 see
every user-facing feature in one place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: guard against empty git_username after trailing-hyphen strip

Current word lists can't produce a username long enough to enter the
truncation branch, so this fix is defensive rather than bug-chasing — but
the invariant was wrong: if a future data entry were a run of hyphens,
the pop loop would empty the string and we'd return "". Add a length
guard so at least one character always remains.

Found in review of PR branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: actually PEP 503-normalize pypi_package_name

Response to Copilot review on PR #48.

The docstring and README claimed "PEP 503 normalised" but the generator
was emitting underscores ~20% of the time. PEP 503 §Normalized Names
collapses runs of `[-_.]+` to a single `-`, so normalized output must
contain only `[a-z0-9-]`.

Changes:
- Drop the underscore branch from generate_pypi_package_name; hyphen is
  the sole separator. Collapse the two now-identical `py-{primary}`
  match arms.
- Tighten the Rust test to reject any char outside [a-z0-9-] and to
  enforce no leading/trailing/double hyphens.
- Tighten the Python regex to `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` and
  add explicit assertions against `_` and `.`.
- Revert the README / CHANGELOG softening from the previous iteration —
  the "PEP 503 normalised" claim is now accurate.
- Update the Rust + Python docstrings to explain the normalization.

Also addresses three other Copilot findings:
- Add the 44 missing Faker-class stubs to python/forgery/_forgery.pyi
  so IDE autocomplete and type checking work for callers using the
  Faker class directly.
- Fix a broken assertion in test_sometimes_has_qualifier: the check
  `"." in v.split(".")[-1]` was tautologically false (the last split
  segment never contains a dot). Replace with `v.count(".") > 2`,
  which correctly identifies dot-separated Maven qualifiers like
  `.Final` / `.RELEASE`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: add unique=True to package-name methods + expand keyword pool

Nine methods that select from combinatorial patterns now accept
unique=True to guarantee no duplicates, matching the names(n, unique=True)
contract used elsewhere in the library. Useful for seeding registry
tables that have a unique-name constraint.

Methods with unique support:
- pypi_package_names, npm_package_names, cargo_package_names, gem_names
- maven_group_ids, maven_artifact_ids, maven_coordinates
- git_usernames
- spdx_licenses (capped at 50 — the pool size)

Also:
- PACKAGE_KEYWORDS: 94 -> 245 entries
- PACKAGE_MODIFIERS: 32 -> 67 entries
  Combinatorial headroom is now ~77k distinct pypi names, ~1.9M distinct
  git usernames, millions for maven coordinates — plenty of room before
  UniqueExhaustedError hits for realistic batch sizes.

Implementation:
- New batch_simple_unique! macro mirrors batch_locale_unique! for
  generators that don't take locale; wraps them in a closure that
  ignores the locale argument generate_unique passes.
- Return type shifts from Result<_, BatchSizeError> to
  Result<_, ForgeryError> for the affected methods, since ForgeryError
  covers both batch-size and unique-exhaustion failures.
- Python signatures gain `unique: bool = False` (keyword-compatible, so
  existing calls keep working).
- Both .pyi stub files updated.
- New TestUnique pytest class covers all 9 methods: no-duplicates,
  determinism under seed, exhaustion error, non-unique path unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant