feat: streaming file writer for bounded-memory generation#46
Merged
Conversation
Add records_to_file() which generates records in chunks and writes each chunk to disk, keeping memory bounded by chunk_size regardless of total n. Supports CSV, NDJSON, SQL, and Parquet formats with auto-detection from file extension. Includes progress callback and estimate_memory() utility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…atch - chunk_size=0 caused an infinite loop; now returns a clear error - build_column_from_records used panic! for type mismatches on a pub function; replaced with Err(SchemaError) returns - Added regression tests for both issues Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… map I/O to OSError
- P1: Call validate_schema_with_custom() before File::create() so
invalid schemas (e.g. ("int", 10, 1)) fail fast without truncating
the target file. Also rejects invalid schemas when n=0.
- P2: Progress callback exceptions are now propagated instead of
silently swallowed — callers can use the callback to abort writes.
- P3: FileWriteError::Io is mapped to PyOSError/PyFileNotFoundError/
PyPermissionError instead of ValueError, matching the documented API.
- Added regression tests for all three issues.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The progress callback error path was stringifying PyErr then wrapping it as FileWriteError::Config, losing the original exception type and traceback. Now stashes the PyErr in a RefCell and re-raises it directly, so a RuntimeError in the callback surfaces as RuntimeError to the caller. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Extract each Arrow column type into its own helper function (extract_i64_column, extract_f64_column, extract_bool_column, extract_coordinate_column, extract_rgb_column) to bring the parent function under SonarCloud's cognitive complexity threshold of 15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
9 tasks
williajm
added a commit
that referenced
this pull request
Apr 17, 2026
* feat: package registry providers + 0.4.0 release Adds 22 method pairs for seeding test databases of package registries (PyPI, npm, Maven, Cargo, RubyGems). Cross-ecosystem primitives share one API; ecosystem-specific shapes have their own methods where the canonical form genuinely differs. New providers: - commit_sha / short_commit_sha - semver / semver_prerelease / calver - spdx_license (50 common IDs) - git_username (strict GitHub rules) - pypi_version (PEP 440), maven_version (with qualifiers) - pypi/npm/cargo/maven/gem version constraints - pypi/npm/cargo/gem package names, maven group/artifact/coordinate - pypi_requirement (full pip-install line) Bumps version to 0.4.0, updates README, ARCHITECTURE.md, CHANGELOG.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: backfill streaming writer + serialized formats into 0.4.0 changelog Those were shipped in #46 and #45 but never made it into CHANGELOG.md. Adding them to the 0.4.0 release notes so users upgrading from 0.3.0 see every user-facing feature in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: guard against empty git_username after trailing-hyphen strip Current word lists can't produce a username long enough to enter the truncation branch, so this fix is defensive rather than bug-chasing — but the invariant was wrong: if a future data entry were a run of hyphens, the pop loop would empty the string and we'd return "". Add a length guard so at least one character always remains. Found in review of PR branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: actually PEP 503-normalize pypi_package_name Response to Copilot review on PR #48. The docstring and README claimed "PEP 503 normalised" but the generator was emitting underscores ~20% of the time. PEP 503 §Normalized Names collapses runs of `[-_.]+` to a single `-`, so normalized output must contain only `[a-z0-9-]`. Changes: - Drop the underscore branch from generate_pypi_package_name; hyphen is the sole separator. Collapse the two now-identical `py-{primary}` match arms. - Tighten the Rust test to reject any char outside [a-z0-9-] and to enforce no leading/trailing/double hyphens. - Tighten the Python regex to `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` and add explicit assertions against `_` and `.`. - Revert the README / CHANGELOG softening from the previous iteration — the "PEP 503 normalised" claim is now accurate. - Update the Rust + Python docstrings to explain the normalization. Also addresses three other Copilot findings: - Add the 44 missing Faker-class stubs to python/forgery/_forgery.pyi so IDE autocomplete and type checking work for callers using the Faker class directly. - Fix a broken assertion in test_sometimes_has_qualifier: the check `"." in v.split(".")[-1]` was tautologically false (the last split segment never contains a dot). Replace with `v.count(".") > 2`, which correctly identifies dot-separated Maven qualifiers like `.Final` / `.RELEASE`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: add unique=True to package-name methods + expand keyword pool Nine methods that select from combinatorial patterns now accept unique=True to guarantee no duplicates, matching the names(n, unique=True) contract used elsewhere in the library. Useful for seeding registry tables that have a unique-name constraint. Methods with unique support: - pypi_package_names, npm_package_names, cargo_package_names, gem_names - maven_group_ids, maven_artifact_ids, maven_coordinates - git_usernames - spdx_licenses (capped at 50 — the pool size) Also: - PACKAGE_KEYWORDS: 94 -> 245 entries - PACKAGE_MODIFIERS: 32 -> 67 entries Combinatorial headroom is now ~77k distinct pypi names, ~1.9M distinct git usernames, millions for maven coordinates — plenty of room before UniqueExhaustedError hits for realistic batch sizes. Implementation: - New batch_simple_unique! macro mirrors batch_locale_unique! for generators that don't take locale; wraps them in a closure that ignores the locale argument generate_unique passes. - Return type shifts from Result<_, BatchSizeError> to Result<_, ForgeryError> for the affected methods, since ForgeryError covers both batch-size and unique-exhaustion failures. - Python signatures gain `unique: bool = False` (keyword-compatible, so existing calls keep working). - Both .pyi stub files updated. - New TestUnique pytest class covers all 9 methods: no-duplicates, determinism under seed, exhaustion error, non-unique path unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
records_to_file()for generating arbitrarily large datasets (100M+ records) with bounded memory, by writing records in configurable chunks to diskestimate_memory()utility for planning chunk sizes based on available RAMon_progress) for tracking long-running writes, with proper exception propagation preserving the original Python exception typeOSError/FileNotFoundError/PermissionError(notValueError)Example
Test plan
file_writer.rs(format detection, all 4 writers, chunking, determinism, progress, errors, memory estimation)test_file_writer.py(all formats, auto-detection, chunking equivalence, progress callbacks, exception preservation, OSError mapping, schema validation, custom providers, 10k-record stress tests)test_convenience_new.py🤖 Generated with Claude Code