Skip to content

feat: package registry providers + 0.4.0 release#48

Merged
williajm merged 5 commits intomainfrom
feat/package-registry-providers
Apr 17, 2026
Merged

feat: package registry providers + 0.4.0 release#48
williajm merged 5 commits intomainfrom
feat/package-registry-providers

Conversation

@williajm
Copy link
Copy Markdown
Owner

Summary

What's in the new provider

Cross-ecosystem primitives: commit_sha / short_commit_sha, semver / semver_prerelease, calver, spdx_license (50 common IDs), git_username (enforces GitHub's rules — alphanumerics + single hyphens, no leading/trailing hyphen, no consecutive hyphens, ≤ 39 chars).

Ecosystem-specific versions: pypi_version (PEP 440 — includes pre/post/dev releases), maven_version (with qualifiers -SNAPSHOT, .RELEASE, .Final, etc.).

Version constraints (syntax differs per ecosystem): pypi_version_specifier, npm_version_range, cargo_version_req, maven_version_range, gem_version_requirement.

Package identity: pypi_package_name (PEP 503-normalised), npm_package_name (plain or @scope/pkg), cargo_package_name, gem_name, maven_group_id / maven_artifact_id / maven_coordinate (GAV).

Full lines: pypi_requirement — e.g. requests>=2.0.0,<3.0.0.

All batch methods are parallel-safe via set_parallel().

Docs

  • README.md — new "Package Registry Data" section under ## Available Generators with a usage snippet
  • ARCHITECTURE.md — provider table + en_us data directory
  • CHANGELOG.md[0.4.0] section dated 2026-04-17 with packages + backfilled features

Test plan

  • CI passes (rust-check, python-check, security, codeql)
  • cargo test --lib — 832 tests pass (30 new; unit + proptest)
  • pytest — 1446 tests pass, 100% Python coverage
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --check clean
  • ruff check + ruff format --check clean
  • mypy --strict clean
  • bandit -r python/ -ll clean
  • Cross-validated pypi_version output against packaging.version.Version and pypi_version_specifier against packaging.specifiers.SpecifierSet

🤖 Generated with Claude Code

williajm and others added 3 commits April 17, 2026 22:16
Adds 22 method pairs for seeding test databases of package registries
(PyPI, npm, Maven, Cargo, RubyGems). Cross-ecosystem primitives share
one API; ecosystem-specific shapes have their own methods where the
canonical form genuinely differs.

New providers:
- commit_sha / short_commit_sha
- semver / semver_prerelease / calver
- spdx_license (50 common IDs)
- git_username (strict GitHub rules)
- pypi_version (PEP 440), maven_version (with qualifiers)
- pypi/npm/cargo/maven/gem version constraints
- pypi/npm/cargo/gem package names, maven group/artifact/coordinate
- pypi_requirement (full pip-install line)

Bumps version to 0.4.0, updates README, ARCHITECTURE.md, CHANGELOG.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elog

Those were shipped in #46 and #45 but never made it into CHANGELOG.md.
Adding them to the 0.4.0 release notes so users upgrading from 0.3.0 see
every user-facing feature in one place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Current word lists can't produce a username long enough to enter the
truncation branch, so this fix is defensive rather than bug-chasing — but
the invariant was wrong: if a future data entry were a run of hyphens,
the pop loop would empty the string and we'd return "". Add a length
guard so at least one character always remains.

Found in review of PR branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 98.57143% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/providers/packages.rs 98.30% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new “package registry data” provider to generate realistic fake registry identifiers (PyPI, npm, Maven, Cargo, RubyGems) across both the Rust and Python APIs, and rolls the project version forward to 0.4.0 with updated documentation/changelog entries.

Changes:

  • Add a new Rust provider (packages) plus PyO3-exposed Faker methods and Python module-level convenience functions for package-registry primitives, versions, constraints, and identifiers.
  • Add comprehensive Rust + Python test coverage for the new generators, including determinism and parallel-shape checks.
  • Bump version to 0.4.0 and update README/ARCHITECTURE/CHANGELOG to document the new provider and backfill prior shipped features.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/providers/packages.rs Implements package-registry generators (SHAs, versions, constraints, package IDs, requirements) with unit + proptest coverage.
src/providers/mod.rs Registers the new packages provider module.
src/lib.rs Exposes new provider methods on Faker (Rust API) and adds corresponding PyO3 Python-visible methods.
src/data/en_us/packages.rs Adds wordlists for package keywords/modifiers, Maven components, npm scopes, prerelease tags, qualifiers.
src/data/en_us/spdx_licenses.rs Adds curated list of common SPDX identifiers for generation.
src/data/en_us/mod.rs Wires new packages and spdx_licenses datasets into the locale data exports.
python/forgery/__init__.py Adds module-level convenience functions and exports; bumps __version__ to 0.4.0.
python/forgery/__init__.pyi Adds type stubs for the new module-level convenience functions.
tests/test_packages.py Adds Python test suite for new package-registry APIs (shape checks, determinism, parallel invariants, convenience functions).
README.md Documents the new “Package Registry Data” generator section and example usage.
ARCHITECTURE.md Adds provider/data-file documentation entries for packages + SPDX data.
CHANGELOG.md Adds 0.4.0 section and updates comparison links/backfills.
Cargo.toml Bumps crate version to 0.4.0.
Cargo.lock Updates lockfile to reflect crate version bump.
pyproject.toml Bumps Python package version to 0.4.0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/providers/packages.rs Outdated
Comment thread README.md Outdated
Comment thread tests/test_packages.py Outdated
Comment thread python/forgery/__init__.pyi
williajm and others added 2 commits April 17, 2026 22:59
Response to Copilot review on PR #48.

The docstring and README claimed "PEP 503 normalised" but the generator
was emitting underscores ~20% of the time. PEP 503 §Normalized Names
collapses runs of `[-_.]+` to a single `-`, so normalized output must
contain only `[a-z0-9-]`.

Changes:
- Drop the underscore branch from generate_pypi_package_name; hyphen is
  the sole separator. Collapse the two now-identical `py-{primary}`
  match arms.
- Tighten the Rust test to reject any char outside [a-z0-9-] and to
  enforce no leading/trailing/double hyphens.
- Tighten the Python regex to `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` and
  add explicit assertions against `_` and `.`.
- Revert the README / CHANGELOG softening from the previous iteration —
  the "PEP 503 normalised" claim is now accurate.
- Update the Rust + Python docstrings to explain the normalization.

Also addresses three other Copilot findings:
- Add the 44 missing Faker-class stubs to python/forgery/_forgery.pyi
  so IDE autocomplete and type checking work for callers using the
  Faker class directly.
- Fix a broken assertion in test_sometimes_has_qualifier: the check
  `"." in v.split(".")[-1]` was tautologically false (the last split
  segment never contains a dot). Replace with `v.count(".") > 2`,
  which correctly identifies dot-separated Maven qualifiers like
  `.Final` / `.RELEASE`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nine methods that select from combinatorial patterns now accept
unique=True to guarantee no duplicates, matching the names(n, unique=True)
contract used elsewhere in the library. Useful for seeding registry
tables that have a unique-name constraint.

Methods with unique support:
- pypi_package_names, npm_package_names, cargo_package_names, gem_names
- maven_group_ids, maven_artifact_ids, maven_coordinates
- git_usernames
- spdx_licenses (capped at 50 — the pool size)

Also:
- PACKAGE_KEYWORDS: 94 -> 245 entries
- PACKAGE_MODIFIERS: 32 -> 67 entries
  Combinatorial headroom is now ~77k distinct pypi names, ~1.9M distinct
  git usernames, millions for maven coordinates — plenty of room before
  UniqueExhaustedError hits for realistic batch sizes.

Implementation:
- New batch_simple_unique! macro mirrors batch_locale_unique! for
  generators that don't take locale; wraps them in a closure that
  ignores the locale argument generate_unique passes.
- Return type shifts from Result<_, BatchSizeError> to
  Result<_, ForgeryError> for the affected methods, since ForgeryError
  covers both batch-size and unique-exhaustion failures.
- Python signatures gain `unique: bool = False` (keyword-compatible, so
  existing calls keep working).
- Both .pyi stub files updated.
- New TestUnique pytest class covers all 9 methods: no-duplicates,
  determinism under seed, exhaustion error, non-unique path unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

@williajm williajm merged commit ee5d0d1 into main Apr 17, 2026
13 checks passed
@williajm williajm deleted the feat/package-registry-providers branch April 17, 2026 22:20
williajm added a commit that referenced this pull request Apr 17, 2026
The v0.4.0 release workflow failed at the Publish step because twine
rejected sbom.cdx.json as an InvalidDistribution. The first download
step used `merge-multiple: true` with no filter, which pulled every
workflow artifact — including the SBOM — into dist/. pypa/gh-action-pypi-publish
v1.13.0 then ran twine against dist/* and choked on the JSON file:

    Checking dist/sbom.cdx.json: ERROR InvalidDistribution:
    Unknown distribution format: 'sbom.cdx.json'

Fix: replace the single "download all" step with three targeted steps.
Wheels (pattern wheels-*) and sdist (name: sdist) land in dist/; the
SBOM (name: sbom, no path) lands in the workflow root where the later
`gh release upload` step already expects it. twine now only sees valid
distribution files.

Noticed on PR #48's v0.4.0 release run; nothing ever made it to PyPI
and the GitHub release assets never populated. Re-tagging v0.4.0 after
this merges will go through cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
williajm added a commit that referenced this pull request Apr 17, 2026
…49)

The v0.4.0 release workflow failed at the Publish step because twine
rejected sbom.cdx.json as an InvalidDistribution. The first download
step used `merge-multiple: true` with no filter, which pulled every
workflow artifact — including the SBOM — into dist/. pypa/gh-action-pypi-publish
v1.13.0 then ran twine against dist/* and choked on the JSON file:

    Checking dist/sbom.cdx.json: ERROR InvalidDistribution:
    Unknown distribution format: 'sbom.cdx.json'

Fix: replace the single "download all" step with three targeted steps.
Wheels (pattern wheels-*) and sdist (name: sdist) land in dist/; the
SBOM (name: sbom, no path) lands in the workflow root where the later
`gh release upload` step already expects it. twine now only sees valid
distribution files.

Noticed on PR #48's v0.4.0 release run; nothing ever made it to PyPI
and the GitHub release assets never populated. Re-tagging v0.4.0 after
this merges will go through cleanly.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants