Skip to content

rdzehtsiar/corpusforge

Repository files navigation

CorpusForge

Tests codecov Quality Gate Status

CorpusForge is a planned offline, deterministic corpus compiler for hostile text. It is intended for engineers who need reproducible inputs that stress tokenizers, parsers, renderers, compression behavior, Unicode handling, and text preprocessing pipelines.

The project is not an AI writing tool, a local language model, or a generic lorem ipsum generator. Its goal is engineering reliability: generate adversarial text and byte corpora that can be reproduced, minimized, and turned into regression fixtures.

Vision

CorpusForge is meant to make hostile text testing practical in local development and CI.

The product direction is:

  • deterministic generation from explicit seeds and profile metadata
  • Unicode-aware tokenizer and parser torture testing
  • reproducible failing samples and byte ranges
  • shrinking/minimization of failure cases
  • transparent profile formats that can be inspected and verified
  • offline-first operation with no telemetry or cloud dependency
  • clear documentation of what is supported, partial, unstable, or intentionally unsupported

Current State

CorpusForge is at an early implementation stage.

This repository now contains a Rust workspace, shared error types, deterministic seed and stream primitives, Milestone 3 .cff v0 profile support, Milestone 6 built-in tokenizer Unicode workflows, a narrow Milestone 7 shrink/replay MVP, and narrow Milestone 8 fixture/template-based grammar generation. The corpusforge binary can print top-level and command-specific help, and .cff v0 profile build, read, inspect, and verify workflows exist for deterministic fixture profiles.

The corpusforge-unicode crate includes Milestone 4 library APIs for deterministic, fixture-based Unicode adversarial generation. The CLI also exposes these fixture-based tokenizer modes through corpusforge gen --unicode ... and corpusforge ci tokenizer: grapheme, bidi, zero-width, emoji, normalization, mixed, and invalid-utf8.

Unicode output boundaries are intentionally separate. Valid-text generation returns UTF-8 text and rejects invalid-utf8. Raw-byte generation returns bytes and is the only supported path for invalid-utf8 cases. The current implementation samples from built-in fixtures with deterministic streams; it is not a broad Unicode or tokenizer compatibility guarantee.

N-gram training and profile-backed generation are implemented as a byte-level bigram MVP. corpusforge ci tokenizer can run an external stdin harness against built-in tokenizer Unicode samples and write a stable JSON report.

Milestone 7 adds byte-level shrinking and profile-backed replay by byte range. corpusforge shrink reads an original failing byte input, invokes a predicate executable directly without a shell, writes candidate bytes to the predicate stdin, and preserves a reproducible failure signature. Predicate exit code 0 means the candidate passed; a nonzero exit means failure. A timeout is preserved only when the original input consistently times out, and flaky predicates are rejected. Defaults are --timeout-ms 1000 and --max-runs 10000.

corpusforge replay reads a .cff profile with an embedded n-gram model, accepts --seed or --seed-file, and replays a half-open --range <start>..<end> byte range. Without --out, replay writes binary bytes directly to stdout. --json requires --out because stdout otherwise carries replayed bytes. Shrink and replay can write stable metadata JSON without timestamps.

Profile format support is limited to unstable .cff v0 behavior with no cross-version compatibility guarantee. The shrinker is byte-level, not Unicode-aware or structure-aware. Replay currently uses direct profile, seed, and range flags rather than consuming a saved metadata file. Broader deterministic output guarantees, compatibility claims, and generation behavior should be treated as planned until implemented and covered by tests.

Milestone 8 adds built-in grammar generation for Markdown and JSON through corpusforge gen --grammar markdown|json --grammar-mode valid|near-valid|malformed. Grammar output is UTF-8 text only and is built from deterministic fixtures/templates. It is not a full Markdown or JSON conformance suite, and it is not backed by .cff profiles yet. Grammar generation can optionally compose valid-text Unicode fixture modes into leaf content with --unicode <mode>; invalid-utf8 does not compose with grammar generation because grammar output is valid UTF-8. See Grammar workflow demo for the current harness-oriented workflow.

Do not rely on CorpusForge for production workflows yet.

Intended Users

CorpusForge is for engineers who need to answer practical questions such as:

  • whether a tokenizer handles hostile Unicode and byte sequences correctly
  • why a parser or renderer crashes on rare text edge cases
  • how to reproduce a text-processing failure from a seed and profile
  • how to minimize a failing input into a stable regression fixture
  • how ingestion, embedding, RAG, or preprocessing pipelines behave with adversarial text
  • how to run deterministic text stress tests in CI without network access

Planned Scope

Initial development is focused on:

  • seedable corpus profiles
  • deterministic text and byte generation
  • Unicode adversarial modes
  • built-in Markdown and JSON grammar fixtures
  • profile inspection and verification
  • reproducible replay
  • shrinking/minimization workflows
  • CI-friendly reports

Later work may connect grammar generation to .cff profiles, expand coverage, or add grammar-specific CI reports once those behaviors are implemented and tested.

Non-Goals

CorpusForge is not intended to be:

  • a generic AI text generator
  • a transformer runtime
  • a hosted service
  • a telemetry-backed product
  • a tool that requires a cloud account
  • a replacement for format-specific conformance suites
  • a guarantee of parser or tokenizer correctness without evidence

Principles

  • offline by default
  • no telemetry by default
  • deterministic output where practical
  • reproducible profiles, generated corpora, replay ranges, and minimized cases
  • compatibility and reliability claims backed by tests
  • explicit unsupported and partial behavior
  • stable, inspectable report formats
  • cross-platform development and CI friendliness

Development

Development currently uses a Rust CLI-first workspace with deterministic tests and a single static binary as a distribution goal.

Local checks:

cargo fmt --check
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace
cargo run -p corpusforge-cli -- --help

These checks should stay offline and deterministic.

Project documentation:

License

Licensed under the Apache License, Version 2.0. See LICENSE.

About

Deterministic offline corpus compiler for Unicode, tokenizer, and parser stress testing

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages