CorpusForge

CorpusForge is a planned offline, deterministic corpus compiler for hostile text. It is intended for engineers who need reproducible inputs that stress tokenizers, parsers, renderers, compression behavior, Unicode handling, and text preprocessing pipelines.

The project is not an AI writing tool, a local language model, or a generic lorem ipsum generator. Its goal is engineering reliability: generate adversarial text and byte corpora that can be reproduced, minimized, and turned into regression fixtures.

Vision

CorpusForge is meant to make hostile text testing practical in local development and CI.

The product direction is:

deterministic generation from explicit seeds and profile metadata
Unicode-aware tokenizer and parser torture testing
reproducible failing samples and byte ranges
shrinking/minimization of failure cases
transparent profile formats that can be inspected and verified
offline-first operation with no telemetry or cloud dependency
clear documentation of what is supported, partial, unstable, or intentionally unsupported

Current State

CorpusForge is at an early implementation stage.

This repository now contains a Rust workspace, shared error types, deterministic seed and stream primitives, Milestone 3 .cff v0 profile support, Milestone 6 built-in tokenizer Unicode workflows, a narrow Milestone 7 shrink/replay MVP, and narrow Milestone 8 fixture/template-based grammar generation. The corpusforge binary can print top-level and command-specific help, and .cff v0 profile build, read, inspect, and verify workflows exist for deterministic fixture profiles.

The corpusforge-unicode crate includes Milestone 4 library APIs for deterministic, fixture-based Unicode adversarial generation. The CLI also exposes these fixture-based tokenizer modes through corpusforge gen --unicode ... and corpusforge ci tokenizer: grapheme, bidi, zero-width, emoji, normalization, mixed, and invalid-utf8.

Unicode output boundaries are intentionally separate. Valid-text generation returns UTF-8 text and rejects invalid-utf8. Raw-byte generation returns bytes and is the only supported path for invalid-utf8 cases. The current implementation samples from built-in fixtures with deterministic streams; it is not a broad Unicode or tokenizer compatibility guarantee.

N-gram training and profile-backed generation are implemented as a byte-level bigram MVP. corpusforge ci tokenizer can run an external stdin harness against built-in tokenizer Unicode samples and write a stable JSON report.

Milestone 7 adds byte-level shrinking and profile-backed replay by byte range. corpusforge shrink reads an original failing byte input, invokes a predicate executable directly without a shell, writes candidate bytes to the predicate stdin, and preserves a reproducible failure signature. Predicate exit code 0 means the candidate passed; a nonzero exit means failure. A timeout is preserved only when the original input consistently times out, and flaky predicates are rejected. Defaults are --timeout-ms 1000 and --max-runs 10000.

corpusforge replay reads a .cff profile with an embedded n-gram model, accepts --seed or --seed-file, and replays a half-open --range <start>..<end> byte range. Without --out, replay writes binary bytes directly to stdout. --json requires --out because stdout otherwise carries replayed bytes. Shrink and replay can write stable metadata JSON without timestamps.

Profile format support is limited to unstable .cff v0 behavior with no cross-version compatibility guarantee. The shrinker is byte-level, not Unicode-aware or structure-aware. Replay currently uses direct profile, seed, and range flags rather than consuming a saved metadata file. Broader deterministic output guarantees, compatibility claims, and generation behavior should be treated as planned until implemented and covered by tests.

Milestone 8 adds built-in grammar generation for Markdown and JSON through corpusforge gen --grammar markdown|json --grammar-mode valid|near-valid|malformed. Grammar output is UTF-8 text only and is built from deterministic fixtures/templates. It is not a full Markdown or JSON conformance suite, and it is not backed by .cff profiles yet. Grammar generation can optionally compose valid-text Unicode fixture modes into leaf content with --unicode <mode>; invalid-utf8 does not compose with grammar generation because grammar output is valid UTF-8. See Grammar workflow demo for the current harness-oriented workflow.

Do not rely on CorpusForge for production workflows yet.

Intended Users

CorpusForge is for engineers who need to answer practical questions such as:

whether a tokenizer handles hostile Unicode and byte sequences correctly
why a parser or renderer crashes on rare text edge cases
how to reproduce a text-processing failure from a seed and profile
how to minimize a failing input into a stable regression fixture
how ingestion, embedding, RAG, or preprocessing pipelines behave with adversarial text
how to run deterministic text stress tests in CI without network access

Planned Scope

Initial development is focused on:

seedable corpus profiles
deterministic text and byte generation
Unicode adversarial modes
built-in Markdown and JSON grammar fixtures
profile inspection and verification
reproducible replay
shrinking/minimization workflows
CI-friendly reports

Later work may connect grammar generation to .cff profiles, expand coverage, or add grammar-specific CI reports once those behaviors are implemented and tested.

Non-Goals

CorpusForge is not intended to be:

a generic AI text generator
a transformer runtime
a hosted service
a telemetry-backed product
a tool that requires a cloud account
a replacement for format-specific conformance suites
a guarantee of parser or tokenizer correctness without evidence

Principles

offline by default
no telemetry by default
deterministic output where practical
reproducible profiles, generated corpora, replay ranges, and minimized cases
compatibility and reliability claims backed by tests
explicit unsupported and partial behavior
stable, inspectable report formats
cross-platform development and CI friendliness

Development

Development currently uses a Rust CLI-first workspace with deterministic tests and a single static binary as a distribution goal.

Local checks:

cargo fmt --check
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace
cargo run -p corpusforge-cli -- --help

These checks should stay offline and deterministic.

Project documentation:

License

Licensed under the Apache License, Version 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
crates		crates
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md
qodana.sarif.json		qodana.sarif.json
qodana.yaml		qodana.yaml
rust-toolchain.toml		rust-toolchain.toml
sonar-project.properties		sonar-project.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CorpusForge

Vision

Current State

Intended Users

Planned Scope

Non-Goals

Principles

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CorpusForge

Vision

Current State

Intended Users

Planned Scope

Non-Goals

Principles

Development

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages