Install crowd-code 2.0 to help crowd-source the next-generation coding dataset.
Rust CLI for fast anonymization of CSV datasets that contain crowd-sourced software engineering traces. It scans cells for PII, leaked secrets, developer metadata, and high-entropy tokens, then writes cleaned CSVs and an optional JSON report.
- Parallel redaction with progress reporting; optional dry-run and interactive review modes.
- PII detector: emails, international phone numbers, IPv4 addresses, SSN/ITIN/PAN/NRIC, IBAN (strict adds extra IDs).
- Secrets detector: AWS keys, GitHub/GitLab tokens, Stripe/OpenAI/HuggingFace/Slack/Discord/Telegram keys, JWTs, generic key/value secrets, and PEM private keys.
- Name detection using the bundled
names-data/{first,last}_names.json(top-1000 names per country) with accent normalization. - Custom/structural patterns: user home paths across OSes, SSH key comments, git author lines (strict), plus a high-entropy catch-all to flag unknown tokens.
- JSON report writer capturing file-level stats and each redaction (row/offset/text/label/source).
- Requires a recent Rust toolchain.
- Build locally:
cargo build --release(binary attarget/release/anonymize). - Or install into your cargo bin:
cargo install --path .
anonymize <input> [--output <dir>] [--recursive] [--dry-run] [--strict] [--review] \
[-j <workers>] [--report <path>] [--names-dir <dir>]
- Single CSV → new directory:
anonymize data/events.csv --output anonymized/ - Directory, recurse, strict detection, and JSON report:
anonymize ./traces --recursive --strict --report report.json - Manual review without writing files:
anonymize ./traces --review --dry-run
- Input can be a CSV file or a directory;
--recursivedescends into subdirectories. - Without
--output, results are written next to inputs usingfile.anonymized.csv. - Review mode uses arrow keys (
→yes,←no,↑redo) ory/n; decisions cache per unique text+label. - Name detection looks for
first_names.jsonandlast_names.jsoninnames-data/by default; override with--names-dir. - The JSON report (via
--report) includes every redaction with row/line/column offsets to support audits. - Run tests:
cargo test
