Skip to content

p-doom/crowd-code-anonymizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Install crowd-code 2.0 to help crowd-source the next-generation coding dataset.

Install in Cursor Install in VS Code Install in Antigravity



p(doom)

crowd-code-anonymizer

Rust CLI for fast anonymization of CSV datasets that contain crowd-sourced software engineering traces. It scans cells for PII, leaked secrets, developer metadata, and high-entropy tokens, then writes cleaned CSVs and an optional JSON report.

Features

  • Parallel redaction with progress reporting; optional dry-run and interactive review modes.
  • PII detector: emails, international phone numbers, IPv4 addresses, SSN/ITIN/PAN/NRIC, IBAN (strict adds extra IDs).
  • Secrets detector: AWS keys, GitHub/GitLab tokens, Stripe/OpenAI/HuggingFace/Slack/Discord/Telegram keys, JWTs, generic key/value secrets, and PEM private keys.
  • Name detection using the bundled names-data/{first,last}_names.json (top-1000 names per country) with accent normalization.
  • Custom/structural patterns: user home paths across OSes, SSH key comments, git author lines (strict), plus a high-entropy catch-all to flag unknown tokens.
  • JSON report writer capturing file-level stats and each redaction (row/offset/text/label/source).

Install

  • Requires a recent Rust toolchain.
  • Build locally: cargo build --release (binary at target/release/anonymize).
  • Or install into your cargo bin: cargo install --path .

Usage

anonymize <input> [--output <dir>] [--recursive] [--dry-run] [--strict] [--review] \
  [-j <workers>] [--report <path>] [--names-dir <dir>]

Common flows

  • Single CSV → new directory: anonymize data/events.csv --output anonymized/
  • Directory, recurse, strict detection, and JSON report: anonymize ./traces --recursive --strict --report report.json
  • Manual review without writing files: anonymize ./traces --review --dry-run

Notes

  • Input can be a CSV file or a directory; --recursive descends into subdirectories.
  • Without --output, results are written next to inputs using file.anonymized.csv.
  • Review mode uses arrow keys ( yes, no, redo) or y/n; decisions cache per unique text+label.
  • Name detection looks for first_names.json and last_names.json in names-data/ by default; override with --names-dir.
  • The JSON report (via --report) includes every redaction with row/line/column offsets to support audits.
  • Run tests: cargo test

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages