A Python service for redaction and anonymization of personally identifiable information (PII) and sensitive technical data in structured and semi-structured files. Built on top of Microsoft Presidio, it detects and replaces PII with synthetic data, supporting multiple file formats.
- Multi-format support: Plain text, PDF, Microsoft Word documents (.docx), and images (with OCR)
- Flexible PII detection:
- Built-in recognizers for standard PII (names, emails, phone numbers, IP addresses, credit cards, etc.)
- Custom keyword matching via deny-list configuration
- Custom regex patterns for domain-specific sensitive data
- Technical term recognizers (hostnames, Kubernetes namespaces)
- Consistent anonymization: Same values are replaced with the same synthetic data within a session
- Structured output: Generates both redacted text and detailed reports (YAML or JSON)
- Command-line interface: Easy-to-use CLI tool with configuration options
Heidi Anonymizer requires tesseract-ocr for image text extraction via OCR.
sudo apt update
sudo apt install tesseract-ocrbrew install tesseractDownload from UB Mannheim - Tesseract OCR
- Clone the repository:
git clone https://github.com/stackable/heidi-anonymizer.git
cd heidi-anonymizer- Download the required spaCy language model:
python -m spacy download en_core_web_lguv run heidi-anonymizer /path/to/input/folder /path/to/output/folderThis processes all supported files in the input folder using built-in PII recognizers.
Create a configuration file (config.yaml) with custom keywords and patterns:
uv run heidi-anonymizer /path/to/input/folder /path/to/output/folder --config config.yamlSee example-config.yaml for configuration format and examples.
--config FILE Path to YAML or TOML configuration file
--format [yaml|json] Output format for the report (default: yaml)
--language TEXT Language for NLP-based detection (default: en)
Generate JSON report instead of YAML:
uv run heidi-anonymizer input/ output/ --format json --config config.yamlThe heidi-jira-fetch tool downloads issues from Atlassian Cloud Jira and saves them locally for processing. Each issue is saved with its description, all comments, and any attachments.
First, generate an API token from your Atlassian account:
- Visit https://id.atlassian.com/manage-profile/security/api-tokens
- Click Create API token
- Copy the token
Set up environment variables:
export ATLASSIAN_EMAIL="your-email@example.com"
export ATLASSIAN_TOKEN="your-api-token"uv run heidi-jira-fetch /path/to/output/folder SUP-313 SUP-322 --url https://yourorg.atlassian.netThis fetches the two issues and saves them under output/SUP-313/ and output/SUP-322/.
--url TEXT Atlassian Cloud base URL (required, e.g. https://myorg.atlassian.net)
-v, --verbose Enable debug logging to troubleshoot authentication or API issues
For each issue, a directory is created with:
OUTPUT_DIR/
ISSUE_KEY/
issue.txt ← issue summary and description (plain text)
comments.txt ← all comments with author and date headers
attachments/ ← downloaded attachment files
screenshot.png
debug.log
...
Fetch a single issue:
uv run heidi-jira-fetch ./data/input SUP-313 --url https://stackable.atlassian.netFetch multiple issues with debug output:
uv run heidi-jira-fetch ./data/input SUP-313 SUP-322 SUP-281 --url https://stackable.atlassian.net -vError: "Issue does not exist or you do not have permission to see it"
- Verify the issue key is correct
- Verify your API token is not expired
- Check that your Atlassian account has permission to view the issue
Error: "Authentication failed"
Run with -v flag to see detailed debug logs:
uv run heidi-jira-fetch ./data/input SUP-313 --url https://stackable.atlassian.net -vConfiguration can be in YAML or TOML format. The file must contain:
keywords:
- "confidential"
- "internal"
patterns:
- name: "customer_id"
regex: "CUST-[0-9]{6}"
score: 0.9
- name: "internal_domain"
regex: ".*\\.internal$"
score: 0.7
language: "en"- keywords (list): Strings to be treated as PII if found (deny-list matching)
- patterns (list): Custom regex patterns with entity type names
name: Entity type name (will be uppercased)regex: Regular expression pattern to match
- language (string): Language code for NLP-based recognizers (default: "en")
- nlp_engine (object, optional): Presidio NLP engine configuration
nlp_engine_name: NLP backend name (spacyortransformers, default:spacy)models: List of model entries. Schema depends onnlp_engine_name:- For
spacy: each model entry must belang_code: stringmodel_name: string (for example,en_core_web_lg)
- For
transformers: each model entry must belang_code: stringmodel_name: mapping with keys:spacy: spaCy pipeline used by Presidiotransformers: HuggingFace model identifier
- If omitted entirely, defaults to
spacywithen_core_web_lg
Heidi creates two output files per input in the specified output directory:
- redacted - The redacted text with file section markers.
- report - Structured report of identified entities and replacements.
- ORGANIZATION - Names of people
- EMAIL_ADDRESS - Email addresses
- IP_ADDRESS - IPv4 and IPv6 addresses
- (disabled) URL - Web URLs. Disabled because it wrongly redacts Java namespaces.
- IBAN_CODE - International bank account numbers
- K8S_NAMESPACE - Kubernetes namespace names
- CUSTOM_KEYWORD - Custom keywords from configuration
Additional entity types can be defined via regex patterns in the configuration file.
The service consists of several components:
- Extractor - Reads files and extracts text (supports txt, pdf, docx, images with OCR)
- Analyzer - Uses Presidio's AnalyzerEngine to detect PII entities
- Anonymizer - Replaces detected entities with synthetic data using Faker
- Reporter - Generates structured reports of findings
- OCR Accuracy: Image OCR quality depends on image clarity and text size
- Language Support: NLP-based recognizers depend on available spaCy models
- Pattern Coverage: Custom regex patterns must be carefully crafted to avoid false positives
- Performance: Large files (especially images) may take time to process
Error: TesseractNotFoundError or pytesseract.TesseractNotFoundError: tesseract is not installed
Solution: Install tesseract-ocr as described in the System Dependencies section.
Error: OSError: [E050] Can't find model "en_core_web_lg"
Solution: Download the model:
python -m spacy download en_core_web_lgSolution: Process files in batches or increase available system memory. Presidio caches NLP models which may consume significant RAM.
To set up a development environment:
git clone https://github.com/stackable/heidi-anonymizer.git
cd heidi-anonymizer
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"[Add your license here]
For issues, questions, or suggestions, please open an issue on the GitHub repository.