Heidi Anonymizer

A Python service for redaction and anonymization of personally identifiable information (PII) and sensitive technical data in structured and semi-structured files. Built on top of Microsoft Presidio, it detects and replaces PII with synthetic data, supporting multiple file formats.

Features

Multi-format support: Plain text, PDF, Microsoft Word documents (.docx), and images (with OCR)
Flexible PII detection:
- Built-in recognizers for standard PII (names, emails, phone numbers, IP addresses, credit cards, etc.)
- Custom keyword matching via deny-list configuration
- Custom regex patterns for domain-specific sensitive data
- Technical term recognizers (hostnames, Kubernetes namespaces)
Consistent anonymization: Same values are replaced with the same synthetic data within a session
Structured output: Generates both redacted text and detailed reports (YAML or JSON)
Command-line interface: Easy-to-use CLI tool with configuration options

Installation

System Dependencies

Heidi Anonymizer requires tesseract-ocr for image text extraction via OCR.

Linux (Debian/Ubuntu)

sudo apt update
sudo apt install tesseract-ocr

macOS

brew install tesseract

Windows

Download from UB Mannheim - Tesseract OCR

Python Package

Clone the repository:

git clone https://github.com/stackable/heidi-anonymizer.git
cd heidi-anonymizer

Download the required spaCy language model:

python -m spacy download en_core_web_lg

Usage

Basic Usage

uv run heidi-anonymizer /path/to/input/folder /path/to/output/folder

This processes all supported files in the input folder using built-in PII recognizers.

With Custom Configuration

Create a configuration file (config.yaml) with custom keywords and patterns:

uv run heidi-anonymizer /path/to/input/folder /path/to/output/folder --config config.yaml

See example-config.yaml for configuration format and examples.

Options

--config FILE           Path to YAML or TOML configuration file
--format [yaml|json]    Output format for the report (default: yaml)
--language TEXT         Language for NLP-based detection (default: en)

Examples

Generate JSON report instead of YAML:

uv run heidi-anonymizer input/ output/ --format json --config config.yaml

Jira Fetcher

The heidi-jira-fetch tool downloads issues from Atlassian Cloud Jira and saves them locally for processing. Each issue is saved with its description, all comments, and any attachments.

Setup

First, generate an API token from your Atlassian account:

Visit https://id.atlassian.com/manage-profile/security/api-tokens
Click Create API token
Copy the token

Set up environment variables:

export ATLASSIAN_EMAIL="your-email@example.com"
export ATLASSIAN_TOKEN="your-api-token"

Basic Usage

uv run heidi-jira-fetch /path/to/output/folder SUP-313 SUP-322 --url https://yourorg.atlassian.net

This fetches the two issues and saves them under output/SUP-313/ and output/SUP-322/.

Options

--url TEXT              Atlassian Cloud base URL (required, e.g. https://myorg.atlassian.net)
-v, --verbose           Enable debug logging to troubleshoot authentication or API issues

Output Structure

For each issue, a directory is created with:

OUTPUT_DIR/
  ISSUE_KEY/
    issue.txt        ← issue summary and description (plain text)
    comments.txt     ← all comments with author and date headers
    attachments/     ← downloaded attachment files
      screenshot.png
      debug.log
      ...

Examples

Fetch a single issue:

uv run heidi-jira-fetch ./data/input SUP-313 --url https://stackable.atlassian.net

Fetch multiple issues with debug output:

uv run heidi-jira-fetch ./data/input SUP-313 SUP-322 SUP-281 --url https://stackable.atlassian.net -v

Troubleshooting

Error: "Issue does not exist or you do not have permission to see it"

Verify the issue key is correct
Verify your API token is not expired
Check that your Atlassian account has permission to view the issue

Error: "Authentication failed"

Run with -v flag to see detailed debug logs:

uv run heidi-jira-fetch ./data/input SUP-313 --url https://stackable.atlassian.net -v

Configuration

Configuration File Format

Configuration can be in YAML or TOML format. The file must contain:

YAML Example

keywords:
  - "confidential"
  - "internal"

patterns:
  - name: "customer_id"
    regex: "CUST-[0-9]{6}"
    score: 0.9
- name: "internal_domain"
    regex: ".*\\.internal$"
    score: 0.7

language: "en"

Configuration Options

keywords (list): Strings to be treated as PII if found (deny-list matching)
patterns (list): Custom regex patterns with entity type names
- name: Entity type name (will be uppercased)
- regex: Regular expression pattern to match
language (string): Language code for NLP-based recognizers (default: "en")
nlp_engine (object, optional): Presidio NLP engine configuration
- nlp_engine_name: NLP backend name (spacy or transformers, default: spacy)
- models: List of model entries. Schema depends on nlp_engine_name:
- For spacy: each model entry must be
  - lang_code: string
  - model_name: string (for example, en_core_web_lg)
- For transformers: each model entry must be
  - lang_code: string
  - model_name: mapping with keys:
  - spacy: spaCy pipeline used by Presidio
  - transformers: HuggingFace model identifier
- If omitted entirely, defaults to spacy with en_core_web_lg

Output

Structure

Heidi creates two output files per input in the specified output directory:

redacted - The redacted text with file section markers.
report - Structured report of identified entities and replacements.

Supported Entity Types

Built-in PII Entities

ORGANIZATION - Names of people
EMAIL_ADDRESS - Email addresses
IP_ADDRESS - IPv4 and IPv6 addresses
(disabled) URL - Web URLs. Disabled because it wrongly redacts Java namespaces.
IBAN_CODE - International bank account numbers

Technical Entities

K8S_NAMESPACE - Kubernetes namespace names
CUSTOM_KEYWORD - Custom keywords from configuration

Custom Entities

Additional entity types can be defined via regex patterns in the configuration file.

Architecture

The service consists of several components:

Extractor - Reads files and extracts text (supports txt, pdf, docx, images with OCR)
Analyzer - Uses Presidio's AnalyzerEngine to detect PII entities
Anonymizer - Replaces detected entities with synthetic data using Faker
Reporter - Generates structured reports of findings

Limitations

OCR Accuracy: Image OCR quality depends on image clarity and text size
Language Support: NLP-based recognizers depend on available spaCy models
Pattern Coverage: Custom regex patterns must be carefully crafted to avoid false positives
Performance: Large files (especially images) may take time to process

Troubleshooting

Tesseract Not Found

Error: TesseractNotFoundError or pytesseract.TesseractNotFoundError: tesseract is not installed

Solution: Install tesseract-ocr as described in the System Dependencies section.

spaCy Model Not Found

Error: OSError: [E050] Can't find model "en_core_web_lg"

Solution: Download the model:

python -m spacy download en_core_web_lg

Memory Issues with Large Files

Solution: Process files in batches or increase available system memory. Presidio caches NLP models which may consume significant RAM.

Development

To set up a development environment:

git clone https://github.com/stackable/heidi-anonymizer.git
cd heidi-anonymizer
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

License

[Add your license here]

Support

For issues, questions, or suggestions, please open an issue on the GitHub repository.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Heidi Anonymizer

Features

Installation

System Dependencies

Linux (Debian/Ubuntu)

macOS

Windows

Python Package

Usage

Basic Usage

With Custom Configuration

Options

Examples

Jira Fetcher

Setup

Basic Usage

Options

Output Structure

Examples

Troubleshooting

Configuration

Configuration File Format

YAML Example

Configuration Options

Output

Structure

Supported Entity Types

Built-in PII Entities

Technical Entities

Custom Entities

Architecture

Limitations

Troubleshooting

Tesseract Not Found

spaCy Model Not Found

Memory Issues with Large Files

Development

License

Support