Inconnu

What is Inconnu?

Inconnu is a GDPR-compliant data privacy tool designed for entity redaction and de-anonymization. It provides cutting-edge NLP-based tools for anonymizing and pseudonymizing text data while maintaining data utility, ensuring your business meets stringent privacy regulations.

Why Inconnu?

Seamless Compliance: Inconnu simplifies the complexity of GDPR and other privacy laws, making sure your data handling practices are always in line with legal standards.
State-of-the-Art NLP: Utilizing advanced spaCy models and custom entity recognition, Inconnu ensures that personal identifiers are completely detected and properly handled.
Transparency and Trust: Complete processing documentation with timestamping, hashing, and entity mapping for full audit trails.
Reversible Processing: Support for both anonymization and pseudonymization with complete de-anonymization capabilities.
Performance Optimized: Fast processing with singleton pattern optimization and configurable text length limits.

Installation

Prerequisites

Python 3.10 or higher
pip (Python package manager)

Install from PyPI

# Using pip
pip install inconnu

# Using UV (Recommended)
uv add inconnu

Note: Language models are NOT included as optional dependencies. You'll need to download them separately using the inconnu-download command after installation (see below).

Download Language Models

After installing Inconnu, use the inconnu-download command to download spaCy language models:

# Download default (small) models
inconnu-download en              # English
inconnu-download de              # German
inconnu-download en de fr        # Multiple languages
inconnu-download all             # All default models

# Download specific model sizes
inconnu-download en --size large       # Large English model
inconnu-download en --size transformer # Transformer model (English only)

# List available models and check what's installed
inconnu-download --list

# Upgrade existing models
inconnu-download en --upgrade

# Get help for UV environments
inconnu-download --uv-help

How Model Installation Works

No Optional Dependencies: spaCy models are NOT included as pip/uv optional dependencies to avoid unnecessary downloads during dependency resolution
On-Demand Downloads: The inconnu-download command downloads only the models you need
Smart Environment Detection: Automatically detects UV environments and provides appropriate guidance
Verification: Checks if models are already installed before downloading

Available Model Sizes

Small (sm): Default, fast processing, ~15-50MB, good for high-volume
Medium (md): Better accuracy, ~50-200MB, moderate speed
Large (lg): High accuracy, ~200-600MB, slower processing
Transformer (trf): Highest accuracy, ~400MB+, GPU-optimized (English only)

Alternative: Direct spaCy Download

You can also use spaCy directly if preferred:

python -m spacy download en_core_web_sm   # English small
python -m spacy download de_core_news_lg  # German large

Install from Source

Clone the repository:

git clone https://github.com/0xjgv/inconnu.git
cd inconnu

Install with UV (recommended for development):

uv sync                      # Install dependencies
inconnu-download en de       # Download language models
make test                    # Run tests

Or install with pip:

pip install -e .     # Install in editable mode
python -m spacy download en_core_web_sm

Development Commands

For development, the Makefile provides convenience targets:

make install        # Install dependencies
make test           # Run tests (auto-downloads required models)
make check          # Run format check, lint, and tests
make lint           # Check code with ruff
make format         # Format code
make clean          # Format, lint, fix, and clean cache

To download additional language models, use the CLI directly:

uv run inconnu-download en de fr   # Download specific models
uv run inconnu-download all        # Download all models
uv run inconnu-download --list     # List available models

Using Different Models in Code

To use a different model size, first download it, then specify it when initializing:

from inconnu import Inconnu
from inconnu.nlp.entity_redactor import SpacyModels

# First, download the model you want
# $ inconnu-download en --size large

# Then use it in your code
inconnu = Inconnu(
    language="en",
    model_name=SpacyModels.EN_CORE_WEB_LG  # Use large model
)

# For highest accuracy (transformer model)
inconnu_trf = Inconnu(
    language="en",
    model_name=SpacyModels.EN_CORE_WEB_TRF
)

Model Selection Guide:

en_core_web_sm: Fast processing, good for high-volume
en_core_web_lg: Better accuracy, moderate speed
en_core_web_trf: Highest accuracy, GPU-optimized (recommended for sensitive data)

For a complete list of supported models, run inconnu-download --list

Development Setup

Available Commands

make install          # Install all dependencies
make test             # Run full test suite (downloads required models)
make check            # Run format check, lint, and tests
make lint             # Check code with ruff
make format           # Format code with ruff
make fix              # Auto-fix linting issues
make clean            # Format, lint, fix, and clean cache
make update-deps      # Update dependencies

# Download language models via CLI
uv run inconnu-download en de it   # Download specific models
uv run inconnu-download all        # Download all supported models

Running Tests

# Run all tests
make test

# Run with verbose output
uv run pytest -vv

# Run specific test file
uv run pytest tests/test_inconnu.py -vv

# Run specific test class
uv run pytest tests/test_inconnu.py::TestInconnuPseudonymizer -vv

Usage Examples

Basic Text Anonymization

from inconnu import Inconnu

# Simple initialization - no Config class required!
inconnu = Inconnu()  # Uses sensible defaults

# Simple anonymization - just the redacted text
text = "John Doe from New York visited Paris last summer."
redacted = inconnu.redact(text)
print(redacted)
# Output: "[PERSON] from [GPE] visited [GPE] [DATE]."

# Pseudonymization - get both redacted text and entity mapping
redacted_text, entity_map = inconnu.pseudonymize(text)
print(redacted_text)
# Output: "[PERSON_0] from [GPE_0] visited [GPE_1] [DATE_0]."
print(entity_map)
# Output: {'[PERSON_0]': 'John Doe', '[GPE_0]': 'New York', '[GPE_1]': 'Paris', '[DATE_0]': 'last summer'}

# Advanced usage with full metadata (original API)
result = inconnu(text=text)
print(result.redacted_text)
print(f"Processing time: {result.processing_time_ms:.2f}ms")

Async and Batch Processing

import asyncio

# Async processing for non-blocking operations
async def process_texts():
    inconnu = Inconnu()

    # Single async processing
    text = "John Doe called from +1-555-123-4567"
    redacted = await inconnu.redact_async(text)
    print(redacted)  # "[PERSON] called from [PHONE_NUMBER]"

    # Batch async processing
    texts = [
        "Alice Smith visited Berlin",
        "Bob Jones went to Tokyo",
        "Carol Brown lives in Paris"
    ]
    results = await inconnu.redact_batch_async(texts)
    for result in results:
        print(result)

asyncio.run(process_texts())

Customer Service Email Processing

# Process customer service email with personal data
customer_email = """
Dear SolarTech Team,

I am Max Mustermann living at Hauptstraße 50, 80331 Munich, Germany.
My phone number is +49 89 1234567 and my email is max@example.com.
I need to return my solar modules (Order: ST-78901) due to relocation.

Best regards,
Max Mustermann
"""

# Simple redaction
redacted = inconnu.redact(customer_email)
print(redacted)
# Personal identifiers are automatically detected and redacted

Multi-language Support

# German language processing - simplified!
inconnu_de = Inconnu("de")  # Just specify the language

german_text = "Herr Schmidt aus München besuchte Berlin im März."
redacted = inconnu_de.redact(german_text)
print(redacted)
# Output: "[PERSON] aus [GPE] besuchte [GPE] [DATE]."

Custom Entity Recognition

from inconnu import Inconnu, NERComponent
import re

# Add custom entity recognition
custom_components = [
    NERComponent(
        label="CREDIT_CARD",
        pattern=re.compile(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'),
        processing_function=None
    )
]

# Simple initialization with custom components
inconnu_custom = Inconnu(
    language="en",
    custom_components=custom_components
)

# Test custom entity detection
text = "My card number is 1234 5678 9012 3456"
redacted = inconnu_custom.redact(text)
print(redacted)  # "My card number is [CREDIT_CARD]"

Context Manager for Resource Management

# Automatic resource cleanup
with Inconnu() as inc:
    redacted = inc.redact("Sensitive data about John Doe")
    print(redacted)
# Resources automatically cleaned up

Error Handling

from inconnu import Inconnu, TextTooLongError, ProcessingError

inconnu = Inconnu(max_text_length=100)  # Set small limit for demo

try:
    long_text = "x" * 200  # Exceeds limit
    result = inconnu.redact(long_text)
except TextTooLongError as e:
    print(f"Text too long: {e}")
    # Error includes helpful suggestions for resolution
except ProcessingError as e:
    print(f"Processing failed: {e}")

Use Cases

1. Customer Support Systems

Automatically redact personal information from customer service emails, chat logs, and support tickets while maintaining context for analysis.

2. Legal Document Processing

Anonymize legal documents, contracts, and case files for training, analysis, or public release while ensuring GDPR compliance.

3. Medical Record Anonymization

Process medical records and research data to remove patient identifiers while preserving clinical information for research purposes.

4. Financial Transaction Analysis

Redact personal financial information from transaction logs and banking communications for fraud analysis and compliance reporting.

5. Survey and Feedback Analysis

Anonymize customer feedback, survey responses, and user-generated content for analysis while protecting respondent privacy.

6. Training Data Preparation

Prepare training datasets for machine learning models by removing personal identifiers from text data while maintaining semantic meaning.

Supported Entity Types

Standard Entities: PERSON, GPE (locations), DATE, ORG, MONEY
Custom Entities: EMAIL, IBAN, PHONE_NUMBER
Enhanced Detection: Person titles (Dr, Mr, Ms), international phone numbers
Multilingual: English, German, Italian, Spanish, and French language support

Features

Robust Entity Detection: Advanced NLP with spaCy models and custom regex patterns
Dual Processing Modes: Anonymization ([PERSON]) and pseudonymization ([PERSON_0])
Complete Audit Trail: Timestamping, hashing, and processing metadata
Reversible Processing: Full de-anonymization capabilities with entity mapping
Performance Optimized: Singleton pattern for model loading, configurable limits
GDPR Compliant: Built-in data retention policies and compliance features

Contributing

We welcome contributions to Inconnu! As an open source project, we believe in the power of community collaboration to build better privacy tools.

How to Contribute

1. Bug Reports & Feature Requests

Open an issue on GitHub with detailed descriptions
Include code examples and expected vs actual behavior
Tag issues appropriately (bug, enhancement, documentation)

2. Code Contributions

# Fork the repository and create a feature branch
git checkout -b feature/your-feature-name

# Make your changes and ensure tests pass
make test
make lint

# Submit a pull request with:
# - Clear description of changes
# - Test coverage for new features
# - Updated documentation if needed

3. Development Guidelines

Follow existing code style and patterns
Add tests for new functionality
Update documentation for user-facing changes
Ensure GDPR compliance considerations are addressed

4. Areas for Contribution

Language Support: Add new language models and region-specific entity detection
Custom Entities: Implement detection for industry-specific identifiers
Performance: Optimize processing speed and memory usage
Documentation: Improve examples, tutorials, and API documentation
Testing: Expand test coverage and edge case handling

5. Code Review Process

All contributions require code review
Automated tests must pass
Documentation updates are appreciated
Maintain backward compatibility when possible

Community Guidelines

Be Respectful: Foster an inclusive environment for all contributors
Privacy First: Always consider privacy implications of changes
Security Minded: Report security issues privately before public disclosure
Quality Focused: Prioritize code quality and comprehensive testing

Getting Help

Discussions: Use GitHub Discussions for questions and ideas
Issues: Report bugs and request features through GitHub Issues
Documentation: Check existing docs and contribute improvements

Thank you for helping make Inconnu a better tool for data privacy and GDPR compliance!

Publishing to PyPI

For Maintainers

To publish a new version to PyPI:

Configure Trusted Publisher (first time only):
- Go to https://pypi.org/manage/project/inconnu/settings/publishing/
- Add a new trusted publisher:
  - Publisher: GitHub
  - Organization/username: 0xjgv
  - Repository name: inconnu
  - Workflow name: publish.yml
  - Environment name: pypi (optional but recommended)
- For Test PyPI, do the same at https://test.pypi.org with environment name: testpypi
Update Version: Update the version in pyproject.toml and inconnu/__init__.py
Create a Git Tag:
```
git tag v0.1.0
git push origin v0.1.0
```
GitHub Actions: The workflow will automatically:
- Run tests on Python 3.10, 3.11, and 3.12
- Build the package
- Publish to PyPI using Trusted Publisher (no API tokens needed!)
- Generate PEP 740 attestations for security
Test PyPI Publishing:
- Use workflow_dispatch to manually trigger Test PyPI publishing
- Go to Actions → Publish to PyPI → Run workflow

Manual Publishing (if needed)

# Build the package
uv build

# Check the package
twine check dist/*

# Upload to Test PyPI (requires API token)
twine upload --repository testpypi dist/*

# Upload to PyPI (requires API token)
twine upload dist/*

GitHub Environments (Recommended)

Configure GitHub environments for additional security:

Go to Settings → Environments
Create pypi and testpypi environments
Add protection rules:
- Required reviewers
- Restrict to specific tags (e.g., v*)
- Add deployment branch restrictions

Additional Resources

spaCy Models Directory - Complete list of available language models
spaCy Model Releases - GitHub repository for model updates
pgeocode - Geographic location processing (potential future integration)

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
.bitbucket/scripts		.bitbucket/scripts
.github/workflows		.github/workflows
examples		examples
inconnu		inconnu
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
bitbucket-pipelines.yml		bitbucket-pipelines.yml
examples.ipynb		examples.ipynb
issues.json		issues.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Inconnu

What is Inconnu?

Why Inconnu?

Installation

Prerequisites

Install from PyPI

Download Language Models

How Model Installation Works

Available Model Sizes

Alternative: Direct spaCy Download

Install from Source

Development Commands

Using Different Models in Code

Development Setup

Available Commands

Running Tests

Usage Examples

Basic Text Anonymization

Async and Batch Processing

Customer Service Email Processing

Multi-language Support

Custom Entity Recognition

Context Manager for Resource Management

Error Handling

Use Cases

1. Customer Support Systems

2. Legal Document Processing

3. Medical Record Anonymization

4. Financial Transaction Analysis

5. Survey and Feedback Analysis

6. Training Data Preparation

Supported Entity Types

Features

Contributing

How to Contribute

1. Bug Reports & Feature Requests

2. Code Contributions

3. Development Guidelines

4. Areas for Contribution

5. Code Review Process

Community Guidelines

Getting Help

Publishing to PyPI

For Maintainers

Manual Publishing (if needed)

GitHub Environments (Recommended)

Additional Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages