Skip to content

Privacy and Data Protection

Nick edited this page Nov 21, 2025 · 3 revisions

Privacy & Data Protection in PATAS

Purpose: This document describes how PATAS handles data privacy, what data it processes, and how to configure it for strict privacy requirements.


On-Prem Deployment by Default

PATAS is designed for on-premises deployment by default:

  • ✅ Runs on your infrastructure
  • ✅ No telemetry or external calls unless explicitly configured
  • ✅ All data stays within your infrastructure
  • ✅ LLM provider is configurable (internal endpoint or external, operator's choice)

Privacy Modes

STANDARD Mode (Default)

Configuration:

privacy_mode = "STANDARD"

Behavior:

  • ✅ External LLM providers can be used (if configured by operator)
  • ✅ Message texts can be included in test reports
  • ✅ Full logging available for debugging
  • ✅ Operator controls all external endpoints

Use Case: Development, testing, or when operator explicitly configures external services.


STRICT Mode

Configuration:

privacy_mode = "STRICT"

Behavior:

  • ❌ External LLM providers disabled by default (unless explicitly configured to internal endpoint)
  • ❌ Logs avoid storing full message texts (only ids + pattern ids / counts)
  • ❌ No telemetry or external calls unless explicitly configured
  • ✅ Only internal/on-prem LLM endpoints allowed
  • ✅ Minimal data retention

Use Case: Production deployment with strict privacy requirements.


Data Processing

What Data PATAS Processes

Message Data:

  • text: Message content (can be hashed or dropped in STRICT mode)
  • id: Message identifier
  • timestamp: Message timestamp
  • is_spam: Spam label (True/False/None)
  • meta: JSON metadata (channel, language, country, etc.)

Pattern Data:

  • Pattern descriptions
  • Pattern examples (can be truncated/hashed in STRICT mode)
  • SQL rules (safe SELECT queries only)

Evaluation Data:

  • Rule evaluation metrics (precision, recall, coverage)
  • Match counts (spam/ham hits)

Data Retention

Configurable Retention

Settings:

log_retention_days: int = 30  # How long to keep logs
report_retention_days: int = 90  # How long to keep reports

In STRICT Mode:

  • Raw message texts can be dropped after pattern mining
  • Only aggregated statistics and pattern IDs are retained
  • Logs contain only message IDs and pattern matches, not full texts

LLM & External Services

LLM Provider Configuration

Operator Controls:

  • LLM provider is configurable by operator
  • No hardcoded external endpoints
  • Operator chooses: internal endpoint, external API, or disabled

Configuration:

llm_provider: str = "openai"  # or "local", "none", "disabled"
llm_api_endpoint: str = ""  # Internal endpoint URL (if using local provider)

In STRICT Mode:

  • External LLM providers disabled by default
  • Only internal/on-prem endpoints allowed
  • Operator must explicitly configure internal endpoint if LLM is needed

Data Minimization

What Can Be Dropped/Hashed

At Ingestion:

  • User identifiers (sender IDs, user names) can be hashed or dropped
  • Message texts can be hashed (for pattern matching without storing full text)
  • Metadata can be anonymized

In Logs:

  • Full message texts can be replaced with message IDs
  • Only pattern matches and counts stored
  • User identifiers can be hashed

Example Configuration:

# In STRICT mode, logs only contain:
{
  "message_id": "abc123",
  "pattern_id": 42,
  "match_count": 5,
  # No "text" field, no "sender" field
}

No Data Leakage

Guarantees

  1. No Hardcoded External Calls: PATAS does not hardcode sending data to external services
  2. Operator Controls Endpoints: All external endpoints are configured by operator
  3. On-Prem by Default: All processing happens on-premises unless operator configures otherwise
  4. No Telemetry: PATAS does not send telemetry or usage statistics to external services

Compliance & Audit

Audit Trail

PATAS can maintain audit logs (if enabled):

  • Pattern creation/modification
  • Rule promotion/deprecation
  • Safety evaluation results
  • Configuration changes

In STRICT Mode:

  • Audit logs contain only IDs and metadata, not full message texts
  • User identifiers can be hashed in audit logs

Configuration Examples

STRICT Privacy Configuration

# .env or config file
PRIVACY_MODE=STRICT
LLM_PROVIDER=none  # or "local" with internal endpoint
LOG_RETENTION_DAYS=7  # Shorter retention
REPORT_RETENTION_DAYS=30
ENABLE_LLM=false  # Disable LLM entirely if not needed

STANDARD Privacy Configuration

# .env or config file
PRIVACY_MODE=STANDARD
LLM_PROVIDER=openai  # or "local" with internal endpoint
LOG_RETENTION_DAYS=30
REPORT_RETENTION_DAYS=90
ENABLE_LLM=true

Data Flow

Ingestion

External Source (TAS logs, CSV)
  ↓
PATAS Ingestion
  ↓
[STRICT: Hash/drop user identifiers, truncate texts]
  ↓
Normalized Message Storage
  ↓
[STRICT: Only IDs + metadata stored]

Pattern Mining

Message Storage
  ↓
Pattern Mining Pipeline
  ↓
[STRICT: Aggregated signals only, no individual texts]
  ↓
LLM (if enabled, internal endpoint only in STRICT)
  ↓
Pattern + Rule Creation
  ↓
[STRICT: Examples truncated/hashed]

Evaluation

Rules
  ↓
Offline Evaluation
  ↓
[STRICT: Only metrics stored, no message texts]
  ↓
Safety Evaluation
  ↓
[STRICT: Only IDs + metrics in reports]

Summary

Privacy Guarantees:

  • ✅ On-prem deployment by default
  • ✅ No telemetry or external calls unless explicitly configured
  • ✅ LLM provider configurable by operator (internal or external)
  • ✅ STRICT mode: minimal data storage, no external LLM by default
  • ✅ Data retention configurable
  • ✅ User identifiers can be hashed/dropped
  • ✅ Message texts can be hashed/truncated in STRICT mode

for integration:

  • PATAS runs on your infrastructure
  • You control all external endpoints
  • STRICT mode available for production
  • No data leaves your infrastructure unless you explicitly configure it

Last Updated: 2025-11-18

Clone this wiki locally