Skip to content

Safety Guide

Nick edited this page Nov 24, 2025 · 3 revisions

PATAS Safety & Enforcement Guide

Purpose: This document explains how PATAS should be integrated into spam detection infrastructure, with clear safety boundaries and enforcement recommendations.


Enforcement & Safety Model

PATAS is designed as a high-precision spam pattern signal engine, not as a standalone enforcement system.

Core Architecture: PATAS Core is a deterministic, rule-based engine. ML/LLM integration (if used) is optional and external, running inside your infrastructure for pattern discovery.

Core Principle

PATAS provides signals and patterns. You control enforcement decisions.

We strongly recommend:

✅ Conservative Profile → Low-Risk Automatic Actions

Use the Conservative profile only for low-impact automatic actions:

  • Spam labeling (marking messages as spam for filtering)
  • Hiding / deprioritizing messages (reducing visibility without blocking)
  • Rate-limiting (temporary throttling of suspicious senders)
  • Soft actions that can be easily reversed

Metrics:

  • Ham hit rate: ≤ 1.5%
  • Spam recall: 20-40%
  • Precision: ≥ 98%

Safety: Conservative profile is safe for automatic actions because it only includes SAFE_AUTO tier patterns with very low false positive rates.


⚠️ Balanced / Aggressive Profiles → Signals Only

Use Balanced and Aggressive profiles as signals only:

  • Input features into your existing risk-scoring systems
  • Prioritization for manual review queues
  • Research / experiments for pattern discovery
  • NOT for auto-ban or irreversible actions

Metrics:

  • Balanced: 9.75% ham hit rate, 66.87% recall
  • Aggressive: 19.12% ham hit rate, 72.84% recall

Why signals only: These profiles have higher recall but also higher false positive rates. They should be combined with other signals before making enforcement decisions.


❌ Permanent / High-Impact Enforcement

Keep permanent or high-impact enforcement (account bans, global blocks, long-term penalties) under your existing systems that combine:

  • ✅ User reports
  • ✅ Account history and trust scores
  • ✅ Device / network heuristics
  • ✅ Behavioral patterns
  • PATAS signals as just one of the inputs

We do NOT recommend wiring PATAS patterns directly into irreversible enforcement without combining them with your internal signals and policies.


Safety Profiles Overview

Profile Patterns Ham Hit Rate Spam Recall Use Case Auto-Ban?
Conservative SAFE_AUTO only ≤ 1.5% 20-40% Low-impact auto actions ✅ Yes (low-risk)
Balanced SAFE_AUTO + top REVIEW_ONLY ≤ 12% ≥ 60% Signals, prioritization ❌ No
Aggressive All except FEATURE_ONLY ≤ 20% ≥ 70% Research, experiments ❌ No

Rule Safety Categories

PATAS classifies rules into three safety categories:

AUTO_SAFE (Ready for Automation)

  • Precision: ≥ 95% (from shadow evaluation)
  • False Positive Rate: ≤ 1% (precision ≥ 99%)
  • Minimum hits: ≥ 10 (sufficient sample size)
  • Usage: Can be auto-promoted to active rules for Conservative profile
  • Criteria: Must pass deterministic checks AND (high precision evaluation OR whitelist patterns with AND conditions)
  • Description: High confidence, low risk (<10%). Specific rules without false positives. Safe for automatic application.

REQUIRES_REVIEW (Requires Human Verification)

  • Precision: 90-98% OR ham hit rate 1-5%
  • Usage: Requires manual review before promotion
  • Use in: Balanced profile (top patterns only)
  • Description: Good pattern, but may catch edge cases. Recommended for quarantine or manual review.

DANGEROUS (Insight Only - High Risk)

  • Pattern characteristics: Too broad (e.g., "Hello", "work", "time", stop words, very short patterns <3 chars)
  • SQL status: Automatically commented out to prevent accidental execution
  • Usage: Insight only - useful for understanding attack trends, NOT for enforcement
  • Description: Pattern is too broad and would cause many false positives. SQL is commented out. Useful for research and understanding spam evolution.

Running Safety Evaluation

Before deploying PATAS patterns to production:

patas safety-eval

This command:

  • Evaluates all profiles against safety thresholds
  • Generates SAFETY_EVAL_REPORT.json
  • Exits with code 0 if all thresholds pass, non-zero if any threshold is violated

DO NOT DEPLOY if patas safety-eval fails.


Integration Recommendations

1. Start with Conservative Profile

  • Deploy only Conservative profile patterns initially
  • Monitor ham hit rate in production logs
  • Track user complaints about false positives

2. Gradually Add Balanced Profile

  • Use Balanced profile patterns as signals only
  • Combine with existing risk-scoring systems
  • Do NOT use Balanced patterns for auto-ban

3. Use Aggressive Profile for Research

  • Aggressive profile is for offline analysis only
  • Use for pattern discovery and experiments
  • Never use for production enforcement

Safety Guardrails in Code

PATAS includes safety guardrails enforced in code:

  • SafetyMode enum: CONSERVATIVE, BALANCED, OFF
  • Conservative mode: Only AUTO_SAFE patterns can trigger auto-ban
  • Balanced mode: Patterns produce scores/flags only, never auto-ban
  • SQL safety: Only SELECT queries, whitelisted columns, no match-everything rules
  • Rule Safety Classifier: Three-category system (AUTO_SAFE / REQUIRES_REVIEW / DANGEROUS)
    • AUTO_SAFE: Ready for Automation - Precision ≥ 95%, FPR ≤ 1%, whitelist patterns with AND conditions. Can be automatically applied.
    • REQUIRES_REVIEW: Requires Human Verification - Good pattern, but may catch edge cases. Recommended for quarantine or manual review.
    • DANGEROUS: Insight Only (High Risk) - Pattern too broad (e.g., "Hello", "work", "time"). SQL is commented out to prevent accidental execution. Useful for understanding attack trends.
  • Auto-fix SQL errors: Simple SQL errors automatically fixed in shadow evaluation
  • Stop words filtering: Common words blocked to prevent false positives

See app/v2_safety_mode.py, app/v2_sql_safety.py, and app/v2_rule_safety_classifier.py for implementation details.


Monitoring & Rollback

Continuous Monitoring

After deployment:

  1. Monitor ham hit rate in production logs
  2. Track user complaints about false positives
  3. Re-run patas safety-eval if:
    • Pattern mining discovers new patterns
    • Pattern quality thresholds change
    • Significant changes to pattern SQL generation

Rollback Procedure

If safety thresholds are violated:

  1. Immediately disable affected patterns
  2. Review SAFETY_EVAL_REPORT.json for violations
  3. Fix issues before re-enabling patterns
  4. Re-run patas safety-eval to confirm fixes

Summary

TL;DR for Engineers:

  • ✅ Use Conservative profile for low-impact auto actions (labeling, hiding, throttling)
  • ⚠️ Use Balanced profile as signals only (NOT for auto-ban)
  • ❌ Use Aggressive profile for research only (NOT for production)
  • 🔒 Keep permanent enforcement (bans, blocks) in your existing systems
  • ✅ Always run patas safety-eval before deployment
  • 📊 Monitor ham hit rates and user complaints continuously

PATAS provides the "brain" for pattern detection. You control the "trigger" for enforcement.


Last Updated: 2025-11-21

Clone this wiki locally