Skip to content
Nick edited this page Nov 21, 2025 · 2 revisions

FAQ

General

What is PATAS and how does it work?

PATAS uses a two-stage approach:

  1. Stage 1: Fast deterministic patterns (URLs, keywords) - no LLM/embeddings
  2. Stage 2: Deep semantic analysis only for suspicious patterns (2-3% of messages)

Why efficient? Two-stage reduces LLM/embedding costs by 70-90%.

How does PATAS identify spam?

PATAS is a signal engine - it provides rules and metrics, doesn't block directly:

  1. Analyzes historical data for recurring patterns
  2. Groups similar messages via semantic clustering (DBSCAN)
  3. Generates SQL rules with quality metrics
  4. Tests rules in shadow mode before activation

What's the accuracy? How many false positives?

Profiles:

  • Conservative (default): precision >= 0.95, max 5 false positives
  • Balanced: precision >= 0.90
  • Aggressive: precision >= 0.85

Real metrics (500K messages):

  • Precision: 0.93-0.97 (conservative)
  • False positive rate: 0.15%
  • Coverage: 5-8% of spam messages

How long does processing take?

Volume Stage 1 Stage 2 Total
100K ~30 sec ~20 min ~21 min
500K ~2 min ~3.5 hr ~3.5 hr
1M ~4 min ~7 hr ~7 hr
10M ~42 min ~70 hr ~3 days

For 10M+: Use incremental mining, parallel evaluation, rule filtering.

What are the costs?

Volume Per Run Monthly (4 runs)
500K ~$91 ~$364
1M ~$182 ~$728

Reduce costs:

  • Use local models (Mistral-7B, Llama-3.1-8B)
  • Disable LLM (use_llm: false)
  • Use incremental mining

Where is data stored? Are messages sent to external services?

On-premise deployment:

  • Fully local deployment supported
  • Local LLM models (vLLM, TGI, Ollama)
  • Local embeddings (BGE-M3, E5)
  • Can disable LLM entirely

Privacy:

  • All data stays in your infrastructure
  • privacy_mode: STRICT for additional safeguards
  • GDPR compliant

Technical

How to integrate PATAS?

PATAS works as signal engine:

  1. Export rules: SQL format, messenger backend, ROL format
  2. Use as signal: Combine with existing moderation rules
  3. API: REST API for ingestion, mining, rule management

What data format is expected?

Formats: JSONL (recommended), CSV, API

Required fields:

  • message_id or id
  • text
  • timestamp
  • is_spam (true/false)

Optional: external_id (for idempotency), user_id, chat_id, language

How does idempotency work?

Uses external_id for deduplication:

  • If external_id exists, message is skipped
  • If not provided, uses message_hash

Recommendation: Always use external_id.

How often should I run pattern mining?

Recommended:

  • Daily: Incremental mining (--since-checkpoint)
  • Weekly: Full mining for new patterns
  • On-demand: When new spam types appear

Can I use 10 instances to process 10M messages in 7 hours?

Current limitation: Distributed locks prevent concurrent processing of same dataset.

Solution: Shard data:

  • Split into 10 shards (by message_id or timestamp)
  • Each instance processes its shard with unique lock key
  • Merge results after processing

Result: ~7 hours (instead of 3 days on single instance)

Roadmap: Automatic sharding in P1. See Horizontal Scaling.

How to minimize false positives?

Multi-layer protection:

  1. SQL Safety: Only SELECT queries, whitelist tables/columns, SQL injection checks
  2. LLM Validation (optional): Logic check before saving
  3. Shadow Evaluation: Test on historical data before activation
  4. Safety Profiles: Conservative (precision >= 0.95, max 5 FP)
  5. Quality Tiers: SAFE_AUTO (precision >= 0.98), REVIEW_ONLY (>= 0.90)

Real metrics: Precision 0.93-0.97, FPR 0.15%

Recommendation: Use conservative profile. Main protection is shadow evaluation.

What if LLM generates incorrect SQL rule?

SQL safety validation:

  • Only SELECT queries allowed
  • Whitelist tables/columns
  • SQL injection checks
  • Syntax validation via sqlparse

Fallback: Invalid rules are not saved, errors are logged.

How does the system adapt to new spam types?

Automatic adaptation:

  • No retraining needed (rule-based system)
  • New patterns discovered automatically from historical data
  • Rules auto-deprecated on degradation (>10% precision drop)

Update rules:

  • Run pattern mining regularly (daily/weekly)
  • Use incremental mining for new messages only

Performance

Why is evaluation so slow? How to speed up?

Problem: Evaluation can take 30+ hours for 500K messages.

Optimization:

  1. Parallel processing: shadow_evaluation_parallel_workers: 8-16
  2. Rule filtering: max_shadow_rules_to_evaluate: 1000-5000 (top-N by quality)
  3. Sampling: shadow_evaluation_sample_size: 10000 for large datasets

Results:

  • With 4 workers: ~7.5 hours for 1K rules (vs 30 hours)
  • With 8 workers: ~4 hours for 1K rules

How to handle 10M+ messages?

Recommendations:

  1. Incremental mining: Process only new messages
  2. Parallel evaluation: 8-16 workers
  3. Rule filtering: Top-N rules only
  4. Local models: On-premise LLM/embeddings
  5. Horizontal scaling: Shard data across instances (see Horizontal Scaling)

Security

How are API keys stored?

  • Stored in environment variables or config files
  • Not in code or repository
  • Recommend secrets management (Vault, AWS Secrets Manager)

Is there rate limiting? WAF?

Current: No built-in rate limiting Recommendation: Use Nginx or API Gateway Roadmap: Built-in rate limiting (P1)

Is there audit logging?

Logging:

  • All operations logged (mining, promotion, evaluation)
  • Includes: timestamp, user/instance, operation, result
  • Configurable log level (INFO, DEBUG, ERROR)

Roadmap: Structured audit logging in DB (P2)

Customization

Can I customize precision/recall thresholds?

Custom profiles:

custom_profiles:
  my_custom:
    min_precision: 0.92
    max_coverage: 0.10
    min_sample_size: 50
    max_ham_hits: 3

How to add custom rules or exceptions?

Adding rules:

  • Add manually via API
  • Use whitelist for pattern exceptions (roadmap P2)

Roadmap: UI for rule management (P1)

Clone this wiki locally