-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
Version: 2.0
This document provides a high-level overview of PATAS Core v2 architecture, data models, services, and data flow.
PATAS Core is a generic engine for:
- Analyzing large corpora of messages (spam / not_spam)
- Discovering spam patterns automatically
- Generating machine-readable blocking rules
- Evaluating rules on real traffic
- Promoting good rules and deactivating bad ones
Key Principle: PATAS Core is generic and reusable. It works with abstract domain models and can be wrapped by different integration layers without modification.
Normalized message storage from logs or CSV imports.
Fields:
-
id- Internal ID -
external_id- External message ID (for idempotence) -
timestamp- Message timestamp -
text- Message text content -
meta- JSON metadata (channel, language, country, etc.) -
is_spam- Optional spam label (True/False/None) -
tas_action- Action taken ('blocked' / 'allowed') -
user_complaint- User-reported spam -
unbanned- Whether message/user was unbanned
Discovered spam patterns.
Fields:
-
id- Pattern ID -
type- Pattern type (URL, PHONE, TEXT, META, SIGNATURE, KEYWORD) -
description- Human-readable description -
examples- Representative message texts (JSON array)
SQL blocking rules with lifecycle management.
Fields:
-
id- Rule ID -
pattern_id- Associated pattern (optional) -
sql_expression- Safe SELECT query -
status- Lifecycle state (candidate → shadow → active → deprecated) -
origin- Origin ('llm', 'pattern_mining', 'manual') -
created_at,updated_at- Timestamps
Evaluation metrics for rules.
Fields:
-
id- Evaluation ID -
rule_id- Associated rule -
time_period_start,time_period_end- Evaluation window -
hits_total- Total messages matched -
spam_hits- Spam messages matched -
ham_hits- Non-spam messages matched -
precision- spam_hits / hits_total -
recall- (requires total spam count) -
coverage- hits_total / total_messages
Important: PATAS uses LLMs for offline pattern discovery only, not for real-time message classification.
- ✅ Pattern Discovery: Analyze aggregated spam signals to identify semantic patterns
- ✅ SQL Rule Generation: Propose SQL rules that catch spam variations
- ✅ SQL Quality Validation (optional): Assess false positive risks for generated rules
- ✅ Offline Only: All LLM processing happens during pattern mining, not during message evaluation
- ❌ NOT used for real-time message classification
- ❌ NOT making ban/unban decisions
- ❌ NOT processing individual messages online
- Provider: Configurable (OpenAI, local/on-prem endpoint, or disabled)
- Privacy: On-prem deployment by default, no hardcoded external calls
- Data: LLM sees only aggregated signals (top URLs, keywords, sample messages), not individual user messages
- Validation: LLM quality validation (if enabled) uses the same client/API key as pattern mining
See LLM Usage for detailed documentation.
Purpose: Load messages from external sources into PATAS storage.
Components:
-
TASLogIngester- Ingest from TAS API or storage -
CSVIngester- Ingest from CSV files - Idempotency handling via
external_id
Flow:
- Fetch messages from source
- Normalize to
Messagemodel - Store in database (with deduplication)
Purpose: Discover spam patterns from message corpus.
Components:
-
PatternMiningPipeline- Main orchestration -
PatternMiningEngine- Abstract interface (implemented by LLM engine) - Chunked processing for large datasets
- Pre-aggregation before LLM calls
Flow:
- Load messages from storage
- Aggregate by type (URLs, keywords, etc.)
- Cluster similar messages (semantic or exact)
- Generate pattern descriptions via LLM
- Create
Patternrecords
Purpose: Manage rule state transitions.
State Machine:
-
candidate→ Newly discovered, not yet evaluated -
shadow→ Evaluated on historical data, not active -
active→ Deployed and monitoring -
deprecated→ Deactivated due to poor performance
Components:
-
RuleLifecycleService- State transitions - Validation before state changes
- Audit logging
Purpose: Evaluate rules on historical data without deploying them.
Components:
-
ShadowEvaluationService- Run SQL queries on message storage - Compute metrics (precision, recall, coverage, ham_rate)
- Store results in
RuleEvaluation
Flow:
- Load
shadowrules - Execute
sql_expressionon message storage - Compute metrics
- Store
RuleEvaluationrecords
Purpose: Promote good rules to active, deprecate bad ones.
Components:
-
PromotionService- Review evaluation metrics - Apply safety profile thresholds
- Export rules to external systems (via
RuleBackend)
Flow:
- Load
shadowrules with recent evaluations - Check metrics against safety profile thresholds
- Promote to
activeif thresholds met - Export to external system (your platform, etc.)
- Deprecate
activerules if metrics degrade
Purpose: Export rules to external systems.
Interfaces:
-
RuleBackend- Abstract interface -
SqlRuleBackend- Export as SQL -
RolRuleBackend- Export as ROL (Rule Object Language)
Usage: Implement RuleBackend for your system (e.g., your platform rule engine).
Messages (Storage)
↓
Pattern Mining Pipeline
↓
Aggregation (URLs, keywords, etc.)
↓
Clustering (semantic or exact)
↓
LLM Pattern Description
↓
Pattern Records
↓
SQL Rule Generation
↓
Rule Records (status: candidate)
Rule (status: shadow)
↓
Shadow Evaluation Service
↓
Execute SQL on Messages
↓
Compute Metrics
↓
RuleEvaluation Records
↓
Promotion Service
↓
Check Safety Thresholds
↓
Promote to active OR Keep in shadow
Active Rules
↓
Export via RuleBackend
↓
External System (your platform, etc.)
↓
Monitor Performance
↓
Re-evaluate Periodically
↓
Deprecate if Metrics Degrade
Implement RuleBackend interface to export rules to your system:
class MyRuleBackend(RuleBackend):
def export_rule(self, rule: Rule) -> str:
# Convert rule to your format
return formatted_ruleImplement PatternMiningEngine interface for custom LLM providers:
class MyLLMEngine(PatternMiningEngine):
async def discover_patterns(self, signals: Dict) -> List[Pattern]:
# Your LLM integration
return patternsExtend MessageRepository or create adapter for custom message formats:
class MyMessageAdapter:
def to_patas_message(self, raw_message: Dict) -> Message:
# Convert to PATAS Message model
return messageAll SQL rules are validated:
- Only SELECT queries allowed
- No DDL/DELETE/UPDATE/INSERT
- Syntax validation
- "Match everything" detection
See SQL Rule Generation for details.
Three profiles with different risk tolerances:
- Conservative: High precision (≥98%), low false positive rate (≤1%)
- Balanced: Moderate precision (≥95%), higher recall
- Aggressive: Maximum recall, higher false positive rate
See Safety Profiles for details.
The API layer (app/api/) is a thin orchestration layer over Core services:
- No business logic in API
- Delegates to Core services
- Pydantic models for request/response
- FastAPI for HTTP handling
See API Reference for endpoint documentation.
The CLI (app/cli.py) provides command-line access to Core services:
-
patas ingest-logs- Ingest messages -
patas mine-patterns- Discover patterns -
patas eval-rules- Evaluate shadow rules -
patas promote-rules- Promote/deprecate rules -
patas safety-eval- Run safety evaluation
- Code Overview - Detailed code structure
- Configuration - Configuration options
- [Engineering Notes for integration](Engineering-Notes-for-your platform) - your platform-specific guidance
- Safety Profiles - Safety profile details