Product PRD

Product Requirements Document (PRD)

PATAS - Pattern-Adaptive Transmodal Anti-Spam System

Version: 2.0.0
Date: 2025-01-27
Status: Pilot-ready MVP (production-grade core, missing streaming/UI)

Executive Summary

PATAS is an autonomous pattern discovery and rule management system for anti-spam operations. It analyzes historical message logs, automatically discovers spam patterns, generates safe blocking rules, and evaluates their effectiveness before deployment.

Key Value Proposition:

Signal engine, not enforcement - Provides patterns and metrics that inform anti-spam decisions
On-premise deployment - Designed for deployment within your infrastructure
Two-stage processing - Fast scanning + deep analysis for 70-90% cost reduction
Deterministic and rule-based - Core engine is deterministic; ML/LLM is optional
Safety-first design - Multiple safety profiles with clear risk boundaries
Batch/offline analysis - Designed for daily/weekly batch processing of historical logs, not real-time filtering

1. Product Overview

1.1 Problem Statement

Traditional anti-spam systems require:

Manual rule creation and maintenance
Constant monitoring and adjustment
High false positive rates
Inability to adapt to new spam patterns quickly
Expensive ML/LLM costs for processing all messages

1.2 Solution

PATAS automates the entire spam pattern discovery and rule generation pipeline:

Ingest historical message logs
Discover recurring spam patterns automatically (two-stage: fast scan + deep analysis)
Generate safe SQL rules from discovered patterns
Evaluate rules on historical data (shadow mode)
Promote rules that meet safety thresholds
Monitor rule performance and deprecate underperforming rules

1.3 Target Users

Primary: Anti-spam teams at messaging platforms (Telegram, WhatsApp, etc.)
Secondary: Content moderation teams, security operations
Tertiary: Research teams studying spam patterns

1.4 Success Metrics

Observed on internal benchmark dataset (500K messages):

Precision: 0.93-0.97 (conservative profile)
False positive rate: <0.15%
Coverage: 5-8% of all spam messages
Cost reduction: 70-90% vs. processing all messages with LLM
Processing time: 3.5 hours for 500K messages (vs. days with naive approach)
AUTO_SAFE classification: 50-55% of rules (target: >50%)
Pattern Mining time: 60-65% of total time (optimized from 70.1%)

Note: On new datasets, we recommend running PATAS in shadow mode first to re-calibrate these metrics. PATAS is designed for offline/batch analysis of historical logs (daily/weekly runs), not inline real-time filtering.

2. Architecture & Technical Design

2.1 System Architecture

Design Philosophy: PATAS is designed as an offline/batch analysis system for historical message logs. Typical use case: daily/weekly logs → run PATAS overnight → receive new rules in the morning. It is not designed for inline real-time filtering.

Layered Architecture:

API Layer (app/api/) - FastAPI HTTP endpoints
Service Layer (app/v2_*.py) - Business logic
Repository Layer (app/repositories.py) - Data access
Infrastructure - Database, caching, observability

2.2 Core Components

2.2.1 Pattern Mining Pipeline

Two-Stage Approach:

Stage 1: Fast Scanning

Large chunks (10K-50K messages)
Deterministic patterns only (URLs, keywords, signatures)
No LLM/embeddings
Fast aggregation (~2-4 min for 500K messages)

Stage 2: Deep Analysis

Small chunks (1K-5K messages)
Suspicious patterns only (top 2-3% by default)
Semantic mining + LLM analysis
High quality rules (~3.5 hours for 500K messages)

Benefits:

70-90% cost reduction (only 2-3% of messages use expensive LLM/embeddings)
Maintains high quality (deep analysis for important patterns)
Scales to millions of messages

2.2.2 Rule Lifecycle Management

States:

CANDIDATE - Newly generated rule, not yet evaluated
SHADOW - Evaluated on historical data, not yet active
ACTIVE - Deployed and monitoring performance
DEPRECATED - Underperforming (>10% precision drop) or manually disabled

Promotion Criteria:

Conservative: precision >= 0.95, max 5 false positives
Balanced: precision >= 0.90
Aggressive: precision >= 0.85

2.2.3 Safety & Quality Assurance

Multi-layer Protection:

SQL Safety Validation:
- Only SELECT queries (no INSERT/UPDATE/DELETE/DROP)
- Whitelist tables/columns
- SQL injection detection
- Syntax validation via sqlparse
- "Match-everything" detection (coverage >80% = red flag)
LLM Validation (optional):
- Logic verification before saving
- False positive detection
Shadow Evaluation:
- Testing on historical data before activation
- Precision, recall, F1-score metrics
- Automatic deprecation on degradation
Quality Tiers:
- SAFE_AUTO: precision >= 0.98, low FPR - auto-activate
- REVIEW_ONLY: precision >= 0.90 - manual review required
- FEATURE_ONLY: precision < 0.90 - use as ML feature, not standalone rule

2.3 Data Models

Core Entities:

Message - Normalized message storage (text, timestamp, is_spam, metadata)
Pattern - Discovered spam pattern (type, description, examples)
Rule - SQL rule generated from pattern (status, sql_expression, evaluation metrics)
RuleEvaluation - Historical evaluation results (precision, recall, coverage)
Checkpoint - Progress tracking for incremental mining

2.4 Technology Stack

Backend:

Python 3.10+
FastAPI (async HTTP framework)
SQLAlchemy 2.0 (ORM)
PostgreSQL (production) / SQLite (development)

ML/AI:

OpenAI API (default) or local LLM (vLLM/TGI/Ollama)
OpenAI Embeddings (default) or local embeddings (BGE-M3, E5)
DBSCAN clustering for semantic similarity

Infrastructure:

Redis (distributed locks, caching)
Prometheus + Grafana (monitoring)
Docker + docker-compose (deployment)
OpenTelemetry (observability)

Development:

Poetry (dependency management)
Pytest (testing)
Ruff + Black (linting/formatting)
MyPy (type checking)

3. Features & Functionality

3.1 Core Features

3.1.1 Message Ingestion

Sources: TAS logs, CSV files, API endpoints
Formats: JSONL (recommended), CSV
Idempotency: Deduplication via external_id or message_hash
Batch processing: Efficient bulk ingestion

3.1.2 Pattern Mining

Deterministic patterns: URLs, phone numbers, keywords, signatures
Semantic patterns: DBSCAN clustering on embeddings
Incremental mining: Process only new messages (5-10x faster)
Checkpointing: Resume from interruptions

3.1.3 Rule Generation

SQL rules: Transparent, executable SQL expressions
LLM-based refinement: Optional pattern explanation and rule optimization
Multi-language support: Works with any Unicode text

3.1.4 Rule Evaluation

Shadow mode: Test rules on historical data before activation
Metrics: Precision, recall, F1-score, coverage, false positive rate
Parallel evaluation: 4-16 workers for faster processing
Sampling: Optional sampling for very large datasets

3.1.5 Rule Promotion

Safety profiles: Conservative, Balanced, Aggressive
Custom profiles: Configurable thresholds
Automatic deprecation: Monitor active rules, deprecate on degradation
Export backends: SQL, ROL, platform-specific formats

3.2 API Endpoints

Core Endpoints:

POST /api/v1/messages/ingest - Ingest messages
POST /api/v1/patterns/mine - Run pattern mining
GET /api/v1/patterns - List patterns
GET /api/v1/rules - List rules (with filtering, explanations, risk assessment)
POST /api/v1/rules/eval-shadow - Evaluate shadow rules
POST /api/v1/rules/promote - Promote/rollback rules
GET /api/v1/rules/export - Export rules
POST /api/v1/analyze - High-level batch analysis

Legacy v1 Endpoints:

POST /v1/classify - Single message classification
POST /v1/train - Submit training example
GET /v1/stats - System statistics

3.3 CLI Commands

patas ingest-logs          # Ingest messages
patas mine-patterns        # Discover patterns
patas eval-rules           # Evaluate rules
patas promote-rules        # Promote/rollback rules
patas list-checkpoints     # List mining checkpoints

3.4 Advanced Features

3.4.1 Incremental Mining

Process only new messages (after last checkpoint)
5-10x faster than full mining
Suitable for daily operations

3.4.2 Distributed Processing

Redis-based distributed locks
Multi-instance coordination
Horizontal scaling support (with data sharding)

3.4.3 Cost Optimization

Two-stage processing (70-90% cost reduction)
Local LLM/embeddings support (zero API costs)
LLM/embedding caching
Cost guard with budget alerts

3.4.4 Observability

OpenTelemetry tracing
Prometheus metrics
Grafana dashboards
Structured logging
Audit trails

4. Performance & Scalability

4.1 Performance Metrics

Typical values observed on internal benchmark dataset. Performance may vary based on dataset characteristics, hardware, and configuration.

Pattern Mining (Two-Stage): Designed for offline/batch analysis. Typical use case: daily/weekly logs → run PATAS overnight → receive new rules in the morning.

Messages	Stage 1	Stage 2	Total	Stage 2 %
100K	~30 sec	~20 min	~21 min	2-3%
500K	~2 min	~3.5 h	~3.5 h	2-3%
1M	~4 min	~7 h	~7 h	2-3%
10M	~42 min	~70 h	~3 days	2-3%

Shadow Evaluation:

Rules	Sequential	Parallel (4)	Parallel (8)
100	~3 hours	~45 min	~25 min
1,000	~30 hours	~7.5 hours	~4 hours

Ingestion:

Throughput: ~200 msg/sec (end-to-end including validation, ORM, and DB writes)
500K messages: ~42 min

4.2 Scalability Strategies

For 10M+ Messages:

Incremental mining (process only new messages)
Parallel evaluation (8-16 workers)
Rule filtering (top-N by quality tier)
Sampling for very large datasets
Local models (eliminate API costs)

Horizontal Scaling:

Current: Distributed locks prevent concurrent processing
Solution: Data sharding (split by message_id or timestamp)
Roadmap: Automatic sharding (P1)

4.3 Cost Estimates

Example cost profile for OpenAI-based deployment. Actual costs depend on number of patterns found, message characteristics, and LLM usage patterns.

OpenAI Mode (per 500K messages):

Embeddings: ~$0.06 (only 2-3% in Stage 2)
LLM: ~$91 (rule generation)
Total: ~$91 per run

Monthly (weekly runs):

500K messages/week: ~$364/month
1M messages/week: ~$728/month

Local Models:

Zero API costs
Infrastructure only (GPU server)
Break-even: ~2-3 months for 500K/week

5. Safety & Security

5.1 Safety Profiles

Conservative (default, recommended for high-risk environments):

Precision >= 0.95
Max 5 false positives
Recall ~0.08-0.10
Recommended for production use in high-risk environments like Telegram
Minimizes false positives at the cost of lower recall

Balanced:

Precision >= 0.90
Recall ~0.10-0.15
Suitable for controlled scenarios and internal experiments

Aggressive:

Precision >= 0.85
Recall ~0.15-0.20
Not recommended for production without careful evaluation
Use only in controlled scenarios with extensive shadow testing

Custom Profiles:

Configurable thresholds
Per-use-case optimization
For high-risk environments, we recommend starting with Conservative and adjusting only after thorough shadow evaluation

5.2 Security Features

API Security:

API key authentication
Rate limiting
WAF (pattern-based attack detection)
IP whitelisting (optional)

Data Protection:

PII redaction (SSN, passport, bank accounts, driver licenses)
Privacy modes (STANDARD, STRICT)
On-premise deployment support
Designed for GDPR-friendly on-premise deployment with built-in PII redaction and privacy modes to help meet internal compliance requirements

SQL Safety:

Only SELECT queries
Whitelist validation
SQL injection detection
Syntax validation

5.3 Audit & Compliance

Audit logging (all operations)
Request tracing (OpenTelemetry)
Retention policies (configurable)
Security audit checklist

6. Deployment & Operations

6.1 Deployment Options

On-Premise:

Docker + docker-compose
PostgreSQL + Redis
Local LLM/embeddings (vLLM, TGI, Ollama)
Air-gapped deployment support

Cloud:

Hetzner Cloud VPS (recommended)
AWS/GCP/Azure compatible
Managed PostgreSQL/Redis

6.2 Infrastructure Requirements

Minimum:

2 CPU cores
4 GB RAM
20 GB storage
PostgreSQL 12+
Redis 6+ (optional, for distributed locks)

Recommended (production):

4+ CPU cores
8+ GB RAM
100+ GB storage (for 10M+ messages)
PostgreSQL 14+ with connection pooling
Redis 7+ for caching and locks
GPU (for local LLM/embeddings)

6.3 Monitoring & Alerting

Metrics:

Prometheus (system metrics)
Custom metrics (pattern mining, rule evaluation)
Cost tracking (LLM usage)

Dashboards:

Grafana (pre-configured dashboards)
Pattern mining progress
Rule performance
System health

Alerting:

AlertManager integration
Rule degradation alerts
Cost budget alerts
System health alerts

6.4 Maintenance

Database:

Automatic migrations
Retention policies (configurable)
Backup scripts (pg_dump)

Updates:

Zero-downtime deployments
Secret rotation support
Graceful degradation

7. Roadmap & Future Enhancements

7.1 P1: Integration Enhancements (Month 1-2)

Streaming Ingestion:

Kafka/RabbitMQ consumer
Real-time message processing
Automatic backpressure handling

UI for Rule Management:

Web interface for viewing/editing rules
Quick disable/enable
Rule performance dashboards
Moderator feedback interface

Unicode & Emoji Handling:

Improved Unicode normalization
Better emoji processing
RTL language support

Multilingual Support:

Multilingual embeddings (BGE-M3, E5)
Language-specific normalization
Cross-language pattern detection

7.2 P2: Scaling Optimizations (Month 3-6)

Sharded Evaluation:

Parallel evaluation across message shards
Distributed evaluation for 10M+ messages

Database Optimization:

Table partitioning by timestamp
Read replicas for evaluation queries
Optimized indexes

Retention & Archiving:

Automatic data retention policies
Cold storage archiving
Configurable TTL

Enhanced Risk Assessment:

Whitelist support
Context-aware risk scoring
LLM-based false positive detection

7.3 P3: Advanced Features (Month 6+)

A/B Testing:

Shadow mode for partial traffic
Comparison metrics
Automatic promotion on improvement

Feedback Loop:

Moderator feedback API
Automatic rule deprecation from feedback
Pattern improvement from corrections

Cost Optimization:

Automatic threshold adjustment based on budget
Smart batching for LLM calls
Cost guard with auto-scaling

Dark Launch:

Gradual rule rollout
Canary deployments
Automatic rollback on degradation

8. Competitive Analysis & Positioning

8.1 Comparison with Traditional Approaches

Traditional Rule-Based Systems:

❌ Manual rule creation
❌ Static rules, no adaptation
✅ High precision
❌ Low recall

ML-Based Systems:

✅ Automatic pattern discovery
✅ Adaptive to new patterns
❌ High false positive rate
❌ Expensive (process all messages)
❌ Black box (hard to explain)

PATAS:

✅ Automatic pattern discovery
✅ Adaptive to new patterns
✅ High precision (0.93-0.97 observed on benchmark dataset)
✅ Low false positive rate (<0.15% observed on benchmark dataset)
✅ Transparent SQL rules
✅ Cost-effective (70-90% reduction vs. naive approach)
✅ On-premise deployment

8.2 Unique Selling Points

Two-stage processing - Only 2-3% of messages use expensive LLM/embeddings
Transparent rules - SQL expressions, not black box ML
Safety-first - Multiple safety profiles, shadow evaluation
On-premise - Full control, designed for GDPR-friendly deployment
Cost-effective - 70-90% cost reduction vs. naive approach

9. Risk Assessment & Mitigation

9.1 Technical Risks

Risk: High false positive rate

Mitigation: Conservative profile by default, shadow evaluation, automatic deprecation

Risk: SQL injection in generated rules

Mitigation: SQL safety validation, whitelist, syntax validation

Risk: LLM API failures

Mitigation: Graceful degradation, local LLM support, retries with backoff

Risk: Performance degradation at scale

Mitigation: Two-stage processing, incremental mining, parallel evaluation, horizontal scaling

9.2 Operational Risks

Risk: Data privacy violations

Mitigation: PII redaction, privacy modes, on-premise deployment, GDPR-friendly design with built-in compliance tools

Risk: High LLM costs

Mitigation: Two-stage processing, local models, caching, cost guard

Risk: Rule quality degradation

Mitigation: Shadow evaluation, automatic deprecation, quality tiers

9.3 Business Risks

Risk: Low adoption due to complexity

Mitigation: Comprehensive documentation, CLI tools, API quickstart, examples

Risk: Competition from established players

Mitigation: Open-source, on-premise focus, transparent rules, cost-effectiveness

10. Success Criteria & KPIs

10.1 Technical KPIs

Target metrics based on observed performance on internal benchmark dataset. Actual values may vary on new datasets and should be re-calibrated in shadow mode.

Precision: >= 0.93 (conservative profile, observed on benchmark)
False positive rate: < 0.15% (observed on benchmark)
Coverage: 5-8% of spam messages (observed on benchmark)
Processing time: < 4 hours for 500K messages (batch/offline mode)
Cost reduction: 70-90% vs. naive approach (typical range)

10.2 Business KPIs

Adoption: 3+ production deployments in 6 months
User satisfaction: > 4.0/5.0 (if survey conducted)
Community: Active GitHub stars, contributions
Documentation: Complete wiki, examples, tutorials

10.3 Operational KPIs

Uptime: > 99.5%
Mean time to recovery: < 1 hour
API response time: P95 < 500ms
Error rate: < 0.1%

11. Recommendations for MVP

11.1 Current State Assessment

Strengths:

✅ Complete two-stage pipeline implementation
✅ Comprehensive API and CLI
✅ Safety profiles and quality tiers
✅ Shadow evaluation and automatic deprecation
✅ On-premise deployment support
✅ Extensive documentation

Gaps:

⚠️ No streaming ingestion (batch only)
⚠️ No UI for rule management
⚠️ Limited horizontal scaling (requires manual sharding)
⚠️ No feedback loop from moderators
⚠️ Limited multilingual support (Unicode issues)

11.2 MVP Recommendations

Must-Have (P0):

✅ Two-stage pattern mining (DONE)
✅ Rule lifecycle management (DONE)
✅ Shadow evaluation (DONE)
✅ Safety profiles (DONE)
✅ API and CLI (DONE)
✅ Documentation (DONE)

Should-Have (P1):

Streaming ingestion (Kafka/RabbitMQ)
UI for rule management
Improved Unicode/emoji handling
Automatic horizontal scaling (sharding)

Nice-to-Have (P2):

Moderator feedback loop
A/B testing
Advanced cost optimization
Dark launch support

11.3 Next Steps

Pilot Deployment:
- Deploy to pilot/production environment
- Run in shadow mode first to re-calibrate metrics on real data
- Monitor performance and costs
- Collect user feedback
P1 Features:
- Implement streaming ingestion
- Build UI for rule management
- Improve Unicode handling
P2 Features:
- Implement feedback loop
- Add A/B testing
- Advanced optimizations

12. Conclusion

PATAS is a pilot-ready MVP with production-grade core functionality:

✅ Complete two-stage pattern mining pipeline
✅ Comprehensive API and CLI interfaces
✅ Safety-first design with shadow evaluation
✅ Cost-effective architecture (70-90% reduction)
✅ On-premise deployment support
✅ Extensive documentation

Ready for:

Production pilot deployments (with shadow mode calibration)
Integration with messaging platforms
Community adoption

Note: Core functionality is mature and production-grade. Enterprise features (streaming ingestion, UI, automatic scaling) are planned for P1/P2 roadmap after successful pilot feedback.

Future focus:

Streaming ingestion (P1)
UI development (P1)
Advanced scaling (P1/P2)
Feedback loops (P2)

Appendix A: Glossary

Pattern: Recurring spam characteristic (URL, keyword, semantic similarity)
Rule: SQL expression that matches spam messages
Shadow Evaluation: Testing rules on historical data before activation
Two-Stage Processing: Fast scanning (Stage 1) + deep analysis (Stage 2)
Safety Profile: Aggressiveness level (Conservative, Balanced, Aggressive)
Quality Tier: Rule quality classification (SAFE_AUTO, REVIEW_ONLY, FEATURE_ONLY)
Checkpoint: Progress tracking for incremental mining
Incremental Mining: Processing only new messages (after last checkpoint)

Appendix B: References

Document Version: 1.0
Last Updated: 2025-01-27
Author: KikuAI Lab
Status: Final

Note: This PRD is maintained as part of the codebase. If code changes, the PRD is updated as the last step to ensure consistency.

Product PRD

Product Requirements Document (PRD)

PATAS - Pattern-Adaptive Transmodal Anti-Spam System

Executive Summary

1. Product Overview

1.1 Problem Statement

1.2 Solution

1.3 Target Users

1.4 Success Metrics

2. Architecture & Technical Design

2.1 System Architecture

2.2 Core Components

2.2.1 Pattern Mining Pipeline

2.2.2 Rule Lifecycle Management

2.2.3 Safety & Quality Assurance

2.3 Data Models

2.4 Technology Stack

3. Features & Functionality

3.1 Core Features

3.1.1 Message Ingestion

3.1.2 Pattern Mining

3.1.3 Rule Generation

3.1.4 Rule Evaluation

3.1.5 Rule Promotion

3.2 API Endpoints

3.3 CLI Commands

3.4 Advanced Features

3.4.1 Incremental Mining

3.4.2 Distributed Processing

3.4.3 Cost Optimization

3.4.4 Observability

4. Performance & Scalability

4.1 Performance Metrics

4.2 Scalability Strategies

4.3 Cost Estimates

5. Safety & Security

5.1 Safety Profiles

5.2 Security Features

5.3 Audit & Compliance

6. Deployment & Operations

6.1 Deployment Options

6.2 Infrastructure Requirements

6.3 Monitoring & Alerting

6.4 Maintenance

7. Roadmap & Future Enhancements

7.1 P1: Integration Enhancements (Month 1-2)

7.2 P2: Scaling Optimizations (Month 3-6)

7.3 P3: Advanced Features (Month 6+)

8. Competitive Analysis & Positioning

8.1 Comparison with Traditional Approaches

8.2 Unique Selling Points

9. Risk Assessment & Mitigation

9.1 Technical Risks

9.2 Operational Risks

9.3 Business Risks

10. Success Criteria & KPIs

10.1 Technical KPIs

10.2 Business KPIs

10.3 Operational KPIs

11. Recommendations for MVP

11.1 Current State Assessment

11.2 MVP Recommendations

11.3 Next Steps

12. Conclusion

Appendix A: Glossary

Appendix B: References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!