-
Notifications
You must be signed in to change notification settings - Fork 0
Product PRD
Version: 2.0.0
Date: 2025-01-27
Status: Pilot-ready MVP (production-grade core, missing streaming/UI)
PATAS is an autonomous pattern discovery and rule management system for anti-spam operations. It analyzes historical message logs, automatically discovers spam patterns, generates safe blocking rules, and evaluates their effectiveness before deployment.
Key Value Proposition:
- Signal engine, not enforcement - Provides patterns and metrics that inform anti-spam decisions
- On-premise deployment - Designed for deployment within your infrastructure
- Two-stage processing - Fast scanning + deep analysis for 70-90% cost reduction
- Deterministic and rule-based - Core engine is deterministic; ML/LLM is optional
- Safety-first design - Multiple safety profiles with clear risk boundaries
- Batch/offline analysis - Designed for daily/weekly batch processing of historical logs, not real-time filtering
Traditional anti-spam systems require:
- Manual rule creation and maintenance
- Constant monitoring and adjustment
- High false positive rates
- Inability to adapt to new spam patterns quickly
- Expensive ML/LLM costs for processing all messages
PATAS automates the entire spam pattern discovery and rule generation pipeline:
- Ingest historical message logs
- Discover recurring spam patterns automatically (two-stage: fast scan + deep analysis)
- Generate safe SQL rules from discovered patterns
- Evaluate rules on historical data (shadow mode)
- Promote rules that meet safety thresholds
- Monitor rule performance and deprecate underperforming rules
- Primary: Anti-spam teams at messaging platforms (Telegram, WhatsApp, etc.)
- Secondary: Content moderation teams, security operations
- Tertiary: Research teams studying spam patterns
Observed on internal benchmark dataset (500K messages):
- Precision: 0.93-0.97 (conservative profile)
- False positive rate: <0.15%
- Coverage: 5-8% of all spam messages
- Cost reduction: 70-90% vs. processing all messages with LLM
- Processing time: 3.5 hours for 500K messages (vs. days with naive approach)
- AUTO_SAFE classification: 50-55% of rules (target: >50%)
- Pattern Mining time: 60-65% of total time (optimized from 70.1%)
Note: On new datasets, we recommend running PATAS in shadow mode first to re-calibrate these metrics. PATAS is designed for offline/batch analysis of historical logs (daily/weekly runs), not inline real-time filtering.
Design Philosophy: PATAS is designed as an offline/batch analysis system for historical message logs. Typical use case: daily/weekly logs → run PATAS overnight → receive new rules in the morning. It is not designed for inline real-time filtering.
Layered Architecture:
-
API Layer (
app/api/) - FastAPI HTTP endpoints -
Service Layer (
app/v2_*.py) - Business logic -
Repository Layer (
app/repositories.py) - Data access - Infrastructure - Database, caching, observability
Two-Stage Approach:
Stage 1: Fast Scanning
- Large chunks (10K-50K messages)
- Deterministic patterns only (URLs, keywords, signatures)
- No LLM/embeddings
- Fast aggregation (~2-4 min for 500K messages)
Stage 2: Deep Analysis
- Small chunks (1K-5K messages)
- Suspicious patterns only (top 2-3% by default)
- Semantic mining + LLM analysis
- High quality rules (~3.5 hours for 500K messages)
Benefits:
- 70-90% cost reduction (only 2-3% of messages use expensive LLM/embeddings)
- Maintains high quality (deep analysis for important patterns)
- Scales to millions of messages
States:
- CANDIDATE - Newly generated rule, not yet evaluated
- SHADOW - Evaluated on historical data, not yet active
- ACTIVE - Deployed and monitoring performance
- DEPRECATED - Underperforming (>10% precision drop) or manually disabled
Promotion Criteria:
- Conservative: precision >= 0.95, max 5 false positives
- Balanced: precision >= 0.90
- Aggressive: precision >= 0.85
Multi-layer Protection:
-
SQL Safety Validation:
- Only SELECT queries (no INSERT/UPDATE/DELETE/DROP)
- Whitelist tables/columns
- SQL injection detection
- Syntax validation via
sqlparse - "Match-everything" detection (coverage >80% = red flag)
-
LLM Validation (optional):
- Logic verification before saving
- False positive detection
-
Shadow Evaluation:
- Testing on historical data before activation
- Precision, recall, F1-score metrics
- Automatic deprecation on degradation
-
Quality Tiers:
- SAFE_AUTO: precision >= 0.98, low FPR - auto-activate
- REVIEW_ONLY: precision >= 0.90 - manual review required
- FEATURE_ONLY: precision < 0.90 - use as ML feature, not standalone rule
Core Entities:
- Message - Normalized message storage (text, timestamp, is_spam, metadata)
- Pattern - Discovered spam pattern (type, description, examples)
- Rule - SQL rule generated from pattern (status, sql_expression, evaluation metrics)
- RuleEvaluation - Historical evaluation results (precision, recall, coverage)
- Checkpoint - Progress tracking for incremental mining
Backend:
- Python 3.10+
- FastAPI (async HTTP framework)
- SQLAlchemy 2.0 (ORM)
- PostgreSQL (production) / SQLite (development)
ML/AI:
- OpenAI API (default) or local LLM (vLLM/TGI/Ollama)
- OpenAI Embeddings (default) or local embeddings (BGE-M3, E5)
- DBSCAN clustering for semantic similarity
Infrastructure:
- Redis (distributed locks, caching)
- Prometheus + Grafana (monitoring)
- Docker + docker-compose (deployment)
- OpenTelemetry (observability)
Development:
- Poetry (dependency management)
- Pytest (testing)
- Ruff + Black (linting/formatting)
- MyPy (type checking)
- Sources: TAS logs, CSV files, API endpoints
- Formats: JSONL (recommended), CSV
-
Idempotency: Deduplication via
external_idormessage_hash - Batch processing: Efficient bulk ingestion
- Deterministic patterns: URLs, phone numbers, keywords, signatures
- Semantic patterns: DBSCAN clustering on embeddings
- Incremental mining: Process only new messages (5-10x faster)
- Checkpointing: Resume from interruptions
- SQL rules: Transparent, executable SQL expressions
- LLM-based refinement: Optional pattern explanation and rule optimization
- Multi-language support: Works with any Unicode text
- Shadow mode: Test rules on historical data before activation
- Metrics: Precision, recall, F1-score, coverage, false positive rate
- Parallel evaluation: 4-16 workers for faster processing
- Sampling: Optional sampling for very large datasets
- Safety profiles: Conservative, Balanced, Aggressive
- Custom profiles: Configurable thresholds
- Automatic deprecation: Monitor active rules, deprecate on degradation
- Export backends: SQL, ROL, platform-specific formats
Core Endpoints:
-
POST /api/v1/messages/ingest- Ingest messages -
POST /api/v1/patterns/mine- Run pattern mining -
GET /api/v1/patterns- List patterns -
GET /api/v1/rules- List rules (with filtering, explanations, risk assessment) -
POST /api/v1/rules/eval-shadow- Evaluate shadow rules -
POST /api/v1/rules/promote- Promote/rollback rules -
GET /api/v1/rules/export- Export rules -
POST /api/v1/analyze- High-level batch analysis
Legacy v1 Endpoints:
-
POST /v1/classify- Single message classification -
POST /v1/train- Submit training example -
GET /v1/stats- System statistics
patas ingest-logs # Ingest messages
patas mine-patterns # Discover patterns
patas eval-rules # Evaluate rules
patas promote-rules # Promote/rollback rules
patas list-checkpoints # List mining checkpoints- Process only new messages (after last checkpoint)
- 5-10x faster than full mining
- Suitable for daily operations
- Redis-based distributed locks
- Multi-instance coordination
- Horizontal scaling support (with data sharding)
- Two-stage processing (70-90% cost reduction)
- Local LLM/embeddings support (zero API costs)
- LLM/embedding caching
- Cost guard with budget alerts
- OpenTelemetry tracing
- Prometheus metrics
- Grafana dashboards
- Structured logging
- Audit trails
Typical values observed on internal benchmark dataset. Performance may vary based on dataset characteristics, hardware, and configuration.
Pattern Mining (Two-Stage): Designed for offline/batch analysis. Typical use case: daily/weekly logs → run PATAS overnight → receive new rules in the morning.
| Messages | Stage 1 | Stage 2 | Total | Stage 2 % |
|---|---|---|---|---|
| 100K | ~30 sec | ~20 min | ~21 min | 2-3% |
| 500K | ~2 min | ~3.5 h | ~3.5 h | 2-3% |
| 1M | ~4 min | ~7 h | ~7 h | 2-3% |
| 10M | ~42 min | ~70 h | ~3 days | 2-3% |
Shadow Evaluation:
| Rules | Sequential | Parallel (4) | Parallel (8) |
|---|---|---|---|
| 100 | ~3 hours | ~45 min | ~25 min |
| 1,000 | ~30 hours | ~7.5 hours | ~4 hours |
Ingestion:
- Throughput: ~200 msg/sec (end-to-end including validation, ORM, and DB writes)
- 500K messages: ~42 min
For 10M+ Messages:
- Incremental mining (process only new messages)
- Parallel evaluation (8-16 workers)
- Rule filtering (top-N by quality tier)
- Sampling for very large datasets
- Local models (eliminate API costs)
Horizontal Scaling:
- Current: Distributed locks prevent concurrent processing
- Solution: Data sharding (split by message_id or timestamp)
- Roadmap: Automatic sharding (P1)
Example cost profile for OpenAI-based deployment. Actual costs depend on number of patterns found, message characteristics, and LLM usage patterns.
OpenAI Mode (per 500K messages):
- Embeddings: ~$0.06 (only 2-3% in Stage 2)
- LLM: ~$91 (rule generation)
- Total: ~$91 per run
Monthly (weekly runs):
- 500K messages/week: ~$364/month
- 1M messages/week: ~$728/month
Local Models:
- Zero API costs
- Infrastructure only (GPU server)
- Break-even: ~2-3 months for 500K/week
Conservative (default, recommended for high-risk environments):
- Precision >= 0.95
- Max 5 false positives
- Recall ~0.08-0.10
- Recommended for production use in high-risk environments like Telegram
- Minimizes false positives at the cost of lower recall
Balanced:
- Precision >= 0.90
- Recall ~0.10-0.15
- Suitable for controlled scenarios and internal experiments
Aggressive:
- Precision >= 0.85
- Recall ~0.15-0.20
- Not recommended for production without careful evaluation
- Use only in controlled scenarios with extensive shadow testing
Custom Profiles:
- Configurable thresholds
- Per-use-case optimization
- For high-risk environments, we recommend starting with Conservative and adjusting only after thorough shadow evaluation
API Security:
- API key authentication
- Rate limiting
- WAF (pattern-based attack detection)
- IP whitelisting (optional)
Data Protection:
- PII redaction (SSN, passport, bank accounts, driver licenses)
- Privacy modes (STANDARD, STRICT)
- On-premise deployment support
- Designed for GDPR-friendly on-premise deployment with built-in PII redaction and privacy modes to help meet internal compliance requirements
SQL Safety:
- Only SELECT queries
- Whitelist validation
- SQL injection detection
- Syntax validation
- Audit logging (all operations)
- Request tracing (OpenTelemetry)
- Retention policies (configurable)
- Security audit checklist
On-Premise:
- Docker + docker-compose
- PostgreSQL + Redis
- Local LLM/embeddings (vLLM, TGI, Ollama)
- Air-gapped deployment support
Cloud:
- Hetzner Cloud VPS (recommended)
- AWS/GCP/Azure compatible
- Managed PostgreSQL/Redis
Minimum:
- 2 CPU cores
- 4 GB RAM
- 20 GB storage
- PostgreSQL 12+
- Redis 6+ (optional, for distributed locks)
Recommended (production):
- 4+ CPU cores
- 8+ GB RAM
- 100+ GB storage (for 10M+ messages)
- PostgreSQL 14+ with connection pooling
- Redis 7+ for caching and locks
- GPU (for local LLM/embeddings)
Metrics:
- Prometheus (system metrics)
- Custom metrics (pattern mining, rule evaluation)
- Cost tracking (LLM usage)
Dashboards:
- Grafana (pre-configured dashboards)
- Pattern mining progress
- Rule performance
- System health
Alerting:
- AlertManager integration
- Rule degradation alerts
- Cost budget alerts
- System health alerts
Database:
- Automatic migrations
- Retention policies (configurable)
- Backup scripts (pg_dump)
Updates:
- Zero-downtime deployments
- Secret rotation support
- Graceful degradation
Streaming Ingestion:
- Kafka/RabbitMQ consumer
- Real-time message processing
- Automatic backpressure handling
UI for Rule Management:
- Web interface for viewing/editing rules
- Quick disable/enable
- Rule performance dashboards
- Moderator feedback interface
Unicode & Emoji Handling:
- Improved Unicode normalization
- Better emoji processing
- RTL language support
Multilingual Support:
- Multilingual embeddings (BGE-M3, E5)
- Language-specific normalization
- Cross-language pattern detection
Sharded Evaluation:
- Parallel evaluation across message shards
- Distributed evaluation for 10M+ messages
Database Optimization:
- Table partitioning by timestamp
- Read replicas for evaluation queries
- Optimized indexes
Retention & Archiving:
- Automatic data retention policies
- Cold storage archiving
- Configurable TTL
Enhanced Risk Assessment:
- Whitelist support
- Context-aware risk scoring
- LLM-based false positive detection
A/B Testing:
- Shadow mode for partial traffic
- Comparison metrics
- Automatic promotion on improvement
Feedback Loop:
- Moderator feedback API
- Automatic rule deprecation from feedback
- Pattern improvement from corrections
Cost Optimization:
- Automatic threshold adjustment based on budget
- Smart batching for LLM calls
- Cost guard with auto-scaling
Dark Launch:
- Gradual rule rollout
- Canary deployments
- Automatic rollback on degradation
Traditional Rule-Based Systems:
- ❌ Manual rule creation
- ❌ Static rules, no adaptation
- ✅ High precision
- ❌ Low recall
ML-Based Systems:
- ✅ Automatic pattern discovery
- ✅ Adaptive to new patterns
- ❌ High false positive rate
- ❌ Expensive (process all messages)
- ❌ Black box (hard to explain)
PATAS:
- ✅ Automatic pattern discovery
- ✅ Adaptive to new patterns
- ✅ High precision (0.93-0.97 observed on benchmark dataset)
- ✅ Low false positive rate (<0.15% observed on benchmark dataset)
- ✅ Transparent SQL rules
- ✅ Cost-effective (70-90% reduction vs. naive approach)
- ✅ On-premise deployment
- Two-stage processing - Only 2-3% of messages use expensive LLM/embeddings
- Transparent rules - SQL expressions, not black box ML
- Safety-first - Multiple safety profiles, shadow evaluation
- On-premise - Full control, designed for GDPR-friendly deployment
- Cost-effective - 70-90% cost reduction vs. naive approach
Risk: High false positive rate
- Mitigation: Conservative profile by default, shadow evaluation, automatic deprecation
Risk: SQL injection in generated rules
- Mitigation: SQL safety validation, whitelist, syntax validation
Risk: LLM API failures
- Mitigation: Graceful degradation, local LLM support, retries with backoff
Risk: Performance degradation at scale
- Mitigation: Two-stage processing, incremental mining, parallel evaluation, horizontal scaling
Risk: Data privacy violations
- Mitigation: PII redaction, privacy modes, on-premise deployment, GDPR-friendly design with built-in compliance tools
Risk: High LLM costs
- Mitigation: Two-stage processing, local models, caching, cost guard
Risk: Rule quality degradation
- Mitigation: Shadow evaluation, automatic deprecation, quality tiers
Risk: Low adoption due to complexity
- Mitigation: Comprehensive documentation, CLI tools, API quickstart, examples
Risk: Competition from established players
- Mitigation: Open-source, on-premise focus, transparent rules, cost-effectiveness
Target metrics based on observed performance on internal benchmark dataset. Actual values may vary on new datasets and should be re-calibrated in shadow mode.
- Precision: >= 0.93 (conservative profile, observed on benchmark)
- False positive rate: < 0.15% (observed on benchmark)
- Coverage: 5-8% of spam messages (observed on benchmark)
- Processing time: < 4 hours for 500K messages (batch/offline mode)
- Cost reduction: 70-90% vs. naive approach (typical range)
- Adoption: 3+ production deployments in 6 months
- User satisfaction: > 4.0/5.0 (if survey conducted)
- Community: Active GitHub stars, contributions
- Documentation: Complete wiki, examples, tutorials
- Uptime: > 99.5%
- Mean time to recovery: < 1 hour
- API response time: P95 < 500ms
- Error rate: < 0.1%
Strengths:
- ✅ Complete two-stage pipeline implementation
- ✅ Comprehensive API and CLI
- ✅ Safety profiles and quality tiers
- ✅ Shadow evaluation and automatic deprecation
- ✅ On-premise deployment support
- ✅ Extensive documentation
Gaps:
⚠️ No streaming ingestion (batch only)⚠️ No UI for rule management⚠️ Limited horizontal scaling (requires manual sharding)⚠️ No feedback loop from moderators⚠️ Limited multilingual support (Unicode issues)
Must-Have (P0):
- ✅ Two-stage pattern mining (DONE)
- ✅ Rule lifecycle management (DONE)
- ✅ Shadow evaluation (DONE)
- ✅ Safety profiles (DONE)
- ✅ API and CLI (DONE)
- ✅ Documentation (DONE)
Should-Have (P1):
- Streaming ingestion (Kafka/RabbitMQ)
- UI for rule management
- Improved Unicode/emoji handling
- Automatic horizontal scaling (sharding)
Nice-to-Have (P2):
- Moderator feedback loop
- A/B testing
- Advanced cost optimization
- Dark launch support
-
Pilot Deployment:
- Deploy to pilot/production environment
- Run in shadow mode first to re-calibrate metrics on real data
- Monitor performance and costs
- Collect user feedback
-
P1 Features:
- Implement streaming ingestion
- Build UI for rule management
- Improve Unicode handling
-
P2 Features:
- Implement feedback loop
- Add A/B testing
- Advanced optimizations
PATAS is a pilot-ready MVP with production-grade core functionality:
- ✅ Complete two-stage pattern mining pipeline
- ✅ Comprehensive API and CLI interfaces
- ✅ Safety-first design with shadow evaluation
- ✅ Cost-effective architecture (70-90% reduction)
- ✅ On-premise deployment support
- ✅ Extensive documentation
Ready for:
- Production pilot deployments (with shadow mode calibration)
- Integration with messaging platforms
- Community adoption
Note: Core functionality is mature and production-grade. Enterprise features (streaming ingestion, UI, automatic scaling) are planned for P1/P2 roadmap after successful pilot feedback.
Future focus:
- Streaming ingestion (P1)
- UI development (P1)
- Advanced scaling (P1/P2)
- Feedback loops (P2)
- Pattern: Recurring spam characteristic (URL, keyword, semantic similarity)
- Rule: SQL expression that matches spam messages
- Shadow Evaluation: Testing rules on historical data before activation
- Two-Stage Processing: Fast scanning (Stage 1) + deep analysis (Stage 2)
- Safety Profile: Aggressiveness level (Conservative, Balanced, Aggressive)
- Quality Tier: Rule quality classification (SAFE_AUTO, REVIEW_ONLY, FEATURE_ONLY)
- Checkpoint: Progress tracking for incremental mining
- Incremental Mining: Processing only new messages (after last checkpoint)
Document Version: 1.0
Last Updated: 2025-01-27
Author: KikuAI Lab
Status: Final
Note: This PRD is maintained as part of the codebase. If code changes, the PRD is updated as the last step to ensure consistency.
Note: This PRD is maintained as part of the codebase. If code changes, the PRD is updated as the last step to ensure consistency.