Skip to content

Product PRD

Nick edited this page Mar 10, 2026 · 4 revisions

Product Requirements Document (PRD)

PATAS - Pattern-Adaptive Transmodal Anti-Spam System

Version: 2.0.0
Date: 2025-01-27
Status: Pilot-ready MVP (production-grade core, missing streaming/UI)


Executive Summary

PATAS is an autonomous pattern discovery and rule management system for anti-spam operations. It analyzes historical message logs, automatically discovers spam patterns, generates safe blocking rules, and evaluates their effectiveness before deployment.

Key Value Proposition:

  • Signal engine, not enforcement - Provides patterns and metrics that inform anti-spam decisions
  • On-premise deployment - Designed for deployment within your infrastructure
  • Two-stage processing - Fast scanning + deep analysis for 70-90% cost reduction
  • Deterministic and rule-based - Core engine is deterministic; ML/LLM is optional
  • Safety-first design - Multiple safety profiles with clear risk boundaries
  • Batch/offline analysis - Designed for daily/weekly batch processing of historical logs, not real-time filtering

1. Product Overview

1.1 Problem Statement

Traditional anti-spam systems require:

  • Manual rule creation and maintenance
  • Constant monitoring and adjustment
  • High false positive rates
  • Inability to adapt to new spam patterns quickly
  • Expensive ML/LLM costs for processing all messages

1.2 Solution

PATAS automates the entire spam pattern discovery and rule generation pipeline:

  1. Ingest historical message logs
  2. Discover recurring spam patterns automatically (two-stage: fast scan + deep analysis)
  3. Generate safe SQL rules from discovered patterns
  4. Evaluate rules on historical data (shadow mode)
  5. Promote rules that meet safety thresholds
  6. Monitor rule performance and deprecate underperforming rules

1.3 Target Users

  • Primary: Anti-spam teams at messaging platforms (Telegram, WhatsApp, etc.)
  • Secondary: Content moderation teams, security operations
  • Tertiary: Research teams studying spam patterns

1.4 Success Metrics

Observed on internal benchmark dataset (500K messages):

  • Precision: 0.93-0.97 (conservative profile)
  • False positive rate: <0.15%
  • Coverage: 5-8% of all spam messages
  • Cost reduction: 70-90% vs. processing all messages with LLM
  • Processing time: 3.5 hours for 500K messages (vs. days with naive approach)
  • AUTO_SAFE classification: 50-55% of rules (target: >50%)
  • Pattern Mining time: 60-65% of total time (optimized from 70.1%)

Note: On new datasets, we recommend running PATAS in shadow mode first to re-calibrate these metrics. PATAS is designed for offline/batch analysis of historical logs (daily/weekly runs), not inline real-time filtering.


2. Architecture & Technical Design

2.1 System Architecture

Design Philosophy: PATAS is designed as an offline/batch analysis system for historical message logs. Typical use case: daily/weekly logs → run PATAS overnight → receive new rules in the morning. It is not designed for inline real-time filtering.

Layered Architecture:

  • API Layer (app/api/) - FastAPI HTTP endpoints
  • Service Layer (app/v2_*.py) - Business logic
  • Repository Layer (app/repositories.py) - Data access
  • Infrastructure - Database, caching, observability

2.2 Core Components

2.2.1 Pattern Mining Pipeline

Two-Stage Approach:

Stage 1: Fast Scanning

  • Large chunks (10K-50K messages)
  • Deterministic patterns only (URLs, keywords, signatures)
  • No LLM/embeddings
  • Fast aggregation (~2-4 min for 500K messages)

Stage 2: Deep Analysis

  • Small chunks (1K-5K messages)
  • Suspicious patterns only (top 2-3% by default)
  • Semantic mining + LLM analysis
  • High quality rules (~3.5 hours for 500K messages)

Benefits:

  • 70-90% cost reduction (only 2-3% of messages use expensive LLM/embeddings)
  • Maintains high quality (deep analysis for important patterns)
  • Scales to millions of messages

2.2.2 Rule Lifecycle Management

States:

  1. CANDIDATE - Newly generated rule, not yet evaluated
  2. SHADOW - Evaluated on historical data, not yet active
  3. ACTIVE - Deployed and monitoring performance
  4. DEPRECATED - Underperforming (>10% precision drop) or manually disabled

Promotion Criteria:

  • Conservative: precision >= 0.95, max 5 false positives
  • Balanced: precision >= 0.90
  • Aggressive: precision >= 0.85

2.2.3 Safety & Quality Assurance

Multi-layer Protection:

  1. SQL Safety Validation:

    • Only SELECT queries (no INSERT/UPDATE/DELETE/DROP)
    • Whitelist tables/columns
    • SQL injection detection
    • Syntax validation via sqlparse
    • "Match-everything" detection (coverage >80% = red flag)
  2. LLM Validation (optional):

    • Logic verification before saving
    • False positive detection
  3. Shadow Evaluation:

    • Testing on historical data before activation
    • Precision, recall, F1-score metrics
    • Automatic deprecation on degradation
  4. Quality Tiers:

    • SAFE_AUTO: precision >= 0.98, low FPR - auto-activate
    • REVIEW_ONLY: precision >= 0.90 - manual review required
    • FEATURE_ONLY: precision < 0.90 - use as ML feature, not standalone rule

2.3 Data Models

Core Entities:

  • Message - Normalized message storage (text, timestamp, is_spam, metadata)
  • Pattern - Discovered spam pattern (type, description, examples)
  • Rule - SQL rule generated from pattern (status, sql_expression, evaluation metrics)
  • RuleEvaluation - Historical evaluation results (precision, recall, coverage)
  • Checkpoint - Progress tracking for incremental mining

2.4 Technology Stack

Backend:

  • Python 3.10+
  • FastAPI (async HTTP framework)
  • SQLAlchemy 2.0 (ORM)
  • PostgreSQL (production) / SQLite (development)

ML/AI:

  • OpenAI API (default) or local LLM (vLLM/TGI/Ollama)
  • OpenAI Embeddings (default) or local embeddings (BGE-M3, E5)
  • DBSCAN clustering for semantic similarity

Infrastructure:

  • Redis (distributed locks, caching)
  • Prometheus + Grafana (monitoring)
  • Docker + docker-compose (deployment)
  • OpenTelemetry (observability)

Development:

  • Poetry (dependency management)
  • Pytest (testing)
  • Ruff + Black (linting/formatting)
  • MyPy (type checking)

3. Features & Functionality

3.1 Core Features

3.1.1 Message Ingestion

  • Sources: TAS logs, CSV files, API endpoints
  • Formats: JSONL (recommended), CSV
  • Idempotency: Deduplication via external_id or message_hash
  • Batch processing: Efficient bulk ingestion

3.1.2 Pattern Mining

  • Deterministic patterns: URLs, phone numbers, keywords, signatures
  • Semantic patterns: DBSCAN clustering on embeddings
  • Incremental mining: Process only new messages (5-10x faster)
  • Checkpointing: Resume from interruptions

3.1.3 Rule Generation

  • SQL rules: Transparent, executable SQL expressions
  • LLM-based refinement: Optional pattern explanation and rule optimization
  • Multi-language support: Works with any Unicode text

3.1.4 Rule Evaluation

  • Shadow mode: Test rules on historical data before activation
  • Metrics: Precision, recall, F1-score, coverage, false positive rate
  • Parallel evaluation: 4-16 workers for faster processing
  • Sampling: Optional sampling for very large datasets

3.1.5 Rule Promotion

  • Safety profiles: Conservative, Balanced, Aggressive
  • Custom profiles: Configurable thresholds
  • Automatic deprecation: Monitor active rules, deprecate on degradation
  • Export backends: SQL, ROL, platform-specific formats

3.2 API Endpoints

Core Endpoints:

  • POST /api/v1/messages/ingest - Ingest messages
  • POST /api/v1/patterns/mine - Run pattern mining
  • GET /api/v1/patterns - List patterns
  • GET /api/v1/rules - List rules (with filtering, explanations, risk assessment)
  • POST /api/v1/rules/eval-shadow - Evaluate shadow rules
  • POST /api/v1/rules/promote - Promote/rollback rules
  • GET /api/v1/rules/export - Export rules
  • POST /api/v1/analyze - High-level batch analysis

Legacy v1 Endpoints:

  • POST /v1/classify - Single message classification
  • POST /v1/train - Submit training example
  • GET /v1/stats - System statistics

3.3 CLI Commands

patas ingest-logs          # Ingest messages
patas mine-patterns        # Discover patterns
patas eval-rules           # Evaluate rules
patas promote-rules        # Promote/rollback rules
patas list-checkpoints     # List mining checkpoints

3.4 Advanced Features

3.4.1 Incremental Mining

  • Process only new messages (after last checkpoint)
  • 5-10x faster than full mining
  • Suitable for daily operations

3.4.2 Distributed Processing

  • Redis-based distributed locks
  • Multi-instance coordination
  • Horizontal scaling support (with data sharding)

3.4.3 Cost Optimization

  • Two-stage processing (70-90% cost reduction)
  • Local LLM/embeddings support (zero API costs)
  • LLM/embedding caching
  • Cost guard with budget alerts

3.4.4 Observability

  • OpenTelemetry tracing
  • Prometheus metrics
  • Grafana dashboards
  • Structured logging
  • Audit trails

4. Performance & Scalability

4.1 Performance Metrics

Typical values observed on internal benchmark dataset. Performance may vary based on dataset characteristics, hardware, and configuration.

Pattern Mining (Two-Stage): Designed for offline/batch analysis. Typical use case: daily/weekly logs → run PATAS overnight → receive new rules in the morning.

Messages Stage 1 Stage 2 Total Stage 2 %
100K ~30 sec ~20 min ~21 min 2-3%
500K ~2 min ~3.5 h ~3.5 h 2-3%
1M ~4 min ~7 h ~7 h 2-3%
10M ~42 min ~70 h ~3 days 2-3%

Shadow Evaluation:

Rules Sequential Parallel (4) Parallel (8)
100 ~3 hours ~45 min ~25 min
1,000 ~30 hours ~7.5 hours ~4 hours

Ingestion:

  • Throughput: ~200 msg/sec (end-to-end including validation, ORM, and DB writes)
  • 500K messages: ~42 min

4.2 Scalability Strategies

For 10M+ Messages:

  1. Incremental mining (process only new messages)
  2. Parallel evaluation (8-16 workers)
  3. Rule filtering (top-N by quality tier)
  4. Sampling for very large datasets
  5. Local models (eliminate API costs)

Horizontal Scaling:

  • Current: Distributed locks prevent concurrent processing
  • Solution: Data sharding (split by message_id or timestamp)
  • Roadmap: Automatic sharding (P1)

4.3 Cost Estimates

Example cost profile for OpenAI-based deployment. Actual costs depend on number of patterns found, message characteristics, and LLM usage patterns.

OpenAI Mode (per 500K messages):

  • Embeddings: ~$0.06 (only 2-3% in Stage 2)
  • LLM: ~$91 (rule generation)
  • Total: ~$91 per run

Monthly (weekly runs):

  • 500K messages/week: ~$364/month
  • 1M messages/week: ~$728/month

Local Models:

  • Zero API costs
  • Infrastructure only (GPU server)
  • Break-even: ~2-3 months for 500K/week

5. Safety & Security

5.1 Safety Profiles

Conservative (default, recommended for high-risk environments):

  • Precision >= 0.95
  • Max 5 false positives
  • Recall ~0.08-0.10
  • Recommended for production use in high-risk environments like Telegram
  • Minimizes false positives at the cost of lower recall

Balanced:

  • Precision >= 0.90
  • Recall ~0.10-0.15
  • Suitable for controlled scenarios and internal experiments

Aggressive:

  • Precision >= 0.85
  • Recall ~0.15-0.20
  • Not recommended for production without careful evaluation
  • Use only in controlled scenarios with extensive shadow testing

Custom Profiles:

  • Configurable thresholds
  • Per-use-case optimization
  • For high-risk environments, we recommend starting with Conservative and adjusting only after thorough shadow evaluation

5.2 Security Features

API Security:

  • API key authentication
  • Rate limiting
  • WAF (pattern-based attack detection)
  • IP whitelisting (optional)

Data Protection:

  • PII redaction (SSN, passport, bank accounts, driver licenses)
  • Privacy modes (STANDARD, STRICT)
  • On-premise deployment support
  • Designed for GDPR-friendly on-premise deployment with built-in PII redaction and privacy modes to help meet internal compliance requirements

SQL Safety:

  • Only SELECT queries
  • Whitelist validation
  • SQL injection detection
  • Syntax validation

5.3 Audit & Compliance

  • Audit logging (all operations)
  • Request tracing (OpenTelemetry)
  • Retention policies (configurable)
  • Security audit checklist

6. Deployment & Operations

6.1 Deployment Options

On-Premise:

  • Docker + docker-compose
  • PostgreSQL + Redis
  • Local LLM/embeddings (vLLM, TGI, Ollama)
  • Air-gapped deployment support

Cloud:

  • Hetzner Cloud VPS (recommended)
  • AWS/GCP/Azure compatible
  • Managed PostgreSQL/Redis

6.2 Infrastructure Requirements

Minimum:

  • 2 CPU cores
  • 4 GB RAM
  • 20 GB storage
  • PostgreSQL 12+
  • Redis 6+ (optional, for distributed locks)

Recommended (production):

  • 4+ CPU cores
  • 8+ GB RAM
  • 100+ GB storage (for 10M+ messages)
  • PostgreSQL 14+ with connection pooling
  • Redis 7+ for caching and locks
  • GPU (for local LLM/embeddings)

6.3 Monitoring & Alerting

Metrics:

  • Prometheus (system metrics)
  • Custom metrics (pattern mining, rule evaluation)
  • Cost tracking (LLM usage)

Dashboards:

  • Grafana (pre-configured dashboards)
  • Pattern mining progress
  • Rule performance
  • System health

Alerting:

  • AlertManager integration
  • Rule degradation alerts
  • Cost budget alerts
  • System health alerts

6.4 Maintenance

Database:

  • Automatic migrations
  • Retention policies (configurable)
  • Backup scripts (pg_dump)

Updates:

  • Zero-downtime deployments
  • Secret rotation support
  • Graceful degradation

7. Roadmap & Future Enhancements

7.1 P1: Integration Enhancements (Month 1-2)

Streaming Ingestion:

  • Kafka/RabbitMQ consumer
  • Real-time message processing
  • Automatic backpressure handling

UI for Rule Management:

  • Web interface for viewing/editing rules
  • Quick disable/enable
  • Rule performance dashboards
  • Moderator feedback interface

Unicode & Emoji Handling:

  • Improved Unicode normalization
  • Better emoji processing
  • RTL language support

Multilingual Support:

  • Multilingual embeddings (BGE-M3, E5)
  • Language-specific normalization
  • Cross-language pattern detection

7.2 P2: Scaling Optimizations (Month 3-6)

Sharded Evaluation:

  • Parallel evaluation across message shards
  • Distributed evaluation for 10M+ messages

Database Optimization:

  • Table partitioning by timestamp
  • Read replicas for evaluation queries
  • Optimized indexes

Retention & Archiving:

  • Automatic data retention policies
  • Cold storage archiving
  • Configurable TTL

Enhanced Risk Assessment:

  • Whitelist support
  • Context-aware risk scoring
  • LLM-based false positive detection

7.3 P3: Advanced Features (Month 6+)

A/B Testing:

  • Shadow mode for partial traffic
  • Comparison metrics
  • Automatic promotion on improvement

Feedback Loop:

  • Moderator feedback API
  • Automatic rule deprecation from feedback
  • Pattern improvement from corrections

Cost Optimization:

  • Automatic threshold adjustment based on budget
  • Smart batching for LLM calls
  • Cost guard with auto-scaling

Dark Launch:

  • Gradual rule rollout
  • Canary deployments
  • Automatic rollback on degradation

8. Competitive Analysis & Positioning

8.1 Comparison with Traditional Approaches

Traditional Rule-Based Systems:

  • ❌ Manual rule creation
  • ❌ Static rules, no adaptation
  • ✅ High precision
  • ❌ Low recall

ML-Based Systems:

  • ✅ Automatic pattern discovery
  • ✅ Adaptive to new patterns
  • ❌ High false positive rate
  • ❌ Expensive (process all messages)
  • ❌ Black box (hard to explain)

PATAS:

  • ✅ Automatic pattern discovery
  • ✅ Adaptive to new patterns
  • ✅ High precision (0.93-0.97 observed on benchmark dataset)
  • ✅ Low false positive rate (<0.15% observed on benchmark dataset)
  • ✅ Transparent SQL rules
  • ✅ Cost-effective (70-90% reduction vs. naive approach)
  • ✅ On-premise deployment

8.2 Unique Selling Points

  1. Two-stage processing - Only 2-3% of messages use expensive LLM/embeddings
  2. Transparent rules - SQL expressions, not black box ML
  3. Safety-first - Multiple safety profiles, shadow evaluation
  4. On-premise - Full control, designed for GDPR-friendly deployment
  5. Cost-effective - 70-90% cost reduction vs. naive approach

9. Risk Assessment & Mitigation

9.1 Technical Risks

Risk: High false positive rate

  • Mitigation: Conservative profile by default, shadow evaluation, automatic deprecation

Risk: SQL injection in generated rules

  • Mitigation: SQL safety validation, whitelist, syntax validation

Risk: LLM API failures

  • Mitigation: Graceful degradation, local LLM support, retries with backoff

Risk: Performance degradation at scale

  • Mitigation: Two-stage processing, incremental mining, parallel evaluation, horizontal scaling

9.2 Operational Risks

Risk: Data privacy violations

  • Mitigation: PII redaction, privacy modes, on-premise deployment, GDPR-friendly design with built-in compliance tools

Risk: High LLM costs

  • Mitigation: Two-stage processing, local models, caching, cost guard

Risk: Rule quality degradation

  • Mitigation: Shadow evaluation, automatic deprecation, quality tiers

9.3 Business Risks

Risk: Low adoption due to complexity

  • Mitigation: Comprehensive documentation, CLI tools, API quickstart, examples

Risk: Competition from established players

  • Mitigation: Open-source, on-premise focus, transparent rules, cost-effectiveness

10. Success Criteria & KPIs

10.1 Technical KPIs

Target metrics based on observed performance on internal benchmark dataset. Actual values may vary on new datasets and should be re-calibrated in shadow mode.

  • Precision: >= 0.93 (conservative profile, observed on benchmark)
  • False positive rate: < 0.15% (observed on benchmark)
  • Coverage: 5-8% of spam messages (observed on benchmark)
  • Processing time: < 4 hours for 500K messages (batch/offline mode)
  • Cost reduction: 70-90% vs. naive approach (typical range)

10.2 Business KPIs

  • Adoption: 3+ production deployments in 6 months
  • User satisfaction: > 4.0/5.0 (if survey conducted)
  • Community: Active GitHub stars, contributions
  • Documentation: Complete wiki, examples, tutorials

10.3 Operational KPIs

  • Uptime: > 99.5%
  • Mean time to recovery: < 1 hour
  • API response time: P95 < 500ms
  • Error rate: < 0.1%

11. Recommendations for MVP

11.1 Current State Assessment

Strengths:

  • ✅ Complete two-stage pipeline implementation
  • ✅ Comprehensive API and CLI
  • ✅ Safety profiles and quality tiers
  • ✅ Shadow evaluation and automatic deprecation
  • ✅ On-premise deployment support
  • ✅ Extensive documentation

Gaps:

  • ⚠️ No streaming ingestion (batch only)
  • ⚠️ No UI for rule management
  • ⚠️ Limited horizontal scaling (requires manual sharding)
  • ⚠️ No feedback loop from moderators
  • ⚠️ Limited multilingual support (Unicode issues)

11.2 MVP Recommendations

Must-Have (P0):

  1. ✅ Two-stage pattern mining (DONE)
  2. ✅ Rule lifecycle management (DONE)
  3. ✅ Shadow evaluation (DONE)
  4. ✅ Safety profiles (DONE)
  5. ✅ API and CLI (DONE)
  6. ✅ Documentation (DONE)

Should-Have (P1):

  1. Streaming ingestion (Kafka/RabbitMQ)
  2. UI for rule management
  3. Improved Unicode/emoji handling
  4. Automatic horizontal scaling (sharding)

Nice-to-Have (P2):

  1. Moderator feedback loop
  2. A/B testing
  3. Advanced cost optimization
  4. Dark launch support

11.3 Next Steps

  1. Pilot Deployment:

    • Deploy to pilot/production environment
    • Run in shadow mode first to re-calibrate metrics on real data
    • Monitor performance and costs
    • Collect user feedback
  2. P1 Features:

    • Implement streaming ingestion
    • Build UI for rule management
    • Improve Unicode handling
  3. P2 Features:

    • Implement feedback loop
    • Add A/B testing
    • Advanced optimizations

12. Conclusion

PATAS is a pilot-ready MVP with production-grade core functionality:

  • ✅ Complete two-stage pattern mining pipeline
  • ✅ Comprehensive API and CLI interfaces
  • ✅ Safety-first design with shadow evaluation
  • ✅ Cost-effective architecture (70-90% reduction)
  • ✅ On-premise deployment support
  • ✅ Extensive documentation

Ready for:

  • Production pilot deployments (with shadow mode calibration)
  • Integration with messaging platforms
  • Community adoption

Note: Core functionality is mature and production-grade. Enterprise features (streaming ingestion, UI, automatic scaling) are planned for P1/P2 roadmap after successful pilot feedback.

Future focus:

  • Streaming ingestion (P1)
  • UI development (P1)
  • Advanced scaling (P1/P2)
  • Feedback loops (P2)

Appendix A: Glossary

  • Pattern: Recurring spam characteristic (URL, keyword, semantic similarity)
  • Rule: SQL expression that matches spam messages
  • Shadow Evaluation: Testing rules on historical data before activation
  • Two-Stage Processing: Fast scanning (Stage 1) + deep analysis (Stage 2)
  • Safety Profile: Aggressiveness level (Conservative, Balanced, Aggressive)
  • Quality Tier: Rule quality classification (SAFE_AUTO, REVIEW_ONLY, FEATURE_ONLY)
  • Checkpoint: Progress tracking for incremental mining
  • Incremental Mining: Processing only new messages (after last checkpoint)

Appendix B: References


Document Version: 1.0
Last Updated: 2025-01-27
Author: KikuAI Lab
Status: Final

Note: This PRD is maintained as part of the codebase. If code changes, the PRD is updated as the last step to ensure consistency.

Note: This PRD is maintained as part of the codebase. If code changes, the PRD is updated as the last step to ensure consistency.

Clone this wiki locally