-
Notifications
You must be signed in to change notification settings - Fork 0
Scaling and Cost Design
How PATAS handles millions of logs without breaking infrastructure or budget
Problem: Processing millions of messages at once exhausts memory and API rate limits.
Solution: Split data into configurable chunks (default: 10,000 messages per chunk).
Benefits:
- Memory efficient: ~25 MB per chunk vs GBs for full dataset
- Parallelizable: Chunks can be processed independently
- Fault-tolerant: Single chunk failure doesn't crash entire pipeline
- Progressive: Results available before full dataset completion
Configuration:
PATTERN_MINING_CHUNK_SIZE=10000Problem: Running expensive LLM/embedding analysis on all messages is cost-prohibitive at scale.
Solution: Split processing into two stages.
- Large chunks (10,000 messages)
- Deterministic patterns only (URLs, keywords, signatures)
- No LLM or embedding calls
- Fast aggregation
- Small chunks (1,000 messages)
- Top 3-10% suspicious patterns from Stage 1
- Semantic mining with embeddings
- LLM analysis for quality
- High-quality rules
Benefits:
- Cost reduction: 50-80% fewer API calls
- Same quality: Deep analysis applied where it matters
- Faster: Parallel Stage 1 for bulk, focused Stage 2 for quality
- Scalable: Handles millions of messages efficiently
Configuration:
ENABLE_TWO_STAGE_PROCESSING=true
SUSPICIOUSNESS_THRESHOLD=0.03 # Top 3% for Stage 2Problem: Naive similarity clustering is O(n²) - prohibitively slow at scale.
Solution: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) with cosine distance.
Benefits:
- Fast: O(n log n) with optimizations (10-100x faster than naive)
- Automatic: No need to specify number of clusters
- Noise-resistant: Automatically filters outliers
- Quality: Finds clusters of arbitrary shape
Parameters (tuned for spam detection):
USE_DBSCAN_CLUSTERING=true
SEMANTIC_SIMILARITY_THRESHOLD=0.75
SEMANTIC_MIN_CLUSTER_SIZE=3Problem: Embedding APIs have per-request limits (OpenAI: 2,048 texts) and repeated analysis wastes API calls.
Solution: Automatic batching + TTL cache.
- Splits large requests into API-compliant batches (2,048 texts)
- Parallel processing where possible
- Respects rate limits
- Cache embeddings for 7 days (TTL)
- 50-80% hit rate on repeated analyses
- Reduces redundant API calls
Benefits:
- API compliance: Never exceeds limits
- Cost reduction: 50-80% fewer embedding calls (with cache)
- Faster: Cached embeddings retrieved instantly
Configuration:
EMBEDDING_BATCH_SIZE=2048 # AutomaticProblem: One-size-fits-all doesn't work for different spam distributions.
Solution: Three configurable profiles.
- Only strongest, most obvious patterns
- Minimum API usage
- Best for: highly distributed spam, tight budgets
- Strong patterns with good coverage
- Moderate API usage
- Best for: typical spam distributions
- Broader pattern coverage
- Higher API usage
- Best for: moderate spam concentration
- Maximum pattern coverage
- Highest API usage
- Best for: concentrated spam (one dominant pattern)
Configuration:
# Choose profile
SUSPICIOUSNESS_THRESHOLD=0.01 # Ultra-conservative
SUSPICIOUSNESS_THRESHOLD=0.03 # Conservative (default)
SUSPICIOUSNESS_THRESHOLD=0.05 # Balanced
SUSPICIOUSNESS_THRESHOLD=0.15 # AggressiveProblem: Teams may want to start with transparent, understandable processing.
Solution: Disable two-stage processing for simple, deterministic-only mode.
Simple mode characteristics:
- Single-stage pipeline
- Deterministic patterns only (URLs, keywords, regex)
- No LLM or embeddings
- Fully transparent and explainable
- Zero AI costs
Enable simple mode:
patas mine-patterns --no-two-stage
# Or in config
ENABLE_TWO_STAGE_PROCESSING=falseWhen to use:
- Initial testing and validation
- When LLM/embeddings unavailable
- When full transparency required
- When cost must be minimized
Problem: External dependencies (OpenAI, sklearn) may be unavailable.
Solution: Automatic fallbacks at every level.
Fallback chain:
- DBSCAN unavailable → Naive clustering
- Embeddings unavailable → Deterministic patterns only
- LLM unavailable → Skip LLM analysis
- Cache unavailable → Direct API calls
Benefits:
- Never crashes on missing dependencies
- Degrades gracefully to simpler methods
- Always produces results (even if lower quality)
- Messages processed: 1,000,000
- Embedding API calls: ~489 (1M / 2048)
- Estimated cost: ~$8-10 per run
- Monthly cost: ~$240-300
- Stage 1: 1,000,000 messages (deterministic, no API)
- Stage 2: ~100,000-300,000 messages (3-10% suspicious)
- Embedding API calls: ~49-147 (100K-300K / 2048)
- Estimated cost: ~$0.80-2.50 per run
- Monthly cost: ~$24-75
- Savings: 70-90%
- Per chunk: ~25 MB (10K messages)
- Peak: ~300 MB (embeddings for 50K messages)
- Database: Indexed queries, <1s response time
- Stage 1: Regex/string matching (fast)
- DBSCAN: Multi-core (uses all CPUs)
- Overall: Low CPU usage (most time in API waits)
- API calls: Batched and cached
- Rate limits: Automatically respected
- Retries: Built-in with exponential backoff
PATAS is designed for large-scale deployment:
| Feature | Benefit | Impact |
|---|---|---|
| Chunked processing | Memory efficient | Handles unlimited dataset size |
| Two-stage pipeline | Cost reduction | 50-80% fewer API calls |
| Fast clustering | Performance | 10-100x faster clustering |
| Batching & caching | API optimization | 50-80% fewer redundant calls |
| Aggressiveness profiles | Flexibility | Tune for your use case |
| Simple mode | Transparency | Fallback to deterministic only |
| Graceful degradation | Reliability | Never crashes on missing deps |
Result: Process millions of messages efficiently without infrastructure strain or budget overruns.
Tested on: 20,000 real your platform moderator reports
Performance: 37 patterns found in ~25 seconds
Savings: 50-80% API cost reduction