-
Notifications
You must be signed in to change notification settings - Fork 0
Performance
Nick edited this page Nov 24, 2025
·
2 revisions
Real-world performance metrics and cost estimates for PATAS pattern mining.
| Messages | Stage 1 Time | Stage 2 Time | Total Time | Stage 2 % | Cost Savings |
|---|---|---|---|---|---|
| 100K | ~30 sec | ~20 min | ~21 min | 2-3% | 97-98% |
| 500K | ~2 min | ~3.5 hours | ~3.5 hours | 2-3% | 97-98% |
| 1M | ~4 min | ~7 hours | ~7 hours | 2-3% | 97-98% |
| 10M | ~42 min | ~70 hours | ~3 days | 2-3% | 97-98% |
Notes:
- Stage 1: Fast deterministic patterns (URLs, keywords, signatures) - no LLM/embeddings
- Stage 2: Deep semantic analysis only for suspicious patterns (top 2-3% by default)
- Cost savings: Only 2-3% of messages require expensive LLM/embedding analysis
Performance Improvements:
- Reduced commit frequency: batch_size 10→25, commit intervals 5→10
- Lazy ham message loading: fixed sample size (1000) instead of proportional
- Parallel chunk processing: 2-3x speedup for large datasets
- Result: Pattern Mining time reduced from 70.1% to ~60-65% of total time
Throughput:
- 615-2,224 messages/second (depending on dataset size)
- Sub-linear scaling: 3.17x time increase for 10x data
- Low memory usage: average peak 18.4 MB
| Rules to Evaluate | Sequential Time | Parallel (4 workers) | Parallel (8 workers) |
|---|---|---|---|
| 100 | ~3 hours | ~45 min | ~25 min |
| 1,000 | ~30 hours | ~7.5 hours | ~4 hours |
| 10,000 | ~300 hours | ~75 hours | ~40 hours |
Optimization:
- Use
max_shadow_rules_to_evaluateto limit evaluation to top-N rules by quality tier - Use
shadow_evaluation_parallel_workersto enable parallel evaluation - Use
shadow_evaluation_sample_sizefor sampling on very large datasets
| Messages | Time (Batch) | Throughput |
|---|---|---|
| 100K | ~8 min | ~200 msg/sec |
| 500K | ~42 min | ~200 msg/sec |
| 1M | ~1.4 hours | ~200 msg/sec |
- Embeddings: ~$0.06 (only 2-3% of messages in Stage 2)
- LLM (rule generation): ~$91 (depends on number of patterns found)
- Total: ~$91 per run
- Embeddings: ~$0.12
- LLM (rule generation): ~$182
- Total: ~$182 per run
- 500K messages/week: ~$364/month
- 1M messages/week: ~$728/month
Disable LLM and embeddings:
use_llm: false-
use_semantic: falseorembedding_provider: none
Performance:
- Stage 1 only: ~2-4 min for 500K messages
- No LLM costs
- Lower recall, but still effective for obvious spam patterns
Default two-stage pipeline:
- Stage 1: Deterministic patterns (fast, cheap)
- Stage 2: Semantic mining + LLM (slow, expensive, but only 2-3% of messages)
Recommendation for on-premise:
- Use local models (Mistral-7B, Llama-3.1-8B) to eliminate API costs
- See On-Premise Deployment for details
-
Incremental Mining: Use
--since-checkpointto process only new messages -
Parallel Evaluation: Set
shadow_evaluation_parallel_workers: 8-16 -
Rule Filtering: Set
max_shadow_rules_to_evaluate: 1000-5000(top-N by quality) -
Sampling: Set
shadow_evaluation_sample_size: 10000for very large datasets - Local Models: Use on-premise LLM/embeddings to eliminate API costs
Current limitation: Distributed locks prevent concurrent processing of the same dataset.
Solution: Shard data (see Horizontal Scaling):
- Split data into N shards (by message_id or timestamp)
- Each instance processes its shard with unique lock key
- Merge results after processing
Example (10 instances, 10M messages):
- Time: ~7 hours (instead of 3 days on single instance)
- Quality: Similar to single-instance processing
Roadmap: Automatic sharding in P1 (after successful pilot)
- Consider sharded evaluation (evaluate rules on message shards in parallel)
- Use incremental mining exclusively (process only new messages)
- Database partitioning by timestamp
- Read replicas for evaluation queries
- Horizontal scaling with data sharding (see Horizontal Scaling)
Production Example (500K messages/week):
- Pattern mining: ~3.5 hours weekly
- Shadow evaluation: ~3-4 hours (with parallel workers, top-1000 rules)
- LLM costs:
$91/week ($364/month) with OpenAI - LLM costs: $0 with local models (Mistral-7B on GPU)