Performance

Performance and Cost

Real-world performance metrics and cost estimates for PATAS pattern mining.

Performance Metrics

Pattern Mining (Two-Stage Pipeline)

Messages	Stage 1 Time	Stage 2 Time	Total Time	Stage 2 %	Cost Savings
100K	~30 sec	~20 min	~21 min	2-3%	97-98%
500K	~2 min	~3.5 hours	~3.5 hours	2-3%	97-98%
1M	~4 min	~7 hours	~7 hours	2-3%	97-98%
10M	~42 min	~70 hours	~3 days	2-3%	97-98%

Notes:

Stage 1: Fast deterministic patterns (URLs, keywords, signatures) - no LLM/embeddings
Stage 2: Deep semantic analysis only for suspicious patterns (top 2-3% by default)
Cost savings: Only 2-3% of messages require expensive LLM/embedding analysis

Recent Optimizations (v2.1)

Performance Improvements:

Reduced commit frequency: batch_size 10→25, commit intervals 5→10
Lazy ham message loading: fixed sample size (1000) instead of proportional
Parallel chunk processing: 2-3x speedup for large datasets
Result: Pattern Mining time reduced from 70.1% to ~60-65% of total time

Throughput:

615-2,224 messages/second (depending on dataset size)
Sub-linear scaling: 3.17x time increase for 10x data
Low memory usage: average peak 18.4 MB

Shadow Evaluation

Rules to Evaluate	Sequential Time	Parallel (4 workers)	Parallel (8 workers)
100	~3 hours	~45 min	~25 min
1,000	~30 hours	~7.5 hours	~4 hours
10,000	~300 hours	~75 hours	~40 hours

Optimization:

Use max_shadow_rules_to_evaluate to limit evaluation to top-N rules by quality tier
Use shadow_evaluation_parallel_workers to enable parallel evaluation
Use shadow_evaluation_sample_size for sampling on very large datasets

Ingestion

Messages	Time (Batch)	Throughput
100K	~8 min	~200 msg/sec
500K	~42 min	~200 msg/sec
1M	~1.4 hours	~200 msg/sec

Cost Estimates (OpenAI Mode)

Per 500K Messages

Embeddings: ~$0.06 (only 2-3% of messages in Stage 2)
LLM (rule generation): ~$91 (depends on number of patterns found)
Total: ~$91 per run

Per 1M Messages

Embeddings: ~$0.12
LLM (rule generation): ~$182
Total: ~$182 per run

Monthly (Weekly Runs)

500K messages/week: ~$364/month
1M messages/week: ~$728/month

Cost Optimization Modes

Cheap Mode (Deterministic Only)

Disable LLM and embeddings:

use_llm: false
use_semantic: false or embedding_provider: none

Performance:

Stage 1 only: ~2-4 min for 500K messages
No LLM costs
Lower recall, but still effective for obvious spam patterns

Full Mode (Semantic + LLM)

Default two-stage pipeline:

Stage 1: Deterministic patterns (fast, cheap)
Stage 2: Semantic mining + LLM (slow, expensive, but only 2-3% of messages)

Recommendation for on-premise:

Use local models (Mistral-7B, Llama-3.1-8B) to eliminate API costs
See On-Premise Deployment for details

Scaling Recommendations

For 10M+ Messages

Incremental Mining: Use --since-checkpoint to process only new messages
Parallel Evaluation: Set shadow_evaluation_parallel_workers: 8-16
Rule Filtering: Set max_shadow_rules_to_evaluate: 1000-5000 (top-N by quality)
Sampling: Set shadow_evaluation_sample_size: 10000 for very large datasets
Local Models: Use on-premise LLM/embeddings to eliminate API costs

Horizontal Scaling (10+ Instances)

Current limitation: Distributed locks prevent concurrent processing of the same dataset.

Solution: Shard data (see Horizontal Scaling):

Split data into N shards (by message_id or timestamp)
Each instance processes its shard with unique lock key
Merge results after processing

Example (10 instances, 10M messages):

Time: ~7 hours (instead of 3 days on single instance)
Quality: Similar to single-instance processing

Roadmap: Automatic sharding in P1 (after successful pilot)

For 100M+ Messages

Consider sharded evaluation (evaluate rules on message shards in parallel)
Use incremental mining exclusively (process only new messages)
Database partitioning by timestamp
Read replicas for evaluation queries
Horizontal scaling with data sharding (see Horizontal Scaling)

Real-World Example

Production Example (500K messages/week):

Pattern mining: ~3.5 hours weekly
Shadow evaluation: ~3-4 hours (with parallel workers, top-1000 rules)
LLM costs: ~~$91/week (~~$364/month) with OpenAI
LLM costs: $0 with local models (Mistral-7B on GPU)

Performance

Performance and Cost

Performance Metrics

Pattern Mining (Two-Stage Pipeline)

Recent Optimizations (v2.1)

Shadow Evaluation

Ingestion

Cost Estimates (OpenAI Mode)

Per 500K Messages

Per 1M Messages

Monthly (Weekly Runs)

Cost Optimization Modes

Cheap Mode (Deterministic Only)

Full Mode (Semantic + LLM)

Scaling Recommendations

For 10M+ Messages

Horizontal Scaling (10+ Instances)

For 100M+ Messages

Real-World Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally