Skip to content

Performance

Nick edited this page Nov 24, 2025 · 2 revisions

Performance and Cost

Real-world performance metrics and cost estimates for PATAS pattern mining.

Performance Metrics

Pattern Mining (Two-Stage Pipeline)

Messages Stage 1 Time Stage 2 Time Total Time Stage 2 % Cost Savings
100K ~30 sec ~20 min ~21 min 2-3% 97-98%
500K ~2 min ~3.5 hours ~3.5 hours 2-3% 97-98%
1M ~4 min ~7 hours ~7 hours 2-3% 97-98%
10M ~42 min ~70 hours ~3 days 2-3% 97-98%

Notes:

  • Stage 1: Fast deterministic patterns (URLs, keywords, signatures) - no LLM/embeddings
  • Stage 2: Deep semantic analysis only for suspicious patterns (top 2-3% by default)
  • Cost savings: Only 2-3% of messages require expensive LLM/embedding analysis

Recent Optimizations (v2.1)

Performance Improvements:

  • Reduced commit frequency: batch_size 10→25, commit intervals 5→10
  • Lazy ham message loading: fixed sample size (1000) instead of proportional
  • Parallel chunk processing: 2-3x speedup for large datasets
  • Result: Pattern Mining time reduced from 70.1% to ~60-65% of total time

Throughput:

  • 615-2,224 messages/second (depending on dataset size)
  • Sub-linear scaling: 3.17x time increase for 10x data
  • Low memory usage: average peak 18.4 MB

Shadow Evaluation

Rules to Evaluate Sequential Time Parallel (4 workers) Parallel (8 workers)
100 ~3 hours ~45 min ~25 min
1,000 ~30 hours ~7.5 hours ~4 hours
10,000 ~300 hours ~75 hours ~40 hours

Optimization:

  • Use max_shadow_rules_to_evaluate to limit evaluation to top-N rules by quality tier
  • Use shadow_evaluation_parallel_workers to enable parallel evaluation
  • Use shadow_evaluation_sample_size for sampling on very large datasets

Ingestion

Messages Time (Batch) Throughput
100K ~8 min ~200 msg/sec
500K ~42 min ~200 msg/sec
1M ~1.4 hours ~200 msg/sec

Cost Estimates (OpenAI Mode)

Per 500K Messages

  • Embeddings: ~$0.06 (only 2-3% of messages in Stage 2)
  • LLM (rule generation): ~$91 (depends on number of patterns found)
  • Total: ~$91 per run

Per 1M Messages

  • Embeddings: ~$0.12
  • LLM (rule generation): ~$182
  • Total: ~$182 per run

Monthly (Weekly Runs)

  • 500K messages/week: ~$364/month
  • 1M messages/week: ~$728/month

Cost Optimization Modes

Cheap Mode (Deterministic Only)

Disable LLM and embeddings:

  • use_llm: false
  • use_semantic: false or embedding_provider: none

Performance:

  • Stage 1 only: ~2-4 min for 500K messages
  • No LLM costs
  • Lower recall, but still effective for obvious spam patterns

Full Mode (Semantic + LLM)

Default two-stage pipeline:

  • Stage 1: Deterministic patterns (fast, cheap)
  • Stage 2: Semantic mining + LLM (slow, expensive, but only 2-3% of messages)

Recommendation for on-premise:

  • Use local models (Mistral-7B, Llama-3.1-8B) to eliminate API costs
  • See On-Premise Deployment for details

Scaling Recommendations

For 10M+ Messages

  1. Incremental Mining: Use --since-checkpoint to process only new messages
  2. Parallel Evaluation: Set shadow_evaluation_parallel_workers: 8-16
  3. Rule Filtering: Set max_shadow_rules_to_evaluate: 1000-5000 (top-N by quality)
  4. Sampling: Set shadow_evaluation_sample_size: 10000 for very large datasets
  5. Local Models: Use on-premise LLM/embeddings to eliminate API costs

Horizontal Scaling (10+ Instances)

Current limitation: Distributed locks prevent concurrent processing of the same dataset.

Solution: Shard data (see Horizontal Scaling):

  • Split data into N shards (by message_id or timestamp)
  • Each instance processes its shard with unique lock key
  • Merge results after processing

Example (10 instances, 10M messages):

  • Time: ~7 hours (instead of 3 days on single instance)
  • Quality: Similar to single-instance processing

Roadmap: Automatic sharding in P1 (after successful pilot)

For 100M+ Messages

  • Consider sharded evaluation (evaluate rules on message shards in parallel)
  • Use incremental mining exclusively (process only new messages)
  • Database partitioning by timestamp
  • Read replicas for evaluation queries
  • Horizontal scaling with data sharding (see Horizontal Scaling)

Real-World Example

Production Example (500K messages/week):

  • Pattern mining: ~3.5 hours weekly
  • Shadow evaluation: ~3-4 hours (with parallel workers, top-1000 rules)
  • LLM costs: $91/week ($364/month) with OpenAI
  • LLM costs: $0 with local models (Mistral-7B on GPU)

Clone this wiki locally