Skip to content

Scaling and Cost Design

Nick edited this page Nov 21, 2025 · 2 revisions

Scaling & Cost Design

How PATAS handles millions of logs without breaking infrastructure or budget


1. Chunked Processing

Problem: Processing millions of messages at once exhausts memory and API rate limits.

Solution: Split data into configurable chunks (default: 10,000 messages per chunk).

Benefits:

  • Memory efficient: ~25 MB per chunk vs GBs for full dataset
  • Parallelizable: Chunks can be processed independently
  • Fault-tolerant: Single chunk failure doesn't crash entire pipeline
  • Progressive: Results available before full dataset completion

Configuration:

PATTERN_MINING_CHUNK_SIZE=10000

2. Two-Stage Pipeline

Problem: Running expensive LLM/embedding analysis on all messages is cost-prohibitive at scale.

Solution: Split processing into two stages.

Stage 1: Fast Scanning (all messages)

  • Large chunks (10,000 messages)
  • Deterministic patterns only (URLs, keywords, signatures)
  • No LLM or embedding calls
  • Fast aggregation

Stage 2: Deep Analysis (suspicious patterns only)

  • Small chunks (1,000 messages)
  • Top 3-10% suspicious patterns from Stage 1
  • Semantic mining with embeddings
  • LLM analysis for quality
  • High-quality rules

Benefits:

  • Cost reduction: 50-80% fewer API calls
  • Same quality: Deep analysis applied where it matters
  • Faster: Parallel Stage 1 for bulk, focused Stage 2 for quality
  • Scalable: Handles millions of messages efficiently

Configuration:

ENABLE_TWO_STAGE_PROCESSING=true
SUSPICIOUSNESS_THRESHOLD=0.03  # Top 3% for Stage 2

3. Fast Clustering

Problem: Naive similarity clustering is O(n²) - prohibitively slow at scale.

Solution: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) with cosine distance.

Benefits:

  • Fast: O(n log n) with optimizations (10-100x faster than naive)
  • Automatic: No need to specify number of clusters
  • Noise-resistant: Automatically filters outliers
  • Quality: Finds clusters of arbitrary shape

Parameters (tuned for spam detection):

USE_DBSCAN_CLUSTERING=true
SEMANTIC_SIMILARITY_THRESHOLD=0.75
SEMANTIC_MIN_CLUSTER_SIZE=3

4. Embedding Batching & Caching

Problem: Embedding APIs have per-request limits (OpenAI: 2,048 texts) and repeated analysis wastes API calls.

Solution: Automatic batching + TTL cache.

Automatic Batching

  • Splits large requests into API-compliant batches (2,048 texts)
  • Parallel processing where possible
  • Respects rate limits

Caching Strategy

  • Cache embeddings for 7 days (TTL)
  • 50-80% hit rate on repeated analyses
  • Reduces redundant API calls

Benefits:

  • API compliance: Never exceeds limits
  • Cost reduction: 50-80% fewer embedding calls (with cache)
  • Faster: Cached embeddings retrieved instantly

Configuration:

EMBEDDING_BATCH_SIZE=2048  # Automatic

5. Aggressiveness Profiles

Problem: One-size-fits-all doesn't work for different spam distributions.

Solution: Three configurable profiles.

Ultra-Conservative (1%)

  • Only strongest, most obvious patterns
  • Minimum API usage
  • Best for: highly distributed spam, tight budgets

Conservative (3%) ⭐ Default

  • Strong patterns with good coverage
  • Moderate API usage
  • Best for: typical spam distributions

Balanced (5%)

  • Broader pattern coverage
  • Higher API usage
  • Best for: moderate spam concentration

Aggressive (10-20%)

  • Maximum pattern coverage
  • Highest API usage
  • Best for: concentrated spam (one dominant pattern)

Configuration:

# Choose profile
SUSPICIOUSNESS_THRESHOLD=0.01  # Ultra-conservative
SUSPICIOUSNESS_THRESHOLD=0.03  # Conservative (default)
SUSPICIOUSNESS_THRESHOLD=0.05  # Balanced
SUSPICIOUSNESS_THRESHOLD=0.15  # Aggressive

6. Simple Mode Fallback

Problem: Teams may want to start with transparent, understandable processing.

Solution: Disable two-stage processing for simple, deterministic-only mode.

Simple mode characteristics:

  • Single-stage pipeline
  • Deterministic patterns only (URLs, keywords, regex)
  • No LLM or embeddings
  • Fully transparent and explainable
  • Zero AI costs

Enable simple mode:

patas mine-patterns --no-two-stage

# Or in config
ENABLE_TWO_STAGE_PROCESSING=false

When to use:

  • Initial testing and validation
  • When LLM/embeddings unavailable
  • When full transparency required
  • When cost must be minimized

7. Graceful Degradation

Problem: External dependencies (OpenAI, sklearn) may be unavailable.

Solution: Automatic fallbacks at every level.

Fallback chain:

  1. DBSCAN unavailable → Naive clustering
  2. Embeddings unavailable → Deterministic patterns only
  3. LLM unavailable → Skip LLM analysis
  4. Cache unavailable → Direct API calls

Benefits:

  • Never crashes on missing dependencies
  • Degrades gracefully to simpler methods
  • Always produces results (even if lower quality)

Cost Comparison

Example: 1M messages over 30 days

Single-Stage (no optimization)

  • Messages processed: 1,000,000
  • Embedding API calls: ~489 (1M / 2048)
  • Estimated cost: ~$8-10 per run
  • Monthly cost: ~$240-300

Two-Stage (optimized, default profile)

  • Stage 1: 1,000,000 messages (deterministic, no API)
  • Stage 2: ~100,000-300,000 messages (3-10% suspicious)
  • Embedding API calls: ~49-147 (100K-300K / 2048)
  • Estimated cost: ~$0.80-2.50 per run
  • Monthly cost: ~$24-75
  • Savings: 70-90%

Infrastructure Impact

Memory Usage

  • Per chunk: ~25 MB (10K messages)
  • Peak: ~300 MB (embeddings for 50K messages)
  • Database: Indexed queries, <1s response time

CPU Usage

  • Stage 1: Regex/string matching (fast)
  • DBSCAN: Multi-core (uses all CPUs)
  • Overall: Low CPU usage (most time in API waits)

Network

  • API calls: Batched and cached
  • Rate limits: Automatically respected
  • Retries: Built-in with exponential backoff

Summary

PATAS is designed for large-scale deployment:

Feature Benefit Impact
Chunked processing Memory efficient Handles unlimited dataset size
Two-stage pipeline Cost reduction 50-80% fewer API calls
Fast clustering Performance 10-100x faster clustering
Batching & caching API optimization 50-80% fewer redundant calls
Aggressiveness profiles Flexibility Tune for your use case
Simple mode Transparency Fallback to deterministic only
Graceful degradation Reliability Never crashes on missing deps

Result: Process millions of messages efficiently without infrastructure strain or budget overruns.


Tested on: 20,000 real your platform moderator reports
Performance: 37 patterns found in ~25 seconds
Savings: 50-80% API cost reduction

Clone this wiki locally