Scaling and Cost Design

Scaling & Cost Design

How PATAS handles millions of logs without breaking infrastructure or budget

1. Chunked Processing

Problem: Processing millions of messages at once exhausts memory and API rate limits.

Solution: Split data into configurable chunks (default: 10,000 messages per chunk).

Benefits:

Memory efficient: ~25 MB per chunk vs GBs for full dataset
Parallelizable: Chunks can be processed independently
Fault-tolerant: Single chunk failure doesn't crash entire pipeline
Progressive: Results available before full dataset completion

Configuration:

PATTERN_MINING_CHUNK_SIZE=10000

2. Two-Stage Pipeline

Problem: Running expensive LLM/embedding analysis on all messages is cost-prohibitive at scale.

Solution: Split processing into two stages.

Stage 1: Fast Scanning (all messages)

Large chunks (10,000 messages)
Deterministic patterns only (URLs, keywords, signatures)
No LLM or embedding calls
Fast aggregation

Stage 2: Deep Analysis (suspicious patterns only)

Small chunks (1,000 messages)
Top 3-10% suspicious patterns from Stage 1
Semantic mining with embeddings
LLM analysis for quality
High-quality rules

Benefits:

Cost reduction: 50-80% fewer API calls
Same quality: Deep analysis applied where it matters
Faster: Parallel Stage 1 for bulk, focused Stage 2 for quality
Scalable: Handles millions of messages efficiently

Configuration:

ENABLE_TWO_STAGE_PROCESSING=true
SUSPICIOUSNESS_THRESHOLD=0.03  # Top 3% for Stage 2

3. Fast Clustering

Problem: Naive similarity clustering is O(n²) - prohibitively slow at scale.

Solution: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) with cosine distance.

Benefits:

Fast: O(n log n) with optimizations (10-100x faster than naive)
Automatic: No need to specify number of clusters
Noise-resistant: Automatically filters outliers
Quality: Finds clusters of arbitrary shape

Parameters (tuned for spam detection):

USE_DBSCAN_CLUSTERING=true
SEMANTIC_SIMILARITY_THRESHOLD=0.75
SEMANTIC_MIN_CLUSTER_SIZE=3

4. Embedding Batching & Caching

Problem: Embedding APIs have per-request limits (OpenAI: 2,048 texts) and repeated analysis wastes API calls.

Solution: Automatic batching + TTL cache.

Automatic Batching

Splits large requests into API-compliant batches (2,048 texts)
Parallel processing where possible
Respects rate limits

Caching Strategy

Cache embeddings for 7 days (TTL)
50-80% hit rate on repeated analyses
Reduces redundant API calls

Benefits:

API compliance: Never exceeds limits
Cost reduction: 50-80% fewer embedding calls (with cache)
Faster: Cached embeddings retrieved instantly

Configuration:

EMBEDDING_BATCH_SIZE=2048  # Automatic

5. Aggressiveness Profiles

Problem: One-size-fits-all doesn't work for different spam distributions.

Solution: Three configurable profiles.

Ultra-Conservative (1%)

Only strongest, most obvious patterns
Minimum API usage
Best for: highly distributed spam, tight budgets

Conservative (3%) ⭐ Default

Strong patterns with good coverage
Moderate API usage
Best for: typical spam distributions

Balanced (5%)

Broader pattern coverage
Higher API usage
Best for: moderate spam concentration

Aggressive (10-20%)

Maximum pattern coverage
Highest API usage
Best for: concentrated spam (one dominant pattern)

Configuration:

# Choose profile
SUSPICIOUSNESS_THRESHOLD=0.01  # Ultra-conservative
SUSPICIOUSNESS_THRESHOLD=0.03  # Conservative (default)
SUSPICIOUSNESS_THRESHOLD=0.05  # Balanced
SUSPICIOUSNESS_THRESHOLD=0.15  # Aggressive

6. Simple Mode Fallback

Problem: Teams may want to start with transparent, understandable processing.

Solution: Disable two-stage processing for simple, deterministic-only mode.

Simple mode characteristics:

Single-stage pipeline
Deterministic patterns only (URLs, keywords, regex)
No LLM or embeddings
Fully transparent and explainable
Zero AI costs

Enable simple mode:

patas mine-patterns --no-two-stage

# Or in config
ENABLE_TWO_STAGE_PROCESSING=false

When to use:

Initial testing and validation
When LLM/embeddings unavailable
When full transparency required
When cost must be minimized

7. Graceful Degradation

Problem: External dependencies (OpenAI, sklearn) may be unavailable.

Solution: Automatic fallbacks at every level.

Fallback chain:

DBSCAN unavailable → Naive clustering
Embeddings unavailable → Deterministic patterns only
LLM unavailable → Skip LLM analysis
Cache unavailable → Direct API calls

Benefits:

Never crashes on missing dependencies
Degrades gracefully to simpler methods
Always produces results (even if lower quality)

Cost Comparison

Example: 1M messages over 30 days

Single-Stage (no optimization)

Messages processed: 1,000,000
Embedding API calls: ~489 (1M / 2048)
Estimated cost: ~$8-10 per run
Monthly cost: ~$240-300

Two-Stage (optimized, default profile)

Stage 1: 1,000,000 messages (deterministic, no API)
Stage 2: ~100,000-300,000 messages (3-10% suspicious)
Embedding API calls: ~49-147 (100K-300K / 2048)
Estimated cost: ~$0.80-2.50 per run
Monthly cost: ~$24-75
Savings: 70-90%

Infrastructure Impact

Memory Usage

Per chunk: ~25 MB (10K messages)
Peak: ~300 MB (embeddings for 50K messages)
Database: Indexed queries, <1s response time

CPU Usage

Stage 1: Regex/string matching (fast)
DBSCAN: Multi-core (uses all CPUs)
Overall: Low CPU usage (most time in API waits)

Network

API calls: Batched and cached
Rate limits: Automatically respected
Retries: Built-in with exponential backoff

Summary

PATAS is designed for large-scale deployment:

Feature	Benefit	Impact
Chunked processing	Memory efficient	Handles unlimited dataset size
Two-stage pipeline	Cost reduction	50-80% fewer API calls
Fast clustering	Performance	10-100x faster clustering
Batching & caching	API optimization	50-80% fewer redundant calls
Aggressiveness profiles	Flexibility	Tune for your use case
Simple mode	Transparency	Fallback to deterministic only
Graceful degradation	Reliability	Never crashes on missing deps

Result: Process millions of messages efficiently without infrastructure strain or budget overruns.

Tested on: 20,000 real your platform moderator reports
Performance: 37 patterns found in ~25 seconds
Savings: 50-80% API cost reduction

Scaling and Cost Design

Scaling & Cost Design

1. Chunked Processing

2. Two-Stage Pipeline

Stage 1: Fast Scanning (all messages)

Stage 2: Deep Analysis (suspicious patterns only)

3. Fast Clustering

4. Embedding Batching & Caching

Automatic Batching

Caching Strategy

5. Aggressiveness Profiles

Ultra-Conservative (1%)

Conservative (3%) ⭐ Default

Balanced (5%)

Aggressive (10-20%)

6. Simple Mode Fallback

7. Graceful Degradation

Cost Comparison

Example: 1M messages over 30 days

Single-Stage (no optimization)

Two-Stage (optimized, default profile)

Infrastructure Impact

Memory Usage

CPU Usage

Network

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!