Skip to content

Threshold Calibration Guide

Nick edited this page Mar 10, 2026 · 2 revisions

PATAS Threshold Calibration Guide

This guide helps you calibrate PATAS thresholds for your specific spam patterns and requirements.


Understanding Thresholds

Pattern Mining Thresholds

Key thresholds:

  1. min_spam_count: Minimum number of spam messages required to create a pattern (default: 10)
  2. min_spam_ratio: Minimum ratio of spam messages in pattern (default: 0.05 = 5%)
  3. semantic_similarity_threshold: Cosine similarity threshold for clustering (default: 0.75)
  4. semantic_min_cluster_size: Minimum messages per semantic cluster (default: 3)

Rule Promotion Thresholds

Key thresholds:

  1. min_precision: Minimum precision required for promotion (default: 0.95)
  2. max_ham_hits: Maximum false positives allowed (default: 5)
  3. min_coverage: Minimum coverage required (default: 0.01 = 1%)
  4. min_sample_size: Minimum evaluation sample size (default: 100)

Calibration Process

Step 1: Start with Defaults

Initial configuration:

pattern_mining:
  min_spam_count: 10
  min_spam_ratio: 0.05
  semantic_similarity_threshold: 0.75
  semantic_min_cluster_size: 3

rule_lifecycle:
  promotion:
    min_precision: 0.95
    max_ham_hits: 5
    min_coverage: 0.01
    min_sample_size: 100

Step 2: Run Initial Pattern Mining

Run pattern mining on historical data:

patas mine-patterns --days=30

Review results:

  • Number of patterns discovered
  • Number of rules generated
  • Pattern types (URL, keyword, semantic, etc.)

Step 3: Evaluate Rules

Run shadow evaluation:

patas eval-rules --days=30

Review metrics:

  • Precision distribution
  • Recall distribution
  • Coverage distribution
  • False positive rate

Step 4: Adjust Thresholds

If missing patterns:

  • Lower min_spam_count (e.g., 5 instead of 10)
  • Lower min_spam_ratio (e.g., 0.03 instead of 0.05)
  • Lower semantic_similarity_threshold (e.g., 0.70 instead of 0.75)

If too many false positives:

  • Raise min_precision (e.g., 0.98 instead of 0.95)
  • Lower max_ham_hits (e.g., 3 instead of 5)
  • Raise semantic_similarity_threshold (e.g., 0.80 instead of 0.75)

If rules too specific:

  • Lower semantic_similarity_threshold (e.g., 0.70-0.75)
  • Enable semantic mining if disabled
  • Lower semantic_min_cluster_size (e.g., 2 instead of 3)

If rules too broad:

  • Raise semantic_similarity_threshold (e.g., 0.80-0.85)
  • Raise min_spam_ratio (e.g., 0.10 instead of 0.05)
  • Increase min_sample_size for evaluation

Step 5: Iterate

Repeat steps 2-4:

  • Run pattern mining with adjusted thresholds
  • Evaluate rules
  • Review metrics
  • Adjust thresholds based on results
  • Continue until optimal balance

Threshold Recommendations by Spam Type

Concentrated Spam (Same Pattern Repeated)

Characteristics:

  • Same spam message sent many times
  • High repetition rate
  • Easy to detect with deterministic patterns

Recommended thresholds:

pattern_mining:
  min_spam_count: 5  # Lower (pattern appears frequently)
  min_spam_ratio: 0.10  # Higher (pattern is concentrated)
  semantic_similarity_threshold: 0.80  # Higher (messages are very similar)
  semantic_min_cluster_size: 5  # Higher (larger clusters expected)

Distributed Spam (Many Variations)

Characteristics:

  • Spam messages vary in wording
  • Same meaning, different words
  • Requires semantic analysis

Recommended thresholds:

pattern_mining:
  min_spam_count: 10  # Default
  min_spam_ratio: 0.03  # Lower (pattern is distributed)
  semantic_similarity_threshold: 0.70  # Lower (catch variations)
  semantic_min_cluster_size: 3  # Default

High False Positive Tolerance

Use case:

  • Can tolerate some false positives
  • Want to catch more spam
  • Manual review available

Recommended thresholds:

rule_lifecycle:
  promotion:
    min_precision: 0.90  # Lower (allow more false positives)
    max_ham_hits: 10  # Higher (allow more false positives)
    min_coverage: 0.01  # Default
    min_sample_size: 50  # Lower (faster promotion)

Low False Positive Tolerance

Use case:

  • Cannot tolerate false positives
  • High-traffic system
  • Automated blocking

Recommended thresholds:

rule_lifecycle:
  promotion:
    min_precision: 0.98  # Higher (fewer false positives)
    max_ham_hits: 3  # Lower (fewer false positives)
    min_coverage: 0.01  # Default
    min_sample_size: 200  # Higher (more confidence)

Semantic Similarity Threshold Tuning

Understanding the Threshold

Semantic similarity threshold:

  • Range: 0.0 to 1.0
  • Higher = stricter (only very similar messages)
  • Lower = more lenient (catches more variations)

Threshold Ranges

0.80-0.85: Very Strict

  • Only very similar messages grouped
  • Fewer false positives
  • May miss legitimate variations
  • Use when: False positives are critical

0.75-0.80: Balanced (Recommended)

  • Good balance of precision and recall
  • Catches most variations
  • Acceptable false positive rate
  • Use when: General purpose

0.70-0.75: More Lenient

  • Catches more variations
  • May include some false positives
  • Better recall
  • Use when: Missing too many spam patterns

<0.70: Very Lenient

  • Catches many variations
  • Higher false positive risk
  • Best recall
  • Use when: Need to catch all variations, manual review available

Tuning Process

  1. Start with 0.75: Good default balance
  2. Run pattern mining: Discover patterns
  3. Evaluate clusters: Check if similar messages are grouped
  4. Adjust based on results:
    • Too many small clusters → Lower threshold
    • Too many false positives → Raise threshold
    • Missing variations → Lower threshold
    • Too many false positives → Raise threshold
  5. Test on sample data: Validate before applying to full dataset

Tools for Tuning

Clustering visualization:

# Visualize clusters to understand threshold impact
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Reduce embeddings to 2D for visualization
embeddings_2d = TSNE(n_components=2).fit_transform(embeddings)

# Plot clusters
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=cluster_labels)
plt.show()

Calibration script:

poetry run python scripts/calibrate_similarity_threshold.py \
  --dataset data/sample_messages.jsonl \
  --thresholds 0.70,0.75,0.80,0.85

Calibration Tools

Automated Calibration Script

Usage:

poetry run python scripts/calibrate_thresholds.py \
  --dataset data/historical_messages.jsonl \
  --days 30 \
  --output calibration_results.json

Output:

  • Optimal thresholds for your data
  • Precision/recall trade-offs
  • Recommended configuration

Manual Calibration

Step-by-step:

  1. Run pattern mining with different thresholds
  2. Evaluate rules for each configuration
  3. Compare metrics (precision, recall, coverage)
  4. Choose configuration with best balance

Example:

# Test different min_spam_count values
for count in 5 10 15 20; do
  patas mine-patterns --min-spam-count=$count --days=30
  patas eval-rules --days=30
  # Review results
done

Best Practices

  1. Start conservative: Use higher thresholds initially
  2. Iterate gradually: Adjust thresholds in small increments
  3. Test on sample data: Validate before applying to full dataset
  4. Monitor metrics: Track precision/recall over time
  5. Document changes: Keep track of threshold adjustments
  6. Review regularly: Recalibrate as spam patterns evolve
  7. Use profiles: Create custom profiles for different use cases
  8. A/B testing: Compare different threshold configurations

Troubleshooting

Too Few Patterns Discovered

Symptoms:

  • Very few patterns discovered
  • Missing obvious spam patterns

Solutions:

  • Lower min_spam_count (e.g., 5 instead of 10)
  • Lower min_spam_ratio (e.g., 0.03 instead of 0.05)
  • Lower semantic_similarity_threshold (e.g., 0.70 instead of 0.75)
  • Enable semantic mining if disabled
  • Check if sufficient historical data available

Too Many False Positives

Symptoms:

  • High false positive rate
  • Legitimate messages blocked

Solutions:

  • Raise min_precision (e.g., 0.98 instead of 0.95)
  • Lower max_ham_hits (e.g., 3 instead of 5)
  • Raise semantic_similarity_threshold (e.g., 0.80 instead of 0.75)
  • Increase min_sample_size for evaluation
  • Review and manually deprecate problematic rules

Rules Too Specific

Symptoms:

  • Rules catch only exact matches
  • Missing variations of spam

Solutions:

  • Lower semantic_similarity_threshold (e.g., 0.70-0.75)
  • Enable semantic mining
  • Lower semantic_min_cluster_size (e.g., 2 instead of 3)
  • Check if embeddings are working correctly

Rules Too Broad

Symptoms:

  • Rules match too many messages
  • High coverage but low precision

Solutions:

  • Raise semantic_similarity_threshold (e.g., 0.80-0.85)
  • Raise min_spam_ratio (e.g., 0.10 instead of 0.05)
  • Increase min_sample_size for evaluation
  • Review patterns and split into more specific patterns

Additional Resources

For calibration questions or issues, please open an issue on GitHub.

Clone this wiki locally