Skip to content

PATTERN_QUALITY_TESTING

Nick edited this page Mar 10, 2026 · 1 revision

Pattern Quality Testing Guide

This guide explains how to test PATAS pattern quality to ensure no false positives.

Problem

PATAS needs to discover spam patterns without creating overly broad rules that would block legitimate messages. For example:

  • ❌ Bad: Blocking all messages with "now" (common word)
  • ❌ Bad: Blocking all messages with "buy" (legitimate commerce)
  • ✅ Good: Blocking messages with specific spam URLs
  • ✅ Good: Blocking messages with specific spam phrases like "earn $1000 daily"

Solution

We've implemented Pattern Quality Filters that:

  1. Filter common words - Words like "now", "buy", "click" are never used as patterns alone
  2. Check ham messages - Patterns must not match legitimate messages
  3. Require minimum thresholds - Patterns must appear in at least 5% of spam messages
  4. Compare spam/ham ratios - Patterns must appear 10x more in spam than in ham

Testing

1. Run Quality Tests

# Run pattern quality tests
poetry run pytest tests/test_pattern_quality.py -v

# Run specific test
poetry run pytest tests/test_pattern_quality.py::test_pattern_quality_checker_logic -v

2. Run Quality Analysis Script

# Analyze patterns on test dataset
poetry run python scripts/test_pattern_quality.py

This script will:

  • Load test dataset (tests/data/large_test_dataset.json)
  • Run PATAS pattern mining
  • Analyze each pattern for false positives
  • Report unsafe patterns

3. Manual Review

After running the analysis, manually review any unsafe patterns:

# Example output
⚠️  Unsafe patterns: 2

Pattern ID: 5
Description: Keyword: now (found in 15 spam messages)
Type: KEYWORD
False positive rate: 12.5%
False positives: 5/40
Issues: ['Pattern is just a common word: now']
Example false positives:
  - ham_001: hey, are you free now? let's meet up.
  - ham_011: the meeting is now scheduled for tomorrow.

Test Dataset

The test dataset (tests/data/large_test_dataset.json) includes:

  • Legitimate messages (ham): Normal conversations that should NOT be blocked
  • Spam messages: Actual spam that SHOULD be blocked
  • Edge cases: Legitimate messages that might look like spam

Example Ham Messages

{"id": "ham_001", "text": "Hey, are you free now? Let's meet up.", "is_spam": false}
{"id": "ham_002", "text": "I need to buy groceries. Can you help?", "is_spam": false}
{"id": "ham_003", "text": "Click here to see the document I shared.", "is_spam": false}

Example Spam Messages

{"id": "spam_001", "text": "Buy now! http://spam-shop.com/offer", "is_spam": true}
{"id": "spam_006", "text": "Call now: +1-555-123-4567 for amazing deals!", "is_spam": true}

Quality Filter Rules

Keyword Patterns

A keyword is safe if:

  1. ✅ Not a common word (not in COMMON_WORDS list)
  2. ✅ At least 3 characters long
  3. ✅ Appears in < 5% of ham messages OR appears 10x more in spam than ham
  4. ✅ Appears in at least 5% of spam messages

URL Patterns

A URL is safe if:

  1. ✅ Appears in at least 3 spam messages
  2. ✅ Not a generic TLD alone (e.g., not just ".com")
  3. ✅ Appears in < 20% of ham messages OR appears 5x more in spam than ham

Phrase Patterns

A phrase (multiple words) is safe if:

  1. ✅ Has at least 2 words
  2. ✅ Not all words are common words
  3. ✅ Appears in < 5% of ham messages OR appears 5x more in spam than ham
  4. ✅ Appears in at least 3% of spam messages

Improving Pattern Quality

If you find unsafe patterns:

  1. Review the pattern - Why is it matching ham messages?
  2. Check the filter - Is the quality filter working correctly?
  3. Adjust thresholds - Maybe the thresholds are too low?
  4. Add to common words - If a word is too common, add it to COMMON_WORDS
  5. Use phrases instead - Instead of "now", use "buy now" or "click now"

Example: Fixing "now" Pattern

Problem: Pattern "now" matches legitimate messages like "Are you free now?"

Solution: Use phrase patterns instead:

  • Keyword: now → Too broad
  • Phrase: "buy now" → Specific to spam
  • Phrase: "click now" → Specific to spam

The quality filter will automatically suggest safer phrases using suggest_safer_pattern().

Next Steps

  1. Run tests on your own dataset
  2. Review unsafe patterns
  3. Adjust quality filter thresholds if needed
  4. Add domain-specific common words
  5. Test in shadow mode before promoting to active

Clone this wiki locally