PATTERN_QUALITY_TESTING

Pattern Quality Testing Guide

This guide explains how to test PATAS pattern quality to ensure no false positives.

Problem

PATAS needs to discover spam patterns without creating overly broad rules that would block legitimate messages. For example:

❌ Bad: Blocking all messages with "now" (common word)
❌ Bad: Blocking all messages with "buy" (legitimate commerce)
✅ Good: Blocking messages with specific spam URLs
✅ Good: Blocking messages with specific spam phrases like "earn $1000 daily"

Solution

We've implemented Pattern Quality Filters that:

Filter common words - Words like "now", "buy", "click" are never used as patterns alone
Check ham messages - Patterns must not match legitimate messages
Require minimum thresholds - Patterns must appear in at least 5% of spam messages
Compare spam/ham ratios - Patterns must appear 10x more in spam than in ham

Testing

1. Run Quality Tests

# Run pattern quality tests
poetry run pytest tests/test_pattern_quality.py -v

# Run specific test
poetry run pytest tests/test_pattern_quality.py::test_pattern_quality_checker_logic -v

2. Run Quality Analysis Script

# Analyze patterns on test dataset
poetry run python scripts/test_pattern_quality.py

This script will:

Load test dataset (tests/data/large_test_dataset.json)
Run PATAS pattern mining
Analyze each pattern for false positives
Report unsafe patterns

3. Manual Review

After running the analysis, manually review any unsafe patterns:

# Example output
⚠️  Unsafe patterns: 2

Pattern ID: 5
Description: Keyword: now (found in 15 spam messages)
Type: KEYWORD
False positive rate: 12.5%
False positives: 5/40
Issues: ['Pattern is just a common word: now']
Example false positives:
  - ham_001: hey, are you free now? let's meet up.
  - ham_011: the meeting is now scheduled for tomorrow.

Test Dataset

The test dataset (tests/data/large_test_dataset.json) includes:

Legitimate messages (ham): Normal conversations that should NOT be blocked
Spam messages: Actual spam that SHOULD be blocked
Edge cases: Legitimate messages that might look like spam

Example Ham Messages

{"id": "ham_001", "text": "Hey, are you free now? Let's meet up.", "is_spam": false}
{"id": "ham_002", "text": "I need to buy groceries. Can you help?", "is_spam": false}
{"id": "ham_003", "text": "Click here to see the document I shared.", "is_spam": false}

Example Spam Messages

{"id": "spam_001", "text": "Buy now! http://spam-shop.com/offer", "is_spam": true}
{"id": "spam_006", "text": "Call now: +1-555-123-4567 for amazing deals!", "is_spam": true}

Quality Filter Rules

Keyword Patterns

A keyword is safe if:

✅ Not a common word (not in COMMON_WORDS list)
✅ At least 3 characters long
✅ Appears in < 5% of ham messages OR appears 10x more in spam than ham
✅ Appears in at least 5% of spam messages

URL Patterns

A URL is safe if:

✅ Appears in at least 3 spam messages
✅ Not a generic TLD alone (e.g., not just ".com")
✅ Appears in < 20% of ham messages OR appears 5x more in spam than ham

Phrase Patterns

A phrase (multiple words) is safe if:

✅ Has at least 2 words
✅ Not all words are common words
✅ Appears in < 5% of ham messages OR appears 5x more in spam than ham
✅ Appears in at least 3% of spam messages

Improving Pattern Quality

If you find unsafe patterns:

Review the pattern - Why is it matching ham messages?
Check the filter - Is the quality filter working correctly?
Adjust thresholds - Maybe the thresholds are too low?
Add to common words - If a word is too common, add it to COMMON_WORDS
Use phrases instead - Instead of "now", use "buy now" or "click now"

Example: Fixing "now" Pattern

Problem: Pattern "now" matches legitimate messages like "Are you free now?"

Solution: Use phrase patterns instead:

❌ Keyword: now → Too broad
✅ Phrase: "buy now" → Specific to spam
✅ Phrase: "click now" → Specific to spam

The quality filter will automatically suggest safer phrases using suggest_safer_pattern().

Next Steps

Run tests on your own dataset
Review unsafe patterns
Adjust quality filter thresholds if needed
Add domain-specific common words
Test in shadow mode before promoting to active

PATTERN_QUALITY_TESTING

Pattern Quality Testing Guide

Problem

Solution

Testing

1. Run Quality Tests

2. Run Quality Analysis Script

3. Manual Review

Test Dataset

Example Ham Messages

Example Spam Messages

Quality Filter Rules

Keyword Patterns

URL Patterns

Phrase Patterns

Improving Pattern Quality

Example: Fixing "now" Pattern

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally