-
Notifications
You must be signed in to change notification settings - Fork 0
PATTERN_QUALITY_TESTING
Nick edited this page Mar 10, 2026
·
1 revision
This guide explains how to test PATAS pattern quality to ensure no false positives.
PATAS needs to discover spam patterns without creating overly broad rules that would block legitimate messages. For example:
- ❌ Bad: Blocking all messages with "now" (common word)
- ❌ Bad: Blocking all messages with "buy" (legitimate commerce)
- ✅ Good: Blocking messages with specific spam URLs
- ✅ Good: Blocking messages with specific spam phrases like "earn $1000 daily"
We've implemented Pattern Quality Filters that:
- Filter common words - Words like "now", "buy", "click" are never used as patterns alone
- Check ham messages - Patterns must not match legitimate messages
- Require minimum thresholds - Patterns must appear in at least 5% of spam messages
- Compare spam/ham ratios - Patterns must appear 10x more in spam than in ham
# Run pattern quality tests
poetry run pytest tests/test_pattern_quality.py -v
# Run specific test
poetry run pytest tests/test_pattern_quality.py::test_pattern_quality_checker_logic -v# Analyze patterns on test dataset
poetry run python scripts/test_pattern_quality.pyThis script will:
- Load test dataset (
tests/data/large_test_dataset.json) - Run PATAS pattern mining
- Analyze each pattern for false positives
- Report unsafe patterns
After running the analysis, manually review any unsafe patterns:
# Example output
⚠️ Unsafe patterns: 2
Pattern ID: 5
Description: Keyword: now (found in 15 spam messages)
Type: KEYWORD
False positive rate: 12.5%
False positives: 5/40
Issues: ['Pattern is just a common word: now']
Example false positives:
- ham_001: hey, are you free now? let's meet up.
- ham_011: the meeting is now scheduled for tomorrow.The test dataset (tests/data/large_test_dataset.json) includes:
- Legitimate messages (ham): Normal conversations that should NOT be blocked
- Spam messages: Actual spam that SHOULD be blocked
- Edge cases: Legitimate messages that might look like spam
{"id": "ham_001", "text": "Hey, are you free now? Let's meet up.", "is_spam": false}
{"id": "ham_002", "text": "I need to buy groceries. Can you help?", "is_spam": false}
{"id": "ham_003", "text": "Click here to see the document I shared.", "is_spam": false}{"id": "spam_001", "text": "Buy now! http://spam-shop.com/offer", "is_spam": true}
{"id": "spam_006", "text": "Call now: +1-555-123-4567 for amazing deals!", "is_spam": true}A keyword is safe if:
- ✅ Not a common word (not in
COMMON_WORDSlist) - ✅ At least 3 characters long
- ✅ Appears in < 5% of ham messages OR appears 10x more in spam than ham
- ✅ Appears in at least 5% of spam messages
A URL is safe if:
- ✅ Appears in at least 3 spam messages
- ✅ Not a generic TLD alone (e.g., not just ".com")
- ✅ Appears in < 20% of ham messages OR appears 5x more in spam than ham
A phrase (multiple words) is safe if:
- ✅ Has at least 2 words
- ✅ Not all words are common words
- ✅ Appears in < 5% of ham messages OR appears 5x more in spam than ham
- ✅ Appears in at least 3% of spam messages
If you find unsafe patterns:
- Review the pattern - Why is it matching ham messages?
- Check the filter - Is the quality filter working correctly?
- Adjust thresholds - Maybe the thresholds are too low?
-
Add to common words - If a word is too common, add it to
COMMON_WORDS - Use phrases instead - Instead of "now", use "buy now" or "click now"
Problem: Pattern "now" matches legitimate messages like "Are you free now?"
Solution: Use phrase patterns instead:
- ❌
Keyword: now→ Too broad - ✅
Phrase: "buy now"→ Specific to spam - ✅
Phrase: "click now"→ Specific to spam
The quality filter will automatically suggest safer phrases using suggest_safer_pattern().
- Run tests on your own dataset
- Review unsafe patterns
- Adjust quality filter thresholds if needed
- Add domain-specific common words
- Test in shadow mode before promoting to active