-
Notifications
You must be signed in to change notification settings - Fork 0
SEMANTIC_PATTERN_MINING
Why we need semantic patterns, not keyword patterns
Spammers are getting smarter. They use:
- LLM to generate variations - Same spam intent, different words
- Synonyms and paraphrases - "earn money" → "make cash" → "get income"
- Different phrasings - "Work from home!" → "Remote work available!" → "Home-based jobs!"
- Language variations - English, Russian, mixed languages
Traditional keyword-based detection fails:
- ❌ Pattern:
"earn money"→ Misses"make cash","get income" - ❌ Pattern:
"work from home"→ Misses"remote work","home-based jobs" - ❌ Pattern:
"buy now"→ Misses"purchase today","order immediately"
PATAS uses semantic similarity to find patterns by meaning, not exact words.
-
Generate Embeddings
- Convert messages to vector embeddings (OpenAI, local models)
- Embeddings capture semantic meaning, not just words
-
Cluster by Similarity
- Group messages with similar embeddings (cosine similarity)
- Messages with same meaning cluster together, even with different words
-
LLM Analysis
- For each cluster, LLM analyzes:
- What do they mean? (semantic pattern)
- Why are they similar? (common concepts, intent)
- What variations exist? (synonyms, paraphrases)
- For each cluster, LLM analyzes:
-
Generate Semantic Rules
- Create SQL rules that catch ALL variations
- Use OR conditions for synonyms:
LIKE '%earn money%' OR LIKE '%make cash%' OR LIKE '%get income%'
1. "Earn $1000 daily working from home! No experience needed."
2. "Make cash fast! Work remotely, no skills required."
3. "Get income from home! Start today, no qualifications!"
4. "Work from home and earn money! Apply now!"
Patterns found:
-
"earn money"(matches 1, 4) -
"work from home"(matches 1, 4) -
"make cash"(matches 2) -
"get income"(matches 3)
Problem: 4 different patterns for the same spam intent!
Cluster found:
- All 4 messages cluster together (semantic similarity > 0.75)
LLM Analysis:
- Semantic pattern: "Unrealistic work-from-home income promises"
- Similarity reason: "All promise easy money from home work, use urgency, require no qualifications"
- Key concepts: ["work from home", "earn money", "no experience", "fast/quick"]
- Variations: ["earn money", "make cash", "get income", "work from home", "remote work"]
Rule generated:
SELECT id, is_spam FROM messages
WHERE (
LOWER(text) LIKE '%work from home%' OR
LOWER(text) LIKE '%remote work%' OR
LOWER(text) LIKE '%earn money%' OR
LOWER(text) LIKE '%make cash%' OR
LOWER(text) LIKE '%get income%'
) AND (
LOWER(text) LIKE '%no experience%' OR
LOWER(text) LIKE '%no skills%' OR
LOWER(text) LIKE '%no qualifications%'
)Result: ✅ One rule catches all variations!
File: app/v2_semantic_mining.py
from app.v2_semantic_mining import SemanticPatternMiner
miner = SemanticPatternMiner(
db=db_session,
embedding_provider=embedding_engine,
llm_engine=llm_engine,
)
result = await miner.mine_semantic_patterns(
days=7,
min_cluster_size=3,
similarity_threshold=0.75,
)File: app/v2_embedding_engine.py
Supports:
-
OpenAI:
text-embedding-3-small(default) -
Local:
sentence-transformers/all-MiniLM-L6-v2
from app.v2_embedding_engine import create_embedding_engine
embedding_engine = create_embedding_engine(
provider="openai", # or "local"
api_key=api_key,
model="text-embedding-3-small",
)File: app/v2_llm_engine.py
LLM prompt now focuses on:
- Semantic patterns (meaning, not words)
- Variations (synonyms, paraphrases)
- Key concepts (core ideas that define the pattern)
# In pattern mining pipeline
from app.v2_semantic_mining import SemanticPatternMiner
from app.v2_embedding_engine import create_embedding_engine
# Create embedding engine
embedding_engine = create_embedding_engine(
provider=settings.embedding_provider, # "openai" or "local"
api_key=settings.openai_api_key,
)
# Create semantic miner
semantic_miner = SemanticPatternMiner(
db=db_session,
embedding_provider=embedding_engine,
llm_engine=llm_engine,
)
# Run semantic mining
result = await semantic_miner.mine_semantic_patterns(
days=7,
min_cluster_size=3,
similarity_threshold=0.75,
)Add to app/config.py:
embedding_provider: str = "openai" # "openai", "local", "none"
embedding_model: str = "text-embedding-3-small"
semantic_similarity_threshold: float = 0.75
semantic_min_cluster_size: int = 3✅ Before: Pattern "earn money" misses "make cash"
✅ After: Semantic pattern catches all variations
✅ Before: 10 patterns for same spam intent (different words)
✅ After: 1 semantic pattern covers all variations
✅ Before: New wording = new pattern needed
✅ After: Semantic pattern catches new wording automatically
✅ Before: Spammer uses LLM to generate variations → bypasses patterns
✅ After: Semantic similarity catches LLM-generated variations
-
Requires Embeddings
- Needs embedding provider (OpenAI API or local model)
- Adds latency and cost
-
Clustering Quality
- Depends on embedding quality
- Similarity threshold needs tuning
-
LLM Dependency
- Requires LLM for pattern analysis
- Can be slow for large clusters
-
Use Both Approaches
- Keyword patterns for exact matches (URLs, phone numbers)
- Semantic patterns for meaning-based detection
-
Tune Thresholds
-
similarity_threshold: Higher = stricter clustering (fewer, more similar clusters) -
min_cluster_size: Minimum messages per pattern
-
-
Monitor Quality
- Check false positive rate
- Review semantic patterns manually
- Adjust thresholds based on results
- ✅ Integrate semantic mining into main pipeline
- ✅ Test on real data with LLM-generated spam variations
- ✅ Compare results with keyword-based approach
- ✅ Tune thresholds based on false positive rate
Key Insight: Spammers use LLM to generate variations. We use LLM + embeddings to catch them all.