SEMANTIC_PATTERN_MINING

Semantic Pattern Mining

Why we need semantic patterns, not keyword patterns

The Problem

Spammers are getting smarter. They use:

LLM to generate variations - Same spam intent, different words
Synonyms and paraphrases - "earn money" → "make cash" → "get income"
Different phrasings - "Work from home!" → "Remote work available!" → "Home-based jobs!"
Language variations - English, Russian, mixed languages

Traditional keyword-based detection fails:

❌ Pattern: "earn money" → Misses "make cash", "get income"
❌ Pattern: "work from home" → Misses "remote work", "home-based jobs"
❌ Pattern: "buy now" → Misses "purchase today", "order immediately"

The Solution: Semantic Pattern Mining

PATAS uses semantic similarity to find patterns by meaning, not exact words.

How It Works

Generate Embeddings
- Convert messages to vector embeddings (OpenAI, local models)
- Embeddings capture semantic meaning, not just words
Cluster by Similarity
- Group messages with similar embeddings (cosine similarity)
- Messages with same meaning cluster together, even with different words
LLM Analysis
- For each cluster, LLM analyzes:
  - What do they mean? (semantic pattern)
  - Why are they similar? (common concepts, intent)
  - What variations exist? (synonyms, paraphrases)
Generate Semantic Rules
- Create SQL rules that catch ALL variations
- Use OR conditions for synonyms: LIKE '%earn money%' OR LIKE '%make cash%' OR LIKE '%get income%'

Example

Input Messages

1. "Earn $1000 daily working from home! No experience needed."
2. "Make cash fast! Work remotely, no skills required."
3. "Get income from home! Start today, no qualifications!"
4. "Work from home and earn money! Apply now!"

Traditional Approach (Keywords)

Patterns found:

"earn money" (matches 1, 4)
"work from home" (matches 1, 4)
"make cash" (matches 2)
"get income" (matches 3)

Problem: 4 different patterns for the same spam intent!

Semantic Approach

Cluster found:

All 4 messages cluster together (semantic similarity > 0.75)

LLM Analysis:

Semantic pattern: "Unrealistic work-from-home income promises"
Similarity reason: "All promise easy money from home work, use urgency, require no qualifications"
Key concepts: ["work from home", "earn money", "no experience", "fast/quick"]
Variations: ["earn money", "make cash", "get income", "work from home", "remote work"]

Rule generated:

SELECT id, is_spam FROM messages 
WHERE (
  LOWER(text) LIKE '%work from home%' OR
  LOWER(text) LIKE '%remote work%' OR
  LOWER(text) LIKE '%earn money%' OR
  LOWER(text) LIKE '%make cash%' OR
  LOWER(text) LIKE '%get income%'
) AND (
  LOWER(text) LIKE '%no experience%' OR
  LOWER(text) LIKE '%no skills%' OR
  LOWER(text) LIKE '%no qualifications%'
)

Result: ✅ One rule catches all variations!

Implementation

1. Semantic Pattern Miner

File: app/v2_semantic_mining.py

from app.v2_semantic_mining import SemanticPatternMiner

miner = SemanticPatternMiner(
    db=db_session,
    embedding_provider=embedding_engine,
    llm_engine=llm_engine,
)

result = await miner.mine_semantic_patterns(
    days=7,
    min_cluster_size=3,
    similarity_threshold=0.75,
)

2. Embedding Engine

File: app/v2_embedding_engine.py

Supports:

OpenAI: text-embedding-3-small (default)
Local: sentence-transformers/all-MiniLM-L6-v2

from app.v2_embedding_engine import create_embedding_engine

embedding_engine = create_embedding_engine(
    provider="openai",  # or "local"
    api_key=api_key,
    model="text-embedding-3-small",
)

3. LLM Prompt Enhancement

File: app/v2_llm_engine.py

LLM prompt now focuses on:

Semantic patterns (meaning, not words)
Variations (synonyms, paraphrases)
Key concepts (core ideas that define the pattern)

Configuration

Enable Semantic Mining

# In pattern mining pipeline
from app.v2_semantic_mining import SemanticPatternMiner
from app.v2_embedding_engine import create_embedding_engine

# Create embedding engine
embedding_engine = create_embedding_engine(
    provider=settings.embedding_provider,  # "openai" or "local"
    api_key=settings.openai_api_key,
)

# Create semantic miner
semantic_miner = SemanticPatternMiner(
    db=db_session,
    embedding_provider=embedding_engine,
    llm_engine=llm_engine,
)

# Run semantic mining
result = await semantic_miner.mine_semantic_patterns(
    days=7,
    min_cluster_size=3,
    similarity_threshold=0.75,
)

Settings

Add to app/config.py:

embedding_provider: str = "openai"  # "openai", "local", "none"
embedding_model: str = "text-embedding-3-small"
semantic_similarity_threshold: float = 0.75
semantic_min_cluster_size: int = 3

Benefits

1. Catches Variations

✅ Before: Pattern "earn money" misses "make cash"
✅ After: Semantic pattern catches all variations

2. Fewer Patterns

✅ Before: 10 patterns for same spam intent (different words)
✅ After: 1 semantic pattern covers all variations

3. Future-Proof

✅ Before: New wording = new pattern needed
✅ After: Semantic pattern catches new wording automatically

4. LLM-Resistant

✅ Before: Spammer uses LLM to generate variations → bypasses patterns
✅ After: Semantic similarity catches LLM-generated variations

Limitations

Requires Embeddings
- Needs embedding provider (OpenAI API or local model)
- Adds latency and cost
Clustering Quality
- Depends on embedding quality
- Similarity threshold needs tuning
LLM Dependency
- Requires LLM for pattern analysis
- Can be slow for large clusters

Best Practices

Use Both Approaches
- Keyword patterns for exact matches (URLs, phone numbers)
- Semantic patterns for meaning-based detection
Tune Thresholds
- similarity_threshold: Higher = stricter clustering (fewer, more similar clusters)
- min_cluster_size: Minimum messages per pattern
Monitor Quality
- Check false positive rate
- Review semantic patterns manually
- Adjust thresholds based on results

Next Steps

✅ Integrate semantic mining into main pipeline
✅ Test on real data with LLM-generated spam variations
✅ Compare results with keyword-based approach
✅ Tune thresholds based on false positive rate

Key Insight: Spammers use LLM to generate variations. We use LLM + embeddings to catch them all.

SEMANTIC_PATTERN_MINING

Semantic Pattern Mining

The Problem

The Solution: Semantic Pattern Mining

How It Works

Example

Input Messages

Traditional Approach (Keywords)

Semantic Approach

Implementation

1. Semantic Pattern Miner

2. Embedding Engine

3. LLM Prompt Enhancement

Configuration

Enable Semantic Mining

Settings

Benefits

1. Catches Variations

2. Fewer Patterns

3. Future-Proof

4. LLM-Resistant

Limitations

Best Practices

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!