Skip to content

SEMANTIC_PATTERN_MINING

Nick edited this page Mar 10, 2026 · 1 revision

Semantic Pattern Mining

Why we need semantic patterns, not keyword patterns


The Problem

Spammers are getting smarter. They use:

  • LLM to generate variations - Same spam intent, different words
  • Synonyms and paraphrases - "earn money" → "make cash" → "get income"
  • Different phrasings - "Work from home!" → "Remote work available!" → "Home-based jobs!"
  • Language variations - English, Russian, mixed languages

Traditional keyword-based detection fails:

  • ❌ Pattern: "earn money" → Misses "make cash", "get income"
  • ❌ Pattern: "work from home" → Misses "remote work", "home-based jobs"
  • ❌ Pattern: "buy now" → Misses "purchase today", "order immediately"

The Solution: Semantic Pattern Mining

PATAS uses semantic similarity to find patterns by meaning, not exact words.

How It Works

  1. Generate Embeddings

    • Convert messages to vector embeddings (OpenAI, local models)
    • Embeddings capture semantic meaning, not just words
  2. Cluster by Similarity

    • Group messages with similar embeddings (cosine similarity)
    • Messages with same meaning cluster together, even with different words
  3. LLM Analysis

    • For each cluster, LLM analyzes:
      • What do they mean? (semantic pattern)
      • Why are they similar? (common concepts, intent)
      • What variations exist? (synonyms, paraphrases)
  4. Generate Semantic Rules

    • Create SQL rules that catch ALL variations
    • Use OR conditions for synonyms: LIKE '%earn money%' OR LIKE '%make cash%' OR LIKE '%get income%'

Example

Input Messages

1. "Earn $1000 daily working from home! No experience needed."
2. "Make cash fast! Work remotely, no skills required."
3. "Get income from home! Start today, no qualifications!"
4. "Work from home and earn money! Apply now!"

Traditional Approach (Keywords)

Patterns found:

  • "earn money" (matches 1, 4)
  • "work from home" (matches 1, 4)
  • "make cash" (matches 2)
  • "get income" (matches 3)

Problem: 4 different patterns for the same spam intent!

Semantic Approach

Cluster found:

  • All 4 messages cluster together (semantic similarity > 0.75)

LLM Analysis:

  • Semantic pattern: "Unrealistic work-from-home income promises"
  • Similarity reason: "All promise easy money from home work, use urgency, require no qualifications"
  • Key concepts: ["work from home", "earn money", "no experience", "fast/quick"]
  • Variations: ["earn money", "make cash", "get income", "work from home", "remote work"]

Rule generated:

SELECT id, is_spam FROM messages 
WHERE (
  LOWER(text) LIKE '%work from home%' OR
  LOWER(text) LIKE '%remote work%' OR
  LOWER(text) LIKE '%earn money%' OR
  LOWER(text) LIKE '%make cash%' OR
  LOWER(text) LIKE '%get income%'
) AND (
  LOWER(text) LIKE '%no experience%' OR
  LOWER(text) LIKE '%no skills%' OR
  LOWER(text) LIKE '%no qualifications%'
)

Result: ✅ One rule catches all variations!


Implementation

1. Semantic Pattern Miner

File: app/v2_semantic_mining.py

from app.v2_semantic_mining import SemanticPatternMiner

miner = SemanticPatternMiner(
    db=db_session,
    embedding_provider=embedding_engine,
    llm_engine=llm_engine,
)

result = await miner.mine_semantic_patterns(
    days=7,
    min_cluster_size=3,
    similarity_threshold=0.75,
)

2. Embedding Engine

File: app/v2_embedding_engine.py

Supports:

  • OpenAI: text-embedding-3-small (default)
  • Local: sentence-transformers/all-MiniLM-L6-v2
from app.v2_embedding_engine import create_embedding_engine

embedding_engine = create_embedding_engine(
    provider="openai",  # or "local"
    api_key=api_key,
    model="text-embedding-3-small",
)

3. LLM Prompt Enhancement

File: app/v2_llm_engine.py

LLM prompt now focuses on:

  • Semantic patterns (meaning, not words)
  • Variations (synonyms, paraphrases)
  • Key concepts (core ideas that define the pattern)

Configuration

Enable Semantic Mining

# In pattern mining pipeline
from app.v2_semantic_mining import SemanticPatternMiner
from app.v2_embedding_engine import create_embedding_engine

# Create embedding engine
embedding_engine = create_embedding_engine(
    provider=settings.embedding_provider,  # "openai" or "local"
    api_key=settings.openai_api_key,
)

# Create semantic miner
semantic_miner = SemanticPatternMiner(
    db=db_session,
    embedding_provider=embedding_engine,
    llm_engine=llm_engine,
)

# Run semantic mining
result = await semantic_miner.mine_semantic_patterns(
    days=7,
    min_cluster_size=3,
    similarity_threshold=0.75,
)

Settings

Add to app/config.py:

embedding_provider: str = "openai"  # "openai", "local", "none"
embedding_model: str = "text-embedding-3-small"
semantic_similarity_threshold: float = 0.75
semantic_min_cluster_size: int = 3

Benefits

1. Catches Variations

Before: Pattern "earn money" misses "make cash"
After: Semantic pattern catches all variations

2. Fewer Patterns

Before: 10 patterns for same spam intent (different words)
After: 1 semantic pattern covers all variations

3. Future-Proof

Before: New wording = new pattern needed
After: Semantic pattern catches new wording automatically

4. LLM-Resistant

Before: Spammer uses LLM to generate variations → bypasses patterns
After: Semantic similarity catches LLM-generated variations


Limitations

  1. Requires Embeddings

    • Needs embedding provider (OpenAI API or local model)
    • Adds latency and cost
  2. Clustering Quality

    • Depends on embedding quality
    • Similarity threshold needs tuning
  3. LLM Dependency

    • Requires LLM for pattern analysis
    • Can be slow for large clusters

Best Practices

  1. Use Both Approaches

    • Keyword patterns for exact matches (URLs, phone numbers)
    • Semantic patterns for meaning-based detection
  2. Tune Thresholds

    • similarity_threshold: Higher = stricter clustering (fewer, more similar clusters)
    • min_cluster_size: Minimum messages per pattern
  3. Monitor Quality

    • Check false positive rate
    • Review semantic patterns manually
    • Adjust thresholds based on results

Next Steps

  1. Integrate semantic mining into main pipeline
  2. Test on real data with LLM-generated spam variations
  3. Compare results with keyword-based approach
  4. Tune thresholds based on false positive rate

Key Insight: Spammers use LLM to generate variations. We use LLM + embeddings to catch them all.

Clone this wiki locally