Skip to content

Degenerate Categories with ".*" Regex Can Create "Black Hole" Effect in Log Categorization #139140

@valeriy42

Description

@valeriy42

Summary

The log categorization algorithm can produce degenerate categories with regex pattern ".*" that match almost any message. Once such a category accumulates sufficient matches, it creates a "black hole" effect where it captures most new messages, preventing formation of new specific categories and severely degrading categorization quality.

Background

The categorize_text aggregation (introduced in 8.3.0) and ES|QL CATEGORIZE function use a token-based algorithm to group similar log messages. Each category maintains:

  • Base tokens: Original tokens from the first message (immutable)
  • Common unique tokens: Tokens present in all messages in the category (dynamic, can shrink)
  • Ordered common token range: Subsequence of base tokens that appear in order in all messages
  • Regex pattern: Generated from ordered common tokens for reverse search

The categorizer checks existing categories in descending order by match count before creating new categories.

Problem Description

Types of Degenerate Categories

A category becomes "degenerate" (regex ".*") when:

Type 1: Born Degenerate (Low Impact)

  • No tokens identified by analyzer (e.g., aggressive categorization_filters: [".*"])
  • Only matches other tokenless messages
  • Impact: Minimal - messages with tokens still form separate categories

Type 2A: Fully Degenerate (CRITICAL)

  • Has base tokens, but commonUniqueTokenIds becomes empty through gradual erosion
  • orderedCommonTokenBeginIndex == orderedCommonTokenEndIndex (empty ordered range)
  • Matches ANY message via matchesSearchForCategory(), bypassing similarity checks
  • Impact: SEVERE - Creates positive feedback loop

Type 2B: Partially Degenerate (Moderate Impact)

  • Has non-empty commonUniqueTokenIds but no ordered subsequence
  • Matches messages containing all common tokens in any order
  • Impact: Moderate - still constrained by requiring specific tokens

The "Black Hole" Effect (Type 2A)

Once a Type 2A degenerate category forms and accumulates more matches than specific categories:

  1. Priority placement: Checked first due to high numMatches
  2. Bypass protection: matchesSearchForCategory() returns true (empty loops always pass)
  3. Immediate acceptance: Returns at line 290 before checking other categories
  4. Match count increases: Category gets another match, stays at top
  5. Cycle repeats: More messages captured, specific categories starved

This creates a positive feedback loop where the degenerate category monopolizes incoming messages.

Root Cause Analysis

Code Flow

File: x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java

// Line 238-241: Categories checked by match count (highest first)
for (int index = 0; index < categoriesByNumMatches.size(); ++index) {
    TokenListCategory compCategory = categoriesByNumMatches.get(index);
    
    // Line 250-255: Check if message matches category's reverse search
    boolean matchesSearch = compCategory.matchesSearchForCategory(
        workWeight, maxUnfilteredStringLen, workTokenUniqueIds, weightedTokenIds
    );
    
    // Line 256-273: Protection mechanisms (ONLY if matchesSearch == false)
    if (matchesSearch == false) {
        // Check token weight ranges
        // Check if adding would reduce common tokens below threshold
    }
    
    // Line 278-290: Accept match if matchesSearch OR high similarity
    if (matchesSearch || similarity > upperThreshold) {
        return addCategoryMatch(...); // EARLY RETURN
    }
}

File: x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategory.java

// Line 533-543: matchesSearchForCategory implementation
public boolean matchesSearchForCategory(...) {
    return (baseWeight == 0) == (otherBaseWeight == 0)
        && maxMatchingStringLen() >= otherUnfilteredStringLen
        && isMissingCommonTokenWeightZero(otherUniqueTokenIds)      // Empty loop if commonUniqueTokenIds is empty
        && containsCommonInOrderTokensInOrder(otherBaseTokenIds);   // Empty loop if ordered range is empty
}

The Critical Flaw

When commonUniqueTokenIds == [] and ordered range is empty:

  • isMissingCommonTokenWeightZero(): Loop doesn't execute (lines 554-568) → returns true
  • containsCommonInOrderTokensInOrder(): Loop doesn't execute (lines 576+) → returns true
  • Result: matchesSearch == true for ANY message
  • Protection at lines 256-273 is bypassed
  • similarity_threshold is ignored

How Degeneration Occurs

Gradual Erosion Path:

  1. Category starts: ["Error", "system", "failure", "module"]
  2. Similar messages merge at 70%+ similarity
  3. Common tokens gradually reduced: ["Error", "system"]
  4. matchesSearch becomes easier to satisfy (fewer required tokens)
  5. More diverse messages match via matchesSearch → bypass protection
  6. Eventually all common tokens eroded → Type 2A state
  7. Category now matches everything within length constraints

Impact Assessment

Severity: High

Affected Components:

  • categorize_text aggregation (Elasticsearch 8.3+)
  • ES|QL CATEGORIZE function
  • ML categorization jobs using similar algorithm

Symptoms:

  • Single category with regex ".*" accumulating most documents
  • Expected specific categories have very low match counts
  • Poor categorization quality despite well-structured logs
  • Issue worsens over time as degenerate category grows

Conditions:

  • More likely with similarity_threshold < 50%
  • Possible (though less likely) at default 70% with:
    • Highly diverse log sources
    • Long messages with 30%+ variation
    • Logs sharing common words across semantic types
    • Low initial token count per message

Reproduction Scenario

POST /test-logs/_search
{
  "aggs": {
    "categories": {
      "categorize_text": {
        "field": "message",
        "similarity_threshold": 50
      }
    }
  }
}

Test data sequence:

  1. "Error in system module A" → Category 1 created
  2. "Error in system module B" → Matches Category 1
  3. "Error system module C different" → Matches Category 1 (50%+)
  4. "system Error module D reversed" → Matches Category 1 (50%+)
  5. After many such merges → commonUniqueTokenIds becomes empty
  6. "Database connection timeout" → Matches Category 1 (via matchesSearch)
  7. "User authentication failed" → Matches Category 1 (via matchesSearch)
  8. New specific categories cannot form

Potential Solutions

Option 1: Prevent Early Return on Degenerate Categories (Recommended)

Modify TokenListCategorizer.computeCategory() to continue searching even if matchesSearch == true when category is degenerate:

// Line 278-290
if (matchesSearch || similarity > upperThreshold) {
    // NEW: If category is degenerate, don't return immediately
    if (compCategory.isDegenerate() && matchesSearch) {
        // Track as best candidate but continue checking other categories
        if (similarity > bestSoFarSimilarity) {
            bestSoFarIndex = index;
            bestSoFarSimilarity = similarity;
        }
        continue; // Don't return immediately
    }
    
    // Existing behavior for non-degenerate categories
    return addCategoryMatch(...);
}

Add to TokenListCategory:

public boolean isDegenerate() {
    return commonUniqueTokenIds.isEmpty() 
        || orderedCommonTokenBeginIndex == orderedCommonTokenEndIndex;
}

Pros:

  • Allows specific categories to compete fairly
  • Minimal algorithmic change
  • Preserves existing matching logic

Cons:

  • Degenerate categories can still match, just not preferentially
  • May increase categorization time slightly

Option 2: Stricter Token Erosion Protection

Prevent commonUniqueTokenIds from becoming completely empty:

// In updateCommonUniqueTokenIds()
if (outputIndex == 0 && initialSize > 0) {
    // Keep at least one token to prevent full degeneration
    // Use highest-weighted token from original set
    // Log warning about preventing degeneration
}

Pros:

  • Prevents Type 2A at source
  • Maintains some semantic meaning

Cons:

  • May force semantically unrelated messages into same category
  • Arbitrary token choice when none are truly common

Option 3: Track and Demote Degenerate Categories

Add degenerate flag and check it during match evaluation:

// Check specific categories before degenerate ones
for (category : nonDegenerateCategories) { ... }
for (category : degenerateCategories) { ... }

Pros:

  • Clear separation of behavior
  • Explicit handling of edge case

Cons:

  • Requires maintaining separate lists
  • More complex code structure

Option 4: Add max_regex_generality Parameter

Allow users to reject categories that become too general:

{
  "categorize_text": {
    "field": "message",
    "max_regex_generality": 0.9  // Reject if 90%+ of messages match
  }
}

Pros:

  • User control over quality vs quantity tradeoff
  • Non-breaking change

Cons:

  • Shifts responsibility to user
  • Hard to tune without understanding internals

Test Coverage Gap

Missing test scenarios:

  • Category degenerating to empty commonUniqueTokenIds
  • Behavior when degenerate category has high match count
  • Verification that new categories can still form
  • Edge case where multiple categories are degenerate

Recommended test:

public void testDegenerateCategoryDoesNotBlockNewCategories() {
    // 1. Create degenerate category with high match count
    // 2. Add message that should form new specific category
    // 3. Assert new category is created, not matched to degenerate
}

References

Key Files:

  • x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java

    • Lines 238-310: Category matching logic
    • Lines 352-392: Category repositioning by match count
  • x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategory.java

    • Lines 261-299: updateCommonUniqueTokenIds() - Token erosion
    • Lines 310-409: updateOrderedCommonTokenIds() - Ordered range calculation
    • Lines 533-596: matchesSearchForCategory() - Match detection
  • x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/SerializableTokenListCategory.java

    • Lines 172-180: getRegex() - Regex generation logic

Related Documentation:

  • docs/reference/aggregations/search-aggregations-bucket-categorize-text-aggregation.md
  • Algorithm changed in 8.3.0 (TransportVersion: V_8_3_0)

Acceptance Criteria

  • Degenerate categories cannot monopolize incoming messages
  • Specific categories can form even when degenerate category exists with high match count
  • Existing categorization quality maintained for well-structured logs
  • Performance impact < 5% for typical workloads
  • Test coverage for degenerate category scenarios
  • Documentation updated with behavior explanation
  • Backport consideration for 8.x branches

Priority

Suggested: P2 (High)

While not affecting all users (requires specific conditions), the impact is severe when it occurs, potentially rendering categorization useless for affected datasets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :mlMachine learning>bugTeam:MLMeta label for the ML team

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions