Degenerate Categories with `".*"` Regex Can Create "Black Hole" Effect in Log Categorization

## Summary

The log categorization algorithm can produce degenerate categories with regex pattern `".*"` that match almost any message. Once such a category accumulates sufficient matches, it creates a "black hole" effect where it captures most new messages, preventing formation of new specific categories and severely degrading categorization quality.

## Background

The `categorize_text` aggregation (introduced in 8.3.0) and ES|QL `CATEGORIZE` function use a token-based algorithm to group similar log messages. Each category maintains:
- **Base tokens**: Original tokens from the first message (immutable)
- **Common unique tokens**: Tokens present in all messages in the category (dynamic, can shrink)
- **Ordered common token range**: Subsequence of base tokens that appear in order in all messages
- **Regex pattern**: Generated from ordered common tokens for reverse search

The categorizer checks existing categories in **descending order by match count** before creating new categories.

## Problem Description

### Types of Degenerate Categories

A category becomes "degenerate" (regex `".*"`) when:

**Type 1: Born Degenerate** (Low Impact)
- No tokens identified by analyzer (e.g., aggressive `categorization_filters: [".*"]`)
- Only matches other tokenless messages
- Impact: Minimal - messages with tokens still form separate categories

**Type 2A: Fully Degenerate** (CRITICAL)
- Has base tokens, but `commonUniqueTokenIds` becomes empty through gradual erosion
- `orderedCommonTokenBeginIndex == orderedCommonTokenEndIndex` (empty ordered range)
- Matches **ANY** message via `matchesSearchForCategory()`, bypassing similarity checks
- Impact: **SEVERE** - Creates positive feedback loop

**Type 2B: Partially Degenerate** (Moderate Impact)
- Has non-empty `commonUniqueTokenIds` but no ordered subsequence
- Matches messages containing all common tokens in any order
- Impact: Moderate - still constrained by requiring specific tokens

### The "Black Hole" Effect (Type 2A)

Once a Type 2A degenerate category forms and accumulates more matches than specific categories:

1. **Priority placement**: Checked first due to high `numMatches`
2. **Bypass protection**: `matchesSearchForCategory()` returns `true` (empty loops always pass)
3. **Immediate acceptance**: Returns at line 290 before checking other categories
4. **Match count increases**: Category gets another match, stays at top
5. **Cycle repeats**: More messages captured, specific categories starved

This creates a **positive feedback loop** where the degenerate category monopolizes incoming messages.

## Root Cause Analysis

### Code Flow

**File**: `x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java`

```java
// Line 238-241: Categories checked by match count (highest first)
for (int index = 0; index < categoriesByNumMatches.size(); ++index) {
    TokenListCategory compCategory = categoriesByNumMatches.get(index);
    
    // Line 250-255: Check if message matches category's reverse search
    boolean matchesSearch = compCategory.matchesSearchForCategory(
        workWeight, maxUnfilteredStringLen, workTokenUniqueIds, weightedTokenIds
    );
    
    // Line 256-273: Protection mechanisms (ONLY if matchesSearch == false)
    if (matchesSearch == false) {
        // Check token weight ranges
        // Check if adding would reduce common tokens below threshold
    }
    
    // Line 278-290: Accept match if matchesSearch OR high similarity
    if (matchesSearch || similarity > upperThreshold) {
        return addCategoryMatch(...); // EARLY RETURN
    }
}
```

**File**: `x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategory.java`

```java
// Line 533-543: matchesSearchForCategory implementation
public boolean matchesSearchForCategory(...) {
    return (baseWeight == 0) == (otherBaseWeight == 0)
        && maxMatchingStringLen() >= otherUnfilteredStringLen
        && isMissingCommonTokenWeightZero(otherUniqueTokenIds)      // Empty loop if commonUniqueTokenIds is empty
        && containsCommonInOrderTokensInOrder(otherBaseTokenIds);   // Empty loop if ordered range is empty
}
```

### The Critical Flaw

When `commonUniqueTokenIds == []` and ordered range is empty:
- `isMissingCommonTokenWeightZero()`: Loop doesn't execute (lines 554-568) → returns `true`
- `containsCommonInOrderTokensInOrder()`: Loop doesn't execute (lines 576+) → returns `true`
- Result: `matchesSearch == true` for ANY message
- Protection at lines 256-273 is **bypassed**
- `similarity_threshold` is **ignored**

### How Degeneration Occurs

**Gradual Erosion Path:**
1. Category starts: `["Error", "system", "failure", "module"]`
2. Similar messages merge at 70%+ similarity
3. Common tokens gradually reduced: `["Error", "system"]`
4. `matchesSearch` becomes easier to satisfy (fewer required tokens)
5. More diverse messages match via `matchesSearch` → bypass protection
6. Eventually all common tokens eroded → Type 2A state
7. Category now matches everything within length constraints

## Impact Assessment

### Severity: High

**Affected Components:**
- `categorize_text` aggregation (Elasticsearch 8.3+)
- ES|QL `CATEGORIZE` function
- ML categorization jobs using similar algorithm

**Symptoms:**
- Single category with regex `".*"` accumulating most documents
- Expected specific categories have very low match counts
- Poor categorization quality despite well-structured logs
- Issue worsens over time as degenerate category grows

**Conditions:**
- More likely with `similarity_threshold` < 50%
- Possible (though less likely) at default 70% with:
  - Highly diverse log sources
  - Long messages with 30%+ variation
  - Logs sharing common words across semantic types
  - Low initial token count per message

## Reproduction Scenario

```json
POST /test-logs/_search
{
  "aggs": {
    "categories": {
      "categorize_text": {
        "field": "message",
        "similarity_threshold": 50
      }
    }
  }
}
```

**Test data sequence:**
1. `"Error in system module A"` → Category 1 created
2. `"Error in system module B"` → Matches Category 1
3. `"Error system module C different"` → Matches Category 1 (50%+)
4. `"system Error module D reversed"` → Matches Category 1 (50%+)
5. After many such merges → `commonUniqueTokenIds` becomes empty
6. `"Database connection timeout"` → Matches Category 1 (via `matchesSearch`)
7. `"User authentication failed"` → Matches Category 1 (via `matchesSearch`)
8. New specific categories cannot form

## Potential Solutions

### Option 1: Prevent Early Return on Degenerate Categories (Recommended)

Modify `TokenListCategorizer.computeCategory()` to continue searching even if `matchesSearch == true` when category is degenerate:

```java
// Line 278-290
if (matchesSearch || similarity > upperThreshold) {
    // NEW: If category is degenerate, don't return immediately
    if (compCategory.isDegenerate() && matchesSearch) {
        // Track as best candidate but continue checking other categories
        if (similarity > bestSoFarSimilarity) {
            bestSoFarIndex = index;
            bestSoFarSimilarity = similarity;
        }
        continue; // Don't return immediately
    }
    
    // Existing behavior for non-degenerate categories
    return addCategoryMatch(...);
}
```

Add to `TokenListCategory`:
```java
public boolean isDegenerate() {
    return commonUniqueTokenIds.isEmpty() 
        || orderedCommonTokenBeginIndex == orderedCommonTokenEndIndex;
}
```

**Pros:**
- Allows specific categories to compete fairly
- Minimal algorithmic change
- Preserves existing matching logic

**Cons:**
- Degenerate categories can still match, just not preferentially
- May increase categorization time slightly

### Option 2: Stricter Token Erosion Protection

Prevent `commonUniqueTokenIds` from becoming completely empty:

```java
// In updateCommonUniqueTokenIds()
if (outputIndex == 0 && initialSize > 0) {
    // Keep at least one token to prevent full degeneration
    // Use highest-weighted token from original set
    // Log warning about preventing degeneration
}
```

**Pros:**
- Prevents Type 2A at source
- Maintains some semantic meaning

**Cons:**
- May force semantically unrelated messages into same category
- Arbitrary token choice when none are truly common

### Option 3: Track and Demote Degenerate Categories

Add degenerate flag and check it during match evaluation:

```java
// Check specific categories before degenerate ones
for (category : nonDegenerateCategories) { ... }
for (category : degenerateCategories) { ... }
```

**Pros:**
- Clear separation of behavior
- Explicit handling of edge case

**Cons:**
- Requires maintaining separate lists
- More complex code structure

### Option 4: Add `max_regex_generality` Parameter

Allow users to reject categories that become too general:

```json
{
  "categorize_text": {
    "field": "message",
    "max_regex_generality": 0.9  // Reject if 90%+ of messages match
  }
}
```

**Pros:**
- User control over quality vs quantity tradeoff
- Non-breaking change

**Cons:**
- Shifts responsibility to user
- Hard to tune without understanding internals

## Test Coverage Gap

**Missing test scenarios:**
- Category degenerating to empty `commonUniqueTokenIds`
- Behavior when degenerate category has high match count
- Verification that new categories can still form
- Edge case where multiple categories are degenerate

**Recommended test:**
```java
public void testDegenerateCategoryDoesNotBlockNewCategories() {
    // 1. Create degenerate category with high match count
    // 2. Add message that should form new specific category
    // 3. Assert new category is created, not matched to degenerate
}
```

## References

**Key Files:**
- `x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java`
  - Lines 238-310: Category matching logic
  - Lines 352-392: Category repositioning by match count
  
- `x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategory.java`
  - Lines 261-299: `updateCommonUniqueTokenIds()` - Token erosion
  - Lines 310-409: `updateOrderedCommonTokenIds()` - Ordered range calculation
  - Lines 533-596: `matchesSearchForCategory()` - Match detection
  
- `x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/SerializableTokenListCategory.java`
  - Lines 172-180: `getRegex()` - Regex generation logic

**Related Documentation:**
- `docs/reference/aggregations/search-aggregations-bucket-categorize-text-aggregation.md`
- Algorithm changed in 8.3.0 (TransportVersion: `V_8_3_0`)

## Acceptance Criteria

- [ ] Degenerate categories cannot monopolize incoming messages
- [ ] Specific categories can form even when degenerate category exists with high match count
- [ ] Existing categorization quality maintained for well-structured logs
- [ ] Performance impact < 5% for typical workloads
- [ ] Test coverage for degenerate category scenarios
- [ ] Documentation updated with behavior explanation
- [ ] Backport consideration for 8.x branches

## Priority

**Suggested: P2 (High)**

While not affecting all users (requires specific conditions), the impact is severe when it occurs, potentially rendering categorization useless for affected datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Degenerate Categories with `".*"` Regex Can Create "Black Hole" Effect in Log Categorization #139140

Summary

Background

Problem Description

Types of Degenerate Categories

The "Black Hole" Effect (Type 2A)

Root Cause Analysis

Code Flow

The Critical Flaw

How Degeneration Occurs

Impact Assessment

Severity: High

Reproduction Scenario

Potential Solutions

Option 1: Prevent Early Return on Degenerate Categories (Recommended)

Option 2: Stricter Token Erosion Protection

Option 3: Track and Demote Degenerate Categories

Option 4: Add `max_regex_generality` Parameter

Test Coverage Gap

References

Acceptance Criteria

Priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Degenerate Categories with ".*" Regex Can Create "Black Hole" Effect in Log Categorization #139140

Description

Summary

Background

Problem Description

Types of Degenerate Categories

The "Black Hole" Effect (Type 2A)

Root Cause Analysis

Code Flow

The Critical Flaw

How Degeneration Occurs

Impact Assessment

Severity: High

Reproduction Scenario

Potential Solutions

Option 1: Prevent Early Return on Degenerate Categories (Recommended)

Option 2: Stricter Token Erosion Protection

Option 3: Track and Demote Degenerate Categories

Option 4: Add max_regex_generality Parameter

Test Coverage Gap

References

Acceptance Criteria

Priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Degenerate Categories with `".*"` Regex Can Create "Black Hole" Effect in Log Categorization #139140

Option 4: Add `max_regex_generality` Parameter