-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
Summary
The log categorization algorithm can produce degenerate categories with regex pattern ".*" that match almost any message. Once such a category accumulates sufficient matches, it creates a "black hole" effect where it captures most new messages, preventing formation of new specific categories and severely degrading categorization quality.
Background
The categorize_text aggregation (introduced in 8.3.0) and ES|QL CATEGORIZE function use a token-based algorithm to group similar log messages. Each category maintains:
- Base tokens: Original tokens from the first message (immutable)
- Common unique tokens: Tokens present in all messages in the category (dynamic, can shrink)
- Ordered common token range: Subsequence of base tokens that appear in order in all messages
- Regex pattern: Generated from ordered common tokens for reverse search
The categorizer checks existing categories in descending order by match count before creating new categories.
Problem Description
Types of Degenerate Categories
A category becomes "degenerate" (regex ".*") when:
Type 1: Born Degenerate (Low Impact)
- No tokens identified by analyzer (e.g., aggressive
categorization_filters: [".*"]) - Only matches other tokenless messages
- Impact: Minimal - messages with tokens still form separate categories
Type 2A: Fully Degenerate (CRITICAL)
- Has base tokens, but
commonUniqueTokenIdsbecomes empty through gradual erosion orderedCommonTokenBeginIndex == orderedCommonTokenEndIndex(empty ordered range)- Matches ANY message via
matchesSearchForCategory(), bypassing similarity checks - Impact: SEVERE - Creates positive feedback loop
Type 2B: Partially Degenerate (Moderate Impact)
- Has non-empty
commonUniqueTokenIdsbut no ordered subsequence - Matches messages containing all common tokens in any order
- Impact: Moderate - still constrained by requiring specific tokens
The "Black Hole" Effect (Type 2A)
Once a Type 2A degenerate category forms and accumulates more matches than specific categories:
- Priority placement: Checked first due to high
numMatches - Bypass protection:
matchesSearchForCategory()returnstrue(empty loops always pass) - Immediate acceptance: Returns at line 290 before checking other categories
- Match count increases: Category gets another match, stays at top
- Cycle repeats: More messages captured, specific categories starved
This creates a positive feedback loop where the degenerate category monopolizes incoming messages.
Root Cause Analysis
Code Flow
File: x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java
// Line 238-241: Categories checked by match count (highest first)
for (int index = 0; index < categoriesByNumMatches.size(); ++index) {
TokenListCategory compCategory = categoriesByNumMatches.get(index);
// Line 250-255: Check if message matches category's reverse search
boolean matchesSearch = compCategory.matchesSearchForCategory(
workWeight, maxUnfilteredStringLen, workTokenUniqueIds, weightedTokenIds
);
// Line 256-273: Protection mechanisms (ONLY if matchesSearch == false)
if (matchesSearch == false) {
// Check token weight ranges
// Check if adding would reduce common tokens below threshold
}
// Line 278-290: Accept match if matchesSearch OR high similarity
if (matchesSearch || similarity > upperThreshold) {
return addCategoryMatch(...); // EARLY RETURN
}
}File: x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategory.java
// Line 533-543: matchesSearchForCategory implementation
public boolean matchesSearchForCategory(...) {
return (baseWeight == 0) == (otherBaseWeight == 0)
&& maxMatchingStringLen() >= otherUnfilteredStringLen
&& isMissingCommonTokenWeightZero(otherUniqueTokenIds) // Empty loop if commonUniqueTokenIds is empty
&& containsCommonInOrderTokensInOrder(otherBaseTokenIds); // Empty loop if ordered range is empty
}The Critical Flaw
When commonUniqueTokenIds == [] and ordered range is empty:
isMissingCommonTokenWeightZero(): Loop doesn't execute (lines 554-568) → returnstruecontainsCommonInOrderTokensInOrder(): Loop doesn't execute (lines 576+) → returnstrue- Result:
matchesSearch == truefor ANY message - Protection at lines 256-273 is bypassed
similarity_thresholdis ignored
How Degeneration Occurs
Gradual Erosion Path:
- Category starts:
["Error", "system", "failure", "module"] - Similar messages merge at 70%+ similarity
- Common tokens gradually reduced:
["Error", "system"] matchesSearchbecomes easier to satisfy (fewer required tokens)- More diverse messages match via
matchesSearch→ bypass protection - Eventually all common tokens eroded → Type 2A state
- Category now matches everything within length constraints
Impact Assessment
Severity: High
Affected Components:
categorize_textaggregation (Elasticsearch 8.3+)- ES|QL
CATEGORIZEfunction - ML categorization jobs using similar algorithm
Symptoms:
- Single category with regex
".*"accumulating most documents - Expected specific categories have very low match counts
- Poor categorization quality despite well-structured logs
- Issue worsens over time as degenerate category grows
Conditions:
- More likely with
similarity_threshold< 50% - Possible (though less likely) at default 70% with:
- Highly diverse log sources
- Long messages with 30%+ variation
- Logs sharing common words across semantic types
- Low initial token count per message
Reproduction Scenario
POST /test-logs/_search
{
"aggs": {
"categories": {
"categorize_text": {
"field": "message",
"similarity_threshold": 50
}
}
}
}Test data sequence:
"Error in system module A"→ Category 1 created"Error in system module B"→ Matches Category 1"Error system module C different"→ Matches Category 1 (50%+)"system Error module D reversed"→ Matches Category 1 (50%+)- After many such merges →
commonUniqueTokenIdsbecomes empty "Database connection timeout"→ Matches Category 1 (viamatchesSearch)"User authentication failed"→ Matches Category 1 (viamatchesSearch)- New specific categories cannot form
Potential Solutions
Option 1: Prevent Early Return on Degenerate Categories (Recommended)
Modify TokenListCategorizer.computeCategory() to continue searching even if matchesSearch == true when category is degenerate:
// Line 278-290
if (matchesSearch || similarity > upperThreshold) {
// NEW: If category is degenerate, don't return immediately
if (compCategory.isDegenerate() && matchesSearch) {
// Track as best candidate but continue checking other categories
if (similarity > bestSoFarSimilarity) {
bestSoFarIndex = index;
bestSoFarSimilarity = similarity;
}
continue; // Don't return immediately
}
// Existing behavior for non-degenerate categories
return addCategoryMatch(...);
}Add to TokenListCategory:
public boolean isDegenerate() {
return commonUniqueTokenIds.isEmpty()
|| orderedCommonTokenBeginIndex == orderedCommonTokenEndIndex;
}Pros:
- Allows specific categories to compete fairly
- Minimal algorithmic change
- Preserves existing matching logic
Cons:
- Degenerate categories can still match, just not preferentially
- May increase categorization time slightly
Option 2: Stricter Token Erosion Protection
Prevent commonUniqueTokenIds from becoming completely empty:
// In updateCommonUniqueTokenIds()
if (outputIndex == 0 && initialSize > 0) {
// Keep at least one token to prevent full degeneration
// Use highest-weighted token from original set
// Log warning about preventing degeneration
}Pros:
- Prevents Type 2A at source
- Maintains some semantic meaning
Cons:
- May force semantically unrelated messages into same category
- Arbitrary token choice when none are truly common
Option 3: Track and Demote Degenerate Categories
Add degenerate flag and check it during match evaluation:
// Check specific categories before degenerate ones
for (category : nonDegenerateCategories) { ... }
for (category : degenerateCategories) { ... }Pros:
- Clear separation of behavior
- Explicit handling of edge case
Cons:
- Requires maintaining separate lists
- More complex code structure
Option 4: Add max_regex_generality Parameter
Allow users to reject categories that become too general:
{
"categorize_text": {
"field": "message",
"max_regex_generality": 0.9 // Reject if 90%+ of messages match
}
}Pros:
- User control over quality vs quantity tradeoff
- Non-breaking change
Cons:
- Shifts responsibility to user
- Hard to tune without understanding internals
Test Coverage Gap
Missing test scenarios:
- Category degenerating to empty
commonUniqueTokenIds - Behavior when degenerate category has high match count
- Verification that new categories can still form
- Edge case where multiple categories are degenerate
Recommended test:
public void testDegenerateCategoryDoesNotBlockNewCategories() {
// 1. Create degenerate category with high match count
// 2. Add message that should form new specific category
// 3. Assert new category is created, not matched to degenerate
}References
Key Files:
-
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java- Lines 238-310: Category matching logic
- Lines 352-392: Category repositioning by match count
-
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategory.java- Lines 261-299:
updateCommonUniqueTokenIds()- Token erosion - Lines 310-409:
updateOrderedCommonTokenIds()- Ordered range calculation - Lines 533-596:
matchesSearchForCategory()- Match detection
- Lines 261-299:
-
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/SerializableTokenListCategory.java- Lines 172-180:
getRegex()- Regex generation logic
- Lines 172-180:
Related Documentation:
docs/reference/aggregations/search-aggregations-bucket-categorize-text-aggregation.md- Algorithm changed in 8.3.0 (TransportVersion:
V_8_3_0)
Acceptance Criteria
- Degenerate categories cannot monopolize incoming messages
- Specific categories can form even when degenerate category exists with high match count
- Existing categorization quality maintained for well-structured logs
- Performance impact < 5% for typical workloads
- Test coverage for degenerate category scenarios
- Documentation updated with behavior explanation
- Backport consideration for 8.x branches
Priority
Suggested: P2 (High)
While not affecting all users (requires specific conditions), the impact is severe when it occurs, potentially rendering categorization useless for affected datasets.