Skip to content

Adds a document:match query function for substring matching against d column entries#3470

Open
drewfarris wants to merge 9 commits intointegrationfrom
feature/document-match-function
Open

Adds a document:match query function for substring matching against d column entries#3470
drewfarris wants to merge 9 commits intointegrationfrom
feature/document-match-function

Conversation

@drewfarris
Copy link
Copy Markdown
Collaborator

d column entries

  • Adds the document:match(viewname, string) and document:match(string) query functions that will scan the d columns of candidate documents at evaluation time and filter those candidates whose values do not contain the string specified.
    • Exposed via Lucene syntax using the #DOCUMENT_MATCH operator.
    • If no view name is included as a function parameter all 'd' columns will be scanned.
    • The viewname can be a prefix that ends with '*' to search all views with the specified prefix.
  • If the specified string is found, the view name and start offsets for matches will be stored as a JSON map in the DOCUMENT_MATCHES field in the result.

This change includes:

  • Lucene-to-JEXL translation
  • Planner/iterator wiring
  • Runtime document-match evaluation
  • Configurable limits for d column sizes to prevent evaluation of large documents
  • Unit and integration tests

While useful in its own right, this is a predecessor for more advanced matching functions on d column payloads.

@drewfarris drewfarris changed the title Adds a document:match query function for substring matching against… Adds a document:match query function for substring matching against d column entries Mar 23, 2026
@drewfarris drewfarris self-assigned this Mar 23, 2026
@drewfarris drewfarris marked this pull request as ready for review March 24, 2026 13:15
Comment thread warehouse/query-core/src/main/java/datawave/query/function/JexlEvaluation.java Outdated
Comment thread warehouse/query-core/src/main/java/datawave/query/function/KeyToDocumentData.java Outdated
@drewfarris drewfarris requested a review from ivakegg March 25, 2026 19:07
Comment thread warehouse/query-core/src/main/java/datawave/query/iterator/QueryIterator.java Outdated
Comment thread warehouse/query-core/src/test/java/datawave/query/DocumentMatchQueryTest.java Outdated
Comment thread warehouse/query-core/src/test/java/datawave/query/DocumentMatchQueryTest.java Outdated
@drewfarris drewfarris requested a review from apmoriarty March 29, 2026 23:52
@drewfarris drewfarris requested a review from apmoriarty March 31, 2026 15:48
Comment thread warehouse/query-core/src/main/java/datawave/query/function/JexlEvaluation.java Outdated
Comment thread warehouse/query-core/src/main/java/datawave/query/jexl/DatawaveInterpreter.java Outdated
Comment thread warehouse/query-core/src/main/java/datawave/query/function/KeyToDocumentData.java Outdated
Comment on lines +1159 to +1162
if (nestedScript == null) {
nestedScript = ArithmeticJexlEngines.getEngine(getArithmetic()).parse(nestedQuery.getQuery());
}
return DocumentMatchFunctionVisitor.requiresDocumentMatchContext(nestedScript);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are methods in JexlASTHelper that can parse the functions out of the JexlNode

protected EventDataQueryFilter eventEvaluationFilter;
// filter specifically for event keys. required when performing a seeking aggregation
protected EventDataQueryFilter eventFilter;
protected boolean retainDocumentColumnFamily = false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can agree d-colums won't be aggregated into the event I think this can be removed

((DelayedNonEventIndexContext) input.third()).populateDocument(input.second());
}

String documentMatches = (documentMatchContext == null) ? "" : DocumentFunctions.toJson(documentMatchContext.getMergedMatches());
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a conversation with @FineAndDandy, consider whether we want do the serialization to JSON here or move that to a transformer layer, so that the datastructure can be populated through other means (e.g., this is similar to the existing hit term code and thus should utilize a economy of mechanism instead of implementing its own path)

… `d` column entries

* Adds the `document:match(viewname, string)` and `document:match(string)` query functions
  that will scan the `d` columns of candidate documents at evaluation time and filter
  those candidates whose values do not contain the string specified.
  * Exposed via Lucene syntax using the `#DOCUMENT_MATCH` operator.
  * If no view name is included as a function parameter all 'd' columns will be scanned.
  * The viewname can be a prefix that ends with '*' to search all views with the specified prefix.
* If the specified string is found, the view name and start offsets for matches will be stored as a JSON map in
  the DOCUMENT_MATCHES field in the result.

This change includes:

* Lucene-to-JEXL translation
* Planner/iterator wiring
* Runtime document-match evaluation
* Configurable limits for `d` column sizes to prevent evaluation of large documents
* Unit and integration tests

While useful in its own right, this is a predecessor for more advanced matching functions on
`d` column payloads.
@drewfarris drewfarris force-pushed the feature/document-match-function branch from d94ae23 to 472b64d Compare April 11, 2026 13:25
* Added javadoc regarding TRUE_NODE to JexlFunctionArgumentDescriptorFactory that shows this should be used when index searching should be skipped for a function
* Added documentMatchMaxEncodedContextSize to limit total size of encoded d columns collected in DocumentMatchContextFunction.
* Clean up duplicate d column decode paths by tailoring the decode methods in ContentKeyValueFactory
* Improve handling for documentMatchFunction cases in DatawaveInterpreter
* Employ constants where possible
* Avoid clearing documentMatchContext in JexlEvaluation added tests to validate this is the right thing to do
* Avoid merging all results into a single Attribute and choosing the first visbility, adds multiple values for the DOCUMENT_MATCHES field with the appropriate visibility based on the original d-column.
* Significant refactoring of the return format as a result of avoiding merges - adds DocumentMatchResults object to hold results.
* Updated the document match function to return the matched string if there's a successful match, an empty string if not. There was no need to return a full JSON object containing all matches because this comes from the DocumentMatchContext.
* Properly dedups offsets in cases where multiple document match functions against the same query string return the same offsets for a document.
* Updated unit tests to reflect new conditions, edge cases, incorrect input.
* Consolidate serialization to DocumentMatchResults
* Removed the dead DocumentMatchFactory and EmptyDocumentMatchFunctions, updated QueryIterator and TLDQueryIterator to construct DocumentMatchContextFunction directly
* Removed dead code from DocumentMatchResults (copy, contained search, payload builder)
* Removed unnnecessary Content.withKeyMetadata helper
* Cleaned up some brittleness in the tesks related to JSON assertions - now assert the structure instead of exact string nuts
* Additional validation of visibility in unit tests
Comment on lines +78 to +82
Map<String,List<Integer>> jsonMatches = new LinkedHashMap<>();
for (Map.Entry<String,SortedSet<Integer>> matchEntry : matches.entrySet()) {
jsonMatches.put(matchEntry.getKey(), new ArrayList<>(matchEntry.getValue()));
}
payload.put(MATCHES_FIELD, jsonMatches);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding matches to the payload directly instead of converting to a LinkedHashMap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants