Adds a `document:match` query function for substring matching against `d` column entries by drewfarris · Pull Request #3470 · NationalSecurityAgency/datawave

drewfarris · 2026-03-23T21:36:21Z

… d column entries

Adds the document:match(viewname, string) and document:match(string) query functions that will scan the d columns of candidate documents at evaluation time and filter those candidates whose values do not contain the string specified.
- Exposed via Lucene syntax using the #DOCUMENT_MATCH operator.
- If no view name is included as a function parameter all 'd' columns will be scanned.
- The viewname can be a prefix that ends with '*' to search all views with the specified prefix.
If the specified string is found, the view name and start offsets for matches will be stored as a JSON map in the DOCUMENT_MATCHES field in the result.

This change includes:

Lucene-to-JEXL translation
Planner/iterator wiring
Runtime document-match evaluation
Configurable limits for d column sizes to prevent evaluation of large documents
Unit and integration tests

While useful in its own right, this is a predecessor for more advanced matching functions on d column payloads.

FineAndDandy · 2026-04-08T23:20:39Z

+        if (nestedScript == null) {
+            nestedScript = ArithmeticJexlEngines.getEngine(getArithmetic()).parse(nestedQuery.getQuery());
+        }
+        return DocumentMatchFunctionVisitor.requiresDocumentMatchContext(nestedScript);


There are methods in JexlASTHelper that can parse the functions out of the JexlNode

FineAndDandy · 2026-04-08T23:31:19Z

    protected EventDataQueryFilter eventEvaluationFilter;
    // filter specifically for event keys. required when performing a seeking aggregation
    protected EventDataQueryFilter eventFilter;
+    protected boolean retainDocumentColumnFamily = false;


if we can agree d-colums won't be aggregated into the event I think this can be removed

drewfarris · 2026-04-09T12:55:35Z

            ((DelayedNonEventIndexContext) input.third()).populateDocument(input.second());
        }

+        String documentMatches = (documentMatchContext == null) ? "" : DocumentFunctions.toJson(documentMatchContext.getMergedMatches());


Based on a conversation with @FineAndDandy, consider whether we want do the serialization to JSON here or move that to a transformer layer, so that the datastructure can be populated through other means (e.g., this is similar to the existing hit term code and thus should utilize a economy of mechanism instead of implementing its own path)

… `d` column entries * Adds the `document:match(viewname, string)` and `document:match(string)` query functions that will scan the `d` columns of candidate documents at evaluation time and filter those candidates whose values do not contain the string specified. * Exposed via Lucene syntax using the `#DOCUMENT_MATCH` operator. * If no view name is included as a function parameter all 'd' columns will be scanned. * The viewname can be a prefix that ends with '*' to search all views with the specified prefix. * If the specified string is found, the view name and start offsets for matches will be stored as a JSON map in the DOCUMENT_MATCHES field in the result. This change includes: * Lucene-to-JEXL translation * Planner/iterator wiring * Runtime document-match evaluation * Configurable limits for `d` column sizes to prevent evaluation of large documents * Unit and integration tests While useful in its own right, this is a predecessor for more advanced matching functions on `d` column payloads.

* Added javadoc regarding TRUE_NODE to JexlFunctionArgumentDescriptorFactory that shows this should be used when index searching should be skipped for a function * Added documentMatchMaxEncodedContextSize to limit total size of encoded d columns collected in DocumentMatchContextFunction.

* Clean up duplicate d column decode paths by tailoring the decode methods in ContentKeyValueFactory * Improve handling for documentMatchFunction cases in DatawaveInterpreter * Employ constants where possible

* Avoid clearing documentMatchContext in JexlEvaluation added tests to validate this is the right thing to do * Avoid merging all results into a single Attribute and choosing the first visbility, adds multiple values for the DOCUMENT_MATCHES field with the appropriate visibility based on the original d-column. * Significant refactoring of the return format as a result of avoiding merges - adds DocumentMatchResults object to hold results. * Updated the document match function to return the matched string if there's a successful match, an empty string if not. There was no need to return a full JSON object containing all matches because this comes from the DocumentMatchContext. * Properly dedups offsets in cases where multiple document match functions against the same query string return the same offsets for a document. * Updated unit tests to reflect new conditions, edge cases, incorrect input.

* Consolidate serialization to DocumentMatchResults * Removed the dead DocumentMatchFactory and EmptyDocumentMatchFunctions, updated QueryIterator and TLDQueryIterator to construct DocumentMatchContextFunction directly * Removed dead code from DocumentMatchResults (copy, contained search, payload builder) * Removed unnnecessary Content.withKeyMetadata helper * Cleaned up some brittleness in the tesks related to JSON assertions - now assert the structure instead of exact string nuts * Additional validation of visibility in unit tests

drewfarris · 2026-04-13T02:40:22Z

+        Map<String,List<Integer>> jsonMatches = new LinkedHashMap<>();
+        for (Map.Entry<String,SortedSet<Integer>> matchEntry : matches.entrySet()) {
+            jsonMatches.put(matchEntry.getKey(), new ArrayList<>(matchEntry.getValue()));
+        }
+        payload.put(MATCHES_FIELD, jsonMatches);


Consider adding matches to the payload directly instead of converting to a LinkedHashMap

drewfarris changed the title ~~Adds a document:match query function for substring matching against…~~ Adds a document:match query function for substring matching against d column entries Mar 23, 2026

drewfarris requested review from FineAndDandy and apmoriarty March 23, 2026 21:36

drewfarris self-assigned this Mar 23, 2026

drewfarris added the Integration Tested label Mar 24, 2026

drewfarris marked this pull request as ready for review March 24, 2026 13:15

apmoriarty reviewed Mar 25, 2026

View reviewed changes

Comment thread warehouse/query-core/src/main/java/datawave/query/function/JexlEvaluation.java Outdated

apmoriarty reviewed Mar 25, 2026

View reviewed changes

Comment thread warehouse/query-core/src/main/java/datawave/query/function/KeyToDocumentData.java Outdated

drewfarris requested a review from ivakegg March 25, 2026 19:07

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread warehouse/query-core/src/main/java/datawave/query/function/DocumentMatchContextFunction.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread warehouse/query-core/src/main/java/datawave/query/function/DocumentMatchContextFunction.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread warehouse/query-core/src/main/java/datawave/query/iterator/QueryIterator.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread ...-core/src/main/java/datawave/query/jexl/visitors/DocumentMatchFunctionRebuildingVisitor.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread ...-core/src/main/java/datawave/query/jexl/visitors/DocumentMatchFunctionRebuildingVisitor.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread ...-core/src/main/java/datawave/query/jexl/visitors/DocumentMatchFunctionRebuildingVisitor.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread ...ouse/query-core/src/main/java/datawave/query/jexl/functions/DocumentFunctionsDescriptor.java

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread warehouse/query-core/src/main/java/datawave/query/planner/DefaultQueryPlanner.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread ...e/src/test/java/datawave/query/jexl/visitors/DocumentMatchFunctionRebuildingVisitorTest.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread warehouse/query-core/src/test/java/datawave/query/DocumentMatchQueryTest.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread warehouse/query-core/src/test/java/datawave/query/DocumentMatchQueryTest.java Outdated

apmoriarty reviewed Mar 26, 2026

View reviewed changes

Comment thread warehouse/query-core/src/test/resources/datawave/query/QueryLogicFactory.xml

drewfarris requested a review from apmoriarty March 29, 2026 23:52

drewfarris commented Mar 30, 2026

View reviewed changes

Comment thread warehouse/query-core/src/main/java/datawave/query/iterator/QueryIterator.java

apmoriarty reviewed Mar 30, 2026

View reviewed changes

Comment thread warehouse/query-core/src/test/java/datawave/query/function/JexlEvaluationTest.java

apmoriarty reviewed Mar 30, 2026

View reviewed changes

Comment thread warehouse/query-core/src/test/java/datawave/query/planner/DefaultQueryPlannerTest.java Outdated

apmoriarty reviewed Mar 30, 2026

View reviewed changes

Comment thread .../query-core/src/test/java/datawave/query/jexl/visitors/DocumentMatchFunctionVisitorTest.java

apmoriarty reviewed Mar 30, 2026

View reviewed changes

Comment thread ...ouse/query-core/src/main/java/datawave/query/jexl/functions/DocumentFunctionsDescriptor.java

drewfarris requested a review from apmoriarty March 31, 2026 15:48

FineAndDandy reviewed Apr 1, 2026

View reviewed changes

Comment thread ...ouse/query-core/src/main/java/datawave/query/jexl/functions/DocumentFunctionsDescriptor.java

FineAndDandy reviewed Apr 1, 2026

View reviewed changes

Comment thread warehouse/query-core/src/main/java/datawave/query/function/DocumentMatchContextFunction.java Outdated