Adds a document:match query function for substring matching against d column entries#3470
Adds a document:match query function for substring matching against d column entries#3470drewfarris wants to merge 9 commits intointegrationfrom
document:match query function for substring matching against d column entries#3470Conversation
document:match query function for substring matching against…document:match query function for substring matching against d column entries
| if (nestedScript == null) { | ||
| nestedScript = ArithmeticJexlEngines.getEngine(getArithmetic()).parse(nestedQuery.getQuery()); | ||
| } | ||
| return DocumentMatchFunctionVisitor.requiresDocumentMatchContext(nestedScript); |
There was a problem hiding this comment.
There are methods in JexlASTHelper that can parse the functions out of the JexlNode
| protected EventDataQueryFilter eventEvaluationFilter; | ||
| // filter specifically for event keys. required when performing a seeking aggregation | ||
| protected EventDataQueryFilter eventFilter; | ||
| protected boolean retainDocumentColumnFamily = false; |
There was a problem hiding this comment.
if we can agree d-colums won't be aggregated into the event I think this can be removed
| ((DelayedNonEventIndexContext) input.third()).populateDocument(input.second()); | ||
| } | ||
|
|
||
| String documentMatches = (documentMatchContext == null) ? "" : DocumentFunctions.toJson(documentMatchContext.getMergedMatches()); |
There was a problem hiding this comment.
Based on a conversation with @FineAndDandy, consider whether we want do the serialization to JSON here or move that to a transformer layer, so that the datastructure can be populated through other means (e.g., this is similar to the existing hit term code and thus should utilize a economy of mechanism instead of implementing its own path)
… `d` column entries * Adds the `document:match(viewname, string)` and `document:match(string)` query functions that will scan the `d` columns of candidate documents at evaluation time and filter those candidates whose values do not contain the string specified. * Exposed via Lucene syntax using the `#DOCUMENT_MATCH` operator. * If no view name is included as a function parameter all 'd' columns will be scanned. * The viewname can be a prefix that ends with '*' to search all views with the specified prefix. * If the specified string is found, the view name and start offsets for matches will be stored as a JSON map in the DOCUMENT_MATCHES field in the result. This change includes: * Lucene-to-JEXL translation * Planner/iterator wiring * Runtime document-match evaluation * Configurable limits for `d` column sizes to prevent evaluation of large documents * Unit and integration tests While useful in its own right, this is a predecessor for more advanced matching functions on `d` column payloads.
d94ae23 to
472b64d
Compare
* Added javadoc regarding TRUE_NODE to JexlFunctionArgumentDescriptorFactory that shows this should be used when index searching should be skipped for a function * Added documentMatchMaxEncodedContextSize to limit total size of encoded d columns collected in DocumentMatchContextFunction.
* Clean up duplicate d column decode paths by tailoring the decode methods in ContentKeyValueFactory * Improve handling for documentMatchFunction cases in DatawaveInterpreter * Employ constants where possible
* Avoid clearing documentMatchContext in JexlEvaluation added tests to validate this is the right thing to do * Avoid merging all results into a single Attribute and choosing the first visbility, adds multiple values for the DOCUMENT_MATCHES field with the appropriate visibility based on the original d-column. * Significant refactoring of the return format as a result of avoiding merges - adds DocumentMatchResults object to hold results. * Updated the document match function to return the matched string if there's a successful match, an empty string if not. There was no need to return a full JSON object containing all matches because this comes from the DocumentMatchContext. * Properly dedups offsets in cases where multiple document match functions against the same query string return the same offsets for a document. * Updated unit tests to reflect new conditions, edge cases, incorrect input.
* Consolidate serialization to DocumentMatchResults * Removed the dead DocumentMatchFactory and EmptyDocumentMatchFunctions, updated QueryIterator and TLDQueryIterator to construct DocumentMatchContextFunction directly * Removed dead code from DocumentMatchResults (copy, contained search, payload builder) * Removed unnnecessary Content.withKeyMetadata helper * Cleaned up some brittleness in the tesks related to JSON assertions - now assert the structure instead of exact string nuts * Additional validation of visibility in unit tests
| Map<String,List<Integer>> jsonMatches = new LinkedHashMap<>(); | ||
| for (Map.Entry<String,SortedSet<Integer>> matchEntry : matches.entrySet()) { | ||
| jsonMatches.put(matchEntry.getKey(), new ArrayList<>(matchEntry.getValue())); | ||
| } | ||
| payload.put(MATCHES_FIELD, jsonMatches); |
There was a problem hiding this comment.
Consider adding matches to the payload directly instead of converting to a LinkedHashMap
…
dcolumn entriesdocument:match(viewname, string)anddocument:match(string)query functions that will scan thedcolumns of candidate documents at evaluation time and filter those candidates whose values do not contain the string specified.#DOCUMENT_MATCHoperator.This change includes:
dcolumn sizes to prevent evaluation of large documentsWhile useful in its own right, this is a predecessor for more advanced matching functions on
dcolumn payloads.