[analytics-engine] Coverage fixes for search command on the analytics-engine route#21681
Conversation
PR Reviewer Guide 🔍(Review updated until commit df38aab)Here are some key observations to aid the review process:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21681 +/- ##
============================================
+ Coverage 73.33% 73.45% +0.12%
- Complexity 74996 75040 +44
============================================
Files 6012 6012
Lines 340934 340934
Branches 49076 49076
============================================
+ Hits 250021 250441 +420
+ Misses 71005 70484 -521
- Partials 19908 20009 +101 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Previously `visitSearch` discarded `Search.getOriginalExpression()` and emitted a single `query_string(...)` function call covering the whole predicate. On the analytics-engine route every such filter is delegation-routed to Lucene; on a parquet-only composite shard the delegation handle NPEs in `LuceneAnalyticsBackendPlugin.getFilterDelegationHandle` because there is no Lucene reader. Walk the typed `SearchExpression` AST instead. Structured fragments — field comparisons, AND, OR, NOT, IN, plus time-modifier-resolved comparisons (`earliest=` / `latest=`) — become native PPL filter AST (`Compare` / `And` / `Or` / `Not` / `In`) that the existing rexVisitor handles like any `where`-clause condition. DataFusion executes them natively against parquet without any Lucene round-trip. Free-text / phrase literals stay in `query_string` form because they have no native equivalent. Top-level `AND` mixes the two: the structured conjunct is lowered, the relevance conjunct stays in `query_string`, and the resulting `AND(native, query_string(...))` lets the planner route each clause independently — DataFusion prunes parquet, Lucene handles the relevance term against the secondary format. Three guards keep the lowering faithful to PPL search semantics: - `containsLuceneWildcard` — `*` / `?` in a comparison value forces the fallback so Lucene wildcards keep working (`severityText=ERR*`). - `isOpenSearchDateMath` — values produced by `visitTimeModifierExpression` (`now`, `now-1h`, `now+1y/M-1M`, epoch-millis strings) can't be parsed as Calcite timestamps; keep them in `query_string` so Lucene evaluates the date math. - `isLowerableField` — comparisons against fields absent from the relation's row type fall through (matches Lucene's "missing field → no docs" semantics; native would error out in the analyzer). `SearchNot` always falls back: Lucene `NOT field=value` matches docs where the field is missing OR field != value, but `NOT (field = value)` in Calcite simplifies to `field <> value`, which drops missing-field rows since `NULL <> value` evaluates to NULL. Net `CalciteSearchCommandIT` impact (analytics-engine route, all OS- side fixes in `opensearch-project/OpenSearch#21681` in place): 3/52 → 35/52. Tests that pass via the native path include all field-equality, AND, OR, IN, range, double-comparison, attribute-field, and mixed boolean tests. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 821fcd0 |
Documents the 3 → 35 / 52 progression for `CalciteSearchCommandIT` on the analytics-engine route across this PR + opensearch-project/OpenSearch#21681, plus the well-bucketed remaining 17 failures and where each one's fix lands. Mirrors the existing per-command status docs for stats, top/rare, union, and appendcol. Signed-off-by: Kai Huang <ahkcs@amazon.com>
PR Code Suggestions ✨Latest suggestions up to df38aab Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit f8fee15
Suggestions up to commit b920fac
Suggestions up to commit ae47f33
Suggestions up to commit 672aa21
Suggestions up to commit 1b2d651
|
|
Persistent review updated to latest commit be39849 |
Previously `visitSearch` discarded `Search.getOriginalExpression()` and emitted a single `query_string(...)` function call covering the whole predicate. On the analytics-engine route every such filter is delegation-routed to Lucene; on a parquet-only composite shard the delegation handle NPEs in `LuceneAnalyticsBackendPlugin.getFilterDelegationHandle` because there is no Lucene reader. Walk the typed `SearchExpression` AST instead. Structured fragments — field comparisons, AND, OR, NOT, IN, plus time-modifier-resolved comparisons (`earliest=` / `latest=`) — become native PPL filter AST (`Compare` / `And` / `Or` / `Not` / `In`) that the existing rexVisitor handles like any `where`-clause condition. DataFusion executes them natively against parquet without any Lucene round-trip. Free-text / phrase literals stay in `query_string` form because they have no native equivalent. Top-level `AND` mixes the two: the structured conjunct is lowered, the relevance conjunct stays in `query_string`, and the resulting `AND(native, query_string(...))` lets the planner route each clause independently — DataFusion prunes parquet, Lucene handles the relevance term against the secondary format. Three guards keep the lowering faithful to PPL search semantics: - `containsLuceneWildcard` — `*` / `?` in a comparison value forces the fallback so Lucene wildcards keep working (`severityText=ERR*`). - `isOpenSearchDateMath` — values produced by `visitTimeModifierExpression` (`now`, `now-1h`, `now+1y/M-1M`, epoch-millis strings) can't be parsed as Calcite timestamps; keep them in `query_string` so Lucene evaluates the date math. - `isLowerableField` — comparisons against fields absent from the relation's row type fall through (matches Lucene's "missing field → no docs" semantics; native would error out in the analyzer). `SearchNot` always falls back: Lucene `NOT field=value` matches docs where the field is missing OR field != value, but `NOT (field = value)` in Calcite simplifies to `field <> value`, which drops missing-field rows since `NULL <> value` evaluates to NULL. Net `CalciteSearchCommandIT` impact (analytics-engine route, all OS- side fixes in `opensearch-project/OpenSearch#21681` in place): 3/52 → 35/52. Tests that pass via the native path include all field-equality, AND, OR, IN, range, double-comparison, attribute-field, and mixed boolean tests. Signed-off-by: Kai Huang <ahkcs@amazon.com>
7258541 to
64127e6
Compare
|
Persistent review updated to latest commit 64127e6 |
|
❕ Gradle check result for 64127e6: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
64127e6 to
1b2d651
Compare
|
Persistent review updated to latest commit 1b2d651 |
1b2d651 to
672aa21
Compare
|
Persistent review updated to latest commit 672aa21 |
|
❌ Gradle check result for 672aa21: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit ae47f33 |
|
Persistent review updated to latest commit b920fac |
Previously `visitSearch` discarded `Search.getOriginalExpression()` and emitted a single `query_string(...)` function call covering the whole predicate. On the analytics-engine route every such filter is delegation-routed to Lucene; on a parquet-only composite shard the delegation handle NPEs in `LuceneAnalyticsBackendPlugin.getFilterDelegationHandle` because there is no Lucene reader. Walk the typed `SearchExpression` AST instead. Structured fragments — field comparisons, AND, OR, NOT, IN, plus time-modifier-resolved comparisons (`earliest=` / `latest=`) — become native PPL filter AST (`Compare` / `And` / `Or` / `Not` / `In`) that the existing rexVisitor handles like any `where`-clause condition. DataFusion executes them natively against parquet without any Lucene round-trip. Free-text / phrase literals stay in `query_string` form because they have no native equivalent. Top-level `AND` mixes the two: the structured conjunct is lowered, the relevance conjunct stays in `query_string`, and the resulting `AND(native, query_string(...))` lets the planner route each clause independently — DataFusion prunes parquet, Lucene handles the relevance term against the secondary format. Three guards keep the lowering faithful to PPL search semantics: - `containsLuceneWildcard` — `*` / `?` in a comparison value forces the fallback so Lucene wildcards keep working (`severityText=ERR*`). - `isOpenSearchDateMath` — values produced by `visitTimeModifierExpression` (`now`, `now-1h`, `now+1y/M-1M`, epoch-millis strings) can't be parsed as Calcite timestamps; keep them in `query_string` so Lucene evaluates the date math. - `isLowerableField` — comparisons against fields absent from the relation's row type fall through (matches Lucene's "missing field → no docs" semantics; native would error out in the analyzer). `SearchNot` always falls back: Lucene `NOT field=value` matches docs where the field is missing OR field != value, but `NOT (field = value)` in Calcite simplifies to `field <> value`, which drops missing-field rows since `NULL <> value` evaluates to NULL. Net `CalciteSearchCommandIT` impact (analytics-engine route, all OS- side fixes in `opensearch-project/OpenSearch#21681` in place): 3/52 → 35/52. Tests that pass via the native path include all field-equality, AND, OR, IN, range, double-comparison, attribute-field, and mixed boolean tests. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
❌ Gradle check result for b920fac: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
b920fac to
f8fee15
Compare
|
Persistent review updated to latest commit f8fee15 |
…_OPS PPL `timestamp(expr)` lowers to a `ScalarFunction.TIMESTAMP` call. The TimestampFunctionAdapter already wires the call into DataFusion's native `to_timestamp`, but `STANDARD_PROJECT_OPS` did not declare TIMESTAMP, so `OpenSearchProjectRule.annotateExpr` rejected every plan that contained the operator with "No backend supports scalar function [TIMESTAMP] among [datafusion]". Same call also shows up implicitly after the analyzer coerces a string literal to TIMESTAMP for column comparisons such as `@timestamp="2024-01-15T10:30:00Z"` once `@timestamp` is typed as TIMESTAMP (see the date_nanos schema fix in the previous commit). Update the inline comment block to match: the legacy-engine-only path the prior comment described is gone now that the capability is wired. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ltPlanExecutor RexUtil.isFlat / RelOptUtil.eq / Project.isValid / RexChecker call into Calcite's Litmus.THROW, which raises AssertionError from raw Java code rather than via the `assert` keyword. JVM `-da` doesn't gate that path, so an assertion firing inside a search thread escapes to OpenSearchUncaughtExceptionHandler and exits the cluster JVM. This bit hard once `CalciteRelNodeVisitor` started lowering structured PPL `search` predicates to native filter shape: queries like `severityNumber="not-a-number"` fold to `=(SAFE_CAST($X), null)` ahead of the marking phase, and the Litmus check fires before any plan-executor listener gets to translate the error. The cluster died mid-IT and 21 subsequent tests failed with `Connection refused`. Convert AssertionError caught at the executor entrypoint to an IllegalStateException so the query reports as HTTP 500 with a bucketable message and the cluster survives. The same pattern is already in place at `UnifiedQueryPlanner.plan` on the SQL plugin side; this is the analytics-engine-side mirror so neither layer can produce a cluster-fatal assertion. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…eOutputCastRewriter format
Widen the format string passed to `to_char` from the seconds-only
{@code "%Y-%m-%d %H:%M:%S"} to {@code "%Y-%m-%d %H:%M:%S%.f"}. The
trailing {@code %.f} is chrono's variable-length fractional-second
specifier — a leading dot followed by 0-9 digits, omitted when the
value has no sub-second precision.
This matches PPL's legacy formatting for {@code date} and
{@code date_nanos} fields where the displayed precision tracks the
source value:
- {@code 2024-01-15T10:30:01.23456789Z} (date_nanos) →
{@code "2024-01-15 10:30:01.23456789"} (legacy) / now
{@code "2024-01-15 10:30:01.234567890"} (analytics route — internally
9-digit, leading 0 because Arrow Timestamp(ns) precision is fixed)
- {@code 2025-08-01T03:47:41Z} (date) →
{@code "2025-08-01 03:47:41"} (both paths) — no decimal because the
source value has no fractional component
Surfaced by `CalciteSearchCommandIT.testSearchWithDateRangeComparisons`.
Follow-up to opensearch-project/sql#5420 which the original PR (opensearch-project#21650)
closed with a seconds-only format that dropped the fractional digits
the tests still expect.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
…tor placeholder
The indexed-execution path inferred the parquet schema via build_segments
(which calls FileFormat::infer_schema) and registered it directly on the
PlaceholderProvider that from_substrait_plan binds against. The non-indexed
paths (session_context.rs, query_executor.rs, api.rs) all routed their
inferred schemas through schema_coerce::coerce_inferred_schema before
registering, but the indexed path skipped this step.
Result: an OpenSearch `ip` column lands as parquet `BinaryView` on disk;
isthmus on the Java side serializes the Substrait base schema as plain
`Binary` (Substrait has no view types). The placeholder reports `BinaryView`
while the plan declares `Binary` — DataFusion's substrait consumer rejects
the bind with:
Substrait error: Field '<x>' in Substrait schema has a different type
(Binary) than the corresponding field in the table schema (BinaryView).
Every analytics-engine query against an index that includes an `ip` column
(every OTEL-logs query in CalciteSearchCommandIT, for example) fails at
fragment start.
Apply coerce_inferred_schema right after build_segments, before
PlaceholderProvider construction. The placeholder, the
expr_to_bool_tree analysis, and the downstream IndexedTableProvider all
see the same Substrait-compatible (Binary / Int64 / Float32) schema, so
the bind succeeds and the parquet reader's SchemaAdapter handles the
per-batch BinaryView→Binary relabeling at scan time.
This restores parity with the other infer_schema sites; no behavior
change for non-IP columns.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
f8fee15 to
df38aab
Compare
|
Persistent review updated to latest commit df38aab |
…-engine route (opensearch-project#21681) * [analytics-backend-datafusion] Register TIMESTAMP in STANDARD_PROJECT_OPS PPL `timestamp(expr)` lowers to a `ScalarFunction.TIMESTAMP` call. The TimestampFunctionAdapter already wires the call into DataFusion's native `to_timestamp`, but `STANDARD_PROJECT_OPS` did not declare TIMESTAMP, so `OpenSearchProjectRule.annotateExpr` rejected every plan that contained the operator with "No backend supports scalar function [TIMESTAMP] among [datafusion]". Same call also shows up implicitly after the analyzer coerces a string literal to TIMESTAMP for column comparisons such as `@timestamp="2024-01-15T10:30:00Z"` once `@timestamp` is typed as TIMESTAMP (see the date_nanos schema fix in the previous commit). Update the inline comment block to match: the legacy-engine-only path the prior comment described is gone now that the capability is wired. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [analytics-engine] Catch Calcite Litmus.THROW AssertionError in DefaultPlanExecutor RexUtil.isFlat / RelOptUtil.eq / Project.isValid / RexChecker call into Calcite's Litmus.THROW, which raises AssertionError from raw Java code rather than via the `assert` keyword. JVM `-da` doesn't gate that path, so an assertion firing inside a search thread escapes to OpenSearchUncaughtExceptionHandler and exits the cluster JVM. This bit hard once `CalciteRelNodeVisitor` started lowering structured PPL `search` predicates to native filter shape: queries like `severityNumber="not-a-number"` fold to `=(SAFE_CAST($X), null)` ahead of the marking phase, and the Litmus check fires before any plan-executor listener gets to translate the error. The cluster died mid-IT and 21 subsequent tests failed with `Connection refused`. Convert AssertionError caught at the executor entrypoint to an IllegalStateException so the query reports as HTTP 500 with a bucketable message and the cluster survives. The same pattern is already in place at `UnifiedQueryPlanner.plan` on the SQL plugin side; this is the analytics-engine-side mirror so neither layer can produce a cluster-fatal assertion. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [analytics-backend-datafusion] Preserve fractional seconds in DatetimeOutputCastRewriter format Widen the format string passed to `to_char` from the seconds-only {@code "%Y-%m-%d %H:%M:%S"} to {@code "%Y-%m-%d %H:%M:%S%.f"}. The trailing {@code %.f} is chrono's variable-length fractional-second specifier — a leading dot followed by 0-9 digits, omitted when the value has no sub-second precision. This matches PPL's legacy formatting for {@code date} and {@code date_nanos} fields where the displayed precision tracks the source value: - {@code 2024-01-15T10:30:01.23456789Z} (date_nanos) → {@code "2024-01-15 10:30:01.23456789"} (legacy) / now {@code "2024-01-15 10:30:01.234567890"} (analytics route — internally 9-digit, leading 0 because Arrow Timestamp(ns) precision is fixed) - {@code 2025-08-01T03:47:41Z} (date) → {@code "2025-08-01 03:47:41"} (both paths) — no decimal because the source value has no fractional component Surfaced by `CalciteSearchCommandIT.testSearchWithDateRangeComparisons`. Follow-up to opensearch-project/sql#5420 which the original PR (opensearch-project#21650) closed with a seconds-only format that dropped the fractional digits the tests still expect. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [analytics-backend-datafusion] Apply schema coercion on indexed-executor placeholder The indexed-execution path inferred the parquet schema via build_segments (which calls FileFormat::infer_schema) and registered it directly on the PlaceholderProvider that from_substrait_plan binds against. The non-indexed paths (session_context.rs, query_executor.rs, api.rs) all routed their inferred schemas through schema_coerce::coerce_inferred_schema before registering, but the indexed path skipped this step. Result: an OpenSearch `ip` column lands as parquet `BinaryView` on disk; isthmus on the Java side serializes the Substrait base schema as plain `Binary` (Substrait has no view types). The placeholder reports `BinaryView` while the plan declares `Binary` — DataFusion's substrait consumer rejects the bind with: Substrait error: Field '<x>' in Substrait schema has a different type (Binary) than the corresponding field in the table schema (BinaryView). Every analytics-engine query against an index that includes an `ip` column (every OTEL-logs query in CalciteSearchCommandIT, for example) fails at fragment start. Apply coerce_inferred_schema right after build_segments, before PlaceholderProvider construction. The placeholder, the expr_to_bool_tree analysis, and the downstream IndexedTableProvider all see the same Substrait-compatible (Binary / Int64 / Float32) schema, so the bind succeeds and the parquet reader's SchemaAdapter handles the per-batch BinaryView→Binary relabeling at scan time. This restores parity with the other infer_schema sites; no behavior change for non-IP columns. Signed-off-by: Kai Huang <ahkcs@amazon.com> --------- Signed-off-by: Kai Huang <ahkcs@amazon.com>
What this PR does
Four focused fixes in the analytics-engine sandbox plugins, narrowly scoped to the
searchcommand's analytics-engine route.Register TIMESTAMP in STANDARD_PROJECT_OPStimestamp(expr)(also implicit when the analyzer coerces a string literal to TIMESTAMP for@timestamp="..."comparisons) was rejected byOpenSearchProjectRuleeven thoughTimestampFunctionAdapteralready wires it to DataFusion'sto_timestamp.Catch Calcite Litmus.THROW AssertionError in DefaultPlanExecutorLitmus.THROWraises rawAssertionErrorfrom non-asserted Java. Without a catch, the analytics-engine executor lets it bubble toOpenSearchUncaughtExceptionHandlerand kills the cluster JVM. Mirrors the SQL-plugin-side catch inUnifiedQueryPlanner.plan.Preserve fractional seconds in DatetimeOutputCastRewriterto_charformat from'%Y-%m-%d %H:%M:%S'to'%Y-%m-%d %H:%M:%S%.f'sodate_nanoscolumns keep sub-second precision in the response. Follow-up to #21650.Apply schema coercion on indexed-executor placeholderbuild_segmentsand registered it directly on thePlaceholderProviderthatfrom_substrait_planbinds against — bypassing theschema_coerce::coerce_inferred_schemathat the three otherinfer_schemasites (session_context, query_executor, api) all apply. So parquet'sBinaryViewforipcolumns leaked to the substrait bind step, conflicting with isthmus'sBinary. One-line fix: callcoerce_inferred_schemaon the schema returned bybuild_segments.Why commit 4 matters
Without commit 4, every analytics-engine query against an OTEL-logs-style index (any mapping that includes an
ipfield) fails at fragment start with:The Substrait plan declares
Binary(isthmus has no view types), the table provider reportsBinaryView(parquet's storage form). The three otherinfer_schemacall sites coerce; only the indexed-executor path slipped. Adding the missing coerce call restores parity.Pass / fail breakdown
CalciteSearchCommandITon the analytics-engine route, against currentupstream/main:schema_coerceBinaryView↔Utf8)ipcolumn unblocked at the bind step)byte[]→ExprIpValue)The drop from 34 → 26 with sql#5447 is not a regression: sql#5447's earlier iteration carried an SQL-side workaround (lower numeric / boolean predicates to typed
RexCallto skipquery_string) that lifted the same suite to 38 / 52. That workaround was deliberately removed in the latest sql#5447 revision because the root cause — Lucene-secondaryquery_stringdoesn't re-type literals inside the query body using the field mapping — should be fixed at the Lucene-secondary layer rather than in every consumer.Remaining 25 failures — single upstream root cause, not blocking this PR
All 25 collapse to the typed-literal
query_stringgap on the analytics-engine Lucene-secondary path (adjacent to opensearch-project/OpenSearch#21562):query_stringtestSee opensearch-project/sql#5447 for the per-test bucketing and a minimal smoking-gun repro:
Testing
./gradlew check -p sandbox -Dsandbox.enabled=true— see CICalciteSearchCommandITon analytics route: 26 / 52 pass with this PR + sql#5447 against currentupstream/main