Skip to content

[analytics-engine] Coverage fixes for search command on the analytics-engine route#21681

Merged
mch2 merged 4 commits into
opensearch-project:mainfrom
ahkcs:pr/datafusion-extra-field-types
May 19, 2026
Merged

[analytics-engine] Coverage fixes for search command on the analytics-engine route#21681
mch2 merged 4 commits into
opensearch-project:mainfrom
ahkcs:pr/datafusion-extra-field-types

Conversation

@ahkcs
Copy link
Copy Markdown
Contributor

@ahkcs ahkcs commented May 15, 2026

What this PR does

Four focused fixes in the analytics-engine sandbox plugins, narrowly scoped to the search command's analytics-engine route.

# Commit What it fixes
1 Register TIMESTAMP in STANDARD_PROJECT_OPS PPL timestamp(expr) (also implicit when the analyzer coerces a string literal to TIMESTAMP for @timestamp="..." comparisons) was rejected by OpenSearchProjectRule even though TimestampFunctionAdapter already wires it to DataFusion's to_timestamp.
2 Catch Calcite Litmus.THROW AssertionError in DefaultPlanExecutor Calcite's Litmus.THROW raises raw AssertionError from non-asserted Java. Without a catch, the analytics-engine executor lets it bubble to OpenSearchUncaughtExceptionHandler and kills the cluster JVM. Mirrors the SQL-plugin-side catch in UnifiedQueryPlanner.plan.
3 Preserve fractional seconds in DatetimeOutputCastRewriter Widens the to_char format from '%Y-%m-%d %H:%M:%S' to '%Y-%m-%d %H:%M:%S%.f' so date_nanos columns keep sub-second precision in the response. Follow-up to #21650.
4 Apply schema coercion on indexed-executor placeholder The indexed-execution path inferred the parquet schema via build_segments and registered it directly on the PlaceholderProvider that from_substrait_plan binds against — bypassing the schema_coerce::coerce_inferred_schema that the three other infer_schema sites (session_context, query_executor, api) all apply. So parquet's BinaryView for ip columns leaked to the substrait bind step, conflicting with isthmus's Binary. One-line fix: call coerce_inferred_schema on the schema returned by build_segments.

Why commit 4 matters

Without commit 4, every analytics-engine query against an OTEL-logs-style index (any mapping that includes an ip field) fails at fragment start with:

Substrait error: Field 'attributes.client.ip' in Substrait schema has a different type
(Binary) than the corresponding field in the table schema (BinaryView).

The Substrait plan declares Binary (isthmus has no view types), the table provider reports BinaryView (parquet's storage form). The three other infer_schema call sites coerce; only the indexed-executor path slipped. Adding the missing coerce call restores parity.

Pass / fail breakdown

CalciteSearchCommandIT on the analytics-engine route, against current upstream/main:

Step PASS / 52
Pre-#21631 baseline 3
+ #21631 (IP/BINARY/MATCH_ONLY_TEXT scan + schema_coerce BinaryView↔Utf8) ~24
+ #21698 + #21628 + #21622 + #21562 (unsupported-type drop + PPL conversion + dedup + relevance functions on Lucene secondary) 27 (indexed-executor schema_coerce gap surfaces here)
+ Commit 4 (schema_coerce on indexed-executor) 34 (+7 — every OTEL-logs query that scans an ip column unblocked at the bind step)
+ Commits 1, 2, 3 (TIMESTAMP capability, Litmus catch, fractional seconds) 34 (no new passes; safety / precision improvements that the existing test queries don't directly assert on)
+ opensearch-project/sql#5447 (correctness fixes: datetime schema-type recovery + IP byte[]ExprIpValue) 26

The drop from 34 → 26 with sql#5447 is not a regression: sql#5447's earlier iteration carried an SQL-side workaround (lower numeric / boolean predicates to typed RexCall to skip query_string) that lifted the same suite to 38 / 52. That workaround was deliberately removed in the latest sql#5447 revision because the root cause — Lucene-secondary query_string doesn't re-type literals inside the query body using the field mapping — should be fixed at the Lucene-secondary layer rather than in every consumer.

Remaining 25 failures — single upstream root cause, not blocking this PR

All 25 collapse to the typed-literal query_string gap on the analytics-engine Lucene-secondary path (adjacent to opensearch-project/OpenSearch#21562):

  • 5 numeric / decimal range tests
  • 10 date-math / date range tests
  • 4 compound-predicate tests (mixed-AND, same family as Add Support for Relevance functions #21562's tests 10c / 10d that the author flagged as "handled separately")
  • 2 attribute / multi-field tests
  • 1 IP-via-query_string test
  • 2 Lucene-secondary edge cases (wildcard pattern, NOT semantics)
  • 1 composite-engine empty-mapping streaming test

See opensearch-project/sql#5447 for the per-test bucketing and a minimal smoking-gun repro:

where query_string([severityText], "ERROR")               # ✅ works
where query_string([severityNumber], "severityNumber:>15") # ❌ 0 rows (expected 6)

Testing

  • Sandbox-wide check: ./gradlew check -p sandbox -Dsandbox.enabled=true — see CI
  • CalciteSearchCommandIT on analytics route: 26 / 52 pass with this PR + sql#5447 against current upstream/main

@ahkcs ahkcs requested a review from a team as a code owner May 15, 2026 18:37
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

PR Reviewer Guide 🔍

(Review updated until commit df38aab)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for ca67f37: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.45%. Comparing base (debf6ed) to head (df38aab).

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21681      +/-   ##
============================================
+ Coverage     73.33%   73.45%   +0.12%     
- Complexity    74996    75040      +44     
============================================
  Files          6012     6012              
  Lines        340934   340934              
  Branches      49076    49076              
============================================
+ Hits         250021   250441     +420     
+ Misses        71005    70484     -521     
- Partials      19908    20009     +101     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ahkcs added a commit to ahkcs/sql that referenced this pull request May 15, 2026
Previously `visitSearch` discarded `Search.getOriginalExpression()` and
emitted a single `query_string(...)` function call covering the whole
predicate. On the analytics-engine route every such filter is
delegation-routed to Lucene; on a parquet-only composite shard the
delegation handle NPEs in `LuceneAnalyticsBackendPlugin.getFilterDelegationHandle`
because there is no Lucene reader.

Walk the typed `SearchExpression` AST instead. Structured fragments —
field comparisons, AND, OR, NOT, IN, plus time-modifier-resolved
comparisons (`earliest=` / `latest=`) — become native PPL filter AST
(`Compare` / `And` / `Or` / `Not` / `In`) that the existing rexVisitor
handles like any `where`-clause condition. DataFusion executes them
natively against parquet without any Lucene round-trip.

Free-text / phrase literals stay in `query_string` form because they
have no native equivalent. Top-level `AND` mixes the two: the
structured conjunct is lowered, the relevance conjunct stays in
`query_string`, and the resulting `AND(native, query_string(...))`
lets the planner route each clause independently — DataFusion prunes
parquet, Lucene handles the relevance term against the secondary
format.

Three guards keep the lowering faithful to PPL search semantics:

- `containsLuceneWildcard` — `*` / `?` in a comparison value forces
  the fallback so Lucene wildcards keep working (`severityText=ERR*`).
- `isOpenSearchDateMath` — values produced by `visitTimeModifierExpression`
  (`now`, `now-1h`, `now+1y/M-1M`, epoch-millis strings) can't be
  parsed as Calcite timestamps; keep them in `query_string` so Lucene
  evaluates the date math.
- `isLowerableField` — comparisons against fields absent from the
  relation's row type fall through (matches Lucene's "missing field →
  no docs" semantics; native would error out in the analyzer).

`SearchNot` always falls back: Lucene `NOT field=value` matches docs
where the field is missing OR field != value, but
`NOT (field = value)` in Calcite simplifies to `field <> value`,
which drops missing-field rows since `NULL <> value` evaluates to
NULL.

Net `CalciteSearchCommandIT` impact (analytics-engine route, all OS-
side fixes in `opensearch-project/OpenSearch#21681` in place): 3/52 →
35/52. Tests that pass via the native path include all field-equality,
AND, OR, IN, range, double-comparison, attribute-field, and mixed
boolean tests.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs changed the title [analytics-backend-datafusion] Declare scan support for IP / BINARY / MATCH_ONLY_TEXT [analytics-engine] Coverage fixes for search command on the analytics-engine route May 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 821fcd0

ahkcs added a commit to ahkcs/sql that referenced this pull request May 15, 2026
Documents the 3 → 35 / 52 progression for `CalciteSearchCommandIT` on
the analytics-engine route across this PR + opensearch-project/OpenSearch#21681,
plus the well-bucketed remaining 17 failures and where each one's fix
lands.

Mirrors the existing per-command status docs for stats, top/rare,
union, and appendcol.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

PR Code Suggestions ✨

Latest suggestions up to df38aab

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Security
Validate AssertionError source before catching

Catching AssertionError is generally unsafe because it can mask critical JVM-level
failures (e.g., OutOfMemoryError subclasses). Consider catching only specific
Calcite-thrown assertion errors or wrapping the Calcite planning phase separately to
isolate these expected assertion failures from true JVM errors.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [121-128]

 } catch (AssertionError e) {
-    // Calcite's Litmus.THROW (used by RelOptUtil.eq, RexUtil.isFlat, Project.isValid,
-    // RexChecker) throws AssertionError directly via Java code rather than via the
-    // `assert` keyword, so JVM -da doesn't gate them. If one fires inside this
-    // executor, OpenSearchUncaughtExceptionHandler exits the cluster JVM. Convert to
-    // an IllegalStateException so the query path treats it as a per-query failure
-    // (HTTP 500 with a bucketable message) instead of cluster-fatal.
-    listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    // Only convert Calcite-specific assertion errors; rethrow others
+    if (e.getStackTrace().length > 0 && 
+        e.getStackTrace()[0].getClassName().startsWith("org.apache.calcite")) {
+        listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    } else {
+        throw e;
+    }
 }
Suggestion importance[1-10]: 4

__

Why: While the concern about catching AssertionError is valid, the suggested stack trace inspection is fragile and may not reliably distinguish Calcite errors from JVM-level failures. The PR's approach is reasonable given the specific Calcite behavior documented in the comment, though a more robust solution would be preferable.

Low

Previous suggestions

Suggestions up to commit f8fee15
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate AssertionError source before catching

Catching AssertionError is risky because it can mask critical JVM-level failures
(e.g., OutOfMemoryError subclasses). Consider catching only specific Calcite
assertion types or wrapping the executor logic in a more targeted exception handler
that validates the error source before converting to IllegalStateException.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [113-120]

 } catch (AssertionError e) {
-    // Calcite's Litmus.THROW (used by RelOptUtil.eq, RexUtil.isFlat, Project.isValid,
-    // RexChecker) throws AssertionError directly via Java code rather than via the
-    // `assert` keyword, so JVM -da doesn't gate them. If one fires inside this
-    // executor, OpenSearchUncaughtExceptionHandler exits the cluster JVM. Convert to
-    // an IllegalStateException so the query path treats it as a per-query failure
-    // (HTTP 500 with a bucketable message) instead of cluster-fatal.
-    listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    // Only convert Calcite-specific assertions; rethrow others to preserve JVM-fatal signals
+    if (e.getStackTrace().length > 0 && e.getStackTrace()[0].getClassName().startsWith("org.apache.calcite")) {
+        listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    } else {
+        throw e;
+    }
 }
Suggestion importance[1-10]: 4

__

Why: While the concern about catching AssertionError is valid, the suggested stack trace inspection is fragile and may not reliably distinguish Calcite assertions from JVM-fatal errors. The current approach is reasonable given the specific Calcite context mentioned in the comment, though a more robust solution could be considered.

Low
Suggestions up to commit b920fac
CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid catching all AssertionErrors

Catching AssertionError is risky because it can mask critical JVM-level failures
(e.g., OutOfMemoryError subclasses). Consider catching only specific Calcite
assertion errors or using a more targeted approach to avoid suppressing genuine JVM
errors that should terminate the process.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [113-120]

 } catch (AssertionError e) {
-    // Calcite's Litmus.THROW (used by RelOptUtil.eq, RexUtil.isFlat, Project.isValid,
-    // RexChecker) throws AssertionError directly via Java code rather than via the
-    // `assert` keyword, so JVM -da doesn't gate them. If one fires inside this
-    // executor, OpenSearchUncaughtExceptionHandler exits the cluster JVM. Convert to
-    // an IllegalStateException so the query path treats it as a per-query failure
-    // (HTTP 500 with a bucketable message) instead of cluster-fatal.
-    listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    // Only convert Calcite-specific assertion errors to query failures
+    if (e.getStackTrace().length > 0 && 
+        e.getStackTrace()[0].getClassName().startsWith("org.apache.calcite")) {
+        listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    } else {
+        throw e;
+    }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a risk in catching all AssertionError instances, which could mask critical JVM failures. However, the proposed stack trace inspection approach is fragile and may not catch all Calcite assertions. A more robust solution would involve Calcite-specific exception types if available.

Medium
General
Document or remove empty loop

The empty loop body suggests incomplete implementation. If this loop is
intentionally empty for debugging or placeholder purposes, add a comment explaining
why. Otherwise, implement the intended logic or remove the unused loop.

sandbox/plugins/analytics-backend-datafusion/rust/src/indexed_executor.rs [458-459]

 let schema = crate::schema_coerce::coerce_inferred_schema(schema);
+// Segment validation loop - currently no per-segment processing needed
 for (i, seg) in segments.iter().enumerate() {
+    // Reserved for future segment validation logic
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion identifies an empty loop that appears incomplete. While adding documentation would improve code clarity, this is a minor maintainability issue. The loop may be intentionally empty or a placeholder for future logic, but without more context it's difficult to assess if removal is appropriate.

Low
Suggestions up to commit ae47f33
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate AssertionError source before catching

Catching AssertionError is risky because it can mask critical JVM-level failures
(e.g., OutOfMemoryError subclasses). Consider catching only Calcite-specific
assertion errors by checking the stack trace or exception source, or document that
this catch block should only handle planning-phase assertions.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [113-120]

 } catch (AssertionError e) {
     // Calcite's Litmus.THROW (used by RelOptUtil.eq, RexUtil.isFlat, Project.isValid,
     // RexChecker) throws AssertionError directly via Java code rather than via the
     // `assert` keyword, so JVM -da doesn't gate them. If one fires inside this
     // executor, OpenSearchUncaughtExceptionHandler exits the cluster JVM. Convert to
     // an IllegalStateException so the query path treats it as a per-query failure
     // (HTTP 500 with a bucketable message) instead of cluster-fatal.
-    listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    if (e.getStackTrace().length > 0 && e.getStackTrace()[0].getClassName().startsWith("org.apache.calcite")) {
+        listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    } else {
+        throw e;
+    }
 }
Suggestion importance[1-10]: 4

__

Why: While the concern about catching AssertionError is valid, the suggested stack trace check is fragile and could miss Calcite errors from nested calls. The PR's approach is reasonable given the documented context that Calcite's Litmus.THROW specifically throws AssertionError in planning code. The suggestion adds complexity without significantly improving safety.

Low
Suggestions up to commit 672aa21
CategorySuggestion                                                                                                                                    Impact
General
Log AssertionError before conversion

Catching AssertionError is risky because it can mask critical JVM-level failures
(e.g., OutOfMemoryError subclasses). Consider catching only specific Calcite
assertion errors or logging the error before conversion to ensure visibility of
unexpected assertion failures.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [116-123]

 } catch (AssertionError e) {
     // Calcite's Litmus.THROW (used by RelOptUtil.eq, RexUtil.isFlat, Project.isValid,
     // RexChecker) throws AssertionError directly via Java code rather than via the
     // `assert` keyword, so JVM -da doesn't gate them. If one fires inside this
     // executor, OpenSearchUncaughtExceptionHandler exits the cluster JVM. Convert to
     // an IllegalStateException so the query path treats it as a per-query failure
     // (HTTP 500 with a bucketable message) instead of cluster-fatal.
+    logger.error("AssertionError caught in analytics-engine executor", e);
     listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
 }
Suggestion importance[1-10]: 5

__

Why: Adding logging before converting AssertionError to IllegalStateException improves observability and debugging. However, the comment already explains the specific Calcite assertions being caught, and the concern about masking critical JVM failures is somewhat mitigated by the narrow context. The suggestion provides moderate value for operational visibility.

Low
Suggestions up to commit 1b2d651
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate AssertionError source before catching

Catching AssertionError is risky because it can mask critical JVM-level failures
(e.g., OutOfMemoryError subclasses). Consider catching only Calcite-specific
assertion errors by checking the stack trace or exception source, or document that
this catch block should only handle planning-phase assertions.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [116-123]

 } catch (AssertionError e) {
     // Calcite's Litmus.THROW (used by RelOptUtil.eq, RexUtil.isFlat, Project.isValid,
     // RexChecker) throws AssertionError directly via Java code rather than via the
     // `assert` keyword, so JVM -da doesn't gate them. If one fires inside this
     // executor, OpenSearchUncaughtExceptionHandler exits the cluster JVM. Convert to
     // an IllegalStateException so the query path treats it as a per-query failure
     // (HTTP 500 with a bucketable message) instead of cluster-fatal.
-    listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    if (isCalciteAssertion(e)) {
+        listener.onFailure(new IllegalStateException("Analytics-engine executor rejected the plan: " + e.getMessage(), e));
+    } else {
+        throw e;
+    }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a risk in catching AssertionError broadly, as it could mask critical JVM failures. However, the improved_code introduces an undefined isCalciteAssertion(e) method without implementation, and the comment already documents the specific Calcite use case. The concern is valid but the solution needs refinement.

Medium

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit be39849

ahkcs added a commit to ahkcs/sql that referenced this pull request May 15, 2026
Previously `visitSearch` discarded `Search.getOriginalExpression()` and
emitted a single `query_string(...)` function call covering the whole
predicate. On the analytics-engine route every such filter is
delegation-routed to Lucene; on a parquet-only composite shard the
delegation handle NPEs in `LuceneAnalyticsBackendPlugin.getFilterDelegationHandle`
because there is no Lucene reader.

Walk the typed `SearchExpression` AST instead. Structured fragments —
field comparisons, AND, OR, NOT, IN, plus time-modifier-resolved
comparisons (`earliest=` / `latest=`) — become native PPL filter AST
(`Compare` / `And` / `Or` / `Not` / `In`) that the existing rexVisitor
handles like any `where`-clause condition. DataFusion executes them
natively against parquet without any Lucene round-trip.

Free-text / phrase literals stay in `query_string` form because they
have no native equivalent. Top-level `AND` mixes the two: the
structured conjunct is lowered, the relevance conjunct stays in
`query_string`, and the resulting `AND(native, query_string(...))`
lets the planner route each clause independently — DataFusion prunes
parquet, Lucene handles the relevance term against the secondary
format.

Three guards keep the lowering faithful to PPL search semantics:

- `containsLuceneWildcard` — `*` / `?` in a comparison value forces
  the fallback so Lucene wildcards keep working (`severityText=ERR*`).
- `isOpenSearchDateMath` — values produced by `visitTimeModifierExpression`
  (`now`, `now-1h`, `now+1y/M-1M`, epoch-millis strings) can't be
  parsed as Calcite timestamps; keep them in `query_string` so Lucene
  evaluates the date math.
- `isLowerableField` — comparisons against fields absent from the
  relation's row type fall through (matches Lucene's "missing field →
  no docs" semantics; native would error out in the analyzer).

`SearchNot` always falls back: Lucene `NOT field=value` matches docs
where the field is missing OR field != value, but
`NOT (field = value)` in Calcite simplifies to `field <> value`,
which drops missing-field rows since `NULL <> value` evaluates to
NULL.

Net `CalciteSearchCommandIT` impact (analytics-engine route, all OS-
side fixes in `opensearch-project/OpenSearch#21681` in place): 3/52 →
35/52. Tests that pass via the native path include all field-equality,
AND, OR, IN, range, double-comparison, attribute-field, and mixed
boolean tests.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the pr/datafusion-extra-field-types branch 2 times, most recently from 7258541 to 64127e6 Compare May 15, 2026 22:37
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 64127e6

@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for 64127e6: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@ahkcs ahkcs force-pushed the pr/datafusion-extra-field-types branch from 64127e6 to 1b2d651 Compare May 18, 2026 17:34
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 1b2d651

@ahkcs ahkcs force-pushed the pr/datafusion-extra-field-types branch from 1b2d651 to 672aa21 Compare May 18, 2026 18:08
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 672aa21

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 672aa21: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit ae47f33

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b920fac

ahkcs added a commit to ahkcs/sql that referenced this pull request May 18, 2026
Previously `visitSearch` discarded `Search.getOriginalExpression()` and
emitted a single `query_string(...)` function call covering the whole
predicate. On the analytics-engine route every such filter is
delegation-routed to Lucene; on a parquet-only composite shard the
delegation handle NPEs in `LuceneAnalyticsBackendPlugin.getFilterDelegationHandle`
because there is no Lucene reader.

Walk the typed `SearchExpression` AST instead. Structured fragments —
field comparisons, AND, OR, NOT, IN, plus time-modifier-resolved
comparisons (`earliest=` / `latest=`) — become native PPL filter AST
(`Compare` / `And` / `Or` / `Not` / `In`) that the existing rexVisitor
handles like any `where`-clause condition. DataFusion executes them
natively against parquet without any Lucene round-trip.

Free-text / phrase literals stay in `query_string` form because they
have no native equivalent. Top-level `AND` mixes the two: the
structured conjunct is lowered, the relevance conjunct stays in
`query_string`, and the resulting `AND(native, query_string(...))`
lets the planner route each clause independently — DataFusion prunes
parquet, Lucene handles the relevance term against the secondary
format.

Three guards keep the lowering faithful to PPL search semantics:

- `containsLuceneWildcard` — `*` / `?` in a comparison value forces
  the fallback so Lucene wildcards keep working (`severityText=ERR*`).
- `isOpenSearchDateMath` — values produced by `visitTimeModifierExpression`
  (`now`, `now-1h`, `now+1y/M-1M`, epoch-millis strings) can't be
  parsed as Calcite timestamps; keep them in `query_string` so Lucene
  evaluates the date math.
- `isLowerableField` — comparisons against fields absent from the
  relation's row type fall through (matches Lucene's "missing field →
  no docs" semantics; native would error out in the analyzer).

`SearchNot` always falls back: Lucene `NOT field=value` matches docs
where the field is missing OR field != value, but
`NOT (field = value)` in Calcite simplifies to `field <> value`,
which drops missing-field rows since `NULL <> value` evaluates to
NULL.

Net `CalciteSearchCommandIT` impact (analytics-engine route, all OS-
side fixes in `opensearch-project/OpenSearch#21681` in place): 3/52 →
35/52. Tests that pass via the native path include all field-equality,
AND, OR, IN, range, double-comparison, attribute-field, and mixed
boolean tests.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for b920fac: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment thread sandbox/plugins/analytics-backend-datafusion/rust/src/indexed_executor.rs Outdated
@ahkcs ahkcs force-pushed the pr/datafusion-extra-field-types branch from b920fac to f8fee15 Compare May 19, 2026 18:09
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f8fee15

ahkcs added 4 commits May 19, 2026 11:15
…_OPS

PPL `timestamp(expr)` lowers to a `ScalarFunction.TIMESTAMP` call. The
TimestampFunctionAdapter already wires the call into DataFusion's native
`to_timestamp`, but `STANDARD_PROJECT_OPS` did not declare TIMESTAMP, so
`OpenSearchProjectRule.annotateExpr` rejected every plan that contained
the operator with "No backend supports scalar function [TIMESTAMP]
among [datafusion]".

Same call also shows up implicitly after the analyzer coerces a string
literal to TIMESTAMP for column comparisons such as
`@timestamp="2024-01-15T10:30:00Z"` once `@timestamp` is typed as
TIMESTAMP (see the date_nanos schema fix in the previous commit).

Update the inline comment block to match: the legacy-engine-only path
the prior comment described is gone now that the capability is wired.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ltPlanExecutor

RexUtil.isFlat / RelOptUtil.eq / Project.isValid / RexChecker call into
Calcite's Litmus.THROW, which raises AssertionError from raw Java code
rather than via the `assert` keyword. JVM `-da` doesn't gate that path,
so an assertion firing inside a search thread escapes to
OpenSearchUncaughtExceptionHandler and exits the cluster JVM.

This bit hard once `CalciteRelNodeVisitor` started lowering structured
PPL `search` predicates to native filter shape: queries like
`severityNumber="not-a-number"` fold to `=(SAFE_CAST($X), null)` ahead of
the marking phase, and the Litmus check fires before any plan-executor
listener gets to translate the error. The cluster died mid-IT and 21
subsequent tests failed with `Connection refused`.

Convert AssertionError caught at the executor entrypoint to an
IllegalStateException so the query reports as HTTP 500 with a
bucketable message and the cluster survives. The same pattern is
already in place at `UnifiedQueryPlanner.plan` on the SQL plugin side;
this is the analytics-engine-side mirror so neither layer can produce a
cluster-fatal assertion.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…eOutputCastRewriter format

Widen the format string passed to `to_char` from the seconds-only
{@code "%Y-%m-%d %H:%M:%S"} to {@code "%Y-%m-%d %H:%M:%S%.f"}. The
trailing {@code %.f} is chrono's variable-length fractional-second
specifier — a leading dot followed by 0-9 digits, omitted when the
value has no sub-second precision.

This matches PPL's legacy formatting for {@code date} and
{@code date_nanos} fields where the displayed precision tracks the
source value:

- {@code 2024-01-15T10:30:01.23456789Z} (date_nanos) →
  {@code "2024-01-15 10:30:01.23456789"} (legacy) / now
  {@code "2024-01-15 10:30:01.234567890"} (analytics route — internally
  9-digit, leading 0 because Arrow Timestamp(ns) precision is fixed)
- {@code 2025-08-01T03:47:41Z} (date)            →
  {@code "2025-08-01 03:47:41"} (both paths) — no decimal because the
  source value has no fractional component

Surfaced by `CalciteSearchCommandIT.testSearchWithDateRangeComparisons`.
Follow-up to opensearch-project/sql#5420 which the original PR (opensearch-project#21650)
closed with a seconds-only format that dropped the fractional digits
the tests still expect.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…tor placeholder

The indexed-execution path inferred the parquet schema via build_segments
(which calls FileFormat::infer_schema) and registered it directly on the
PlaceholderProvider that from_substrait_plan binds against. The non-indexed
paths (session_context.rs, query_executor.rs, api.rs) all routed their
inferred schemas through schema_coerce::coerce_inferred_schema before
registering, but the indexed path skipped this step.

Result: an OpenSearch `ip` column lands as parquet `BinaryView` on disk;
isthmus on the Java side serializes the Substrait base schema as plain
`Binary` (Substrait has no view types). The placeholder reports `BinaryView`
while the plan declares `Binary` — DataFusion's substrait consumer rejects
the bind with:

    Substrait error: Field '<x>' in Substrait schema has a different type
    (Binary) than the corresponding field in the table schema (BinaryView).

Every analytics-engine query against an index that includes an `ip` column
(every OTEL-logs query in CalciteSearchCommandIT, for example) fails at
fragment start.

Apply coerce_inferred_schema right after build_segments, before
PlaceholderProvider construction. The placeholder, the
expr_to_bool_tree analysis, and the downstream IndexedTableProvider all
see the same Substrait-compatible (Binary / Int64 / Float32) schema, so
the bind succeeds and the parquet reader's SchemaAdapter handles the
per-batch BinaryView→Binary relabeling at scan time.

This restores parity with the other infer_schema sites; no behavior
change for non-IP columns.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the pr/datafusion-extra-field-types branch from f8fee15 to df38aab Compare May 19, 2026 18:16
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit df38aab

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for df38aab: SUCCESS

@ahkcs ahkcs requested review from mch2, sandeshkr419 and vinaykpud May 19, 2026 21:25
@mch2 mch2 merged commit f15c326 into opensearch-project:main May 19, 2026
19 of 20 checks passed
nssuresh2007 pushed a commit to nssuresh2007/OpenSearch that referenced this pull request May 20, 2026
…-engine route (opensearch-project#21681)

* [analytics-backend-datafusion] Register TIMESTAMP in STANDARD_PROJECT_OPS

PPL `timestamp(expr)` lowers to a `ScalarFunction.TIMESTAMP` call. The
TimestampFunctionAdapter already wires the call into DataFusion's native
`to_timestamp`, but `STANDARD_PROJECT_OPS` did not declare TIMESTAMP, so
`OpenSearchProjectRule.annotateExpr` rejected every plan that contained
the operator with "No backend supports scalar function [TIMESTAMP]
among [datafusion]".

Same call also shows up implicitly after the analyzer coerces a string
literal to TIMESTAMP for column comparisons such as
`@timestamp="2024-01-15T10:30:00Z"` once `@timestamp` is typed as
TIMESTAMP (see the date_nanos schema fix in the previous commit).

Update the inline comment block to match: the legacy-engine-only path
the prior comment described is gone now that the capability is wired.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [analytics-engine] Catch Calcite Litmus.THROW AssertionError in DefaultPlanExecutor

RexUtil.isFlat / RelOptUtil.eq / Project.isValid / RexChecker call into
Calcite's Litmus.THROW, which raises AssertionError from raw Java code
rather than via the `assert` keyword. JVM `-da` doesn't gate that path,
so an assertion firing inside a search thread escapes to
OpenSearchUncaughtExceptionHandler and exits the cluster JVM.

This bit hard once `CalciteRelNodeVisitor` started lowering structured
PPL `search` predicates to native filter shape: queries like
`severityNumber="not-a-number"` fold to `=(SAFE_CAST($X), null)` ahead of
the marking phase, and the Litmus check fires before any plan-executor
listener gets to translate the error. The cluster died mid-IT and 21
subsequent tests failed with `Connection refused`.

Convert AssertionError caught at the executor entrypoint to an
IllegalStateException so the query reports as HTTP 500 with a
bucketable message and the cluster survives. The same pattern is
already in place at `UnifiedQueryPlanner.plan` on the SQL plugin side;
this is the analytics-engine-side mirror so neither layer can produce a
cluster-fatal assertion.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [analytics-backend-datafusion] Preserve fractional seconds in DatetimeOutputCastRewriter format

Widen the format string passed to `to_char` from the seconds-only
{@code "%Y-%m-%d %H:%M:%S"} to {@code "%Y-%m-%d %H:%M:%S%.f"}. The
trailing {@code %.f} is chrono's variable-length fractional-second
specifier — a leading dot followed by 0-9 digits, omitted when the
value has no sub-second precision.

This matches PPL's legacy formatting for {@code date} and
{@code date_nanos} fields where the displayed precision tracks the
source value:

- {@code 2024-01-15T10:30:01.23456789Z} (date_nanos) →
  {@code "2024-01-15 10:30:01.23456789"} (legacy) / now
  {@code "2024-01-15 10:30:01.234567890"} (analytics route — internally
  9-digit, leading 0 because Arrow Timestamp(ns) precision is fixed)
- {@code 2025-08-01T03:47:41Z} (date)            →
  {@code "2025-08-01 03:47:41"} (both paths) — no decimal because the
  source value has no fractional component

Surfaced by `CalciteSearchCommandIT.testSearchWithDateRangeComparisons`.
Follow-up to opensearch-project/sql#5420 which the original PR (opensearch-project#21650)
closed with a seconds-only format that dropped the fractional digits
the tests still expect.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [analytics-backend-datafusion] Apply schema coercion on indexed-executor placeholder

The indexed-execution path inferred the parquet schema via build_segments
(which calls FileFormat::infer_schema) and registered it directly on the
PlaceholderProvider that from_substrait_plan binds against. The non-indexed
paths (session_context.rs, query_executor.rs, api.rs) all routed their
inferred schemas through schema_coerce::coerce_inferred_schema before
registering, but the indexed path skipped this step.

Result: an OpenSearch `ip` column lands as parquet `BinaryView` on disk;
isthmus on the Java side serializes the Substrait base schema as plain
`Binary` (Substrait has no view types). The placeholder reports `BinaryView`
while the plan declares `Binary` — DataFusion's substrait consumer rejects
the bind with:

    Substrait error: Field '<x>' in Substrait schema has a different type
    (Binary) than the corresponding field in the table schema (BinaryView).

Every analytics-engine query against an index that includes an `ip` column
(every OTEL-logs query in CalciteSearchCommandIT, for example) fails at
fragment start.

Apply coerce_inferred_schema right after build_segments, before
PlaceholderProvider construction. The placeholder, the
expr_to_bool_tree analysis, and the downstream IndexedTableProvider all
see the same Substrait-compatible (Binary / Int64 / Float32) schema, so
the bind succeeds and the parquet reader's SchemaAdapter handles the
per-batch BinaryView→Binary relabeling at scan time.

This restores parity with the other infer_schema sites; no behavior
change for non-IP columns.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

---------

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants