Skip to content

Test dual-impl (native + codegen-dispatch) expressions consistently across the full routing matrix #4616

@andygrove

Description

@andygrove

Background

Several expressions now have both a native (Rust) implementation and a JVM codegen-dispatch fallback (mixing in CodegenDispatchFallback). For these, the execution path is selected by a combination of the per-input support level and two config flags:

  • spark.comet.expression.<Name>.allowIncompatible
  • spark.comet.exec.scalaUDF.codegen.enabled (default true)

This produces a routing matrix with five outcomes:

# Support level for the input allowIncompatible codegen.enabled Resulting path What a test should assert
1 Compatible any any native Rust native + matches Spark
2 Incompatible true any native Rust stays in Comet only (results may diverge)
3 Incompatible false true (default) JVM codegen dispatch native + matches Spark
4 Incompatible false false Spark fallback fallback reason + matches Spark
5 Unsupported any any Spark fallback fallback reason + matches Spark

Problem

Coverage of this matrix is currently ad hoc and inconsistent across expressions. date_format in CometTemporalExpressionSuite covers the full trio for rows 2/3/4, but most other dual-impl expressions only assert the default dispatch path (row 3). The fallback-when-dispatcher-disabled path (row 4) and the native-when-allowIncompatible path (row 2) are easy to drop silently, since a path-agnostic checkSparkAnswer passes regardless of which engine ran.

This was observed while reviewing #4538: tightening the NTZ hour/minute/second and non-UTC date_trunc tests to assert the dispatch path inadvertently removed any assertion that those cases still fall back correctly when the dispatcher is disabled. That specific gap was fixed in #4538, but the same gap likely exists for the other expression families that route through the dispatcher (string, collection, map, json, etc.).

Proposal

Establish a consistent testing convention for dual-impl (native + codegen-dispatch) expressions so every such expression covers the applicable rows of the matrix above. Likely shape:

  • A small set of shared test helpers that encode the matrix once, for example:
    • checkNativeCompatible(sql) (row 1)
    • checkStaysNativeWhenAllowIncompatible(sql, exprName) (row 2, no equality)
    • checkCodegenDispatch(sql) (row 3)
    • checkFallbackWhenCodegenDisabled(sql) (row 4)
    • checkUnsupportedFallback(sql, reason) (row 5)
  • Per-expression suites call the relevant helpers, so the intended path for each input shape is explicit and a missing quadrant is obvious.

Then retrofit the existing dual-impl expressions (those mixing in CodegenDispatchFallback) to use the convention.

Scope notes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions