Skip to content

Route structured-text functions (CSV/JSON/XPath/XML) through the codegen dispatcher #4619

@andygrove

Description

@andygrove

What is the problem the feature request solves?

The structured-text functions have no Comet implementation, so any query using them falls back to Spark for the enclosing operator:

  • CSV: from_csv, schema_of_csv
  • JSON: schema_of_json, json_object_keys
  • XPath: xpath, xpath_boolean/short/int/long/float/double/string
  • XML (Spark 4.0+): from_xml, to_xml, schema_of_xml

They are hard to implement natively in Rust (CSV/JSON/XML parsing with Spark-specific semantics).

Describe the potential solution

These all extend Spark's CodegenFallback, which the codegen dispatcher already admits (the same mechanism backing from_json/to_json). Routing them through the dispatcher keeps a top-level projection native while matching Spark exactly.

On Spark 3.4/3.5 they are plain expressions and can be registered directly in the serde maps. On Spark 4.x they are RuntimeReplaceable and the optimizer rewrites them to Invoke(evaluator) / StaticInvoke before Comet sees the plan, so they must be dispatched from the 4.x shim (mirroring how from_json/to_json/parse_url are already handled).

Additional context

Tier 2 of the codegen-dispatch expansion identified in #4616. Related: the HOF tier in #4618.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions