feat: route higher-order functions through codegen dispatcher [experimental / WIP] by andygrove · Pull Request #4618 · apache/datafusion-comet

andygrove · 2026-06-10T14:22:58Z

Which issue does this PR close?

Closes #4617.

Rationale for this change

Spark's array and map higher-order (lambda) functions had no Comet implementation, so any query using them fell back to Spark for the enclosing operator. They are hard to implement natively in Rust because they evaluate an arbitrary user lambda per element.

The codegen dispatcher already admits CodegenFallback expressions, which includes all higher-order functions: CometBatchKernelCodegen.canHandle accepts them, and CometCodegenHOFSuite already proved they evaluate correctly inside the kernel when nested in a registered ScalaUDF. Wiring each HOF into the serde lets a top-level HOF projection stay native instead of falling back, while matching Spark exactly (the kernel runs Spark's own per-element evaluation).

What changes are included in this PR?

Register the following previously-unsupported higher-order functions as CometCodegenDispatch (no native rust path; they ride the codegen dispatcher):

array: transform, exists, forall, aggregate/reduce, array_sort (with comparator), zip_with
map: map_filter, transform_keys, transform_values, map_zip_with

When spark.comet.exec.scalaUDF.codegen.enabled=false, these fall back to Spark cleanly.

To keep higher-order functions over nested-complex element types correct natively, two supporting fixes are included:

Emit copy() on the codegen kernel's nested input views (InputArray, InputStruct, InputMap). Spark's interpreted lambda evaluation calls InternalRow.copyValue on complex elements; the views read straight off the per-batch Arrow buffers, so copy() now deep-materializes into on-heap GenericArrayData / GenericInternalRow / ArrayBasedMapData, cloning strings and recursing into nested elements.
Reconcile nested operand nullability for native comparisons. DataFusion's nested comparison kernel requires identical types including nested field nullability, whereas Spark comparisons ignore it. When a comparison's operands are nested types that differ only in nullability (for example a transform result with non-null elements compared against a nullable-element column), both operands are cast to their nullability-union type. This surfaces with char-type read-side padding, which Spark rewrites to transform(arr, x -> mapsort(x)).

The columnar-shuffle map-array-element test is updated: on Spark 4.0+ the shuffle key normalizes to transform(arr, x -> mapsort(x)), which now stays native, so the shuffle runs as a Comet shuffle on all versions.

array_filter with a general lambda is intentionally left out: it already has a partial native serde (the array_compact / IsNotNull special case) that reports Unsupported for general lambdas, so routing it through the dispatcher is a separate, more involved change.

How are these changes tested?

Adds SQL file tests under spark/src/test/resources/sql-tests/expressions/array and .../map, one per higher-order function, covering basic usage, column capture (the lambda referencing another column), nested element types (array<array>, array<struct>, array<map>), null and empty collections, the nested-comparison reconciliation path, and the disabled-dispatcher fallback path. A CometCodegenSourceSuite assertion locks in the copy() emission on the nested input views.

comphead

Thanks @andygrove it would be easier if we include sql tests with basic scenarios, nested, column capture, etc

Register the array and map higher-order (lambda) functions that previously fell back to Spark so they stay native via the codegen dispatcher: - array: transform, exists, forall, aggregate, array_sort (comparator), zip_with - map: map_filter, transform_keys, transform_values, map_zip_with These have no native (rust) implementation and extend Spark's CodegenFallback, which the dispatcher's canHandle already admits, so the projection stays native and matches Spark exactly. When the dispatcher is disabled they fall back to Spark. Supporting fixes so higher-order functions over nested-complex element types stay correct natively: - Emit copy() on the codegen kernel's nested input views (InputArray, InputStruct, InputMap). Spark's interpreted lambda evaluation calls InternalRow.copyValue on complex elements; the views read straight off the per-batch Arrow buffers, so copy() deep-materializes into on-heap GenericArrayData / GenericInternalRow / ArrayBasedMapData, cloning strings and recursing into nested elements. - Reconcile nested operand nullability for native comparisons. DataFusion's nested comparison kernel requires identical types including nested nullability, whereas Spark comparisons ignore it. When a comparison's operands are nested types that differ only in nullability (e.g. a transform result vs a nullable-element column), cast both to their nullability-union type. Update the columnar-shuffle map-array-element test: on Spark 4.0+ the shuffle key normalizes to transform(arr, x -> mapsort(x)), which now stays native, so the shuffle runs as Comet shuffle on all versions. Add SQL file test coverage under expressions/array and expressions/map for each higher-order function (basic, column capture, nested element types, null and empty collections, and the disabled-dispatcher fallback path).

andygrove · 2026-06-10T20:08:14Z

Thanks @andygrove it would be easier if we include sql tests with basic scenarios, nested, column capture, etc

Thanks, I converted the tests to Comet SQL tests.

comphead · 2026-06-10T22:22:41Z

Thanks, I converted the tests to Comet SQL tests.

Awesome, tests makes sense and good to see support for the aggregate which is one of the most complicated functions,

andygrove force-pushed the feat/hof-codegen-dispatch branch from e939e51 to 191995e Compare June 10, 2026 14:30

andygrove mentioned this pull request Jun 10, 2026

Route structured-text functions (CSV/JSON/XPath/XML) through the codegen dispatcher #4619

Open

andygrove force-pushed the feat/hof-codegen-dispatch branch 2 times, most recently from c3665f7 to 42a6878 Compare June 10, 2026 16:26

andygrove changed the title ~~feat: route higher-order functions through codegen dispatcher~~ feat: route higher-order functions through codegen dispatcher [experimental / WIP] Jun 10, 2026

comphead reviewed Jun 10, 2026

View reviewed changes

andygrove force-pushed the feat/hof-codegen-dispatch branch from 42a6878 to 2642bd0 Compare June 10, 2026 20:02

andygrove force-pushed the feat/hof-codegen-dispatch branch from 2642bd0 to a89b235 Compare June 10, 2026 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: route higher-order functions through codegen dispatcher [experimental / WIP]#4618

feat: route higher-order functions through codegen dispatcher [experimental / WIP]#4618
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:feat/hof-codegen-dispatch

andygrove commented Jun 10, 2026 •

edited

Loading

Uh oh!

comphead left a comment

Uh oh!

andygrove commented Jun 10, 2026

Uh oh!

comphead commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jun 10, 2026

Uh oh!

comphead commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Jun 10, 2026 •

edited

Loading