Skip to content

feat: route higher-order functions through codegen dispatcher [experimental / WIP]#4618

Draft
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:feat/hof-codegen-dispatch
Draft

feat: route higher-order functions through codegen dispatcher [experimental / WIP]#4618
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:feat/hof-codegen-dispatch

Conversation

@andygrove

@andygrove andygrove commented Jun 10, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4617.

Rationale for this change

Spark's array and map higher-order (lambda) functions had no Comet implementation, so any query using them fell back to Spark for the enclosing operator. They are hard to implement natively in Rust because they evaluate an arbitrary user lambda per element.

The codegen dispatcher already admits CodegenFallback expressions, which includes all higher-order functions: CometBatchKernelCodegen.canHandle accepts them, and CometCodegenHOFSuite already proved they evaluate correctly inside the kernel when nested in a registered ScalaUDF. Wiring each HOF into the serde lets a top-level HOF projection stay native instead of falling back, while matching Spark exactly (the kernel runs Spark's own per-element evaluation).

What changes are included in this PR?

Register the following previously-unsupported higher-order functions as CometCodegenDispatch (no native rust path; they ride the codegen dispatcher):

  • array: transform, exists, forall, aggregate/reduce, array_sort (with comparator), zip_with
  • map: map_filter, transform_keys, transform_values, map_zip_with

When spark.comet.exec.scalaUDF.codegen.enabled=false, these fall back to Spark cleanly.

To keep higher-order functions over nested-complex element types correct natively, two supporting fixes are included:

  • Emit copy() on the codegen kernel's nested input views (InputArray, InputStruct, InputMap). Spark's interpreted lambda evaluation calls InternalRow.copyValue on complex elements; the views read straight off the per-batch Arrow buffers, so copy() now deep-materializes into on-heap GenericArrayData / GenericInternalRow / ArrayBasedMapData, cloning strings and recursing into nested elements.
  • Reconcile nested operand nullability for native comparisons. DataFusion's nested comparison kernel requires identical types including nested field nullability, whereas Spark comparisons ignore it. When a comparison's operands are nested types that differ only in nullability (for example a transform result with non-null elements compared against a nullable-element column), both operands are cast to their nullability-union type. This surfaces with char-type read-side padding, which Spark rewrites to transform(arr, x -> mapsort(x)).

The columnar-shuffle map-array-element test is updated: on Spark 4.0+ the shuffle key normalizes to transform(arr, x -> mapsort(x)), which now stays native, so the shuffle runs as a Comet shuffle on all versions.

array_filter with a general lambda is intentionally left out: it already has a partial native serde (the array_compact / IsNotNull special case) that reports Unsupported for general lambdas, so routing it through the dispatcher is a separate, more involved change.

How are these changes tested?

Adds SQL file tests under spark/src/test/resources/sql-tests/expressions/array and .../map, one per higher-order function, covering basic usage, column capture (the lambda referencing another column), nested element types (array<array>, array<struct>, array<map>), null and empty collections, the nested-comparison reconciliation path, and the disabled-dispatcher fallback path. A CometCodegenSourceSuite assertion locks in the copy() emission on the nested input views.

@andygrove andygrove force-pushed the feat/hof-codegen-dispatch branch from e939e51 to 191995e Compare June 10, 2026 14:30
@andygrove andygrove force-pushed the feat/hof-codegen-dispatch branch 2 times, most recently from c3665f7 to 42a6878 Compare June 10, 2026 16:26
@andygrove andygrove changed the title feat: route higher-order functions through codegen dispatcher feat: route higher-order functions through codegen dispatcher [experimental / WIP] Jun 10, 2026

@comphead comphead left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove it would be easier if we include sql tests with basic scenarios, nested, column capture, etc

@andygrove andygrove force-pushed the feat/hof-codegen-dispatch branch from 42a6878 to 2642bd0 Compare June 10, 2026 20:02
Register the array and map higher-order (lambda) functions that previously fell
back to Spark so they stay native via the codegen dispatcher:

- array: transform, exists, forall, aggregate, array_sort (comparator), zip_with
- map: map_filter, transform_keys, transform_values, map_zip_with

These have no native (rust) implementation and extend Spark's CodegenFallback,
which the dispatcher's canHandle already admits, so the projection stays native
and matches Spark exactly. When the dispatcher is disabled they fall back to
Spark.

Supporting fixes so higher-order functions over nested-complex element types
stay correct natively:

- Emit copy() on the codegen kernel's nested input views (InputArray, InputStruct,
  InputMap). Spark's interpreted lambda evaluation calls InternalRow.copyValue on
  complex elements; the views read straight off the per-batch Arrow buffers, so
  copy() deep-materializes into on-heap GenericArrayData / GenericInternalRow /
  ArrayBasedMapData, cloning strings and recursing into nested elements.
- Reconcile nested operand nullability for native comparisons. DataFusion's nested
  comparison kernel requires identical types including nested nullability, whereas
  Spark comparisons ignore it. When a comparison's operands are nested types that
  differ only in nullability (e.g. a transform result vs a nullable-element column),
  cast both to their nullability-union type.

Update the columnar-shuffle map-array-element test: on Spark 4.0+ the shuffle key
normalizes to transform(arr, x -> mapsort(x)), which now stays native, so the
shuffle runs as Comet shuffle on all versions.

Add SQL file test coverage under expressions/array and expressions/map for each
higher-order function (basic, column capture, nested element types, null and empty
collections, and the disabled-dispatcher fallback path).
@andygrove

Copy link
Copy Markdown
Member Author

Thanks @andygrove it would be easier if we include sql tests with basic scenarios, nested, column capture, etc

Thanks, I converted the tests to Comet SQL tests.

@andygrove andygrove force-pushed the feat/hof-codegen-dispatch branch from 2642bd0 to a89b235 Compare June 10, 2026 20:08
@comphead

Copy link
Copy Markdown
Contributor

Thanks, I converted the tests to Comet SQL tests.

Awesome, tests makes sense and good to see support for the aggregate which is one of the most complicated functions,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Route array/map higher-order (lambda) functions through the codegen dispatcher

2 participants