Skip to content

feat: fix array_compact for Spark 4.0 and correct return type metadata#3796

Open
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:fix-array-compact
Open

feat: fix array_compact for Spark 4.0 and correct return type metadata#3796
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:fix-array-compact

Conversation

@andygrove
Copy link
Copy Markdown
Member

Summary

Fixes several issues with the array_compact expression found during an audit:

  • Spark 4.0 support: On Spark 4.0, array_compact rewrites to KnownNotContainsNull(ArrayFilter(IsNotNull)). Because KnownNotContainsNull was not handled by Comet, the expression always fell back to Spark on 4.0. This PR adds a case in the Spark 4.0 shim that unwraps KnownNotContainsNull, serializes the inner ArrayFilter using DataFusion's array_remove_all, and propagates the containsNull=false return type correctly.

  • Return type bug: CometArrayCompact.convert was hardcoding ArrayType(elementType = elementType) as the return type passed to DataFusion, which always set containsNull=true. Changed to use expr.dataType so the correct nullability metadata is emitted — matching Spark's behavior where containsNull=false after compaction on Spark 4.0.

  • Incorrect Incompatible classification: CometArrayCompact.getSupportLevel was returning Incompatible(None) with no explanation. The implementation is semantically correct (DataFusion's array_remove_all(arr, null) behaves the same as Spark's ArrayFilter(IsNotNull)), so this is changed to Compatible().

  • SQL test improvements: Added string, double, and array<array<int>> (nested) type coverage; changed spark_answer_only to query to verify native execution.

  • Remove Scala test skip: The assume(\!isSpark40Plus) guard with a // TODO fix for Spark 4.0.0 comment is removed now that Spark 4.0 is supported.

  • Docs: Updated expressions.md to mark ArrayCompact as supported.

- Handle KnownNotContainsNull wrapper in Spark 4.0 shim so that
  array_compact runs natively on Spark 4.0 (previously always fell back)
- Fix return type passed to DataFusion: use expr.dataType instead of
  hardcoded ArrayType(elementType) so that containsNull=false is
  correctly propagated on Spark 4.0
- Mark CometArrayCompact as Compatible() instead of Incompatible(None)
- Expand SQL test coverage: add string, double, and nested array types;
  change spark_answer_only to query to verify native execution
- Remove assume(\!isSpark40Plus) skip from Scala test
- Update expressions.md: ArrayCompact is now supported
…DataFusion

DataFusion's array_remove_all function always returns a list with nullable
elements (nullable=true), but array_compact's Spark dataType has
containsNull=false. Passing the Spark type as the promised return type caused
a runtime type-mismatch assertion in DataFusion's ScalarFunctionExpr. Fix by
passing ArrayType(elementType, containsNull=true) in both the Spark 3.x
CometArrayCompact serde and the Spark 4.0 KnownNotContainsNull shim.
@andygrove andygrove marked this pull request as ready for review March 26, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant