feat(velox): Support RESPECT NULLS for collect_list/collect_set#11837
feat(velox): Support RESPECT NULLS for collect_list/collect_set#11837yaooqinn wants to merge 2 commits intoapache:mainfrom
Conversation
Add ignoreNulls parameter to VeloxCollectList/VeloxCollectSet to support Spark's RESPECT NULLS syntax (SPARK-55256). When ignoreNulls=false, null elements are included in the collected array. - VeloxCollect: conditionally skip nulls based on ignoreNulls parameter - CollectRewriteRule: propagate ignoreNulls from Spark's CollectList/CollectSet via reflection (backward-compatible with Spark versions without ignoreNulls) - ArrayType containsNull reflects the ignoreNulls setting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@yaooqinn thanks, could you please also verify on a new veox branch with patches for collect_list/set - in the current branch the collect_set commit is actually reverted https://github.com/IBM/velox/commits/dft-2026_03_24/ |
|
Attempted local verification with Velox patches (collect_set from upstream main + collect_list from PR #16933). The Velox native lib (libvelox.a) builds successfully with both patches. However, full Gluten build cannot complete locally due to pre-existing Scala version mismatch in cached Maven artifacts (unrelated to this PR). The VeloxCollect changes are purely Scala/JVM-side (DeclarativeAggregate), so they're independent of the Velox native collect_set/collect_list functions. CI will provide the definitive verification. @zhouyuan The native Velox patches for collect_set/collect_list add |
|
@yaooqinn not really I just made a test with your patch + velox changes, the spark unit tests are failing https://github.com/apache/gluten/actions/runs/23652189825/job/68903134168?pr=11712 |
Root Cause AnalysisThanks @zhouyuan for catching this! I traced the issue: What changedBefore PR #16416: After PR #16416: The mismatchGluten's
Fix options
I'll fix this in the Velox PR #16416 (or a follow-up) since the root cause is the signature change creating a suffix-based companion registration. |
|
@yaooqinn I assume we should adapt to the signature change in Gluten (see |
…t_set/list When aggregate functions have multiple signatures with the same intermediate type (e.g., collect_set with 1-arg and 2-arg signatures), Velox registers companion functions with suffix using generic type variables (e.g., collect_set_merge_extract_array_T). The Substrait layer was constructing concrete type suffixes (e.g., array_row_VARCHAR_BIGINT_BIGINT_endrow) that don't match. Fix: After failing exact concrete suffix lookup, fall back to discovering companion function names via getCompanionFunctionSignatures() API. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pushed fix: Updated Change: After failing the concrete type suffix lookup (e.g., The file compiles cleanly against the existing Velox EP. @zhouyuan could you verify with your test setup? |
What changes were proposed in this pull request?
Add
ignoreNullsparameter toVeloxCollectList/VeloxCollectSetto support Spark's RESPECT NULLS syntax (SPARK-55256). WhenignoreNulls=false, null elements are included in the collected array.Changes:
VeloxCollectSet/VeloxCollectList: AcceptignoreNullsparameter (defaulttruefor backward compatibility).CollectRewriteRule: PropagatesignoreNullsfrom Spark'sCollectList/CollectSetusing reflection, making it backward-compatible with Spark versions that don't haveignoreNulls.Related:
How was this patch tested?
CI validation. The default behavior (
ignoreNulls=true) is unchanged, ensuring backward compatibility.