[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys#54182
[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys#54182pan3793 wants to merge 4 commits intoapache:masterfrom
Conversation
JIRA Issue Information=== Bug SPARK-55411 === This comment was automatically generated by GitHub Actions |
sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala
Outdated
Show resolved
Hide resolved
|
thanks for the repo, ill try to take a look. |
| partitioning.numPartitions, | ||
| partitioning.partitionValues) | ||
| partitioning.partitionValues, | ||
| partitioning.originalPartitionValues) |
There was a problem hiding this comment.
I found originalPartitionValues is not always populated. is it intentional?
| projectedExpressions.map(_.dataType)) | ||
| basePartitioning.partitionValues.map { r => | ||
| val projectedRow = KeyGroupedPartitioning.project(expressions, | ||
| val projectedRow = KeyGroupedPartitioning.project(basePartitioning.expressions, |
There was a problem hiding this comment.
Actually the wrong projected excepression is the root cause of the ArrayIndexOutOfBoundsException you hit and passing in basePartitioning.expressions looks good.
But the test you added will unlikely pass as there is an issue with the test framework.
I left a note here:
bucket() in these one side shuffle tests.
The problem is that the bucket() implementation here:
and in
InMemoryBaseTable: So technically the partition keys that the datasource reports and the calculated key of the partition where the partitioner puts the shuffled records don't match.
@pan3793, could you please keep your fix in KeyGroupedPartitionedScan.scala and fix the BucketTransform key calculation in InMemoryBaseTable?
You don't need need the other changes. originalPartitionValues seems unrelated as it is used only when partially clustered distribution is enabled.
BTW, I'm working on refactoring SPJ based on this idea: #53859 (comment) and it looks prosmising so far, but I need some more days to wrap it up.
There was a problem hiding this comment.
// Do not use
bucket()in "one side partition" tests as its implementation in
//InMemoryBaseTableconflicts withBucketFunction
Oh, god, @peter-toth, thanks a lot for pointing this out, I wasn't aware of it and have spent a few hours trying to figure out why SMJ partition key value mismatch and produce wrong result after fixing the ArrayIndexOutOfBoundsException ...
Actually, the current code changes are just a draft; the test cases have not yet passed. I will try to fix it following your guidance. Thank you again, @peter-toth!
…join keys are less than cluster keys
bbf8c3b to
cb78da6
Compare
| case (v, t) => | ||
| throw new IllegalArgumentException(s"Match: unsupported argument(s) type - ($v, $t)") | ||
| } | ||
| (acc + valueHash) & 0xFFFFFFFFFFFFL |
There was a problem hiding this comment.
scala> Long.MaxValue + 1L
res0: Long = -9223372036854775808
scala> (Long.MaxValue + 1L) & 0xFFFFFFFFFFFFL
res1: Long = 0
scala> (Long.MaxValue + 2L) & 0xFFFFFFFFFFFFL
res2: Long = 1
There was a problem hiding this comment.
Ah, this is needed because % N can return negative results, isn't it? That seems like problem at both places as bucket N should return max N different values.
There was a problem hiding this comment.
Should we use Math.floorMod()?
There was a problem hiding this comment.
the bucket num should be >=1 (seems we don't have such a check though), then (non_negative_long % positive_int) should always be positive?
There was a problem hiding this comment.
Yeah, that's correct, but
Math.floorMod() then we wouldn't need that & 0xFFFFFFFFFFFFL non-negative conversion.
|
Looks good to me, let's wait for CI. |
What changes were proposed in this pull request?
Fix a
java.lang.ArrayIndexOutOfBoundsExceptionwhenspark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true, by correcting theexpression(should pass the full partition expression instead of the projected one) passed toKeyGroupedPartitioning#project.Also, fix a test code issue, change the calculation result of
BucketTransformdefined atInMemoryBaseTable.scalato matchBucketFunctionsdefined attransformFunctions.scala(thanks @peter-toth for pointing out this!)Why are the changes needed?
It's a bug fix.
Does this PR introduce any user-facing change?
Some queries that failed when
spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=truenow run normally.How was this patch tested?
New UT is added, previously it failed with
ArrayIndexOutOfBoundsException, now passed.UTs affected by
bucket()calculate logic change are tuned.Was this patch authored or co-authored using generative AI tooling?
No.