[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys by pan3793 · Pull Request #54182 · apache/spark

pan3793 · 2026-02-06T18:43:59Z

What changes were proposed in this pull request?

Fix a java.lang.ArrayIndexOutOfBoundsException when spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true, by correcting the expression(should pass the full partition expression instead of the projected one) passed to KeyGroupedPartitioning#project.

Also, fix a test code issue, change the calculation result of BucketTransform defined at InMemoryBaseTable.scala to match BucketFunctions defined at transformFunctions.scala (thanks @peter-toth for pointing out this!)

Why are the changes needed?

It's a bug fix.

Does this PR introduce any user-facing change?

Some queries that failed when spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true now run normally.

How was this patch tested?

New UT is added, previously it failed with ArrayIndexOutOfBoundsException, now passed.

$ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411"
...
[info] - bug *** FAILED *** (1 second, 884 milliseconds)
[info]   java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
[info]   at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
[info]   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
[info]   at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471)
[info]   at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58)
...

UTs affected by bucket() calculate logic change are tuned.

Was this patch authored or co-authored using generative AI tooling?

No.

github-actions · 2026-02-06T18:44:09Z

JIRA Issue Information

=== Bug SPARK-55411 ===
Summary: SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys
Assignee: None
Status: Open
Affected: ["4.0.2","4.1.1"]

This comment was automatically generated by GitHub Actions

sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala

szehon-ho · 2026-02-06T18:53:41Z

thanks for the repo, ill try to take a look.

pan3793 · 2026-02-06T18:58:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

      partitioning.numPartitions,
-      partitioning.partitionValues)
+      partitioning.partitionValues,
+      partitioning.originalPartitionValues)


I found originalPartitionValues is not always populated. is it intentional?

peter-toth · 2026-02-07T14:56:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/KeyGroupedPartitionedScan.scala

+                projectedExpressions.map(_.dataType))
            basePartitioning.partitionValues.map { r =>
-            val projectedRow = KeyGroupedPartitioning.project(expressions,
+            val projectedRow = KeyGroupedPartitioning.project(basePartitioning.expressions,


Actually the wrong projected excepression is the root cause of the ArrayIndexOutOfBoundsException you hit and passing in basePartitioning.expressions looks good.

But the test you added will unlikely pass as there is an issue with the test framework.
I left a note here:

spark/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala

Lines 2801 to 2802 in 3405255

// Do not use `bucket()` in "one side partition" tests as its implementation in

// `InMemoryBaseTable` conflicts with `BucketFunction`

, but forgot to open a fix for the problem with using bucket() in these one side shuffle tests.

The problem is that the bucket() implementation here:

spark/sql/core/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

Lines 93 to 95 in 3405255

override def produceResult(input: InternalRow): Int = {

(input.getLong(1) % input.getInt(0)).toInt

}

and in InMemoryBaseTable:

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

Lines 240 to 247 in 3405255

val valueTypePairs = cols.map(col => extractor(col.fieldNames, cleanedSchema, row))

var valueHashCode = 0

valueTypePairs.foreach( pair =>

if ( pair._1 != null) valueHashCode += pair._1.hashCode()

)

var dataTypeHashCode = 0

valueTypePairs.foreach(dataTypeHashCode += _._2.hashCode())

((valueHashCode + 31 * dataTypeHashCode) & Integer.MAX_VALUE) % numBuckets

mismatch.
So technically the partition keys that the datasource reports and the calculated key of the partition where the partitioner puts the shuffled records don't match.

@pan3793, could you please keep your fix in KeyGroupedPartitionedScan.scala‎ and fix the BucketTransform key calculation in InMemoryBaseTable?
You don't need need the other changes. originalPartitionValues seems unrelated as it is used only when partially clustered distribution is enabled.

BTW, I'm working on refactoring SPJ based on this idea: #53859 (comment) and it looks prosmising so far, but I need some more days to wrap it up.

// Do not use bucket() in "one side partition" tests as its implementation in
// InMemoryBaseTable conflicts with BucketFunction

Oh, god, @peter-toth, thanks a lot for pointing this out, I wasn't aware of it and have spent a few hours trying to figure out why SMJ partition key value mismatch and produce wrong result after fixing the ArrayIndexOutOfBoundsException ...

Actually, the current code changes are just a draft; the test cases have not yet passed. I will try to fix it following your guidance. Thank you again, @peter-toth!

…join keys are less than cluster keys

pan3793 · 2026-02-08T08:34:00Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

+            case (v, t) =>
+              throw new IllegalArgumentException(s"Match: unsupported argument(s) type - ($v, $t)")
+          }
+          (acc + valueHash) & 0xFFFFFFFFFFFFL


scala> Long.MaxValue + 1L res0: Long = -9223372036854775808 scala> (Long.MaxValue + 1L) & 0xFFFFFFFFFFFFL res1: Long = 0 scala> (Long.MaxValue + 2L) & 0xFFFFFFFFFFFFL res2: Long = 1

Ah, this is needed because % N can return negative results, isn't it? That seems like problem at both places as bucket N should return max N different values.

Should we use Math.floorMod()?

the bucket num should be >=1 (seems we don't have such a check though), then (non_negative_long % positive_int) should always be positive?

Yeah, that's correct, but

spark/sql/core/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

Lines 93 to 95 in 3405255

override def produceResult(input: InternalRow): Int = {

(input.getLong(1) % input.getInt(0)).toInt

}

seems also wrong as it can return values between -N+1 and N-1 so we should probably fix both places. If we used Math.floorMod() then we wouldn't need that & 0xFFFFFFFFFFFFL non-negative conversion.

peter-toth · 2026-02-08T12:52:30Z

Looks good to me, let's wait for CI.

github-actions bot added the SQL label Feb 6, 2026

pan3793 commented Feb 6, 2026

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala Outdated Show resolved Hide resolved

pan3793 commented Feb 6, 2026

View reviewed changes

pan3793 changed the title ~~[SPARK-XXXXX][SQL] Internel error when SPJ ALLOW_JOIN_KEYS_SUBSET_OF_PARTITION_KEYS enabled~~ [SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys Feb 7, 2026

peter-toth reviewed Feb 7, 2026

View reviewed changes

[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when …

cb78da6

…join keys are less than cluster keys

pan3793 force-pushed the spj-subset-joinkey-bug branch from bbf8c3b to cb78da6 Compare February 8, 2026 05:17

fix test related to bucket calc change

beade2b

pan3793 marked this pull request as ready for review February 8, 2026 08:27

ensure positive

55bbd70

pan3793 commented Feb 8, 2026

View reviewed changes

floorMod

216352c

peter-toth approved these changes Feb 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys#54182

[SPARK-55411][SQL] SPJ may throw ArrayIndexOutOfBoundsException when join keys are less than cluster keys#54182
pan3793 wants to merge 4 commits intoapache:masterfrom
pan3793:spj-subset-joinkey-bug

pan3793 commented Feb 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

szehon-ho commented Feb 6, 2026

Uh oh!

pan3793 Feb 6, 2026

Uh oh!

peter-toth Feb 7, 2026 •

edited

Loading

Uh oh!

pan3793 Feb 7, 2026

Uh oh!

pan3793 Feb 8, 2026

Uh oh!

peter-toth Feb 8, 2026 •

edited

Loading

Uh oh!

peter-toth Feb 8, 2026

Uh oh!

pan3793 Feb 8, 2026

Uh oh!

peter-toth Feb 8, 2026 •

edited

Loading

Uh oh!

peter-toth commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// Do not use `bucket()` in "one side partition" tests as its implementation in
	// `InMemoryBaseTable` conflicts with `BucketFunction`

	override def produceResult(input: InternalRow): Int = {
	(input.getLong(1) % input.getInt(0)).toInt
	}

	val valueTypePairs = cols.map(col => extractor(col.fieldNames, cleanedSchema, row))
	var valueHashCode = 0
	valueTypePairs.foreach( pair =>
	if ( pair._1 != null) valueHashCode += pair._1.hashCode()
	)
	var dataTypeHashCode = 0
	valueTypePairs.foreach(dataTypeHashCode += _._2.hashCode())
	((valueHashCode + 31 * dataTypeHashCode) & Integer.MAX_VALUE) % numBuckets

Conversation

pan3793 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA Issue Information

Uh oh!

Uh oh!

szehon-ho commented Feb 6, 2026

Uh oh!

pan3793 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pan3793 commented Feb 6, 2026 •

edited

Loading

github-actions bot commented Feb 6, 2026 •

edited

Loading

peter-toth Feb 7, 2026 •

edited

Loading

peter-toth Feb 8, 2026 •

edited

Loading

peter-toth Feb 8, 2026 •

edited

Loading