[SPARK-56190][SQL] Support nested partition columns for DSV2 PartitionPredicate by szehon-ho · Pull Request #54995 · apache/spark

szehon-ho · 2026-03-25T01:14:53Z

What changes were proposed in this pull request?

Supported nested columns.

Pass an enhanced partitionSchema to the pushdownFilters()
Add support to 'flatten' the provided filters (GetStruct(AttributeReference("parent"), "child")) into a schema understood by the partition predicate and pushdown.

Why are the changes needed?

DSV2 connectors support nested struct fields as partition fields.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add unit tests to DSV2EnhancedPartitionFilterSuite.

Was this patch authored or co-authored using generative AI tooling?

Yes, cursor with hand refactoring

…nPredicate

szehon-ho · 2026-03-25T05:41:57Z

@cloud-fan @peter-toth could you help take a look ? Thanks

...yst/src/test/scala/org/apache/spark/sql/internal/connector/PartitionPredicateImplSuite.scala

peter-toth · 2026-03-25T16:14:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

+      expr: Expression,
+      partitionFields: Seq[PartitionPredicateField],
+      resolver: (String, String) => Boolean): Expression = {
+    val partitionAttrs = toAttributes(StructType(partitionFields.map(_.structField)))


We call normalizePartitionRefs in a loop so it would make sense to precomute partitionAttrs and pass it into normalizePartitionRefs().

yea good point, done

peter-toth · 2026-03-25T16:25:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

@@ -178,18 +190,100 @@ object PushDownUtils {
   */
  private def toSupportedPartitionField(
      transform: Transform,
-      relation: DataSourceV2Relation): Option[StructField] = {
+      relationOutput: Seq[AttributeReference]): Option[StructField] = {
    transform match {
-      case t: IdentityTransform if t.ref.fieldNames.length == 1 =>
-        val colName = t.ref.fieldNames.head
-        relation.output
-          .find(_.name == colName)
-          .map(attr => StructField(colName, attr.dataType, attr.nullable))
+      case t: IdentityTransform =>
+        val names = t.ref.fieldNames.toIndexedSeq
+        resolveIdentityPartitionField(names, relationOutput)
      case _ =>
        None
    }
  }

+  /**
+   * Resolves an identity partition column path to a StructField.
+   */
+  private def resolveIdentityPartitionField(
+      names: Seq[String],
+      relationOutput: Seq[AttributeReference]): Option[StructField] = {
+    if (names.isEmpty) {
+      None
+    } else {
+      val resolver = SQLConf.get.resolver
+      val rootStruct =
+        StructType(relationOutput.map(a => StructField(a.name, a.dataType, a.nullable)))
+      rootStruct.findNestedField(names, resolver = resolver).map {
+        case (_, leaf) =>
+          StructField(names.mkString("."), leaf.dataType, leaf.nullable)
+      }
+    }
+  }


Suggested change

def getPartitionSchemaInfo(

relation: DataSourceV2Relation): Option[Seq[PartitionPredicateField]] = {

val transforms = relation.table.partitioning

if (transforms.isEmpty) return None

val resolver = SQLConf.get.resolver

val rootStruct =

StructType(relation.output.map(a => StructField(a.name, a.dataType, a.nullable)))

val fields = transforms.flatMap {

case t: IdentityTransform =>

toSupportedPartitionField(t, rootStruct, resolver).map(PartitionPredicateField(_, t.ref))

case _ => None

}

if (fields.length == transforms.length) Some(fields.toIndexedSeq) else None

}

/**

* Returns a StructField for the given identity partition transform if it is

* supported for iterative partition predicate push down.

*/

private def toSupportedPartitionField(

transform: IdentityTransform,

rootStruct: StructType,

resolver: (String, String) => Boolean): Option[StructField] = {

val names = transform.ref.fieldNames().toIndexedSeq

if (names.isEmpty) None

else rootStruct.findNestedField(names, resolver = resolver).map {

case (_, leaf) => StructField(names.mkString("."), leaf.dataType, leaf.nullable)

}

}

thanks, done on the updated branch.

...talyst/src/main/java/org/apache/spark/sql/connector/expressions/PartitionFieldReference.java

szehon-ho · 2026-03-25T20:56:28Z

thanks @peter-toth, addressed

FYI i found an issue in the first approach. The normalization of the nested partition field filters, ie GetStruct(AttributeRef("parent"), "child)) => AttributeRef("parent.child") messed up the post-scan filter if rejected, so restoring them to original form before returning.

szehon-ho · 2026-03-25T21:00:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

+        case (_, leaf) => StructField(names.mkString("."), leaf.dataType, leaf.nullable)
+      }
+    } catch {
+      case _: AnalysisException =>


should not happen, but the underlying code does throw exception if something not resolvalabe, in our case we should just skip pushdown as it should not be fatal

szehon-ho · 2026-03-26T07:56:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

-    nonPartitionFilters ++ nonPushable ++ rejectedPartitionFilters
+      p => p.asInstanceOf[PartitionPredicateImpl].expression
+    }.toSeq
+    (nonPartitionFilters ++ nonPushable ++ rejectedPartitionFilters).map(normalizedToOriginal)


this is where i return the original (non-flattened partition filters)

cloud-fan

Review Summary

This PR extends DSV2 partition predicate pushdown (SPARK-55596) to support nested partition columns (e.g., PARTITIONED BY (s.tz)). The approach normalizes GetStructField chains to flat AttributeReferences before the existing partition filter machinery, then denormalizes results back for post-scan filters. The design is clean — normalization is localized to pushPartitionPredicates without changing shared utilities.

One correctness concern with the denormalization map lookup, and two minor text fixes from the rename.

cloud-fan · 2026-03-26T12:36:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

-    nonPartitionFilters ++ nonPushable ++ rejectedPartitionFilters
+      p => p.asInstanceOf[PartitionPredicateImpl].expression
+    }.toSeq
+    (nonPartitionFilters ++ nonPushable ++ rejectedPartitionFilters).map(normalizedToOriginal)


.map(normalizedToOriginal) can throw NoSuchElementException when getPartitionFiltersAndDataFilters extracts partition sub-expressions from conjunction data filters via extractPredicatesWithinOutputSet. Those extracted sub-expressions are not keys in the normalizedToOriginal map (which only maps top-level expressions).

In practice Spark typically pre-splits conjunctions so the risk is low, but the code path is reachable for untranslatable conjunction expressions. Consider using .map(normalizedToOriginal.getOrElse(_, _)) with a fallback, or filtering out sub-expressions not in the map.

cloud-fan · 2026-03-26T12:36:59Z

...src/test/scala/org/apache/spark/sql/connector/DataSourceV2EnhancedPartitionFilterSuite.scala

      refs.foreach { ref =>
-        assert(ref.isInstanceOf[PartitionColumnReference],
+        assert(ref.isInstanceOf[PartitionFieldReference],
          s"Expected PartitionColumnReference, got ${ref.getClass.getName}")


Assertion message still says PartitionColumnReference after the rename. Same issue at lines 469, 471, and 476.

Suggested change

s"Expected PartitionColumnReference, got ${ref.getClass.getName}")

s"Expected PartitionFieldReference, got ${ref.getClass.getName}")

cloud-fan · 2026-03-26T12:36:59Z

...lyst/src/main/java/org/apache/spark/sql/connector/expressions/filter/PartitionPredicate.java

   * the partition columns (from {@link Table#partitioning()}) referenced by this predicate.
-   * Each reference's {@link PartitionColumnReference#fieldNames()} gives the partition column
-   * name; {@link PartitionColumnReference#ordinal()} gives the 0-based position in
+   * Each reference's {@link PartitionFieldReference#fieldNames()} gives the partition column


"partition column" should be "partition field" for consistency with the PartitionColumnReference → PartitionFieldReference rename.

Suggested change

* Each reference's {@link PartitionFieldReference#fieldNames()} gives the partition column

* Each reference's {@link PartitionFieldReference#fieldNames()} gives the partition field

cloud-fan · 2026-03-26T13:30:18Z

...catalyst/src/main/scala/org/apache/spark/sql/internal/connector/PartitionPredicateImpl.scala

    }
-    val partitionNames = partitionSchema.map(_.name).toSet
+    val partitionNames = partitionFields.map(_.structField.name).toSet
    val refNames = catalystExpr.references.map(_.name).toSet


It's anti-pattern to compare qualified names as single strings. I think one side is from the nested cols in partition predicates, the other side is reported by v2 table. How does v2 table report partition cols?

yes, the comparison is between

catalyst partition filter

v2 table partition cols

Both are flattened (ie turn into "a.b.c"):

using normalizePartitionFilters() which returns AttributeReference with flattened name

using resolveIdentityPartitionFIeld() which returns StructField with flattened name.

V2Table reports it via Transform which has transform.ref.fieldNames() which is Seq[String]. But I do need to flatten it for comparison, do you have any other thoughts?

Another reason for flatten it is later I need to pass the partition schema as StructType to DataSourceUtils.getPartitionFiltersAndDataFilters. That has some valuable logic there (eg, extracting more partition filters) that I did not want to re-implement.

szehon-ho force-pushed the nested_partition_filter branch from 302f3b8 to e23797a Compare March 25, 2026 01:17

[SPARK-56190][SQL] Support nested partition columns for DSV2 Partitio…

50d1e68

…nPredicate

szehon-ho force-pushed the nested_partition_filter branch from e23797a to 50d1e68 Compare March 25, 2026 01:20

peter-toth reviewed Mar 25, 2026

View reviewed changes

...yst/src/test/scala/org/apache/spark/sql/internal/connector/PartitionPredicateImplSuite.scala Outdated Show resolved Hide resolved

peter-toth reviewed Mar 25, 2026

View reviewed changes

...yst/src/test/scala/org/apache/spark/sql/internal/connector/PartitionPredicateImplSuite.scala Outdated Show resolved Hide resolved

peter-toth reviewed Mar 25, 2026

View reviewed changes

...talyst/src/main/java/org/apache/spark/sql/connector/expressions/PartitionFieldReference.java Outdated Show resolved Hide resolved

szehon-ho added 2 commits March 25, 2026 11:21

Update pr: fix nested filter corruption by normalization

673cdb6

Another review comment

2bd9420

szehon-ho commented Mar 25, 2026

View reviewed changes

Revert some changes

c6ae7ff

szehon-ho commented Mar 26, 2026

View reviewed changes

cloud-fan reviewed Mar 26, 2026

View reviewed changes

+  def getPartitionSchemaInfo(
+      relation: DataSourceV2Relation): Option[Seq[PartitionPredicateField]] = {
+    val transforms = relation.table.partitioning
+    if (transforms.isEmpty) return None
+    val resolver = SQLConf.get.resolver
+    val rootStruct =
+      StructType(relation.output.map(a => StructField(a.name, a.dataType, a.nullable)))
+    val fields = transforms.flatMap {
+      case t: IdentityTransform =>
+        toSupportedPartitionField(t, rootStruct, resolver).map(PartitionPredicateField(_, t.ref))
+      case _ => None
+    }
+    if (fields.length == transforms.length) Some(fields.toIndexedSeq) else None
+  }
+  /**
+   * Returns a StructField for the given identity partition transform if it is
+   * supported for iterative partition predicate push down.
+   */
+  private def toSupportedPartitionField(
+      transform: IdentityTransform,
+      rootStruct: StructType,
+      resolver: (String, String) => Boolean): Option[StructField] = {
+    val names = transform.ref.fieldNames().toIndexedSeq
+    if (names.isEmpty) None
+    else rootStruct.findNestedField(names, resolver = resolver).map {
+      case (_, leaf) => StructField(names.mkString("."), leaf.dataType, leaf.nullable)
+    }
+  }

	s"Expected PartitionColumnReference, got ${ref.getClass.getName}")
	s"Expected PartitionFieldReference, got ${ref.getClass.getName}")

	* Each reference's {@link PartitionFieldReference#fieldNames()} gives the partition column
	* Each reference's {@link PartitionFieldReference#fieldNames()} gives the partition field

Conversation

szehon-ho commented Mar 25, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

szehon-ho commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

szehon-ho commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

szehon-ho commented Mar 25, 2026 •

edited

Loading

szehon-ho Mar 26, 2026 •

edited

Loading