Skip to content

[SPARK-47017][SQL] Show internal SQL plans for RDDScanExec#56225

Open
alexandrefimov wants to merge 1 commit into
apache:masterfrom
alexandrefimov:SPARK-47017-rddscan-history-metrics
Open

[SPARK-47017][SQL] Show internal SQL plans for RDDScanExec#56225
alexandrefimov wants to merge 1 commit into
apache:masterfrom
alexandrefimov:SPARK-47017-rddscan-history-metrics

Conversation

@alexandrefimov
Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR preserves the original SQL physical plan inside SQLExecutionRDD and uses it when building SparkPlanInfo for RDDScanExec.

When a DataFrame is converted to an RDD and then back to a DataFrame, Spark currently shows only Scan ExistingRDD in the SQL UI. This change lets SparkPlanInfo discover the underlying SQL plan from the SQLExecutionRDD lineage and attach it under the RDDScanExec node.

The RDD lineage traversal is defensive:

  • it deduplicates RDDs by id;
  • it stops at each discovered SQLExecutionRDD;
  • it tolerates dependency lookup failures by skipping that branch.

Why are the changes needed?

Without this, the SQL UI loses useful operator visibility and metrics after workflows such as:

val source = spark.range(10).filter($"id" > 3)
val recreated = spark.createDataFrame(source.rdd, source.schema)
recreated.collect()

The UI only exposes Scan ExistingRDD, while the internal SQL operators, such as Filter, are hidden. This makes query diagnosis harder for DataFrame/RDD/DataFrame workflows.

Does this PR introduce any user-facing change?

Yes. For SQL executions that scan an RDD backed by a previous SQL execution, the SQL UI can now show the internal SQL plan below Scan ExistingRDD, including metrics such as number of output rows.

How was this patch tested?

Added regression coverage in SparkPlanInfoSuite for:

  • exposing the internal SQL plan under RDDScanExec;
  • preserving SQL metrics in the SQL UI plan graph;
  • deduplicating shared RDD lineage;
  • nested createDataFrame(df.rdd, df.schema) cases.

Ran:

build/sbt "sql/testOnly *SparkPlanInfoSuite -- -z SPARK-47017"

Result:

Tests: succeeded 1, failed 0
All tests passed.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5)

@alexandrefimov alexandrefimov force-pushed the SPARK-47017-rddscan-history-metrics branch from 6dcf6ca to a8a3773 Compare May 30, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant