[WIP][SPARK-XXXXX][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017
Open
dbtsai wants to merge 3 commits intoapache:masterfrom
Open
[WIP][SPARK-XXXXX][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017dbtsai wants to merge 3 commits intoapache:masterfrom
dbtsai wants to merge 3 commits intoapache:masterfrom
Conversation
Member
Currently InMemoryTableScanExec has column pruning already. It only deserializes necessary column. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
⏺ ## What changes were proposed in this pull request?
df.cache()is backed byInMemoryRelation, a pre-DSv2 logical plan node. Queries on cachedDataFrames skip the entire
V2ScanRelationPushDownoptimizer batch, losing column pruning, filterpushdown, sort-order propagation, and standard statistics reporting that all DSv2-backed sources
benefit from.
This PR introduces a thin DSv2 wrapper layer —
InMemoryCacheTable,InMemoryScanBuilder, andInMemoryCacheScan— in a new fileInMemoryCacheTable.scala.CacheManager.useCachedData()nowsubstitutes matching plan fragments with
DataSourceV2Relation(InMemoryCacheTable)instead of bareInMemoryRelation, so theV2ScanRelationPushDownbatch fires on cached DataFrames just as it doesfor any other DSv2 source.
The DSv2 interfaces implemented:
SupportsPushDownRequiredColumnsInMemoryTableScanExecdeserializes only the requested columnsSupportsPushDownV2FiltersCachedBatchSerializer.buildFilter; all predicates are returned (category-2) so a post-scanFilterExecis always added forSupportsReportOrderingV2ScanPartitioningAndOrdering, eliminating redundant sorts on ordered cached dataSupportsReportStatisticsANALYZE) are reported to the optimizer and AQE via the standard V2 pathPhysical execution is fully preserved. A new case in
DataSourceV2StrategyinterceptsDataSourceV2ScanRelation(InMemoryCacheScan)before the genericBatchScanExecpath and routes itback to
InMemoryTableScanExec. No change to the columnar scan hot path.A
CachedRelationextractor object matches any of the three plan forms a cached relation takesacross query stages —
InMemoryRelation,DataSourceV2Relation(InMemoryCacheTable), andDataSourceV2ScanRelation(InMemoryCacheScan)— providing backward compatibility for all existingpattern-match sites.
Why are the changes needed?
Without this change, every query on a cached DataFrame ignores standard optimizer rules that every
DSv2 source benefits from. The most impactful gap is column pruning: even
SELECT one_col FROM cached_wide_tablecauses all columns to be deserialized from the in-memorycolumnar store.
Does this PR introduce any user-facing change?
Yes, in a positive way:
ANALYZEcolumn statistics API for cached queries now correctly propagates stats through theoptimized plan.
The
InMemoryRelationtype is still used internally and visible inCachedData.cachedRepresentation.The change is transparent:
df.cache(),spark.catalog.cacheTable(), and related APIs behaveidentically from the user's perspective.
How was this patch tested?
All existing tests in
CachedTableSuite(104 tests),InMemoryColumnarQuerySuite,DatasetCacheSuite,UDFSuite,DataSourceV2SQLSuite, andLogicalPlanTagInSparkPlanSuitepasswithout modification to test logic (only callsite updates to use the new
CachedRelationextractor).A new benchmark
InMemoryCacheDSv2Benchmarkwas added. Results on 1M rows (AQE off, singlepartition):
Column pruning - 1000000 rows, 10 cols, select 2:
sum 2 of 10 cols (column pruning via DSv2) 19ms 1.0X
sum all 10 cols (no pruning - pre-DSv2) 54ms 0.3X (~2.8x speedup)
Planning overhead - 1000 plan-only iterations:
optimizedPlan (DSv2 path, V2ScanRelationPushDown) 254ms total (~0.25ms/plan)
Column pruning yields a ~3x speedup when accessing 2 of 10 cached columns. Planning overhead
from the additional optimizer rules is ~0.25ms per query, negligible in practice.