Skip to content

[WIP][SPARK-XXXXX][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017

Open
dbtsai wants to merge 3 commits intoapache:masterfrom
dbtsai:dfCache
Open

[WIP][SPARK-XXXXX][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017
dbtsai wants to merge 3 commits intoapache:masterfrom
dbtsai:dfCache

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Mar 25, 2026

⏺ ## What changes were proposed in this pull request?

df.cache() is backed by InMemoryRelation, a pre-DSv2 logical plan node. Queries on cached
DataFrames skip the entire V2ScanRelationPushDown optimizer batch, losing column pruning, filter
pushdown, sort-order propagation, and standard statistics reporting that all DSv2-backed sources
benefit from.

This PR introduces a thin DSv2 wrapper layer — InMemoryCacheTable, InMemoryScanBuilder, and
InMemoryCacheScan — in a new file InMemoryCacheTable.scala. CacheManager.useCachedData() now
substitutes matching plan fragments with DataSourceV2Relation(InMemoryCacheTable) instead of bare
InMemoryRelation, so the V2ScanRelationPushDown batch fires on cached DataFrames just as it does
for any other DSv2 source.

The DSv2 interfaces implemented:

Interface Benefit
SupportsPushDownRequiredColumns Column pruning: InMemoryTableScanExec deserializes only the requested columns
SupportsPushDownV2Filters Filter pushdown: predicates are recorded for per-batch min/max pruning via CachedBatchSerializer.buildFilter; all predicates are returned (category-2) so a post-scan FilterExec is always added for
row-level evaluation
SupportsReportOrdering Sort-order propagation: cached sort order is visible to V2ScanPartitioningAndOrdering, eliminating redundant sorts on ordered cached data
SupportsReportStatistics Statistics: accurate row count, size, and column statistics (including those gathered via ANALYZE) are reported to the optimizer and AQE via the standard V2 path

Physical execution is fully preserved. A new case in DataSourceV2Strategy intercepts
DataSourceV2ScanRelation(InMemoryCacheScan) before the generic BatchScanExec path and routes it
back to InMemoryTableScanExec. No change to the columnar scan hot path.

A CachedRelation extractor object matches any of the three plan forms a cached relation takes
across query stages — InMemoryRelation, DataSourceV2Relation(InMemoryCacheTable), and
DataSourceV2ScanRelation(InMemoryCacheScan) — providing backward compatibility for all existing
pattern-match sites.

Why are the changes needed?

Without this change, every query on a cached DataFrame ignores standard optimizer rules that every
DSv2 source benefits from. The most impactful gap is column pruning: even
SELECT one_col FROM cached_wide_table causes all columns to be deserialized from the in-memory
columnar store.

Does this PR introduce any user-facing change?

Yes, in a positive way:

  • Queries that access a subset of columns from a cached DataFrame will be significantly faster.
  • Queries on sorted cached DataFrames may eliminate redundant sort operations.
  • The ANALYZE column statistics API for cached queries now correctly propagates stats through the
    optimized plan.

The InMemoryRelation type is still used internally and visible in CachedData.cachedRepresentation.
The change is transparent: df.cache(), spark.catalog.cacheTable(), and related APIs behave
identically from the user's perspective.

How was this patch tested?

All existing tests in CachedTableSuite (104 tests), InMemoryColumnarQuerySuite,
DatasetCacheSuite, UDFSuite, DataSourceV2SQLSuite, and LogicalPlanTagInSparkPlanSuite pass
without modification to test logic (only callsite updates to use the new CachedRelation extractor).

A new benchmark InMemoryCacheDSv2Benchmark was added. Results on 1M rows (AQE off, single
partition):

Column pruning - 1000000 rows, 10 cols, select 2:

sum 2 of 10 cols (column pruning via DSv2) 19ms 1.0X
sum all 10 cols (no pruning - pre-DSv2) 54ms 0.3X (~2.8x speedup)

Planning overhead - 1000 plan-only iterations:

optimizedPlan (DSv2 path, V2ScanRelationPushDown) 254ms total (~0.25ms/plan)

Column pruning yields a ~3x speedup when accessing 2 of 10 cached columns. Planning overhead
from the additional optimizer rules is ~0.25ms per query, negligible in practice.

@viirya
Copy link
Member

viirya commented Mar 25, 2026

The most impactful gap is column pruning: even
SELECT one_col FROM cached_wide_table causes all columns to be deserialized from the in-memory
columnar store.

Currently InMemoryTableScanExec has column pruning already. It only deserializes necessary column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants