[WIP][SPARK-XXXXX][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules by dbtsai · Pull Request #55017 · apache/spark

dbtsai · 2026-03-25T21:23:35Z

⏺ ## What changes were proposed in this pull request?

df.cache() is backed by InMemoryRelation, a pre-DSv2 logical plan node. Queries on cached
DataFrames skip the entire V2ScanRelationPushDown optimizer batch, losing column pruning, filter
pushdown, sort-order propagation, and standard statistics reporting that all DSv2-backed sources
benefit from.

This PR introduces a thin DSv2 wrapper layer — InMemoryCacheTable, InMemoryScanBuilder, and
InMemoryCacheScan — in a new file InMemoryCacheTable.scala. CacheManager.useCachedData() now
substitutes matching plan fragments with DataSourceV2Relation(InMemoryCacheTable) instead of bare
InMemoryRelation, so the V2ScanRelationPushDown batch fires on cached DataFrames just as it does
for any other DSv2 source.

The DSv2 interfaces implemented:

Interface	Benefit
`SupportsPushDownRequiredColumns`	Column pruning: `InMemoryTableScanExec` deserializes only the requested columns
`SupportsPushDownV2Filters`	Filter pushdown: predicates are recorded for per-batch min/max pruning via `CachedBatchSerializer.buildFilter`; all predicates are returned (category-2) so a post-scan `FilterExec` is always added for
row-level evaluation
`SupportsReportOrdering`	Sort-order propagation: cached sort order is visible to `V2ScanPartitioningAndOrdering`, eliminating redundant sorts on ordered cached data
`SupportsReportStatistics`	Statistics: accurate row count, size, and column statistics (including those gathered via `ANALYZE`) are reported to the optimizer and AQE via the standard V2 path

Physical execution is fully preserved. A new case in DataSourceV2Strategy intercepts
DataSourceV2ScanRelation(InMemoryCacheScan) before the generic BatchScanExec path and routes it
back to InMemoryTableScanExec. No change to the columnar scan hot path.

A CachedRelation extractor object matches any of the three plan forms a cached relation takes
across query stages — InMemoryRelation, DataSourceV2Relation(InMemoryCacheTable), and
DataSourceV2ScanRelation(InMemoryCacheScan) — providing backward compatibility for all existing
pattern-match sites.

Why are the changes needed?

Without this change, every query on a cached DataFrame ignores standard optimizer rules that every
DSv2 source benefits from. The most impactful gap is column pruning: even
SELECT one_col FROM cached_wide_table causes all columns to be deserialized from the in-memory
columnar store.

Does this PR introduce any user-facing change?

Yes, in a positive way:

Queries that access a subset of columns from a cached DataFrame will be significantly faster.
Queries on sorted cached DataFrames may eliminate redundant sort operations.
The ANALYZE column statistics API for cached queries now correctly propagates stats through the
optimized plan.

The InMemoryRelation type is still used internally and visible in CachedData.cachedRepresentation.
The change is transparent: df.cache(), spark.catalog.cacheTable(), and related APIs behave
identically from the user's perspective.

How was this patch tested?

All existing tests in CachedTableSuite (104 tests), InMemoryColumnarQuerySuite,
DatasetCacheSuite, UDFSuite, DataSourceV2SQLSuite, and LogicalPlanTagInSparkPlanSuite pass
without modification to test logic (only callsite updates to use the new CachedRelation extractor).

A new benchmark InMemoryCacheDSv2Benchmark was added. Results on 1M rows (AQE off, single
partition):

Column pruning - 1000000 rows, 10 cols, select 2:

sum 2 of 10 cols (column pruning via DSv2) 19ms 1.0X
sum all 10 cols (no pruning - pre-DSv2) 54ms 0.3X (~2.8x speedup)

Planning overhead - 1000 plan-only iterations:

optimizedPlan (DSv2 path, V2ScanRelationPushDown) 254ms total (~0.25ms/plan)

Column pruning yields a ~3x speedup when accessing 2 of 10 cached columns. Planning overhead
from the additional optimizer rules is ~0.25ms per query, negligible in practice.

viirya · 2026-03-25T23:44:51Z

The most impactful gap is column pruning: even
SELECT one_col FROM cached_wide_table causes all columns to be deserialized from the in-memory
columnar store.

Currently InMemoryTableScanExec has column pruning already. It only deserializes necessary column.

dsv2 cache

d09cde5

dbtsai added 2 commits March 26, 2026 10:14

fix tests

d53044e

more work

fa6878e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-XXXXX][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017

[WIP][SPARK-XXXXX][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017
dbtsai wants to merge 3 commits intoapache:masterfrom
dbtsai:dfCache

dbtsai commented Mar 25, 2026

Uh oh!

viirya commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dbtsai commented Mar 25, 2026

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants