[SPARK-55395][SQL] Disable RDD cache in `DataFrame.zipWithIndex` by zhengruifeng · Pull Request #54178 · apache/spark

zhengruifeng · 2026-02-06T12:51:50Z

What changes were proposed in this pull request?

Disable RDD cache in DataFrame.zipWithIndex

Why are the changes needed?

When AttachDistributedSequence was first introduced for Pandas API on Spark in 93cec49, the underlying RDD was always localCheckpointed to cache to avoid re-computation.
Then we hit serious executor memory issue, and in 4279090 we made the storage level configurable and release the cached data after each stage by AQE.

Since we are reusing AttachDistributedSequence to implement DataFrame.zipWithIndex, to be more conservative, we'd start with a no-cache version, it will be easy to enable the caching if necessary in the future.

Moreover, there is some chance to optimize the no-cache version #54169

This PR disable the RDD cache in DistributedSequenceID by default; and in the PS callsites, explicitly set cache=True

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

No

github-actions · 2026-02-06T12:52:02Z

JIRA Issue Information

=== Improvement SPARK-55395 ===
Summary: Disable RDD cache in DataFrame.zipWithIndex
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

test

zhengruifeng · 2026-02-07T01:02:12Z

on second thought, let me use a bool parameter cache to replace the storage level, to simplify the code

zhengruifeng · 2026-02-07T01:51:46Z

python/pyspark/sql/internal.py

    @staticmethod
    def distributed_sequence_id() -> Column:
-        return InternalFunction._invoke_internal_function_over_columns("distributed_sequence_id")
+        return InternalFunction._invoke_internal_function_over_columns(


this is only used in PS

zhengruifeng · 2026-02-07T01:52:30Z

sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala

   */
  private[sql] def withSequenceColumn(name: String) = {
-    select(Column(DistributedSequenceID()).alias(name), col("*"))
+    select(Column(DistributedSequenceID(Literal(true))).alias(name), col("*"))


this should be also a place for PS on pyspark classic

github-actions bot added SQL PYTHON labels Feb 6, 2026

zhengruifeng force-pushed the zip_with_index_cache branch from b30fe08 to 1afefd0 Compare February 6, 2026 13:02

zhengruifeng marked this pull request as ready for review February 6, 2026 13:06

zhengruifeng requested review from HyukjinKwon and cloud-fan February 7, 2026 00:38

zhengruifeng added 3 commits February 7, 2026 09:01

test

10906f5

test

test

afd73c8

regen

0373a5a

test

c4e7d72

zhengruifeng force-pushed the zip_with_index_cache branch from 1afefd0 to c4e7d72 Compare February 7, 2026 01:40

nit

f258667

zhengruifeng commented Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55395][SQL] Disable RDD cache in `DataFrame.zipWithIndex`#54178

[SPARK-55395][SQL] Disable RDD cache in `DataFrame.zipWithIndex`#54178
zhengruifeng wants to merge 5 commits intoapache:masterfrom
zhengruifeng:zip_with_index_cache

zhengruifeng commented Feb 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

zhengruifeng commented Feb 7, 2026 •

edited

Loading

Uh oh!

zhengruifeng Feb 7, 2026

Uh oh!

zhengruifeng Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhengruifeng commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 6, 2026

JIRA Issue Information

Uh oh!

zhengruifeng commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhengruifeng commented Feb 6, 2026 •

edited

Loading

zhengruifeng commented Feb 7, 2026 •

edited

Loading