Skip to content

[SPARK-55229][PYTHON] Implement DataFrame.zipWithIndex in PySpark Classic#54195

Open
fangchenli wants to merge 3 commits intoapache:masterfrom
fangchenli:pyspark-zip-with-index
Open

[SPARK-55229][PYTHON] Implement DataFrame.zipWithIndex in PySpark Classic#54195
fangchenli wants to merge 3 commits intoapache:masterfrom
fangchenli:pyspark-zip-with-index

Conversation

@fangchenli
Copy link
Contributor

What changes were proposed in this pull request?

Implement DataFrame.zipWithIndex in PySpark Classic

Why are the changes needed?

This method was added in Scala earlier. We need to add it in PySpark classic so user can use it in PySpark.

Does this PR introduce any user-facing change?

Yes, user can see and use this API in PySpark.

How was this patch tested?

Unittests added.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.6

@github-actions
Copy link

github-actions bot commented Feb 7, 2026

JIRA Issue Information

=== Sub-task SPARK-55229 ===
Summary: Implement DataFrame.zipWithIndex in PySpark Classic
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions


def zipWithIndex(self, indexColName: str = "index") -> ParentDataFrame:
return self.select(
F.col("*"), InternalFunction.distributed_sequence_id().alias(indexColName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is dedicated for PS, and I am making it different from zipWithIndex on underlying RDD cache.

basically, we directly invoke JVM methods via py4j for methods in pyspark classic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants