[SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download by yaooqinn · Pull Request #54183 · apache/spark

yaooqinn · 2026-02-06T18:55:26Z

What changes were proposed in this pull request?

This PR adds retry logic and timeout handling to the Spark distribution download in pyspark/install.py to reduce flakiness in pyspark.tests.test_install_spark.

Changes:

Added timeout=10 to the mirror resolution urlopen() call in get_preferred_mirrors() — prevents hanging when closer.lua is unresponsive
Added _download_with_retries() helper — wraps the download with:
- Configurable timeout (default: 600s) on urlopen() to prevent indefinite hangs
- Up to 3 retry attempts with exponential backoff (5s, 10s, 20s)
- Cleanup of partial downloads on failure
- Clear logging of retry attempts for CI debugging

Why are the changes needed?

The pyspark-install CI job frequently fails due to transient network issues when downloading ~400MB Spark distributions from Apache mirrors. Current issues:

urlopen() has no timeout — downloads can hang indefinitely
No retry logic — a single transient network error causes complete failure
CI logs show downloads stalling mid-stream (e.g., at 64%) with no recovery

Does this PR introduce any user-facing change?

No. The download behavior is improved with retries and timeouts, but the API is unchanged. Users who call install_spark or pip install pyspark with PYSPARK_HADOOP_VERSION will benefit from more reliable downloads.

How was this patch tested?

Existing unit tests pass: test_package_name, test_checked_versions
The download test (test_install_spark) exercises the new retry path in CI

Was this patch authored or co-authored using generative AI tooling?

No.

… install download

github-actions · 2026-02-06T18:55:37Z

JIRA Issue Information

=== Improvement SPARK-55401 ===
Summary: Add retry logic and timeout handling to pyspark install download
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

Copilot

Pull request overview

This PR improves the reliability of Spark distribution downloads during pyspark installation by adding request timeouts and retry/backoff behavior, targeting flakiness in pyspark.tests.test_install_spark.

Changes:

Add a timeout to Apache mirror resolution (closer.lua) to prevent indefinite hangs.
Route Spark distribution downloads through a new retrying helper with timeout/backoff and partial-download cleanup.

Comments suppressed due to low confidence (1)

python/pyspark/install.py:180

The comment No retry is misleading because this loop does attempt up to 3 times. Either adjust the comment to match the behavior (e.g., "best-effort") or remove it to avoid confusion during debugging.

        except Exception:
            # If we can't get a mirror URL, skip it. No retry.
            pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/pyspark/install.py

Copilot · 2026-02-06T19:09:19Z

python/pyspark/install.py

+    for attempt in range(max_retries):
+        try:
+            response = urllib.request.urlopen(url, timeout=timeout)
+            download_to_file(response, path)
+            return
+        except Exception as e:
+            if os.path.exists(path):
+                os.remove(path)


Current retry loop will also retry on non-transient failures like HTTPError 404 (bad version / missing artifact) or 403, which can add minutes of unnecessary backoff before the outer loop moves to the next mirror. Consider special-casing urllib.error.HTTPError to only retry for likely-transient status codes (e.g., 429/5xx) and fail fast otherwise.

Copilot · 2026-02-06T19:09:19Z

python/pyspark/install.py

            response = urllib.request.urlopen(
-                "https://www.apache.org/dyn/closer.lua?preferred=true"
+                "https://www.apache.org/dyn/closer.lua?preferred=true", timeout=10
            )
            mirror_urls.append(response.read().decode("utf-8"))


urlopen() responses should be closed to avoid leaking sockets/file descriptors. Consider using a context manager (with urllib.request.urlopen(...) as response:) (or try/finally: response.close()) around the response.read() call here.

Copilot · 2026-02-06T19:09:19Z

python/pyspark/install.py

+        try:
+            response = urllib.request.urlopen(url, timeout=timeout)
+            download_to_file(response, path)
+            return


response is never closed, so each attempt can leak an open HTTP connection/file descriptor (especially problematic with retries). Wrap the urlopen call in a context manager or ensure response.close() runs in a finally block even when download_to_file raises.

zhengruifeng · 2026-02-07T01:34:29Z

thanks for the fix.
also cc @HyukjinKwon @gaogaotiantian

dongjoon-hyun

+1, LGTM.

yaooqinn · 2026-02-07T16:40:04Z

Merged to master, thank you @dongjoon-hyun

[SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark…

76077de

… install download

github-actions bot added CORE PYTHON labels Feb 6, 2026

yaooqinn requested a review from Copilot February 6, 2026 19:03

Copilot started reviewing on behalf of yaooqinn February 6, 2026 19:04 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

dongjoon-hyun approved these changes Feb 7, 2026

View reviewed changes

yaooqinn closed this in 3a9307f Feb 7, 2026

yaooqinn deleted the SPARK-55401 branch February 7, 2026 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download#54183

[SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download#54183
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-55401

yaooqinn commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

zhengruifeng commented Feb 7, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

yaooqinn commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yaooqinn commented Feb 6, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 6, 2026

JIRA Issue Information

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Feb 7, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants