Skip to content

[SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download#54183

Closed
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-55401
Closed

[SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download#54183
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-55401

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Feb 6, 2026

What changes were proposed in this pull request?

This PR adds retry logic and timeout handling to the Spark distribution download in pyspark/install.py to reduce flakiness in pyspark.tests.test_install_spark.

Changes:

  1. Added timeout=10 to the mirror resolution urlopen() call in get_preferred_mirrors() — prevents hanging when closer.lua is unresponsive
  2. Added _download_with_retries() helper — wraps the download with:
    • Configurable timeout (default: 600s) on urlopen() to prevent indefinite hangs
    • Up to 3 retry attempts with exponential backoff (5s, 10s, 20s)
    • Cleanup of partial downloads on failure
    • Clear logging of retry attempts for CI debugging

Why are the changes needed?

The pyspark-install CI job frequently fails due to transient network issues when downloading ~400MB Spark distributions from Apache mirrors. Current issues:

  • urlopen() has no timeout — downloads can hang indefinitely
  • No retry logic — a single transient network error causes complete failure
  • CI logs show downloads stalling mid-stream (e.g., at 64%) with no recovery

Does this PR introduce any user-facing change?

No. The download behavior is improved with retries and timeouts, but the API is unchanged. Users who call install_spark or pip install pyspark with PYSPARK_HADOOP_VERSION will benefit from more reliable downloads.

How was this patch tested?

  • Existing unit tests pass: test_package_name, test_checked_versions
  • The download test (test_install_spark) exercises the new retry path in CI

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

JIRA Issue Information

=== Improvement SPARK-55401 ===
Summary: Add retry logic and timeout handling to pyspark install download
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the reliability of Spark distribution downloads during pyspark installation by adding request timeouts and retry/backoff behavior, targeting flakiness in pyspark.tests.test_install_spark.

Changes:

  • Add a timeout to Apache mirror resolution (closer.lua) to prevent indefinite hangs.
  • Route Spark distribution downloads through a new retrying helper with timeout/backoff and partial-download cleanup.
Comments suppressed due to low confidence (1)

python/pyspark/install.py:180

  • The comment No retry is misleading because this loop does attempt up to 3 times. Either adjust the comment to match the behavior (e.g., "best-effort") or remove it to avoid confusion during debugging.
        except Exception:
            # If we can't get a mirror URL, skip it. No retry.
            pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +205 to +212
for attempt in range(max_retries):
try:
response = urllib.request.urlopen(url, timeout=timeout)
download_to_file(response, path)
return
except Exception as e:
if os.path.exists(path):
os.remove(path)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current retry loop will also retry on non-transient failures like HTTPError 404 (bad version / missing artifact) or 403, which can add minutes of unnecessary backoff before the outer loop moves to the next mirror. Consider special-casing urllib.error.HTTPError to only retry for likely-transient status codes (e.g., 429/5xx) and fail fast otherwise.

Copilot uses AI. Check for mistakes.
Comment on lines 174 to 177
response = urllib.request.urlopen(
"https://www.apache.org/dyn/closer.lua?preferred=true"
"https://www.apache.org/dyn/closer.lua?preferred=true", timeout=10
)
mirror_urls.append(response.read().decode("utf-8"))
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urlopen() responses should be closed to avoid leaking sockets/file descriptors. Consider using a context manager (with urllib.request.urlopen(...) as response:) (or try/finally: response.close()) around the response.read() call here.

Copilot uses AI. Check for mistakes.
Comment on lines +206 to +209
try:
response = urllib.request.urlopen(url, timeout=timeout)
download_to_file(response, path)
return
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response is never closed, so each attempt can leak an open HTTP connection/file descriptor (especially problematic with retries). Wrap the urlopen call in a context manager or ensure response.close() runs in a finally block even when download_to_file raises.

Copilot uses AI. Check for mistakes.
@zhengruifeng
Copy link
Contributor

thanks for the fix.
also cc @HyukjinKwon @gaogaotiantian

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@yaooqinn yaooqinn closed this in 3a9307f Feb 7, 2026
@yaooqinn yaooqinn deleted the SPARK-55401 branch February 7, 2026 16:39
@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 7, 2026

Merged to master, thank you @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants