[SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download#54183
[SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download#54183yaooqinn wants to merge 1 commit intoapache:masterfrom
Conversation
… install download
JIRA Issue Information=== Improvement SPARK-55401 === This comment was automatically generated by GitHub Actions |
There was a problem hiding this comment.
Pull request overview
This PR improves the reliability of Spark distribution downloads during pyspark installation by adding request timeouts and retry/backoff behavior, targeting flakiness in pyspark.tests.test_install_spark.
Changes:
- Add a timeout to Apache mirror resolution (
closer.lua) to prevent indefinite hangs. - Route Spark distribution downloads through a new retrying helper with timeout/backoff and partial-download cleanup.
Comments suppressed due to low confidence (1)
python/pyspark/install.py:180
- The comment
No retryis misleading because this loop does attempt up to 3 times. Either adjust the comment to match the behavior (e.g., "best-effort") or remove it to avoid confusion during debugging.
except Exception:
# If we can't get a mirror URL, skip it. No retry.
pass
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for attempt in range(max_retries): | ||
| try: | ||
| response = urllib.request.urlopen(url, timeout=timeout) | ||
| download_to_file(response, path) | ||
| return | ||
| except Exception as e: | ||
| if os.path.exists(path): | ||
| os.remove(path) |
There was a problem hiding this comment.
Current retry loop will also retry on non-transient failures like HTTPError 404 (bad version / missing artifact) or 403, which can add minutes of unnecessary backoff before the outer loop moves to the next mirror. Consider special-casing urllib.error.HTTPError to only retry for likely-transient status codes (e.g., 429/5xx) and fail fast otherwise.
| response = urllib.request.urlopen( | ||
| "https://www.apache.org/dyn/closer.lua?preferred=true" | ||
| "https://www.apache.org/dyn/closer.lua?preferred=true", timeout=10 | ||
| ) | ||
| mirror_urls.append(response.read().decode("utf-8")) |
There was a problem hiding this comment.
urlopen() responses should be closed to avoid leaking sockets/file descriptors. Consider using a context manager (with urllib.request.urlopen(...) as response:) (or try/finally: response.close()) around the response.read() call here.
| try: | ||
| response = urllib.request.urlopen(url, timeout=timeout) | ||
| download_to_file(response, path) | ||
| return |
There was a problem hiding this comment.
response is never closed, so each attempt can leak an open HTTP connection/file descriptor (especially problematic with retries). Wrap the urlopen call in a context manager or ensure response.close() runs in a finally block even when download_to_file raises.
|
thanks for the fix. |
|
Merged to master, thank you @dongjoon-hyun |
What changes were proposed in this pull request?
This PR adds retry logic and timeout handling to the Spark distribution download in
pyspark/install.pyto reduce flakiness inpyspark.tests.test_install_spark.Changes:
timeout=10to the mirror resolutionurlopen()call inget_preferred_mirrors()— prevents hanging whencloser.luais unresponsive_download_with_retries()helper — wraps the download with:urlopen()to prevent indefinite hangsWhy are the changes needed?
The
pyspark-installCI job frequently fails due to transient network issues when downloading ~400MB Spark distributions from Apache mirrors. Current issues:urlopen()has no timeout — downloads can hang indefinitelyDoes this PR introduce any user-facing change?
No. The download behavior is improved with retries and timeouts, but the API is unchanged. Users who call
install_sparkorpip install pysparkwithPYSPARK_HADOOP_VERSIONwill benefit from more reliable downloads.How was this patch tested?
test_package_name,test_checked_versionstest_install_spark) exercises the new retry path in CIWas this patch authored or co-authored using generative AI tooling?
No.