fix: add network timeouts to discoverValidSitemaps to prevent indefinite hangs by stakeswky · Pull Request #3429 · apify/crawlee

stakeswky · 2026-02-22T08:29:21Z

Summary

Add network timeouts to discoverValidSitemaps to prevent indefinite hangs when servers stall or never respond.

Problem

discoverValidSitemaps makes HTTP requests without any timeout configuration:

urlExists — HEAD requests to check if sitemap URLs exist
RobotsFile.find → RobotsTxtFile.load — GET request to fetch robots.txt

When a server stalls or the connection hangs, these requests block indefinitely, preventing the async iterator from yielding and blocking the crawler from starting.

Fix

Added networkTimeouts option to discoverValidSitemaps with a default of { request: 60_000 } (60 seconds)
Pass timeout to urlExists HEAD requests via got-scraping's timeout option
Added networkTimeouts parameter to RobotsTxtFile.find() and RobotsTxtFile.load(), forwarded to the gotScraping call
All changes are backward-compatible — existing callers without the option get the 60s default

API

// Before: could hang indefinitely
for await (const url of discoverValidSitemaps(urls)) { ... }

// After: 60s default timeout, or customizable
for await (const url of discoverValidSitemaps(urls, { networkTimeouts: { request: 30_000 } })) { ... }

Fixes #3412

…ite hangs The discoverValidSitemaps function makes HTTP requests (HEAD for URL existence checks, GET for robots.txt) without any timeout configuration. When a server stalls or never responds, these requests hang indefinitely, blocking the async iterator and preventing the crawler from starting. Changes: - Add networkTimeouts option to discoverValidSitemaps (default: 60s) - Pass timeout to urlExists HEAD requests - Add networkTimeouts parameter to RobotsTxtFile.find/load - Pass timeout through to robots.txt fetch Fixes apify#3412

barjin

Thank you for your contribution @stakeswky !

I have few high-level comments.

In Crawlee v4 (the new upcoming version of Crawlee), we remove direct dependency on got-scraping and allow users to pass custom HttpClient implementations instead. To make your changes more futureproof, could you please change this so the methods accept a timeout?: number parameter (not got-scraping specific Delays)?

Related to this - I believe most users will understand any timeout-related option as a "global" timeout for the entire find() / load() call. This is currently not the case in e.g. discoverValidSitemaps, which applies the networkTimeouts to each request separately. While this is technically correct (and I can imagine it being useful), I'd rather have the "global" timeout for the sake of API simplicity. You can achieve this by e.g. passing an AbortSignal to got-scraping here.

All this being said, I'm definitely open for discussion, feel free to share your ideas if you disagree with some of the points. Cheers!

Per reviewer feedback: - Remove got-scraping-specific Delays type; accept timeout?: number instead - Create AbortController internally in discoverValidSitemaps and pass the signal to every HTTP request (urlExists, RobotsFile.find, load), so the entire discovery call is cancelled once the timeout elapses — not just individual requests - Clear the timeout handle in a finally block to avoid leaks - RobotsTxtFile.find / load now accept AbortSignal directly, keeping the API future-proof for Crawlee v4 which drops the got-scraping dependency

stakeswky

Thanks for the thorough feedback, @barjin!

Both points addressed in the latest push:

** → ** — removed the got-scraping-specific type entirely. Both discoverValidSitemaps and RobotsTxtFile.find/load now accept plain number (ms) or AbortSignal respectively.
Global timeout via AbortSignal — discoverValidSitemaps now creates an AbortController internally, fires controller.abort() after timeout ms, and passes the signal to every HTTP call (urlExists, RobotsFile.find, and the underlying load). The timeout handle is cleared in a finally block to avoid leaks. This makes the timeout a true global budget for the whole discovery call rather than a per-request cap.

Happy to adjust if you'd prefer the AbortSignal to be caller-supplied instead (so callers can cancel externally too) — just let me know.

barjin self-requested a review February 23, 2026 08:19

barjin reviewed Feb 23, 2026

View reviewed changes

stakeswky commented Feb 23, 2026

View reviewed changes

barjin mentioned this pull request Feb 23, 2026

fix: prevent v3 sitemap discovery init hangs in discoverValidSitemaps #3434

Open

barjin self-requested a review February 23, 2026 15:08

nicklamonov requested a review from nikitachapovskii-dev February 23, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: add network timeouts to discoverValidSitemaps to prevent indefinite hangs#3429

fix: add network timeouts to discoverValidSitemaps to prevent indefinite hangs#3429
stakeswky wants to merge 2 commits intoapify:masterfrom
stakeswky:fix/sitemap-discovery-timeout

stakeswky commented Feb 22, 2026

Uh oh!

barjin left a comment

Uh oh!

stakeswky left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

stakeswky commented Feb 22, 2026

Summary

Problem

Fix

API

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

stakeswky left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants