fix: add network timeouts to discoverValidSitemaps to prevent indefinite hangs#3429
fix: add network timeouts to discoverValidSitemaps to prevent indefinite hangs#3429stakeswky wants to merge 2 commits intoapify:masterfrom
Conversation
…ite hangs The discoverValidSitemaps function makes HTTP requests (HEAD for URL existence checks, GET for robots.txt) without any timeout configuration. When a server stalls or never responds, these requests hang indefinitely, blocking the async iterator and preventing the crawler from starting. Changes: - Add networkTimeouts option to discoverValidSitemaps (default: 60s) - Pass timeout to urlExists HEAD requests - Add networkTimeouts parameter to RobotsTxtFile.find/load - Pass timeout through to robots.txt fetch Fixes apify#3412
barjin
left a comment
There was a problem hiding this comment.
Thank you for your contribution @stakeswky !
I have few high-level comments.
In Crawlee v4 (the new upcoming version of Crawlee), we remove direct dependency on got-scraping and allow users to pass custom HttpClient implementations instead. To make your changes more futureproof, could you please change this so the methods accept a timeout?: number parameter (not got-scraping specific Delays)?
Related to this - I believe most users will understand any timeout-related option as a "global" timeout for the entire find() / load() call. This is currently not the case in e.g. discoverValidSitemaps, which applies the networkTimeouts to each request separately. While this is technically correct (and I can imagine it being useful), I'd rather have the "global" timeout for the sake of API simplicity. You can achieve this by e.g. passing an AbortSignal to got-scraping here.
All this being said, I'm definitely open for discussion, feel free to share your ideas if you disagree with some of the points. Cheers!
Per reviewer feedback: - Remove got-scraping-specific Delays type; accept timeout?: number instead - Create AbortController internally in discoverValidSitemaps and pass the signal to every HTTP request (urlExists, RobotsFile.find, load), so the entire discovery call is cancelled once the timeout elapses — not just individual requests - Clear the timeout handle in a finally block to avoid leaks - RobotsTxtFile.find / load now accept AbortSignal directly, keeping the API future-proof for Crawlee v4 which drops the got-scraping dependency
stakeswky
left a comment
There was a problem hiding this comment.
Thanks for the thorough feedback, @barjin!
Both points addressed in the latest push:
-
** → ** — removed the got-scraping-specific type entirely. Both
discoverValidSitemapsandRobotsTxtFile.find/loadnow accept plainnumber(ms) orAbortSignalrespectively. -
Global timeout via
AbortSignal—discoverValidSitemapsnow creates anAbortControllerinternally, firescontroller.abort()aftertimeoutms, and passes thesignalto every HTTP call (urlExists,RobotsFile.find, and the underlyingload). The timeout handle is cleared in afinallyblock to avoid leaks. This makes the timeout a true global budget for the whole discovery call rather than a per-request cap.
Happy to adjust if you'd prefer the AbortSignal to be caller-supplied instead (so callers can cancel externally too) — just let me know.
Summary
Add network timeouts to
discoverValidSitemapsto prevent indefinite hangs when servers stall or never respond.Problem
discoverValidSitemapsmakes HTTP requests without any timeout configuration:urlExists— HEAD requests to check if sitemap URLs existRobotsFile.find→RobotsTxtFile.load— GET request to fetch robots.txtWhen a server stalls or the connection hangs, these requests block indefinitely, preventing the async iterator from yielding and blocking the crawler from starting.
Fix
networkTimeoutsoption todiscoverValidSitemapswith a default of{ request: 60_000 }(60 seconds)urlExistsHEAD requests viagot-scraping'stimeoutoptionnetworkTimeoutsparameter toRobotsTxtFile.find()andRobotsTxtFile.load(), forwarded to thegotScrapingcallAPI
Fixes #3412