Skip to content

Comments

fix: add network timeouts to discoverValidSitemaps to prevent indefinite hangs#3429

Open
stakeswky wants to merge 2 commits intoapify:masterfrom
stakeswky:fix/sitemap-discovery-timeout
Open

fix: add network timeouts to discoverValidSitemaps to prevent indefinite hangs#3429
stakeswky wants to merge 2 commits intoapify:masterfrom
stakeswky:fix/sitemap-discovery-timeout

Conversation

@stakeswky
Copy link

Summary

Add network timeouts to discoverValidSitemaps to prevent indefinite hangs when servers stall or never respond.

Problem

discoverValidSitemaps makes HTTP requests without any timeout configuration:

  1. urlExists — HEAD requests to check if sitemap URLs exist
  2. RobotsFile.findRobotsTxtFile.load — GET request to fetch robots.txt

When a server stalls or the connection hangs, these requests block indefinitely, preventing the async iterator from yielding and blocking the crawler from starting.

Fix

  • Added networkTimeouts option to discoverValidSitemaps with a default of { request: 60_000 } (60 seconds)
  • Pass timeout to urlExists HEAD requests via got-scraping's timeout option
  • Added networkTimeouts parameter to RobotsTxtFile.find() and RobotsTxtFile.load(), forwarded to the gotScraping call
  • All changes are backward-compatible — existing callers without the option get the 60s default

API

// Before: could hang indefinitely
for await (const url of discoverValidSitemaps(urls)) { ... }

// After: 60s default timeout, or customizable
for await (const url of discoverValidSitemaps(urls, { networkTimeouts: { request: 30_000 } })) { ... }

Fixes #3412

…ite hangs

The discoverValidSitemaps function makes HTTP requests (HEAD for URL
existence checks, GET for robots.txt) without any timeout configuration.
When a server stalls or never responds, these requests hang indefinitely,
blocking the async iterator and preventing the crawler from starting.

Changes:
- Add networkTimeouts option to discoverValidSitemaps (default: 60s)
- Pass timeout to urlExists HEAD requests
- Add networkTimeouts parameter to RobotsTxtFile.find/load
- Pass timeout through to robots.txt fetch

Fixes apify#3412
@barjin barjin self-requested a review February 23, 2026 08:19
Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @stakeswky !

I have few high-level comments.

In Crawlee v4 (the new upcoming version of Crawlee), we remove direct dependency on got-scraping and allow users to pass custom HttpClient implementations instead. To make your changes more futureproof, could you please change this so the methods accept a timeout?: number parameter (not got-scraping specific Delays)?

Related to this - I believe most users will understand any timeout-related option as a "global" timeout for the entire find() / load() call. This is currently not the case in e.g. discoverValidSitemaps, which applies the networkTimeouts to each request separately. While this is technically correct (and I can imagine it being useful), I'd rather have the "global" timeout for the sake of API simplicity. You can achieve this by e.g. passing an AbortSignal to got-scraping here.

All this being said, I'm definitely open for discussion, feel free to share your ideas if you disagree with some of the points. Cheers!

Per reviewer feedback:
- Remove got-scraping-specific Delays type; accept timeout?: number instead
- Create AbortController internally in discoverValidSitemaps and pass the
  signal to every HTTP request (urlExists, RobotsFile.find, load), so the
  entire discovery call is cancelled once the timeout elapses — not just
  individual requests
- Clear the timeout handle in a finally block to avoid leaks
- RobotsTxtFile.find / load now accept AbortSignal directly, keeping the
  API future-proof for Crawlee v4 which drops the got-scraping dependency
Copy link
Author

@stakeswky stakeswky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough feedback, @barjin!

Both points addressed in the latest push:

  1. ** → ** — removed the got-scraping-specific type entirely. Both discoverValidSitemaps and RobotsTxtFile.find/load now accept plain number (ms) or AbortSignal respectively.

  2. Global timeout via AbortSignaldiscoverValidSitemaps now creates an AbortController internally, fires controller.abort() after timeout ms, and passes the signal to every HTTP call (urlExists, RobotsFile.find, and the underlying load). The timeout handle is cleared in a finally block to avoid leaks. This makes the timeout a true global budget for the whole discovery call rather than a per-request cap.

Happy to adjust if you'd prefer the AbortSignal to be caller-supplied instead (so callers can cancel externally too) — just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

discoverValidSitemaps can hang indefinitely

2 participants