fix: prevent v3 sitemap discovery init hangs in discoverValidSitemaps#3434
fix: prevent v3 sitemap discovery init hangs in discoverValidSitemaps#3434nikitachapovskii-dev wants to merge 2 commits intomasterfrom
Conversation
barjin
left a comment
There was a problem hiding this comment.
There is another PR touching similar parts of the code. They are not mutually exclusive (I'd still like to have the timeout option here, even if we merge this), so just a heads-up that there might be some conflicts.
Could you please elaborate more on the http2 / fingerprinting issues you're currently having?
| const robotsTxtFileUrl = new URL(url); | ||
| robotsTxtFileUrl.pathname = '/robots.txt'; | ||
| robotsTxtFileUrl.search = ''; |
Sure, here's all I got after investigating this:
|
barjin
left a comment
There was a problem hiding this comment.
Thank you, @nikitachapovskii-dev !
I don't want to block your downstream fixes (so I'm approving this), but I'm still curious - could you please share a code example with a URL that is affected by this?
Although it has its quirks, I always considered got-scraping with the default settings to be reasonably battle-tested. If there is a bug like this, maybe we should fix it there, as it looks quite severe.

In v3,
discoverValidSitemapscould occasionally hang during initialization (before crawler startup), especially on proxy-heavy targets used by Website Content Crawler.Root cause:
Discovery requests (
GET /robots.txtandHEADsitemap checks) used defaultgot-scrapingbehavior. In this path, HTTP/2 + browser-header generation could become unstable and stall on some targets/proxy combinations.What changed:
Updated
discoverValidSitemapsinternals inpackages/utils/src/internals/sitemap.ts.Added dedicated discovery request options:
http2: falseuseHeaderGenerator: falseApplied these options consistently to:
HEADchecksNote: this PR intentionally keeps got-scraping since we’re on v3; this gives us a minimal, safer fix for the hang without replacing the HTTP stack or introducing broader regressions.
Tested on local with patched
@crawlee/utilsCloses #3412