Skip to content

Comments

fix: prevent v3 sitemap discovery init hangs in discoverValidSitemaps#3434

Open
nikitachapovskii-dev wants to merge 2 commits intomasterfrom
fix/prevent-hangs-discovervalidsitemaps
Open

fix: prevent v3 sitemap discovery init hangs in discoverValidSitemaps#3434
nikitachapovskii-dev wants to merge 2 commits intomasterfrom
fix/prevent-hangs-discovervalidsitemaps

Conversation

@nikitachapovskii-dev
Copy link
Contributor

In v3, discoverValidSitemaps could occasionally hang during initialization (before crawler startup), especially on proxy-heavy targets used by Website Content Crawler.

Root cause:
Discovery requests (GET /robots.txt and HEAD sitemap checks) used default got-scraping behavior. In this path, HTTP/2 + browser-header generation could become unstable and stall on some targets/proxy combinations.

What changed:
Updated discoverValidSitemaps internals in packages/utils/src/internals/sitemap.ts.
Added dedicated discovery request options:

  • http2: false
  • useHeaderGenerator: false

Applied these options consistently to:

  • robots.txt fetch
  • sitemap candidate HEAD checks

Note: this PR intentionally keeps got-scraping since we’re on v3; this gives us a minimal, safer fix for the hang without replacing the HTTP stack or introducing broader regressions.

Tested on local with patched @crawlee/utils

Closes #3412

@github-actions github-actions bot added this to the 135th sprint - Tooling team milestone Feb 23, 2026
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 23, 2026
Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another PR touching similar parts of the code. They are not mutually exclusive (I'd still like to have the timeout option here, even if we merge this), so just a heads-up that there might be some conflicts.

Could you please elaborate more on the http2 / fingerprinting issues you're currently having?

Comment on lines 489 to 491
const robotsTxtFileUrl = new URL(url);
robotsTxtFileUrl.pathname = '/robots.txt';
robotsTxtFileUrl.search = '';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new URL constructor might be a more concise choice

Image

@nikitachapovskii-dev
Copy link
Contributor Author

Could you please elaborate more on the http2 / fingerprinting issues you're currently having?

Sure, here's all I got after investigating this:

  • The stall happens while awaiting discoverValidSitemaps (before any crawler starts).
  • v4 no longer uses this same discovery profile, and we don’t see this issue there.
  • With the same inputs/runtime, switching only discovery probes to a simpler profile (http2: false, useHeaderGenerator: false) removed those init stalls in our repeated validation runs.

Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @nikitachapovskii-dev !

I don't want to block your downstream fixes (so I'm approving this), but I'm still curious - could you please share a code example with a URL that is affected by this?

Although it has its quirks, I always considered got-scraping with the default settings to be reasonably battle-tested. If there is a bug like this, maybe we should fix it there, as it looks quite severe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

discoverValidSitemaps can hang indefinitely

2 participants