fix: prevent v3 sitemap discovery init hangs in discoverValidSitemaps by nikitachapovskii-dev · Pull Request #3434 · apify/crawlee

nikitachapovskii-dev · 2026-02-23T11:58:24Z

In v3, discoverValidSitemaps could occasionally hang during initialization (before crawler startup), especially on proxy-heavy targets used by Website Content Crawler.

Root cause:
Discovery requests (GET /robots.txt and HEAD sitemap checks) used default got-scraping behavior. In this path, HTTP/2 + browser-header generation could become unstable and stall on some targets/proxy combinations.

What changed:
Updated discoverValidSitemaps internals in packages/utils/src/internals/sitemap.ts.
Added dedicated discovery request options:

http2: false
useHeaderGenerator: false

Applied these options consistently to:

robots.txt fetch
sitemap candidate HEAD checks

Note: this PR intentionally keeps got-scraping since we’re on v3; this gives us a minimal, safer fix for the hang without replacing the HTTP stack or introducing broader regressions.

Tested on local with patched @crawlee/utils

Closes #3412

barjin

There is another PR touching similar parts of the code. They are not mutually exclusive (I'd still like to have the timeout option here, even if we merge this), so just a heads-up that there might be some conflicts.

Could you please elaborate more on the http2 / fingerprinting issues you're currently having?

barjin · 2026-02-23T12:32:17Z

packages/utils/src/internals/sitemap.ts

+        const robotsTxtFileUrl = new URL(url);
+        robotsTxtFileUrl.pathname = '/robots.txt';
+        robotsTxtFileUrl.search = '';


The new URL constructor might be a more concise choice

nikitachapovskii-dev · 2026-02-24T09:28:23Z

Could you please elaborate more on the http2 / fingerprinting issues you're currently having?

Sure, here's all I got after investigating this:

The stall happens while awaiting discoverValidSitemaps (before any crawler starts).
v4 no longer uses this same discovery profile, and we don’t see this issue there.
With the same inputs/runtime, switching only discovery probes to a simpler profile (http2: false, useHeaderGenerator: false) removed those init stalls in our repeated validation runs.

barjin

Thank you, @nikitachapovskii-dev !

I don't want to block your downstream fixes (so I'm approving this), but I'm still curious - could you please share a code example with a URL that is affected by this?

Although it has its quirks, I always considered got-scraping with the default settings to be reasonably battle-tested. If there is a bug like this, maybe we should fix it there, as it looks quite severe.

fix: prevent v3 sitemap discovery init hangs in discoverValidSitemaps

65c0eaf

nikitachapovskii-dev requested a review from barjin February 23, 2026 11:58

github-actions bot assigned nikitachapovskii-dev Feb 23, 2026

github-actions bot added this to the 135th sprint - Tooling team milestone Feb 23, 2026

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 23, 2026

barjin reviewed Feb 23, 2026

View reviewed changes

nikitachapovskii-dev requested a review from barjin February 24, 2026 09:45

chore: new URL constructor

ba8fb4b

barjin approved these changes Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: prevent v3 sitemap discovery init hangs in discoverValidSitemaps#3434

fix: prevent v3 sitemap discovery init hangs in discoverValidSitemaps#3434
nikitachapovskii-dev wants to merge 2 commits intomasterfrom
fix/prevent-hangs-discovervalidsitemaps

nikitachapovskii-dev commented Feb 23, 2026

Uh oh!

barjin left a comment

Uh oh!

barjin Feb 23, 2026

Uh oh!

nikitachapovskii-dev commented Feb 24, 2026

Uh oh!

barjin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

nikitachapovskii-dev commented Feb 23, 2026

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

barjin Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

nikitachapovskii-dev commented Feb 24, 2026

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants