Skip to content

feat(scraper): Smart retry – only retry transient errors, fail fast on permanent#345

Open
kenzaelk98 wants to merge 2 commits intoarabold:mainfrom
kenzaelk98:feature/smart-retry-logic
Open

feat(scraper): Smart retry – only retry transient errors, fail fast on permanent#345
kenzaelk98 wants to merge 2 commits intoarabold:mainfrom
kenzaelk98:feature/smart-retry-logic

Conversation

@kenzaelk98
Copy link
Copy Markdown

Issue

The scraper retries every failed request (including permanent errors like 500) up to 7 times with exponential backoff. That wastes time on broken or misconfigured sites and clutters logs, since those errors will not succeed on retry.

Summary of the proposed enhancement

Add smart retry logic so only transient errors are retried; permanent errors (e.g. 4xx, 500) fail fast instead of using all retries.

Changes

  • Retry only on transient HTTP statuses: 408, 429, 502, 503, 504
  • Retry only on transient network errors: ETIMEDOUT, ECONNRESET, ECONNREFUSED
  • Do not retry on permanent errors (4xx except 408/429, 500, other 5xx)
  • Log once when skipping retry: Permanent error, not retrying: <status/code>
  • Existing max retries and exponential backoff are unchanged

Testing

Tested locally; behavior is as expected. Scraping one library completed about 10 minutes faster.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@kenzaelk98
Copy link
Copy Markdown
Author

hi @arabold, seems copilot encountered an error while reviewing 😕

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@kenzaelk98
Copy link
Copy Markdown
Author

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@arabold it again did it 🙂‍↔️ - hopefully next one is the one🥲

Copy link
Copy Markdown
Owner

@arabold arabold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it took me a while to do a full review of this. Not sure why Copilot refused it twice now 🤷 Anyway, thanks for your contribution, @kenzaelk98 !

I left some comments and I would generally prefer keeping most of the existing functionality unchanged to avoid regressions. See my specific code comments. In addition the PR introduces a behavioral change here: completely unknown errors (no HTTP status, no recognizable error code) are stop being retried.

I think this is debatable, and I can see both sides. The current "retry unknown errors" behavior is the safer default for a scraper, but its true that non-network errors would also be retried (like an accidental NULL or undefined type error, for example). What issue did you observe in specific? Your PR suggests only not retrying 500 errors, but was that the main intention?

@kenzaelk98
Copy link
Copy Markdown
Author

kenzaelk98 commented Feb 23, 2026

What issue did you observe in specific? Your PR suggests only not retrying 500 errors, but was that the main intention?

Thank you for the review! ☺️
I get that 500 can be transient (overload, deploys) and that many libraries retry it. For a scraper that hits many different, often unreliable doc/API hosts, the cost of retrying permanent 500s felt higher than the benefit of retrying the occasional transient one. When 500s are permanent (e.g. broken endpoints), retrying them 7 times with exponential backoff adds roughly a minute per failed request — in one run that meant about 10 minutes lost on a single library before failing fast.

As a middle ground, I’ve kept 500 retryable but capped it at 3 attempts total (1 initial + 2 retries). That way transient 500s still get a couple of retries, while permanent 500s fail in ~3–4 seconds instead of ~63. Other status codes (408, 429, 502, 503, 504, 525) and network errors still use the full maxRetries (default 7).

I’ve added 525 back as retryable (Cloudflare/cert rotation) and reverted to a blocklist for network errors so we don’t miss transient codes like EAI_AGAIN, EPIPE, EHOSTUNREACH. I kept the shouldRetry() refactor and the “permanent error, not retrying” log.

ivanzud pushed a commit to ivanzud/docs-mcp-server that referenced this pull request Mar 13, 2026
Cherry-pick PR arabold#345 from upstream (kenzaelk98).

Only retry on transient HTTP statuses (408, 429, 502, 503, 504) and
transient network errors (ETIMEDOUT, ECONNRESET, ECONNREFUSED). Permanent
errors (4xx except 408/429, 500) fail immediately instead of exhausting
all retries with exponential backoff. ~10min faster on large scrapes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@arabold
Copy link
Copy Markdown
Owner

arabold commented Mar 29, 2026

Sorry for the lack of feedback on this. I was going back and forth as adding 100+ lines of code for just adjusting the retry count doesn't seem justified. Here's an alternative proposal that incorporates your core idea and adds a true fail-fast implementation, that will abort any crawl if a certain failure threshold is reached: #377

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants