feat(scraper): Smart retry – only retry transient errors, fail fast on permanent#345
feat(scraper): Smart retry – only retry transient errors, fail fast on permanent#345kenzaelk98 wants to merge 2 commits intoarabold:mainfrom
Conversation
…n permanent (e.g. 500)
|
hi @arabold, seems copilot encountered an error while reviewing 😕 |
@arabold it again did it 🙂 |
arabold
left a comment
There was a problem hiding this comment.
Sorry, it took me a while to do a full review of this. Not sure why Copilot refused it twice now 🤷 Anyway, thanks for your contribution, @kenzaelk98 !
I left some comments and I would generally prefer keeping most of the existing functionality unchanged to avoid regressions. See my specific code comments. In addition the PR introduces a behavioral change here: completely unknown errors (no HTTP status, no recognizable error code) are stop being retried.
I think this is debatable, and I can see both sides. The current "retry unknown errors" behavior is the safer default for a scraper, but its true that non-network errors would also be retried (like an accidental NULL or undefined type error, for example). What issue did you observe in specific? Your PR suggests only not retrying 500 errors, but was that the main intention?
Thank you for the review! As a middle ground, I’ve kept 500 retryable but capped it at 3 attempts total (1 initial + 2 retries). That way transient 500s still get a couple of retries, while permanent 500s fail in ~3–4 seconds instead of ~63. Other status codes (408, 429, 502, 503, 504, 525) and network errors still use the full maxRetries (default 7). I’ve added 525 back as retryable (Cloudflare/cert rotation) and reverted to a blocklist for network errors so we don’t miss transient codes like EAI_AGAIN, EPIPE, EHOSTUNREACH. I kept the shouldRetry() refactor and the “permanent error, not retrying” log. |
…3 attempts), blocklist for network errors
Cherry-pick PR arabold#345 from upstream (kenzaelk98). Only retry on transient HTTP statuses (408, 429, 502, 503, 504) and transient network errors (ETIMEDOUT, ECONNRESET, ECONNREFUSED). Permanent errors (4xx except 408/429, 500) fail immediately instead of exhausting all retries with exponential backoff. ~10min faster on large scrapes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Sorry for the lack of feedback on this. I was going back and forth as adding 100+ lines of code for just adjusting the retry count doesn't seem justified. Here's an alternative proposal that incorporates your core idea and adds a true fail-fast implementation, that will abort any crawl if a certain failure threshold is reached: #377 |
Issue
The scraper retries every failed request (including permanent errors like 500) up to 7 times with exponential backoff. That wastes time on broken or misconfigured sites and clutters logs, since those errors will not succeed on retry.
Summary of the proposed enhancement
Add smart retry logic so only transient errors are retried; permanent errors (e.g. 4xx, 500) fail fast instead of using all retries.
Changes
Permanent error, not retrying: <status/code>Testing
Tested locally; behavior is as expected. Scraping one library completed about 10 minutes faster.