|
| 1 | +## Context |
| 2 | + |
| 3 | +The current scraper has two behaviors that combine poorly on unhealthy targets: HTTP fetches retry up to six times by default, and crawl processing ignores child-page errors by default when `ignoreErrors` is enabled. This gives good resilience for isolated transient failures, but it also allows a scrape to continue deep into a target that is broadly unhealthy due to authentication walls, anti-bot challenges, or persistent server-side failures. |
| 4 | + |
| 5 | +The change needs to preserve existing fail-fast behavior for root URL failures, reduce wasted time on repeated child-page failures, and keep the configuration surface small. The user preference is to avoid exposing many related knobs. |
| 6 | + |
| 7 | +## Goals / Non-Goals |
| 8 | + |
| 9 | +**Goals:** |
| 10 | +- Reduce default per-page retry cost by lowering the HTTP retry default from 6 to 3. |
| 11 | +- Add a target-level abort policy that stops a crawl when child-page failures exceed a configured failure rate. |
| 12 | +- Keep the configuration surface minimal by exposing a single user-facing threshold. |
| 13 | +- Exclude expected refresh deletion handling from the threshold so refresh cleanup remains safe. |
| 14 | +- Preserve existing semantics that a root page failure aborts the job immediately. |
| 15 | + |
| 16 | +**Non-Goals:** |
| 17 | +- Add both rate and count thresholds as separate public settings. |
| 18 | +- Introduce a time-windowed circuit breaker, cooldowns, or half-open recovery logic. |
| 19 | +- Change which HTTP status codes are treated as retryable. |
| 20 | +- Rework `ignoreErrors` into a broader policy system. |
| 21 | + |
| 22 | +## Decisions |
| 23 | + |
| 24 | +### Use a failure-rate threshold instead of a count threshold |
| 25 | + |
| 26 | +The scraper will expose a single configuration key, `scraper.abortOnFailureRate`, representing the maximum tolerated child-page failure rate before aborting the crawl. |
| 27 | + |
| 28 | +Rationale: |
| 29 | +- A rate scales better across small and large crawls than a fixed count. |
| 30 | +- One threshold keeps the configuration surface small. |
| 31 | +- The failure-rate model aligns with the user's fail-fast goal without requiring per-target tuning for crawl size. |
| 32 | + |
| 33 | +Alternatives considered: |
| 34 | +- Count-only threshold: simpler on paper, but it behaves inconsistently across crawl sizes. |
| 35 | +- Exposing both rate and count: more flexible, but adds configuration complexity the user explicitly wants to avoid. |
| 36 | + |
| 37 | +### Use an internal minimum sample size before evaluating the threshold |
| 38 | + |
| 39 | +The scraper will only evaluate the failure-rate threshold after an internal minimum number of child-page attempts have completed. The initial design uses a constant of 10 child-page attempts. |
| 40 | + |
| 41 | +Rationale: |
| 42 | +- Prevents early aborts from one or two isolated failures near the start of a crawl. |
| 43 | +- Keeps the public API simple while still making rate-based aborts practical. |
| 44 | + |
| 45 | +Alternatives considered: |
| 46 | +- No minimum sample: too sensitive for small crawls. |
| 47 | +- Configurable minimum sample: useful but adds another knob for marginal benefit. |
| 48 | + |
| 49 | +### Count only terminal child-page processing failures toward the threshold |
| 50 | + |
| 51 | +The strategy layer will increment failure counters only when a child page fails after retry policy has already been exhausted and the page processing path throws. Root-page failures remain immediate hard failures and do not flow through threshold logic. |
| 52 | + |
| 53 | +Rationale: |
| 54 | +- Keeps responsibility boundaries clear: fetcher handles per-request retries, strategy handles crawl policy. |
| 55 | +- Measures true page-level failures rather than transient sub-attempts. |
| 56 | + |
| 57 | +Alternatives considered: |
| 58 | +- Counting every retry attempt: too noisy and would over-penalize transient outages. |
| 59 | +- Counting pipeline warnings or empty content as failures: would blur the difference between degraded content and hard failure. |
| 60 | + |
| 61 | +### Exclude refresh deletions from failure accounting |
| 62 | + |
| 63 | +Pages that resolve to `FetchStatus.NOT_FOUND` during refresh mode will continue to be treated as expected deletions and will not count toward the failure-rate threshold. |
| 64 | + |
| 65 | +Rationale: |
| 66 | +- Refresh cleanup is expected maintenance behavior, not a target health failure. |
| 67 | +- Prevents valid refresh jobs from aborting when many stale pages have been removed upstream. |
| 68 | + |
| 69 | +### Apply the threshold even when `ignoreErrors` is true |
| 70 | + |
| 71 | +`ignoreErrors` will continue to suppress isolated child-page failures, but once the configured failure-rate threshold is exceeded after the minimum sample size, the crawl will abort. |
| 72 | + |
| 73 | +Rationale: |
| 74 | +- Preserves resilience for a few bad pages. |
| 75 | +- Prevents `ignoreErrors` from masking a target that is broadly broken. |
| 76 | + |
| 77 | +## Risks / Trade-offs |
| 78 | + |
| 79 | +- [Small crawls may still complete despite high failure ratios] -> The minimum sample size intentionally favors tolerance for tiny crawls to avoid noisy aborts. |
| 80 | +- [Some borderline unhealthy sites may abort sooner than today] -> This is expected and aligned with fail-fast behavior; documentation and tests should make the policy explicit. |
| 81 | +- [A single threshold may not fit every deployment] -> Start with one config key and a documented default; add more tuning only if real usage shows a clear need. |
| 82 | +- [Behavior changes for existing scrape jobs] -> Keep root failure semantics unchanged and limit the change to child-page threshold behavior plus reduced retries. |
| 83 | + |
| 84 | +## Migration Plan |
| 85 | + |
| 86 | +1. Update scraper config defaults and schema to lower `scraper.fetcher.maxRetries` to `3` and add `scraper.abortOnFailureRate`. |
| 87 | +2. Implement child-page attempt and failure accounting in `BaseScraperStrategy`. |
| 88 | +3. Abort the crawl when the failure rate exceeds the configured threshold after the internal minimum sample size. |
| 89 | +4. Add or update tests for fetcher retry defaults, threshold-triggered aborts, root-page failures, and refresh deletions. |
| 90 | +5. Document the new default behavior through specs and config-facing tests. |
| 91 | + |
| 92 | +Rollback strategy: |
| 93 | +- Restore the previous retry default and remove the threshold checks if production behavior proves too aggressive. |
| 94 | + |
| 95 | +## Open Questions |
| 96 | + |
| 97 | +- What default value should `scraper.abortOnFailureRate` ship with? The implementation should pick a conservative default that tolerates a few isolated failures while still aborting clearly unhealthy crawls. |
0 commit comments