Skip to content

fix(scraper): add fail-fast threshold for unhealthy targets#377

Open
arabold wants to merge 3 commits intomainfrom
docs/fail-fast-scrape-thresholds
Open

fix(scraper): add fail-fast threshold for unhealthy targets#377
arabold wants to merge 3 commits intomainfrom
docs/fail-fast-scrape-thresholds

Conversation

@arabold
Copy link
Copy Markdown
Owner

@arabold arabold commented Mar 29, 2026

Summary

  • add a fail-fast child-page failure-rate threshold with a 10-page minimum sample and reduce default HTTP retries from 6 to 3
  • preserve refresh deletion semantics so tracked pages that return 404 are handled as deletions, while normal non-refresh 404 pages remain terminal failures instead of deletion events
  • update the OpenSpec change, strategy/config tests, and refresh regression coverage to lock down the new behavior

Validation

  • npm test
  • npm run typecheck
  • npm run lint

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an OpenSpec change describing a new “fail-fast” crawl policy based on child-page failure rate, alongside a reduced default HTTP retry budget, to avoid wasting time on broadly unhealthy scrape targets while preserving existing root-page fail-fast behavior.

Changes:

  • Defines new spec requirements/scenarios for default HTTP retries, root-page abort semantics, child-page failure-rate threshold aborts, and refresh deletion exclusions.
  • Documents the design decisions (single exposed threshold, internal minimum sample size, exclusion rules).
  • Adds an implementation task checklist for config/schema/tests, fetcher retry behavior, and strategy/tool integration.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
openspec/changes/fail-fast-scrape-thresholds/tasks.md Implementation checklist for config defaults, failure-rate accounting, retry behavior, and test verification.
openspec/changes/fail-fast-scrape-thresholds/specs/scrape-failure-policy/spec.md Spec scenarios for retry behavior, root-page abort, failure-rate threshold evaluation, and refresh deletion exclusions.
openspec/changes/fail-fast-scrape-thresholds/specs/configuration/spec.md Config-facing spec for scraper.abortOnFailureRate and the updated scraper.fetcher.maxRetries default.
openspec/changes/fail-fast-scrape-thresholds/proposal.md High-level rationale/scope/impact for the change set.
openspec/changes/fail-fast-scrape-thresholds/design.md Design rationale and trade-offs for the failure-rate threshold + minimum sample size approach.
openspec/changes/fail-fast-scrape-thresholds/.openspec.yaml Metadata marking this as a spec-driven change with creation date.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@arabold arabold force-pushed the docs/fail-fast-scrape-thresholds branch from 117c536 to e8df0fa Compare March 29, 2026 15:40
@arabold arabold changed the title docs(openspec): add fail-fast scrape thresholds change fix(scraper): add fail-fast threshold for unhealthy targets Mar 30, 2026
@arabold arabold requested a review from Copilot March 30, 2026 02:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants