docs(openspec): add fail-fast scrape thresholds change

arabold · arabold · commit 117c5366abb2 · 2026-03-29T08:20:12.000-07:00
diff --git a/openspec/changes/fail-fast-scrape-thresholds/.openspec.yaml b/openspec/changes/fail-fast-scrape-thresholds/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-03-29
diff --git a/openspec/changes/fail-fast-scrape-thresholds/design.md b/openspec/changes/fail-fast-scrape-thresholds/design.md
@@ -0,0 +1,97 @@
+## Context
+
+The current scraper has two behaviors that combine poorly on unhealthy targets: HTTP fetches retry up to six times by default, and crawl processing ignores child-page errors by default when `ignoreErrors` is enabled. This gives good resilience for isolated transient failures, but it also allows a scrape to continue deep into a target that is broadly unhealthy due to authentication walls, anti-bot challenges, or persistent server-side failures.
+
+The change needs to preserve existing fail-fast behavior for root URL failures, reduce wasted time on repeated child-page failures, and keep the configuration surface small. The user preference is to avoid exposing many related knobs.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Reduce default per-page retry cost by lowering the HTTP retry default from 6 to 3.
+- Add a target-level abort policy that stops a crawl when child-page failures exceed a configured failure rate.
+- Keep the configuration surface minimal by exposing a single user-facing threshold.
+- Exclude expected refresh deletion handling from the threshold so refresh cleanup remains safe.
+- Preserve existing semantics that a root page failure aborts the job immediately.
+
+**Non-Goals:**
+- Add both rate and count thresholds as separate public settings.
+- Introduce a time-windowed circuit breaker, cooldowns, or half-open recovery logic.
+- Change which HTTP status codes are treated as retryable.
+- Rework `ignoreErrors` into a broader policy system.
+
+## Decisions
+
+### Use a failure-rate threshold instead of a count threshold
+
+The scraper will expose a single configuration key, `scraper.abortOnFailureRate`, representing the maximum tolerated child-page failure rate before aborting the crawl.
+
+Rationale:
+- A rate scales better across small and large crawls than a fixed count.
+- One threshold keeps the configuration surface small.
+- The failure-rate model aligns with the user's fail-fast goal without requiring per-target tuning for crawl size.
+
+Alternatives considered:
+- Count-only threshold: simpler on paper, but it behaves inconsistently across crawl sizes.
+- Exposing both rate and count: more flexible, but adds configuration complexity the user explicitly wants to avoid.
+
+### Use an internal minimum sample size before evaluating the threshold
+
+The scraper will only evaluate the failure-rate threshold after an internal minimum number of child-page attempts have completed. The initial design uses a constant of 10 child-page attempts.
+
+Rationale:
+- Prevents early aborts from one or two isolated failures near the start of a crawl.
+- Keeps the public API simple while still making rate-based aborts practical.
+
+Alternatives considered:
+- No minimum sample: too sensitive for small crawls.
+- Configurable minimum sample: useful but adds another knob for marginal benefit.
+
+### Count only terminal child-page processing failures toward the threshold
+
+The strategy layer will increment failure counters only when a child page fails after retry policy has already been exhausted and the page processing path throws. Root-page failures remain immediate hard failures and do not flow through threshold logic.
+
+Rationale:
+- Keeps responsibility boundaries clear: fetcher handles per-request retries, strategy handles crawl policy.
+- Measures true page-level failures rather than transient sub-attempts.
+
+Alternatives considered:
+- Counting every retry attempt: too noisy and would over-penalize transient outages.
+- Counting pipeline warnings or empty content as failures: would blur the difference between degraded content and hard failure.
+
+### Exclude refresh deletions from failure accounting
+
+Pages that resolve to `FetchStatus.NOT_FOUND` during refresh mode will continue to be treated as expected deletions and will not count toward the failure-rate threshold.
+
+Rationale:
+- Refresh cleanup is expected maintenance behavior, not a target health failure.
+- Prevents valid refresh jobs from aborting when many stale pages have been removed upstream.
+
+### Apply the threshold even when `ignoreErrors` is true
+
+`ignoreErrors` will continue to suppress isolated child-page failures, but once the configured failure-rate threshold is exceeded after the minimum sample size, the crawl will abort.
+
+Rationale:
+- Preserves resilience for a few bad pages.
+- Prevents `ignoreErrors` from masking a target that is broadly broken.
+
+## Risks / Trade-offs
+
+- [Small crawls may still complete despite high failure ratios] -> The minimum sample size intentionally favors tolerance for tiny crawls to avoid noisy aborts.
+- [Some borderline unhealthy sites may abort sooner than today] -> This is expected and aligned with fail-fast behavior; documentation and tests should make the policy explicit.
+- [A single threshold may not fit every deployment] -> Start with one config key and a documented default; add more tuning only if real usage shows a clear need.
+- [Behavior changes for existing scrape jobs] -> Keep root failure semantics unchanged and limit the change to child-page threshold behavior plus reduced retries.
+
+## Migration Plan
+
+1. Update scraper config defaults and schema to lower `scraper.fetcher.maxRetries` to `3` and add `scraper.abortOnFailureRate`.
+2. Implement child-page attempt and failure accounting in `BaseScraperStrategy`.
+3. Abort the crawl when the failure rate exceeds the configured threshold after the internal minimum sample size.
+4. Add or update tests for fetcher retry defaults, threshold-triggered aborts, root-page failures, and refresh deletions.
+5. Document the new default behavior through specs and config-facing tests.
+
+Rollback strategy:
+- Restore the previous retry default and remove the threshold checks if production behavior proves too aggressive.
+
+## Open Questions
+
+- What default value should `scraper.abortOnFailureRate` ship with? The implementation should pick a conservative default that tolerates a few isolated failures while still aborting clearly unhealthy crawls.
diff --git a/openspec/changes/fail-fast-scrape-thresholds/proposal.md b/openspec/changes/fail-fast-scrape-thresholds/proposal.md
@@ -0,0 +1,26 @@
+## Why
+
+The scraper currently combines a high default HTTP retry count with a crawl policy that ignores non-root page failures by default, which can make unhealthy targets look more successful than they are and waste time on permanently broken sites. We need a simpler fail-fast policy so scrape jobs stop earlier when a target is broadly unhealthy while still tolerating a small number of transient page failures.
+
+## What Changes
+
+- Reduce the default HTTP fetch retry count from 6 retries to 3 retries.
+- Add a target-level scrape failure-rate threshold that aborts a crawl when too many child pages fail after a minimum number of attempted pages.
+- Keep root page failures as immediate job failures.
+- Exclude expected refresh deletions (`404`/`NOT_FOUND`) from the failure-rate threshold so refresh cleanup does not trip the breaker.
+- Surface the new threshold through configuration with a documented default.
+- Add specification scenarios covering retry defaults, threshold evaluation, ignored deletion cases, and root-page fail-fast behavior.
+
+## Capabilities
+
+### New Capabilities
+- `scrape-failure-policy`: Defines retry defaults and fail-fast target abort behavior for unhealthy scrape targets.
+
+### Modified Capabilities
+- `configuration`: Add configuration support for the scrape failure-rate threshold and updated default fetch retry count.
+
+## Impact
+
+- Affected code: `src/utils/config.ts`, `src/scraper/fetcher/HttpFetcher.ts`, `src/scraper/strategies/BaseScraperStrategy.ts`, `src/tools/ScrapeTool.ts`, and related tests.
+- Affected behavior: scrape jobs will stop earlier on broadly failing targets and will spend less time retrying transient fetch failures.
+- Affected interfaces: configuration defaults and config schema for scraper settings.
diff --git a/openspec/changes/fail-fast-scrape-thresholds/specs/configuration/spec.md b/openspec/changes/fail-fast-scrape-thresholds/specs/configuration/spec.md
@@ -0,0 +1,23 @@
+## ADDED Requirements
+
+### Requirement: Scrape Failure Rate Threshold Configuration
+The configuration system SHALL expose `scraper.abortOnFailureRate` as the child-page failure-rate threshold used to abort unhealthy scrape targets.
+
+#### Scenario: Config file sets scrape failure rate threshold
+- **WHEN** the configuration file sets `scraper.abortOnFailureRate` to a numeric value
+- **THEN** the scraper SHALL use that value as the child-page failure-rate threshold
+
+#### Scenario: Environment variable overrides scrape failure rate threshold
+- **WHEN** `DOCS_MCP_SCRAPER_ABORT_ON_FAILURE_RATE` is set
+- **THEN** the environment value SHALL override config file and default values for `scraper.abortOnFailureRate`
+
+### Requirement: Updated Default HTTP Retry Configuration
+The configuration system SHALL default `scraper.fetcher.maxRetries` to `3`.
+
+#### Scenario: Default fetch retry count is three
+- **WHEN** no explicit configuration for `scraper.fetcher.maxRetries` is provided
+- **THEN** the loaded configuration SHALL set `scraper.fetcher.maxRetries` to `3`
+
+#### Scenario: Explicit fetch retry override still wins
+- **WHEN** the user provides an explicit value for `scraper.fetcher.maxRetries`
+- **THEN** the loaded configuration SHALL use the explicit value instead of the default
diff --git a/openspec/changes/fail-fast-scrape-thresholds/specs/scrape-failure-policy/spec.md b/openspec/changes/fail-fast-scrape-thresholds/specs/scrape-failure-policy/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Bounded HTTP Fetch Retries
+The scraper SHALL default HTTP fetch retries to 3 retries per page request, in addition to the initial attempt.
+
+#### Scenario: Default retry budget uses four total attempts
+- **WHEN** an HTTP page fetch repeatedly fails with a retryable error and no per-request override is provided
+- **THEN** the fetcher SHALL stop after the initial attempt plus 3 retries
+
+#### Scenario: Permanent HTTP failures do not use retry budget
+- **WHEN** an HTTP page fetch fails with a non-retryable permanent error
+- **THEN** the fetcher SHALL fail the page without consuming additional retries
+
+### Requirement: Root Page Failures Abort Immediately
+The scraper SHALL fail the scrape job immediately when the root page cannot be processed successfully.
+
+#### Scenario: Root page fetch failure aborts the job
+- **WHEN** the initial root page fails during fetch or processing
+- **THEN** the scrape job SHALL terminate with an error
+
+#### Scenario: Root page failure bypasses child failure threshold logic
+- **WHEN** the initial root page fails before any child pages are attempted
+- **THEN** the scraper SHALL abort immediately rather than evaluating the child-page failure threshold
+
+### Requirement: Child Page Failure Rate Aborts Unhealthy Targets
+The scraper SHALL track child-page attempts and terminal child-page failures, and SHALL abort the crawl when the child-page failure rate exceeds the configured threshold after the minimum evaluation sample has been reached.
+
+#### Scenario: Isolated child-page failures do not abort before minimum sample
+- **WHEN** a crawl encounters child-page failures before the minimum evaluation sample has been reached
+- **THEN** the scraper SHALL continue crawling using normal `ignoreErrors` behavior
+
+#### Scenario: Failure rate above threshold aborts the crawl
+- **WHEN** the minimum child-page sample has been reached and the observed child-page failure rate exceeds the configured threshold
+- **THEN** the scraper SHALL terminate the crawl with an error indicating that the target exceeded the allowed failure rate
+
+#### Scenario: Failure rate at or below threshold continues the crawl
+- **WHEN** the minimum child-page sample has been reached and the observed child-page failure rate is at or below the configured threshold
+- **THEN** the scraper SHALL continue crawling remaining in-scope pages
+
+### Requirement: Refresh Deletions Do Not Count As Failures
+The scraper SHALL exclude expected page deletions detected during refresh from child-page failure-rate accounting.
+
+#### Scenario: Refresh deletion does not increase failure rate
+- **WHEN** a refresh crawl encounters a child page that returns `NOT_FOUND`
+- **THEN** the scraper SHALL mark the page as deleted without incrementing the child-page failure counter
+
+#### Scenario: Refresh cleanup alone cannot trip the threshold
+- **WHEN** a refresh crawl processes multiple deleted child pages and no terminal child-page processing failures occur
+- **THEN** the scraper SHALL not abort due to the child-page failure threshold
diff --git a/openspec/changes/fail-fast-scrape-thresholds/tasks.md b/openspec/changes/fail-fast-scrape-thresholds/tasks.md
@@ -0,0 +1,28 @@
+## 1. Configuration Defaults
+
+- [ ] 1.1 Update the scraper default configuration to set `scraper.fetcher.maxRetries` to `3`.
+- [ ] 1.2 Extend the scraper config schema and typing to support `scraper.abortOnFailureRate`.
+- [ ] 1.3 Add or update configuration tests for the new default retry count and `scraper.abortOnFailureRate` override behavior.
+
+## 2. Crawl Failure Policy
+
+- [ ] 2.1 Add child-page attempt and failure counters to `BaseScraperStrategy` without changing root-page failure semantics.
+- [ ] 2.2 Implement the internal minimum sample-size check before evaluating the child-page failure-rate threshold.
+- [ ] 2.3 Abort the crawl with a scraper error when the configured child-page failure rate is exceeded.
+- [ ] 2.4 Exclude refresh deletions (`FetchStatus.NOT_FOUND`) from child-page failure-rate accounting.
+
+## 3. Retry Behavior
+
+- [ ] 3.1 Update `HttpFetcher`-level tests to verify the new default retry budget and retained permanent-failure behavior.
+- [ ] 3.2 Ensure per-request retry overrides still work independently of the new default.
+
+## 4. Strategy and Tool Integration
+
+- [ ] 4.1 Ensure the new threshold applies during normal scrape jobs even when `ignoreErrors` remains enabled for isolated child-page failures.
+- [ ] 4.2 Verify scrape progress and terminal error behavior remain correct when the threshold aborts a crawl.
+
+## 5. Verification
+
+- [ ] 5.1 Add strategy tests covering root-page fail-fast behavior, below-threshold continuation, above-threshold aborts, and refresh deletion exclusions.
+- [ ] 5.2 Run targeted tests for config, fetcher, and scraper strategy behavior.
+- [ ] 5.3 Run the project lint and typecheck commands and address any failures.

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+schema: spec-driven`
	`2`	`+created: 2026-03-29`