arabold · arabold · Mar 29, 2026 · Mar 30, 2026 · Mar 30, 2026
diff --git a/openspec/changes/fail-fast-scrape-thresholds/.openspec.yaml b/openspec/changes/fail-fast-scrape-thresholds/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-03-29
diff --git a/openspec/changes/fail-fast-scrape-thresholds/design.md b/openspec/changes/fail-fast-scrape-thresholds/design.md
@@ -0,0 +1,97 @@
+## Context
+
+The current scraper has two behaviors that combine poorly on unhealthy targets: HTTP fetches retry up to six times by default, and crawl processing ignores child-page errors by default when `ignoreErrors` is enabled. This gives good resilience for isolated transient failures, but it also allows a scrape to continue deep into a target that is broadly unhealthy due to authentication walls, anti-bot challenges, or persistent server-side failures.
+
+The change needs to preserve existing fail-fast behavior for root URL failures, reduce wasted time on repeated child-page failures, and keep the configuration surface small. The user preference is to avoid exposing many related knobs.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Reduce default per-page retry cost by lowering the HTTP retry default from 6 to 3.
+- Add a target-level abort policy that stops a crawl when child-page failures exceed a configured failure rate.
+- Keep the configuration surface minimal by exposing a single user-facing threshold.
+- Exclude expected refresh deletion handling from the threshold so refresh cleanup remains safe.
+- Preserve existing semantics that a root page failure aborts the job immediately.
+
+**Non-Goals:**
+- Add both rate and count thresholds as separate public settings.
+- Introduce a time-windowed circuit breaker, cooldowns, or half-open recovery logic.
+- Change which HTTP status codes are treated as retryable.
+- Rework `ignoreErrors` into a broader policy system.
+
+## Decisions
+
+### Use a failure-rate threshold instead of a count threshold
+
+The scraper will expose a single configuration key, `scraper.abortOnFailureRate`, representing the maximum tolerated child-page failure rate before aborting the crawl. The value will be a fraction in the inclusive range `[0.0, 1.0]`, and the initial default will be `0.5`.
+
+Rationale:
+- A rate scales better across small and large crawls than a fixed count.
+- One threshold keeps the configuration surface small.
+- The failure-rate model aligns with the user's fail-fast goal without requiring per-target tuning for crawl size.
+
+Alternatives considered:
+- Count-only threshold: simpler on paper, but it behaves inconsistently across crawl sizes.
+- Exposing both rate and count: more flexible, but adds configuration complexity the user explicitly wants to avoid.
+
+### Use an internal minimum sample size before evaluating the threshold
+
+The scraper will only evaluate the failure-rate threshold after an internal minimum number of completed child-page attempts have been observed. The initial design uses a constant of 10 completed child-page attempts, counted as successful child-page completions plus terminal child-page failures, excluding refresh deletions and in-flight attempts.
+
+Rationale:
+- Prevents early aborts from one or two isolated failures near the start of a crawl.
+- Keeps the public API simple while still making rate-based aborts practical.
+
+Alternatives considered:
+- No minimum sample: too sensitive for small crawls.
+- Configurable minimum sample: useful but adds another knob for marginal benefit.
+
+### Count only terminal child-page processing failures toward the threshold
+
+The strategy layer will increment failure counters only when a child page fails after retry policy has already been exhausted and the page processing path throws. Root-page failures remain immediate hard failures and do not flow through threshold logic.
+
+Rationale:
+- Keeps responsibility boundaries clear: fetcher handles per-request retries, strategy handles crawl policy.
+- Measures true page-level failures rather than transient sub-attempts.
+
+Alternatives considered:
+- Counting every retry attempt: too noisy and would over-penalize transient outages.
+- Counting pipeline warnings or empty content as failures: would blur the difference between degraded content and hard failure.
+
+### Exclude refresh deletions from failure accounting
+
+Pages that resolve to `FetchStatus.NOT_FOUND` during refresh mode will continue to be treated as expected deletions and will not count toward the failure-rate threshold.
+
+Rationale:
+- Refresh cleanup is expected maintenance behavior, not a target health failure.
+- Prevents valid refresh jobs from aborting when many stale pages have been removed upstream.
+
+### Apply the threshold even when `ignoreErrors` is true
+
+`ignoreErrors` will continue to suppress isolated child-page failures, but once the configured failure-rate threshold is exceeded after the minimum sample size, the crawl will abort.
+
+Rationale:
+- Preserves resilience for a few bad pages.
+- Prevents `ignoreErrors` from masking a target that is broadly broken.
+
+## Risks / Trade-offs
+
+- [Small crawls may still complete despite high failure ratios] -> The minimum sample size intentionally favors tolerance for tiny crawls to avoid noisy aborts.
+- [Some borderline unhealthy sites may abort sooner than today] -> This is expected and aligned with fail-fast behavior; documentation and tests should make the policy explicit.
+- [A single threshold may not fit every deployment] -> Start with one config key and a documented default; add more tuning only if real usage shows a clear need.
+- [Behavior changes for existing scrape jobs] -> Keep root failure semantics unchanged and limit the change to child-page threshold behavior plus reduced retries.
+
+## Migration Plan
+
+1. Update scraper config defaults and schema to lower `scraper.fetcher.maxRetries` to `3` and add `scraper.abortOnFailureRate`.
+2. Implement child-page attempt and failure accounting in `BaseScraperStrategy`.
+3. Abort the crawl when the failure rate exceeds the configured threshold after the internal minimum sample size.
+4. Add or update tests for fetcher retry defaults, threshold-triggered aborts, root-page failures, and refresh deletions.
+5. Document the new default behavior through specs and config-facing tests.
+
+Rollback strategy:
+- Restore the previous retry default and remove the threshold checks if production behavior proves too aggressive.
+
+## Open Questions
+
+- None.
diff --git a/openspec/changes/fail-fast-scrape-thresholds/proposal.md b/openspec/changes/fail-fast-scrape-thresholds/proposal.md
@@ -0,0 +1,26 @@
+## Why
+
+The scraper currently combines a high default HTTP retry count with a crawl policy that ignores non-root page failures by default, which can make unhealthy targets look more successful than they are and waste time on permanently broken sites. We need a simpler fail-fast policy so scrape jobs stop earlier when a target is broadly unhealthy while still tolerating a small number of transient page failures.
+
+## What Changes
+
+- Reduce the default HTTP fetch retry count from 6 retries to 3 retries.
+- Add a target-level scrape failure-rate threshold that aborts a crawl when too many child pages fail after a minimum number of attempted pages.
+- Keep root page failures as immediate job failures.
+- Exclude expected refresh deletions (`404`/`NOT_FOUND`) from the failure-rate threshold so refresh cleanup does not trip the breaker.
+- Surface the new threshold through configuration with a default `scraper.abortOnFailureRate` of `0.5`.
+- Add specification scenarios covering retry defaults, threshold evaluation, ignored deletion cases, and root-page fail-fast behavior.
+
+## Capabilities
+
+### New Capabilities
+- `scrape-failure-policy`: Defines retry defaults and fail-fast target abort behavior for unhealthy scrape targets.
+
+### Modified Capabilities
+- `configuration`: Add configuration support for the scrape failure-rate threshold and updated default fetch retry count.
+
+## Impact
+
+- Affected code: `src/utils/config.ts`, `src/scraper/fetcher/HttpFetcher.ts`, `src/scraper/strategies/BaseScraperStrategy.ts`, `src/tools/ScrapeTool.ts`, and related tests.
+- Affected behavior: scrape jobs will stop earlier on broadly failing targets and will spend less time retrying transient fetch failures.
+- Affected interfaces: configuration defaults and config schema for scraper settings.
diff --git a/openspec/changes/fail-fast-scrape-thresholds/specs/configuration/spec.md b/openspec/changes/fail-fast-scrape-thresholds/specs/configuration/spec.md
@@ -0,0 +1,27 @@
+## ADDED Requirements
+
+### Requirement: Scrape Failure Rate Threshold Configuration
+The configuration system SHALL expose `scraper.abortOnFailureRate` as the child-page failure-rate threshold used to abort unhealthy scrape targets. The value SHALL be a floating-point fraction in the inclusive range `[0.0, 1.0]`, where `0.0` means any completed child-page failure above the minimum sample triggers an abort and `1.0` means the scraper never aborts based on failure rate alone.
+
+#### Scenario: Config file sets scrape failure rate threshold
+- **WHEN** the configuration file sets `scraper.abortOnFailureRate` to a numeric value in the inclusive range `[0.0, 1.0]`
+- **THEN** the scraper SHALL interpret that value as the maximum allowed fraction of terminally failed child pages among completed child-page attempts
+
+#### Scenario: Environment variable overrides scrape failure rate threshold
+- **WHEN** the environment variable `DOCS_MCP_SCRAPER_ABORT_ON_FAILURE_RATE` is set
+- **THEN** the environment variable value SHALL override config file and default values for `scraper.abortOnFailureRate`
+
+#### Scenario: Default scrape failure rate threshold is one half
+- **WHEN** no explicit configuration for `scraper.abortOnFailureRate` is provided
+- **THEN** the loaded configuration SHALL set `scraper.abortOnFailureRate` to `0.5`
+
+### Requirement: Updated Default HTTP Retry Configuration
+The configuration system SHALL default `scraper.fetcher.maxRetries` to `3`.
+
+#### Scenario: Default fetch retry count is three
+- **WHEN** no explicit configuration for `scraper.fetcher.maxRetries` is provided
+- **THEN** the loaded configuration SHALL set `scraper.fetcher.maxRetries` to `3`
+
+#### Scenario: Explicit fetch retry override still wins
+- **WHEN** the user provides an explicit value for `scraper.fetcher.maxRetries`
+- **THEN** the loaded configuration SHALL use the explicit value instead of the default
diff --git a/openspec/changes/fail-fast-scrape-thresholds/specs/scrape-failure-policy/spec.md b/openspec/changes/fail-fast-scrape-thresholds/specs/scrape-failure-policy/spec.md
@@ -0,0 +1,59 @@
+## ADDED Requirements
+
+### Requirement: Bounded HTTP Fetch Retries
+The scraper SHALL default HTTP fetch retries to 3 retries per page request, in addition to the initial attempt.
+
+#### Scenario: Default retry budget uses four total attempts
+- **WHEN** an HTTP page fetch repeatedly fails with a retryable error and no per-request override is provided
+- **THEN** the fetcher SHALL stop after the initial attempt plus 3 retries
+
+#### Scenario: Permanent HTTP failures do not use retry budget
+- **WHEN** an HTTP page fetch fails with a non-retryable permanent error
+- **THEN** the fetcher SHALL fail the page without consuming additional retries
+
+### Requirement: Root Page Failures Abort Immediately
+The scraper SHALL fail the scrape job immediately when the root page cannot be processed successfully during a normal scrape. During refresh, a tracked root page that returns `NOT_FOUND` SHALL be treated as a deletion instead of a hard failure.
+
+#### Scenario: Root page fetch failure aborts the job
+- **WHEN** the initial root page fails during fetch or processing
+- **THEN** the scrape job SHALL terminate with an error
+
+#### Scenario: Root page failure bypasses child failure threshold logic
+- **WHEN** the initial root page fails before any child pages are attempted
+- **THEN** the scraper SHALL abort immediately rather than evaluating the child-page failure threshold
+
+#### Scenario: Root page not found during refresh is treated as deletion
+- **WHEN** a refresh crawl re-processes a tracked root page and that page returns `NOT_FOUND`
+- **THEN** the scraper SHALL report the root page as deleted
+- **AND** the scrape SHALL continue applying refresh deletion handling instead of failing the job
+
+### Requirement: Child Page Failure Rate Aborts Unhealthy Targets
+The scraper SHALL track completed child-page attempts and terminal child-page failures, and SHALL abort the crawl when, after a minimum evaluation sample of 10 completed child-page attempts has been reached, the observed child-page failure rate exceeds the configured threshold. Completed child-page attempts SHALL include successful child-page completions and terminal child-page failures, and SHALL exclude refresh deletions and in-flight attempts.
+
+#### Scenario: Isolated child-page failures do not abort before minimum sample
+- **WHEN** a crawl encounters child-page failures before 10 completed child-page attempts have been reached
+- **THEN** the scraper SHALL continue crawling using normal `ignoreErrors` behavior
+
+#### Scenario: Failure rate above threshold aborts the crawl
+- **WHEN** 10 completed child-page attempts have been reached and the observed child-page failure rate exceeds the configured threshold
+- **THEN** the scraper SHALL terminate the crawl with an error indicating that the target exceeded the allowed failure rate
+
+#### Scenario: Failure rate at or below threshold continues the crawl
+- **WHEN** 10 completed child-page attempts have been reached and the observed child-page failure rate is at or below the configured threshold
+- **THEN** the scraper SHALL continue crawling remaining in-scope pages
+
+### Requirement: Refresh Deletions Do Not Count As Failures
+The scraper SHALL exclude expected page deletions detected during refresh from child-page failure-rate accounting.
+
+#### Scenario: Refresh deletion does not increase failure rate
+- **WHEN** a refresh crawl encounters a child page that returns `NOT_FOUND`
+- **THEN** the scraper SHALL mark the page as deleted without incrementing the child-page failure counter
+
+#### Scenario: Refresh cleanup alone cannot trip the threshold
+- **WHEN** a refresh crawl processes multiple deleted child pages and no terminal child-page processing failures occur
+- **THEN** the scraper SHALL not abort due to the child-page failure threshold
+
+#### Scenario: Non-refresh not found counts as a terminal page failure
+- **WHEN** a normal crawl encounters a non-root page that returns `NOT_FOUND`
+- **THEN** the scraper SHALL treat that page as a terminal page failure for failure-rate accounting
+- **AND** the page SHALL not be treated as a refresh deletion
diff --git a/openspec/changes/fail-fast-scrape-thresholds/tasks.md b/openspec/changes/fail-fast-scrape-thresholds/tasks.md
@@ -0,0 +1,28 @@
+## 1. Configuration Defaults
+
+- [x] 1.1 Update the scraper default configuration to set `scraper.fetcher.maxRetries` to `3`.
+- [x] 1.2 Extend the scraper config schema and typing to support `scraper.abortOnFailureRate`.
+- [x] 1.3 Add or update configuration tests for the new default retry count and `scraper.abortOnFailureRate` override behavior.
+
+## 2. Crawl Failure Policy
+
+- [x] 2.1 Add child-page attempt and failure counters to `BaseScraperStrategy` without changing root-page failure semantics.
+- [x] 2.2 Implement the internal minimum sample-size check before evaluating the child-page failure-rate threshold.
+- [x] 2.3 Abort the crawl with a scraper error when the configured child-page failure rate is exceeded.
+- [x] 2.4 Exclude refresh deletions (`FetchStatus.NOT_FOUND`) from child-page failure-rate accounting.
+
+## 3. Retry Behavior
+
+- [x] 3.1 Update `HttpFetcher`-level tests to verify the new default retry budget and retained permanent-failure behavior.
+- [x] 3.2 Ensure per-request retry overrides still work independently of the new default.
+
+## 4. Strategy and Tool Integration
+
+- [x] 4.1 Ensure the new threshold applies during normal scrape jobs even when `ignoreErrors` remains enabled for isolated child-page failures.
+- [x] 4.2 Verify scrape progress and terminal error behavior remain correct when the threshold aborts a crawl.
+
+## 5. Verification
+
+- [x] 5.1 Add strategy tests covering root-page fail-fast behavior, below-threshold continuation, above-threshold aborts, and refresh deletion exclusions.
+- [x] 5.2 Run targeted tests for config, fetcher, and scraper strategy behavior.
+- [x] 5.3 Run the project lint and typecheck commands and address any failures.
diff --git a/src/pipeline/PipelineWorker.test.ts b/src/pipeline/PipelineWorker.test.ts
@@ -367,7 +367,6 @@ describe("PipelineWorker", () => {
             depth: 1,
             maxDepth: 1,
             result: null, // No result for 304
-            deleted: false,
             pageId: 123, // Page ID from refresh queue
             totalDiscovered: 0,
           };
@@ -387,7 +386,6 @@ describe("PipelineWorker", () => {
         mockJob,
         expect.objectContaining({
           result: null,
-          deleted: false,
           pageId: 123,
         }),
       );

diff --git a/src/scraper/fetcher/HttpFetcher.test.ts b/src/scraper/fetcher/HttpFetcher.test.ts
@@ -1,6 +1,6 @@
 import { beforeEach, describe, expect, it, vi } from "vitest";
 import { CancellationError } from "../../pipeline/errors";
-import { loadConfig } from "../../utils/config";
+import { DEFAULT_CONFIG } from "../../utils/config";
 import { RedirectError, ScraperError } from "../../utils/errors";
 
 vi.mock("axios");
@@ -11,8 +11,7 @@ const mockedAxios = vi.mocked(axios, true);
 
 import { HttpFetcher } from "./HttpFetcher";
 
-const scraperConfig = loadConfig().scraper;
-const createFetcher = () => new HttpFetcher(scraperConfig);
+const createFetcher = () => new HttpFetcher(DEFAULT_CONFIG.scraper);
 
 describe("HttpFetcher", () => {
   beforeEach(() => {
@@ -143,8 +142,8 @@ describe("HttpFetcher", () => {
         }),
       ).rejects.toThrow(ScraperError);
 
-      // Should call initial attempt + 6 retries (default SCRAPER_FETCHER_MAX_RETRIES = 6)
-      expect(mockedAxios.get).toHaveBeenCalledTimes(7);
+      // Should call initial attempt + 3 retries (default SCRAPER_FETCHER_MAX_RETRIES = 3)
+      expect(mockedAxios.get).toHaveBeenCalledTimes(4);
     });
 
     it("should respect custom maxRetries option", async () => {

diff --git a/src/scraper/middleware/HtmlPlaywrightMiddleware.test.ts b/src/scraper/middleware/HtmlPlaywrightMiddleware.test.ts
@@ -163,10 +163,11 @@ describe("HtmlPlaywrightMiddleware", () => {
       maxPages: 1000,
       maxDepth: 3,
       maxConcurrency: 3,
+      abortOnFailureRate: 0.5,
       pageTimeoutMs: 5000,
       browserTimeoutMs: 30000,
       fetcher: {
-        maxRetries: 6,
+        maxRetries: 3,
         baseDelayMs: 1000,
         maxCacheItems: 200,
         maxCacheItemSizeBytes: 500 * 1024,
@@ -932,10 +933,11 @@ describe("Route handling race condition protection", () => {
       maxPages: 1000,
       maxDepth: 3,
       maxConcurrency: 3,
+      abortOnFailureRate: 0.5,
       pageTimeoutMs: 5000,
       browserTimeoutMs: 30000,
       fetcher: {
-        maxRetries: 6,
+        maxRetries: 3,
         baseDelayMs: 1000,
         maxCacheItems: 200,
         maxCacheItemSizeBytes: 500 * 1024,