Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions openspec/changes/fail-fast-scrape-thresholds/.openspec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-29
97 changes: 97 additions & 0 deletions openspec/changes/fail-fast-scrape-thresholds/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
## Context

The current scraper has two behaviors that combine poorly on unhealthy targets: HTTP fetches retry up to six times by default, and crawl processing ignores child-page errors by default when `ignoreErrors` is enabled. This gives good resilience for isolated transient failures, but it also allows a scrape to continue deep into a target that is broadly unhealthy due to authentication walls, anti-bot challenges, or persistent server-side failures.

The change needs to preserve existing fail-fast behavior for root URL failures, reduce wasted time on repeated child-page failures, and keep the configuration surface small. The user preference is to avoid exposing many related knobs.

## Goals / Non-Goals

**Goals:**
- Reduce default per-page retry cost by lowering the HTTP retry default from 6 to 3.
- Add a target-level abort policy that stops a crawl when child-page failures exceed a configured failure rate.
- Keep the configuration surface minimal by exposing a single user-facing threshold.
- Exclude expected refresh deletion handling from the threshold so refresh cleanup remains safe.
- Preserve existing semantics that a root page failure aborts the job immediately.

**Non-Goals:**
- Add both rate and count thresholds as separate public settings.
- Introduce a time-windowed circuit breaker, cooldowns, or half-open recovery logic.
- Change which HTTP status codes are treated as retryable.
- Rework `ignoreErrors` into a broader policy system.

## Decisions

### Use a failure-rate threshold instead of a count threshold

The scraper will expose a single configuration key, `scraper.abortOnFailureRate`, representing the maximum tolerated child-page failure rate before aborting the crawl. The value will be a fraction in the inclusive range `[0.0, 1.0]`, and the initial default will be `0.5`.

Rationale:
- A rate scales better across small and large crawls than a fixed count.
- One threshold keeps the configuration surface small.
- The failure-rate model aligns with the user's fail-fast goal without requiring per-target tuning for crawl size.

Alternatives considered:
- Count-only threshold: simpler on paper, but it behaves inconsistently across crawl sizes.
- Exposing both rate and count: more flexible, but adds configuration complexity the user explicitly wants to avoid.

### Use an internal minimum sample size before evaluating the threshold

The scraper will only evaluate the failure-rate threshold after an internal minimum number of completed child-page attempts have been observed. The initial design uses a constant of 10 completed child-page attempts, counted as successful child-page completions plus terminal child-page failures, excluding refresh deletions and in-flight attempts.

Rationale:
- Prevents early aborts from one or two isolated failures near the start of a crawl.
- Keeps the public API simple while still making rate-based aborts practical.

Alternatives considered:
- No minimum sample: too sensitive for small crawls.
- Configurable minimum sample: useful but adds another knob for marginal benefit.

### Count only terminal child-page processing failures toward the threshold

The strategy layer will increment failure counters only when a child page fails after retry policy has already been exhausted and the page processing path throws. Root-page failures remain immediate hard failures and do not flow through threshold logic.

Rationale:
- Keeps responsibility boundaries clear: fetcher handles per-request retries, strategy handles crawl policy.
- Measures true page-level failures rather than transient sub-attempts.

Alternatives considered:
- Counting every retry attempt: too noisy and would over-penalize transient outages.
- Counting pipeline warnings or empty content as failures: would blur the difference between degraded content and hard failure.

### Exclude refresh deletions from failure accounting

Pages that resolve to `FetchStatus.NOT_FOUND` during refresh mode will continue to be treated as expected deletions and will not count toward the failure-rate threshold.

Rationale:
- Refresh cleanup is expected maintenance behavior, not a target health failure.
- Prevents valid refresh jobs from aborting when many stale pages have been removed upstream.

### Apply the threshold even when `ignoreErrors` is true

`ignoreErrors` will continue to suppress isolated child-page failures, but once the configured failure-rate threshold is exceeded after the minimum sample size, the crawl will abort.

Rationale:
- Preserves resilience for a few bad pages.
- Prevents `ignoreErrors` from masking a target that is broadly broken.

## Risks / Trade-offs

- [Small crawls may still complete despite high failure ratios] -> The minimum sample size intentionally favors tolerance for tiny crawls to avoid noisy aborts.
- [Some borderline unhealthy sites may abort sooner than today] -> This is expected and aligned with fail-fast behavior; documentation and tests should make the policy explicit.
- [A single threshold may not fit every deployment] -> Start with one config key and a documented default; add more tuning only if real usage shows a clear need.
- [Behavior changes for existing scrape jobs] -> Keep root failure semantics unchanged and limit the change to child-page threshold behavior plus reduced retries.

## Migration Plan

1. Update scraper config defaults and schema to lower `scraper.fetcher.maxRetries` to `3` and add `scraper.abortOnFailureRate`.
2. Implement child-page attempt and failure accounting in `BaseScraperStrategy`.
3. Abort the crawl when the failure rate exceeds the configured threshold after the internal minimum sample size.
4. Add or update tests for fetcher retry defaults, threshold-triggered aborts, root-page failures, and refresh deletions.
5. Document the new default behavior through specs and config-facing tests.

Rollback strategy:
- Restore the previous retry default and remove the threshold checks if production behavior proves too aggressive.

## Open Questions

- None.
26 changes: 26 additions & 0 deletions openspec/changes/fail-fast-scrape-thresholds/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## Why

The scraper currently combines a high default HTTP retry count with a crawl policy that ignores non-root page failures by default, which can make unhealthy targets look more successful than they are and waste time on permanently broken sites. We need a simpler fail-fast policy so scrape jobs stop earlier when a target is broadly unhealthy while still tolerating a small number of transient page failures.

## What Changes

- Reduce the default HTTP fetch retry count from 6 retries to 3 retries.
- Add a target-level scrape failure-rate threshold that aborts a crawl when too many child pages fail after a minimum number of attempted pages.
- Keep root page failures as immediate job failures.
- Exclude expected refresh deletions (`404`/`NOT_FOUND`) from the failure-rate threshold so refresh cleanup does not trip the breaker.
- Surface the new threshold through configuration with a default `scraper.abortOnFailureRate` of `0.5`.
- Add specification scenarios covering retry defaults, threshold evaluation, ignored deletion cases, and root-page fail-fast behavior.

## Capabilities

### New Capabilities
- `scrape-failure-policy`: Defines retry defaults and fail-fast target abort behavior for unhealthy scrape targets.

### Modified Capabilities
- `configuration`: Add configuration support for the scrape failure-rate threshold and updated default fetch retry count.

## Impact

- Affected code: `src/utils/config.ts`, `src/scraper/fetcher/HttpFetcher.ts`, `src/scraper/strategies/BaseScraperStrategy.ts`, `src/tools/ScrapeTool.ts`, and related tests.
- Affected behavior: scrape jobs will stop earlier on broadly failing targets and will spend less time retrying transient fetch failures.
- Affected interfaces: configuration defaults and config schema for scraper settings.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## ADDED Requirements

### Requirement: Scrape Failure Rate Threshold Configuration
The configuration system SHALL expose `scraper.abortOnFailureRate` as the child-page failure-rate threshold used to abort unhealthy scrape targets. The value SHALL be a floating-point fraction in the inclusive range `[0.0, 1.0]`, where `0.0` means any completed child-page failure above the minimum sample triggers an abort and `1.0` means the scraper never aborts based on failure rate alone.

#### Scenario: Config file sets scrape failure rate threshold
- **WHEN** the configuration file sets `scraper.abortOnFailureRate` to a numeric value in the inclusive range `[0.0, 1.0]`
- **THEN** the scraper SHALL interpret that value as the maximum allowed fraction of terminally failed child pages among completed child-page attempts

#### Scenario: Environment variable overrides scrape failure rate threshold
- **WHEN** the environment variable `DOCS_MCP_SCRAPER_ABORT_ON_FAILURE_RATE` is set
- **THEN** the environment variable value SHALL override config file and default values for `scraper.abortOnFailureRate`

#### Scenario: Default scrape failure rate threshold is one half
- **WHEN** no explicit configuration for `scraper.abortOnFailureRate` is provided
- **THEN** the loaded configuration SHALL set `scraper.abortOnFailureRate` to `0.5`

### Requirement: Updated Default HTTP Retry Configuration
The configuration system SHALL default `scraper.fetcher.maxRetries` to `3`.

#### Scenario: Default fetch retry count is three
- **WHEN** no explicit configuration for `scraper.fetcher.maxRetries` is provided
- **THEN** the loaded configuration SHALL set `scraper.fetcher.maxRetries` to `3`

#### Scenario: Explicit fetch retry override still wins
- **WHEN** the user provides an explicit value for `scraper.fetcher.maxRetries`
- **THEN** the loaded configuration SHALL use the explicit value instead of the default
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
## ADDED Requirements

### Requirement: Bounded HTTP Fetch Retries
The scraper SHALL default HTTP fetch retries to 3 retries per page request, in addition to the initial attempt.

#### Scenario: Default retry budget uses four total attempts
- **WHEN** an HTTP page fetch repeatedly fails with a retryable error and no per-request override is provided
- **THEN** the fetcher SHALL stop after the initial attempt plus 3 retries

#### Scenario: Permanent HTTP failures do not use retry budget
- **WHEN** an HTTP page fetch fails with a non-retryable permanent error
- **THEN** the fetcher SHALL fail the page without consuming additional retries

### Requirement: Root Page Failures Abort Immediately
The scraper SHALL fail the scrape job immediately when the root page cannot be processed successfully during a normal scrape. During refresh, a tracked root page that returns `NOT_FOUND` SHALL be treated as a deletion instead of a hard failure.

#### Scenario: Root page fetch failure aborts the job
- **WHEN** the initial root page fails during fetch or processing
- **THEN** the scrape job SHALL terminate with an error

#### Scenario: Root page failure bypasses child failure threshold logic
- **WHEN** the initial root page fails before any child pages are attempted
- **THEN** the scraper SHALL abort immediately rather than evaluating the child-page failure threshold

#### Scenario: Root page not found during refresh is treated as deletion
- **WHEN** a refresh crawl re-processes a tracked root page and that page returns `NOT_FOUND`
- **THEN** the scraper SHALL report the root page as deleted
- **AND** the scrape SHALL continue applying refresh deletion handling instead of failing the job

### Requirement: Child Page Failure Rate Aborts Unhealthy Targets
The scraper SHALL track completed child-page attempts and terminal child-page failures, and SHALL abort the crawl when, after a minimum evaluation sample of 10 completed child-page attempts has been reached, the observed child-page failure rate exceeds the configured threshold. Completed child-page attempts SHALL include successful child-page completions and terminal child-page failures, and SHALL exclude refresh deletions and in-flight attempts.

#### Scenario: Isolated child-page failures do not abort before minimum sample
- **WHEN** a crawl encounters child-page failures before 10 completed child-page attempts have been reached
- **THEN** the scraper SHALL continue crawling using normal `ignoreErrors` behavior

#### Scenario: Failure rate above threshold aborts the crawl
- **WHEN** 10 completed child-page attempts have been reached and the observed child-page failure rate exceeds the configured threshold
- **THEN** the scraper SHALL terminate the crawl with an error indicating that the target exceeded the allowed failure rate

#### Scenario: Failure rate at or below threshold continues the crawl
- **WHEN** 10 completed child-page attempts have been reached and the observed child-page failure rate is at or below the configured threshold
- **THEN** the scraper SHALL continue crawling remaining in-scope pages

### Requirement: Refresh Deletions Do Not Count As Failures
The scraper SHALL exclude expected page deletions detected during refresh from child-page failure-rate accounting.

#### Scenario: Refresh deletion does not increase failure rate
- **WHEN** a refresh crawl encounters a child page that returns `NOT_FOUND`
- **THEN** the scraper SHALL mark the page as deleted without incrementing the child-page failure counter

#### Scenario: Refresh cleanup alone cannot trip the threshold
- **WHEN** a refresh crawl processes multiple deleted child pages and no terminal child-page processing failures occur
- **THEN** the scraper SHALL not abort due to the child-page failure threshold

#### Scenario: Non-refresh not found counts as a terminal page failure
- **WHEN** a normal crawl encounters a non-root page that returns `NOT_FOUND`
- **THEN** the scraper SHALL treat that page as a terminal page failure for failure-rate accounting
- **AND** the page SHALL not be treated as a refresh deletion
28 changes: 28 additions & 0 deletions openspec/changes/fail-fast-scrape-thresholds/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
## 1. Configuration Defaults

- [x] 1.1 Update the scraper default configuration to set `scraper.fetcher.maxRetries` to `3`.
- [x] 1.2 Extend the scraper config schema and typing to support `scraper.abortOnFailureRate`.
- [x] 1.3 Add or update configuration tests for the new default retry count and `scraper.abortOnFailureRate` override behavior.

## 2. Crawl Failure Policy

- [x] 2.1 Add child-page attempt and failure counters to `BaseScraperStrategy` without changing root-page failure semantics.
- [x] 2.2 Implement the internal minimum sample-size check before evaluating the child-page failure-rate threshold.
- [x] 2.3 Abort the crawl with a scraper error when the configured child-page failure rate is exceeded.
- [x] 2.4 Exclude refresh deletions (`FetchStatus.NOT_FOUND`) from child-page failure-rate accounting.

## 3. Retry Behavior

- [x] 3.1 Update `HttpFetcher`-level tests to verify the new default retry budget and retained permanent-failure behavior.
- [x] 3.2 Ensure per-request retry overrides still work independently of the new default.

## 4. Strategy and Tool Integration

- [x] 4.1 Ensure the new threshold applies during normal scrape jobs even when `ignoreErrors` remains enabled for isolated child-page failures.
- [x] 4.2 Verify scrape progress and terminal error behavior remain correct when the threshold aborts a crawl.

## 5. Verification

- [x] 5.1 Add strategy tests covering root-page fail-fast behavior, below-threshold continuation, above-threshold aborts, and refresh deletion exclusions.
- [x] 5.2 Run targeted tests for config, fetcher, and scraper strategy behavior.
- [x] 5.3 Run the project lint and typecheck commands and address any failures.
2 changes: 0 additions & 2 deletions src/pipeline/PipelineWorker.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -367,7 +367,6 @@ describe("PipelineWorker", () => {
depth: 1,
maxDepth: 1,
result: null, // No result for 304
deleted: false,
pageId: 123, // Page ID from refresh queue
totalDiscovered: 0,
};
Expand All @@ -387,7 +386,6 @@ describe("PipelineWorker", () => {
mockJob,
expect.objectContaining({
result: null,
deleted: false,
pageId: 123,
}),
);
Expand Down
9 changes: 4 additions & 5 deletions src/scraper/fetcher/HttpFetcher.test.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import { beforeEach, describe, expect, it, vi } from "vitest";
import { CancellationError } from "../../pipeline/errors";
import { loadConfig } from "../../utils/config";
import { DEFAULT_CONFIG } from "../../utils/config";
import { RedirectError, ScraperError } from "../../utils/errors";

vi.mock("axios");
Expand All @@ -11,8 +11,7 @@ const mockedAxios = vi.mocked(axios, true);

import { HttpFetcher } from "./HttpFetcher";

const scraperConfig = loadConfig().scraper;
const createFetcher = () => new HttpFetcher(scraperConfig);
const createFetcher = () => new HttpFetcher(DEFAULT_CONFIG.scraper);

describe("HttpFetcher", () => {
beforeEach(() => {
Expand Down Expand Up @@ -143,8 +142,8 @@ describe("HttpFetcher", () => {
}),
).rejects.toThrow(ScraperError);

// Should call initial attempt + 6 retries (default SCRAPER_FETCHER_MAX_RETRIES = 6)
expect(mockedAxios.get).toHaveBeenCalledTimes(7);
// Should call initial attempt + 3 retries (default SCRAPER_FETCHER_MAX_RETRIES = 3)
expect(mockedAxios.get).toHaveBeenCalledTimes(4);
});

it("should respect custom maxRetries option", async () => {
Expand Down
6 changes: 4 additions & 2 deletions src/scraper/middleware/HtmlPlaywrightMiddleware.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -163,10 +163,11 @@ describe("HtmlPlaywrightMiddleware", () => {
maxPages: 1000,
maxDepth: 3,
maxConcurrency: 3,
abortOnFailureRate: 0.5,
pageTimeoutMs: 5000,
browserTimeoutMs: 30000,
fetcher: {
maxRetries: 6,
maxRetries: 3,
baseDelayMs: 1000,
maxCacheItems: 200,
maxCacheItemSizeBytes: 500 * 1024,
Expand Down Expand Up @@ -932,10 +933,11 @@ describe("Route handling race condition protection", () => {
maxPages: 1000,
maxDepth: 3,
maxConcurrency: 3,
abortOnFailureRate: 0.5,
pageTimeoutMs: 5000,
browserTimeoutMs: 30000,
fetcher: {
maxRetries: 6,
maxRetries: 3,
baseDelayMs: 1000,
maxCacheItems: 200,
maxCacheItemSizeBytes: 500 * 1024,
Expand Down
Loading