Skip to content

Commit 117c536

Browse files
committed
docs(openspec): add fail-fast scrape thresholds change
1 parent 4f4550e commit 117c536

File tree

6 files changed

+225
-0
lines changed

6 files changed

+225
-0
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
schema: spec-driven
2+
created: 2026-03-29
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
## Context
2+
3+
The current scraper has two behaviors that combine poorly on unhealthy targets: HTTP fetches retry up to six times by default, and crawl processing ignores child-page errors by default when `ignoreErrors` is enabled. This gives good resilience for isolated transient failures, but it also allows a scrape to continue deep into a target that is broadly unhealthy due to authentication walls, anti-bot challenges, or persistent server-side failures.
4+
5+
The change needs to preserve existing fail-fast behavior for root URL failures, reduce wasted time on repeated child-page failures, and keep the configuration surface small. The user preference is to avoid exposing many related knobs.
6+
7+
## Goals / Non-Goals
8+
9+
**Goals:**
10+
- Reduce default per-page retry cost by lowering the HTTP retry default from 6 to 3.
11+
- Add a target-level abort policy that stops a crawl when child-page failures exceed a configured failure rate.
12+
- Keep the configuration surface minimal by exposing a single user-facing threshold.
13+
- Exclude expected refresh deletion handling from the threshold so refresh cleanup remains safe.
14+
- Preserve existing semantics that a root page failure aborts the job immediately.
15+
16+
**Non-Goals:**
17+
- Add both rate and count thresholds as separate public settings.
18+
- Introduce a time-windowed circuit breaker, cooldowns, or half-open recovery logic.
19+
- Change which HTTP status codes are treated as retryable.
20+
- Rework `ignoreErrors` into a broader policy system.
21+
22+
## Decisions
23+
24+
### Use a failure-rate threshold instead of a count threshold
25+
26+
The scraper will expose a single configuration key, `scraper.abortOnFailureRate`, representing the maximum tolerated child-page failure rate before aborting the crawl.
27+
28+
Rationale:
29+
- A rate scales better across small and large crawls than a fixed count.
30+
- One threshold keeps the configuration surface small.
31+
- The failure-rate model aligns with the user's fail-fast goal without requiring per-target tuning for crawl size.
32+
33+
Alternatives considered:
34+
- Count-only threshold: simpler on paper, but it behaves inconsistently across crawl sizes.
35+
- Exposing both rate and count: more flexible, but adds configuration complexity the user explicitly wants to avoid.
36+
37+
### Use an internal minimum sample size before evaluating the threshold
38+
39+
The scraper will only evaluate the failure-rate threshold after an internal minimum number of child-page attempts have completed. The initial design uses a constant of 10 child-page attempts.
40+
41+
Rationale:
42+
- Prevents early aborts from one or two isolated failures near the start of a crawl.
43+
- Keeps the public API simple while still making rate-based aborts practical.
44+
45+
Alternatives considered:
46+
- No minimum sample: too sensitive for small crawls.
47+
- Configurable minimum sample: useful but adds another knob for marginal benefit.
48+
49+
### Count only terminal child-page processing failures toward the threshold
50+
51+
The strategy layer will increment failure counters only when a child page fails after retry policy has already been exhausted and the page processing path throws. Root-page failures remain immediate hard failures and do not flow through threshold logic.
52+
53+
Rationale:
54+
- Keeps responsibility boundaries clear: fetcher handles per-request retries, strategy handles crawl policy.
55+
- Measures true page-level failures rather than transient sub-attempts.
56+
57+
Alternatives considered:
58+
- Counting every retry attempt: too noisy and would over-penalize transient outages.
59+
- Counting pipeline warnings or empty content as failures: would blur the difference between degraded content and hard failure.
60+
61+
### Exclude refresh deletions from failure accounting
62+
63+
Pages that resolve to `FetchStatus.NOT_FOUND` during refresh mode will continue to be treated as expected deletions and will not count toward the failure-rate threshold.
64+
65+
Rationale:
66+
- Refresh cleanup is expected maintenance behavior, not a target health failure.
67+
- Prevents valid refresh jobs from aborting when many stale pages have been removed upstream.
68+
69+
### Apply the threshold even when `ignoreErrors` is true
70+
71+
`ignoreErrors` will continue to suppress isolated child-page failures, but once the configured failure-rate threshold is exceeded after the minimum sample size, the crawl will abort.
72+
73+
Rationale:
74+
- Preserves resilience for a few bad pages.
75+
- Prevents `ignoreErrors` from masking a target that is broadly broken.
76+
77+
## Risks / Trade-offs
78+
79+
- [Small crawls may still complete despite high failure ratios] -> The minimum sample size intentionally favors tolerance for tiny crawls to avoid noisy aborts.
80+
- [Some borderline unhealthy sites may abort sooner than today] -> This is expected and aligned with fail-fast behavior; documentation and tests should make the policy explicit.
81+
- [A single threshold may not fit every deployment] -> Start with one config key and a documented default; add more tuning only if real usage shows a clear need.
82+
- [Behavior changes for existing scrape jobs] -> Keep root failure semantics unchanged and limit the change to child-page threshold behavior plus reduced retries.
83+
84+
## Migration Plan
85+
86+
1. Update scraper config defaults and schema to lower `scraper.fetcher.maxRetries` to `3` and add `scraper.abortOnFailureRate`.
87+
2. Implement child-page attempt and failure accounting in `BaseScraperStrategy`.
88+
3. Abort the crawl when the failure rate exceeds the configured threshold after the internal minimum sample size.
89+
4. Add or update tests for fetcher retry defaults, threshold-triggered aborts, root-page failures, and refresh deletions.
90+
5. Document the new default behavior through specs and config-facing tests.
91+
92+
Rollback strategy:
93+
- Restore the previous retry default and remove the threshold checks if production behavior proves too aggressive.
94+
95+
## Open Questions
96+
97+
- What default value should `scraper.abortOnFailureRate` ship with? The implementation should pick a conservative default that tolerates a few isolated failures while still aborting clearly unhealthy crawls.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
## Why
2+
3+
The scraper currently combines a high default HTTP retry count with a crawl policy that ignores non-root page failures by default, which can make unhealthy targets look more successful than they are and waste time on permanently broken sites. We need a simpler fail-fast policy so scrape jobs stop earlier when a target is broadly unhealthy while still tolerating a small number of transient page failures.
4+
5+
## What Changes
6+
7+
- Reduce the default HTTP fetch retry count from 6 retries to 3 retries.
8+
- Add a target-level scrape failure-rate threshold that aborts a crawl when too many child pages fail after a minimum number of attempted pages.
9+
- Keep root page failures as immediate job failures.
10+
- Exclude expected refresh deletions (`404`/`NOT_FOUND`) from the failure-rate threshold so refresh cleanup does not trip the breaker.
11+
- Surface the new threshold through configuration with a documented default.
12+
- Add specification scenarios covering retry defaults, threshold evaluation, ignored deletion cases, and root-page fail-fast behavior.
13+
14+
## Capabilities
15+
16+
### New Capabilities
17+
- `scrape-failure-policy`: Defines retry defaults and fail-fast target abort behavior for unhealthy scrape targets.
18+
19+
### Modified Capabilities
20+
- `configuration`: Add configuration support for the scrape failure-rate threshold and updated default fetch retry count.
21+
22+
## Impact
23+
24+
- Affected code: `src/utils/config.ts`, `src/scraper/fetcher/HttpFetcher.ts`, `src/scraper/strategies/BaseScraperStrategy.ts`, `src/tools/ScrapeTool.ts`, and related tests.
25+
- Affected behavior: scrape jobs will stop earlier on broadly failing targets and will spend less time retrying transient fetch failures.
26+
- Affected interfaces: configuration defaults and config schema for scraper settings.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
## ADDED Requirements
2+
3+
### Requirement: Scrape Failure Rate Threshold Configuration
4+
The configuration system SHALL expose `scraper.abortOnFailureRate` as the child-page failure-rate threshold used to abort unhealthy scrape targets.
5+
6+
#### Scenario: Config file sets scrape failure rate threshold
7+
- **WHEN** the configuration file sets `scraper.abortOnFailureRate` to a numeric value
8+
- **THEN** the scraper SHALL use that value as the child-page failure-rate threshold
9+
10+
#### Scenario: Environment variable overrides scrape failure rate threshold
11+
- **WHEN** `DOCS_MCP_SCRAPER_ABORT_ON_FAILURE_RATE` is set
12+
- **THEN** the environment value SHALL override config file and default values for `scraper.abortOnFailureRate`
13+
14+
### Requirement: Updated Default HTTP Retry Configuration
15+
The configuration system SHALL default `scraper.fetcher.maxRetries` to `3`.
16+
17+
#### Scenario: Default fetch retry count is three
18+
- **WHEN** no explicit configuration for `scraper.fetcher.maxRetries` is provided
19+
- **THEN** the loaded configuration SHALL set `scraper.fetcher.maxRetries` to `3`
20+
21+
#### Scenario: Explicit fetch retry override still wins
22+
- **WHEN** the user provides an explicit value for `scraper.fetcher.maxRetries`
23+
- **THEN** the loaded configuration SHALL use the explicit value instead of the default
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
## ADDED Requirements
2+
3+
### Requirement: Bounded HTTP Fetch Retries
4+
The scraper SHALL default HTTP fetch retries to 3 retries per page request, in addition to the initial attempt.
5+
6+
#### Scenario: Default retry budget uses four total attempts
7+
- **WHEN** an HTTP page fetch repeatedly fails with a retryable error and no per-request override is provided
8+
- **THEN** the fetcher SHALL stop after the initial attempt plus 3 retries
9+
10+
#### Scenario: Permanent HTTP failures do not use retry budget
11+
- **WHEN** an HTTP page fetch fails with a non-retryable permanent error
12+
- **THEN** the fetcher SHALL fail the page without consuming additional retries
13+
14+
### Requirement: Root Page Failures Abort Immediately
15+
The scraper SHALL fail the scrape job immediately when the root page cannot be processed successfully.
16+
17+
#### Scenario: Root page fetch failure aborts the job
18+
- **WHEN** the initial root page fails during fetch or processing
19+
- **THEN** the scrape job SHALL terminate with an error
20+
21+
#### Scenario: Root page failure bypasses child failure threshold logic
22+
- **WHEN** the initial root page fails before any child pages are attempted
23+
- **THEN** the scraper SHALL abort immediately rather than evaluating the child-page failure threshold
24+
25+
### Requirement: Child Page Failure Rate Aborts Unhealthy Targets
26+
The scraper SHALL track child-page attempts and terminal child-page failures, and SHALL abort the crawl when the child-page failure rate exceeds the configured threshold after the minimum evaluation sample has been reached.
27+
28+
#### Scenario: Isolated child-page failures do not abort before minimum sample
29+
- **WHEN** a crawl encounters child-page failures before the minimum evaluation sample has been reached
30+
- **THEN** the scraper SHALL continue crawling using normal `ignoreErrors` behavior
31+
32+
#### Scenario: Failure rate above threshold aborts the crawl
33+
- **WHEN** the minimum child-page sample has been reached and the observed child-page failure rate exceeds the configured threshold
34+
- **THEN** the scraper SHALL terminate the crawl with an error indicating that the target exceeded the allowed failure rate
35+
36+
#### Scenario: Failure rate at or below threshold continues the crawl
37+
- **WHEN** the minimum child-page sample has been reached and the observed child-page failure rate is at or below the configured threshold
38+
- **THEN** the scraper SHALL continue crawling remaining in-scope pages
39+
40+
### Requirement: Refresh Deletions Do Not Count As Failures
41+
The scraper SHALL exclude expected page deletions detected during refresh from child-page failure-rate accounting.
42+
43+
#### Scenario: Refresh deletion does not increase failure rate
44+
- **WHEN** a refresh crawl encounters a child page that returns `NOT_FOUND`
45+
- **THEN** the scraper SHALL mark the page as deleted without incrementing the child-page failure counter
46+
47+
#### Scenario: Refresh cleanup alone cannot trip the threshold
48+
- **WHEN** a refresh crawl processes multiple deleted child pages and no terminal child-page processing failures occur
49+
- **THEN** the scraper SHALL not abort due to the child-page failure threshold
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
## 1. Configuration Defaults
2+
3+
- [ ] 1.1 Update the scraper default configuration to set `scraper.fetcher.maxRetries` to `3`.
4+
- [ ] 1.2 Extend the scraper config schema and typing to support `scraper.abortOnFailureRate`.
5+
- [ ] 1.3 Add or update configuration tests for the new default retry count and `scraper.abortOnFailureRate` override behavior.
6+
7+
## 2. Crawl Failure Policy
8+
9+
- [ ] 2.1 Add child-page attempt and failure counters to `BaseScraperStrategy` without changing root-page failure semantics.
10+
- [ ] 2.2 Implement the internal minimum sample-size check before evaluating the child-page failure-rate threshold.
11+
- [ ] 2.3 Abort the crawl with a scraper error when the configured child-page failure rate is exceeded.
12+
- [ ] 2.4 Exclude refresh deletions (`FetchStatus.NOT_FOUND`) from child-page failure-rate accounting.
13+
14+
## 3. Retry Behavior
15+
16+
- [ ] 3.1 Update `HttpFetcher`-level tests to verify the new default retry budget and retained permanent-failure behavior.
17+
- [ ] 3.2 Ensure per-request retry overrides still work independently of the new default.
18+
19+
## 4. Strategy and Tool Integration
20+
21+
- [ ] 4.1 Ensure the new threshold applies during normal scrape jobs even when `ignoreErrors` remains enabled for isolated child-page failures.
22+
- [ ] 4.2 Verify scrape progress and terminal error behavior remain correct when the threshold aborts a crawl.
23+
24+
## 5. Verification
25+
26+
- [ ] 5.1 Add strategy tests covering root-page fail-fast behavior, below-threshold continuation, above-threshold aborts, and refresh deletion exclusions.
27+
- [ ] 5.2 Run targeted tests for config, fetcher, and scraper strategy behavior.
28+
- [ ] 5.3 Run the project lint and typecheck commands and address any failures.

0 commit comments

Comments
 (0)