Skip to content

Commit 414c1e0

Browse files
committed
docs(openspec): add fetch access controls change
1 parent 4f4550e commit 414c1e0

6 files changed

Lines changed: 496 additions & 0 deletions

File tree

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
schema: spec-driven
2+
created: 2026-03-29
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
## Context
2+
3+
The server intentionally supports arbitrary HTTP(S) and `file://` fetching across CLI, MCP, and web-triggered scraping flows. Today, fetcher selection and local file traversal are centralized enough to apply consistent controls, but there is no policy layer that limits outbound network targets, constrains local file roots, or handles hidden paths and symlink escapes. The project already has a strong configuration model with schema-backed defaults and environment/CLI overrides, so access controls should be introduced as configuration-driven behavior rather than interface-specific flags.
4+
5+
## Goals / Non-Goals
6+
7+
**Goals:**
8+
- Add one shared outbound access policy that applies to `fetch_url` and scrape workflows regardless of entry point.
9+
- Default-deny private, loopback, link-local, and similar special-use network targets unless explicitly configured.
10+
- Add configurable local file access modes with allowed roots, `$DOCUMENTS` token expansion, and secure defaults for hidden paths and symlinks.
11+
- Produce clear user-facing errors when policy blocks access.
12+
- Preserve explicit opt-in paths for trusted local and self-hosted deployments.
13+
14+
**Non-Goals:**
15+
- Add per-user or per-token authorization rules.
16+
- Implement a sandbox or OS-level file isolation boundary.
17+
- Detect every possible cloud metadata alias or enterprise-internal hostname pattern out of the box.
18+
- Redesign the scrape tool surface or add new transport-level authentication features.
19+
20+
## Decisions
21+
22+
### Introduce a shared scraper security policy under configuration
23+
24+
Add a new `scraper.security` section to the typed config schema. This keeps policy close to the fetch/scrape subsystem and lets the existing precedence model handle config file, environment, and CLI overrides.
25+
26+
Alternative considered: a top-level `security` block. Rejected because the affected behavior is currently isolated to scraper-driven URL and file access, and placing the settings under `scraper` avoids broadening the configuration surface prematurely.
27+
28+
### Enforce policy centrally before network or file reads
29+
30+
Introduce a reusable access-policy helper that validates URLs and resolved file paths before `AutoDetectFetcher`, `HttpFetcher`, `FileFetcher`, and local crawl entry points perform I/O. DNS-resolved IP revalidation and redirect-target validation will be mandatory enforcement behavior, not optional configuration knobs. For archive-member URLs such as `file:///path/to/archive.zip/docs/readme.md`, policy evaluation will be based on the resolved backing archive file path plus the virtual entry path, not on the nonexistent combined filesystem path.
31+
32+
Alternative considered: only validating at the MCP tool boundary. Rejected because CLI and web-triggered scraping must behave consistently, and lower-level enforcement prevents bypass through internal call paths.
33+
34+
### Apply network policy to browser-initiated subrequests
35+
36+
Treat browser-based fetching as a series of outbound requests rather than a single navigation. The same network policy must apply to the initial page load, redirects, frames, iframes, scripts, stylesheets, images, XHR/fetch calls, and any other subresource request initiated during rendering so that a public page cannot pivot the renderer into private network access.
37+
38+
Alternative considered: only validating the initial browser navigation target. Rejected because it leaves a clear SSRF-style gap for render-time secondary requests.
39+
40+
### Default-deny special network ranges while keeping explicit opt-in overrides
41+
42+
Model network protection with one main switch, `allowPrivateNetworks`, that defaults to `false` and covers loopback, RFC1918 private ranges, link-local addresses, and equivalent IPv6 special-use ranges. Add `allowedHosts` and `allowedCidrs` so trusted deployments can permit only the specific internal targets they need while the broad default-deny behavior remains in place.
43+
44+
When `allowPrivateNetworks` is `false`, an empty `allowedHosts` and `allowedCidrs` configuration means no private or special-use network targets are allowed. Public internet targets remain allowed unless blocked by some other rule. When `allowPrivateNetworks` is `true`, all network targets become eligible for access, subject to the same redirect and DNS resolution checks.
45+
46+
`allowedHosts` and `allowedCidrs` serve different purposes. `allowedHosts` grants a hostname-bound exception for requests explicitly targeting that host, even when it resolves to private or special-use addresses. `allowedCidrs` grants an address-bound exception for direct IP targets or any hostname whose resolved connection address falls within an allowed subnet. Redirects and secondary requests are evaluated independently against the same rules, so a request allowed via `allowedHosts` does not automatically authorize a different hostname or a direct IP target unless that new target is also allowed.
47+
48+
Example default-secure network configuration:
49+
50+
```yaml
51+
scraper:
52+
security:
53+
network:
54+
allowPrivateNetworks: false
55+
allowedHosts: []
56+
allowedCidrs: []
57+
```
58+
59+
Example selective internal allowlist:
60+
61+
```yaml
62+
scraper:
63+
security:
64+
network:
65+
allowPrivateNetworks: false
66+
allowedHosts:
67+
- docs.internal.example
68+
allowedCidrs:
69+
- 10.42.0.0/16
70+
```
71+
72+
Alternative considered: permissive defaults with warnings only. Rejected because the objective of this change is meaningful hardening for exposed deployments, and warnings do not reduce actual risk.
73+
74+
Alternative considered: separate booleans such as `allowLoopback` and `allowLinkLocal`. Rejected because they create overlapping policy states and make it harder to explain how exceptions should be configured.
75+
76+
### Add file access modes with allowlisted roots and secure traversal defaults
77+
78+
Model local file access as `disabled`, `allowedRoots`, or `unrestricted`. In `allowedRoots` mode, expand configured tokens such as `$DOCUMENTS`, canonicalize paths, and require the effective target to remain inside an allowed root. Set `followSymlinks` and `includeHidden` to `false` by default. Hidden paths will be blocked even when explicitly named unless the user opts in.
79+
80+
When `fileAccess.mode` is `allowedRoots`, an empty `allowedRoots` list means all user-requested `file://` access is denied. This does not block internally managed temporary archive handoff for accepted web archive roots.
81+
82+
If `$DOCUMENTS` cannot be resolved on the current platform or runtime account, it will expand to no path and therefore grant no access by itself. Deployments that require local file access in containers or service accounts must configure explicit absolute paths.
83+
84+
Example default file policy:
85+
86+
```yaml
87+
scraper:
88+
security:
89+
fileAccess:
90+
mode: allowedRoots
91+
allowedRoots:
92+
- $DOCUMENTS
93+
followSymlinks: false
94+
includeHidden: false
95+
```
96+
97+
Example fully disabled local file access:
98+
99+
```yaml
100+
scraper:
101+
security:
102+
fileAccess:
103+
mode: disabled
104+
allowedRoots: []
105+
followSymlinks: false
106+
includeHidden: false
107+
```
108+
109+
Alternative considered: denylisting sensitive directories under `$HOME`. Rejected because denylisting is incomplete and easier to bypass than root allowlisting.
110+
111+
### Preserve internal temporary archive handoff
112+
113+
Treat temporary files created by the application itself for supported web archive scraping as internal implementation artifacts rather than user-requested `file://` access. Policy enforcement will still apply to the original network URL, but once a web archive root has been accepted and downloaded, the handoff into local archive processing must not be blocked merely because the temp file lives outside configured user roots such as `$DOCUMENTS`. This exception is limited to the downloaded archive artifact produced for that accepted request and virtual members read from that archive during the same processing flow.
114+
115+
Alternative considered: requiring the temp directory to be added to default allowed roots. Rejected because it couples user-facing file policy to OS temp locations and would either over-broaden local access or break existing archive behavior.
116+
117+
### Use `$DOCUMENTS` as a convenience token, not a policy special case
118+
119+
Support `$DOCUMENTS` expansion during config resolution or policy evaluation, but enforce access against concrete absolute paths after expansion. This keeps the user experience simple without coupling enforcement logic to platform-specific home-directory conventions.
120+
121+
Alternative considered: using `$HOME` as a default root. Rejected because it is too broad and includes common secret locations such as `.ssh`, `.aws`, and dot-config directories.
122+
123+
## Risks / Trade-offs
124+
125+
- [Behavior break for existing internal-doc users] -> Provide explicit `allowedHosts` and `allowedCidrs` overrides and document migration clearly.
126+
- [Regression in supported web archive scraping] -> Exempt internally managed temp archive handoff from user file-root checks while preserving network policy on the original URL.
127+
- [DNS resolution and redirect validation complexity] -> Keep the first implementation limited to HTTP(S) plus redirect revalidation and well-defined CIDR checks, covered by focused tests.
128+
- [Browser subrequest enforcement complexity] -> Intercept all Playwright requests centrally and apply the same resolver/policy helper used by non-browser fetchers.
129+
- [Platform-specific `$DOCUMENTS` resolution differences] -> Fall back to explicit paths when the token cannot be resolved reliably.
130+
- [Confusion around hidden path semantics] -> Treat `includeHidden: false` as a hard deny for both direct fetch and traversal, and document this explicitly.
131+
- [Symlink handling may surprise users with linked doc folders] -> Keep an opt-in `followSymlinks` flag and validate the resolved target stays within an allowed root.
132+
133+
## Migration Plan
134+
135+
1. Add the new config schema and defaults.
136+
2. Implement shared access-policy utilities and wire them into fetchers and local file traversal.
137+
3. Preserve archive workflows by validating virtual archive paths against the concrete archive file and by exempting internal temp archive handoff from user file-root checks.
138+
4. Update tests to cover blocked and allowed cases for top-level and browser-initiated network requests, hidden paths, symlinks, and archive workflows.
139+
5. Update user-facing documentation to explain defaults, array/env encoding, token expansion, and override patterns.
140+
6. Release with changelog notes highlighting that private-network access is now blocked by default and file access is restricted by policy.
141+
142+
Rollback consists of removing the policy enforcement layer and reverting to the previous permissive defaults, but the preferred operational mitigation is to loosen the new config values rather than removing the feature.
143+
144+
## Open Questions
145+
146+
- Do we want CLI flags for temporary overrides in addition to config and environment variables, or is config-only sufficient for the first iteration?
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
## Why
2+
3+
The product intentionally fetches arbitrary web and local content, but today it does so without configurable network or file access restrictions. That is acceptable for trusted local workflows, yet it leaves shared or internet-exposed deployments without a first-class way to reduce SSRF-style risk, sensitive local file access, or accidental access to hidden content.
4+
5+
## What Changes
6+
7+
- Add configurable outbound access controls for HTTP(S) fetches used by `fetch_url` and scraping workflows.
8+
- Default-deny access to private, loopback, link-local, and other special-use network targets unless explicitly allowed via host or CIDR allowlists.
9+
- Add configurable local file access modes, with allowed root directories, `$DOCUMENTS` token expansion, and secure defaults for symlinks and hidden paths.
10+
- Preserve supported archive workflows by distinguishing user-requested local file access from internally managed temporary files and archive member resolution.
11+
- Enforce the same access policy across CLI, MCP, and web-triggered fetch/scrape operations.
12+
- Return clear validation errors when a URL or file path is blocked by policy, including guidance for enabling access.
13+
14+
## Capabilities
15+
16+
### New Capabilities
17+
- `outbound-access-control`: Define and enforce configurable network and local file access restrictions for one-shot fetches and scraping workflows.
18+
19+
### Modified Capabilities
20+
- `configuration`: Add scraper security configuration for network restrictions, file access modes, allowed roots, symlink handling, and hidden path handling.
21+
22+
## Impact
23+
24+
- Affected code: scraper config schema and defaults, URL/file fetcher entry points, local file traversal, archive handoff paths, fetch and scrape tools, and related tests.
25+
- Affected interfaces: CLI, MCP, and web-backed scraping behavior will share the same enforcement rules.
26+
- Documentation impact: configuration and security docs will need updates to explain the new defaults and opt-in overrides.
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
## MODIFIED Requirements
2+
3+
### Requirement: Generic Environment Variable Overrides
4+
The system SHALL support overriding any configuration setting via environment variables using a predictable naming convention. Environment-derived configuration values SHALL be normalized by trimming surrounding whitespace and stripping matching surrounding single or double quotes before schema parsing.
5+
6+
#### Scenario: Environment Variable Naming Convention
7+
- **GIVEN** a config path `scraper.document.maxSize`
8+
- **WHEN** the environment variable `DOCS_MCP_SCRAPER_DOCUMENT_MAX_SIZE` is set
9+
- **THEN** its value SHALL override the config file and default values
10+
11+
#### Scenario: Quoted configuration override from Docker Compose
12+
- **GIVEN** the environment variable `DOCS_MCP_EMBEDDING_MODEL` is provided as `"openai:nomic-embed-text"`
13+
- **WHEN** the configuration is loaded
14+
- **THEN** the resulting `app.embeddingModel` value SHALL be `openai:nomic-embed-text`
15+
16+
#### Scenario: Whitespace-padded configuration override
17+
- **GIVEN** the environment variable `DOCS_MCP_SCRAPER_MAX_PAGES` is provided as ` "500" `
18+
- **WHEN** the configuration is loaded
19+
- **THEN** the resulting `scraper.maxPages` value SHALL be parsed as `500`
20+
21+
#### Scenario: CamelCase to Upper Snake Case Conversion
22+
- **GIVEN** a config path segment `maxNestingDepth`
23+
- **WHEN** converted to environment variable format
24+
- **THEN** it SHALL become `MAX_NESTING_DEPTH`
25+
26+
#### Scenario: Deeply Nested Path Conversion
27+
- **GIVEN** a config path `splitter.json.maxNestingDepth`
28+
- **WHEN** converted to environment variable name
29+
- **THEN** it SHALL become `DOCS_MCP_SPLITTER_JSON_MAX_NESTING_DEPTH`
30+
31+
#### Scenario: Nested security override naming
32+
- **GIVEN** a config path `scraper.security.fileAccess.followSymlinks`
33+
- **WHEN** converted to environment variable name
34+
- **THEN** it SHALL become `DOCS_MCP_SCRAPER_SECURITY_FILE_ACCESS_FOLLOW_SYMLINKS`
35+
36+
#### Scenario: Array override provided as JSON
37+
- **GIVEN** the environment variable `DOCS_MCP_SCRAPER_SECURITY_NETWORK_ALLOWED_HOSTS` is provided as `["docs.internal.example","wiki.corp.local"]`
38+
- **WHEN** the configuration is loaded
39+
- **THEN** the resulting `scraper.security.network.allowedHosts` value SHALL be parsed as a string array
40+
41+
#### Scenario: Array override provided as inline array string
42+
- **GIVEN** the environment variable `DOCS_MCP_SCRAPER_SECURITY_FILE_ACCESS_ALLOWED_ROOTS` is provided as `["$DOCUMENTS", "/srv/docs"]`
43+
- **WHEN** the configuration is loaded
44+
- **THEN** the resulting `scraper.security.fileAccess.allowedRoots` value SHALL be parsed as a string array
45+
46+
### Requirement: Scraper Security Configuration
47+
The system SHALL expose configuration for outbound network and local file access controls under `scraper.security`.
48+
49+
#### Scenario: Private network access defaults to disabled
50+
- **GIVEN** no custom security configuration is provided
51+
- **WHEN** the configuration is loaded
52+
- **THEN** private network access for scraper-driven HTTP(S) fetches SHALL default to disabled
53+
54+
#### Scenario: Host allowlist can permit a specific internal hostname
55+
- **GIVEN** `scraper.security.network.allowPrivateNetworks` is `false`
56+
- **AND** `scraper.security.network.allowedHosts` contains `docs.internal.example`
57+
- **WHEN** the configuration is loaded
58+
- **THEN** requests explicitly targeting `docs.internal.example` SHALL be eligible for access despite the default private-network restriction
59+
- **AND** direct IP requests SHALL still require `allowedCidrs` or `allowPrivateNetworks`
60+
61+
#### Scenario: CIDR allowlist can permit a specific internal subnet
62+
- **GIVEN** `scraper.security.network.allowPrivateNetworks` is `false`
63+
- **AND** `scraper.security.network.allowedCidrs` contains `10.42.0.0/16`
64+
- **WHEN** the configuration is loaded
65+
- **THEN** requests whose resolved address falls within `10.42.0.0/16` SHALL be eligible for access despite the default private-network restriction
66+
67+
#### Scenario: Empty network allowlists do not permit private targets
68+
- **GIVEN** `scraper.security.network.allowPrivateNetworks` is `false`
69+
- **AND** `scraper.security.network.allowedHosts` is empty
70+
- **AND** `scraper.security.network.allowedCidrs` is empty
71+
- **WHEN** the configuration is loaded
72+
- **THEN** no private or special-use network targets SHALL be permitted by configuration
73+
74+
#### Scenario: Private network override enables broad internal access
75+
- **GIVEN** `scraper.security.network.allowPrivateNetworks` is `true`
76+
- **WHEN** the configuration is loaded
77+
- **THEN** private, loopback, link-local, and other special-use network targets SHALL be eligible for access
78+
79+
#### Scenario: File access mode defaults to allowed roots
80+
- **GIVEN** no custom security configuration is provided
81+
- **WHEN** the configuration is loaded
82+
- **THEN** local file access mode SHALL default to `allowedRoots`
83+
84+
#### Scenario: Documents root is configured by default
85+
- **GIVEN** no custom security configuration is provided
86+
- **WHEN** the configuration is loaded
87+
- **THEN** the allowed file roots SHALL include `$DOCUMENTS`
88+
89+
#### Scenario: Internal archive handoff does not require temp root configuration
90+
- **GIVEN** no custom security configuration is provided
91+
- **WHEN** the system processes a supported web archive root via an internal temporary file
92+
- **THEN** the default file access configuration SHALL not require the OS temporary directory to be added to `allowedRoots`
93+
94+
#### Scenario: Empty file root allowlist denies user-requested file access
95+
- **GIVEN** `scraper.security.fileAccess.mode` is `allowedRoots`
96+
- **AND** `scraper.security.fileAccess.allowedRoots` is empty
97+
- **WHEN** the configuration is loaded
98+
- **THEN** user-requested `file://` access SHALL not be permitted by configuration
99+
100+
#### Scenario: Unresolvable documents token grants no access
101+
- **GIVEN** `scraper.security.fileAccess.allowedRoots` contains `$DOCUMENTS`
102+
- **AND** the runtime cannot resolve a documents directory for the current platform or account
103+
- **WHEN** the configuration is loaded and file access policy is evaluated
104+
- **THEN** `$DOCUMENTS` SHALL not expand to an implicit fallback path
105+
- **AND** no access SHALL be granted from that token alone
106+
107+
#### Scenario: Hidden path access defaults to disabled
108+
- **GIVEN** no custom security configuration is provided
109+
- **WHEN** the configuration is loaded
110+
- **THEN** hidden file and directory access SHALL default to disabled
111+
112+
#### Scenario: Symlink following defaults to disabled
113+
- **GIVEN** no custom security configuration is provided
114+
- **WHEN** the configuration is loaded
115+
- **THEN** symlink following for local file access SHALL default to disabled

0 commit comments

Comments
 (0)