|
| 1 | +## Context |
| 2 | + |
| 3 | +The server intentionally supports arbitrary HTTP(S) and `file://` fetching across CLI, MCP, and web-triggered scraping flows. Today, fetcher selection and local file traversal are centralized enough to apply consistent controls, but there is no policy layer that limits outbound network targets, constrains local file roots, or handles hidden paths and symlink escapes. The project already has a strong configuration model with schema-backed defaults and environment/CLI overrides, so access controls should be introduced as configuration-driven behavior rather than interface-specific flags. |
| 4 | + |
| 5 | +## Goals / Non-Goals |
| 6 | + |
| 7 | +**Goals:** |
| 8 | +- Add one shared outbound access policy that applies to `fetch_url` and scrape workflows regardless of entry point. |
| 9 | +- Default-deny private, loopback, link-local, and similar special-use network targets unless explicitly configured. |
| 10 | +- Add configurable local file access modes with allowed roots, `$DOCUMENTS` token expansion, and secure defaults for hidden paths and symlinks. |
| 11 | +- Produce clear user-facing errors when policy blocks access. |
| 12 | +- Preserve explicit opt-in paths for trusted local and self-hosted deployments. |
| 13 | + |
| 14 | +**Non-Goals:** |
| 15 | +- Add per-user or per-token authorization rules. |
| 16 | +- Implement a sandbox or OS-level file isolation boundary. |
| 17 | +- Detect every possible cloud metadata alias or enterprise-internal hostname pattern out of the box. |
| 18 | +- Redesign the scrape tool surface or add new transport-level authentication features. |
| 19 | + |
| 20 | +## Decisions |
| 21 | + |
| 22 | +### Introduce a shared scraper security policy under configuration |
| 23 | + |
| 24 | +Add a new `scraper.security` section to the typed config schema. This keeps policy close to the fetch/scrape subsystem and lets the existing precedence model handle config file, environment, and CLI overrides. |
| 25 | + |
| 26 | +Alternative considered: a top-level `security` block. Rejected because the affected behavior is currently isolated to scraper-driven URL and file access, and placing the settings under `scraper` avoids broadening the configuration surface prematurely. |
| 27 | + |
| 28 | +### Enforce policy centrally before network or file reads |
| 29 | + |
| 30 | +Introduce a reusable access-policy helper that validates URLs and resolved file paths before `AutoDetectFetcher`, `HttpFetcher`, `FileFetcher`, and local crawl entry points perform I/O. DNS-resolved IP revalidation and redirect-target validation will be mandatory enforcement behavior, not optional configuration knobs. For archive-member URLs such as `file:///path/to/archive.zip/docs/readme.md`, policy evaluation will be based on the resolved backing archive file path plus the virtual entry path, not on the nonexistent combined filesystem path. |
| 31 | + |
| 32 | +Alternative considered: only validating at the MCP tool boundary. Rejected because CLI and web-triggered scraping must behave consistently, and lower-level enforcement prevents bypass through internal call paths. |
| 33 | + |
| 34 | +### Apply network policy to browser-initiated subrequests |
| 35 | + |
| 36 | +Treat browser-based fetching as a series of outbound requests rather than a single navigation. The same network policy must apply to the initial page load, redirects, frames, iframes, scripts, stylesheets, images, XHR/fetch calls, and any other subresource request initiated during rendering so that a public page cannot pivot the renderer into private network access. |
| 37 | + |
| 38 | +Alternative considered: only validating the initial browser navigation target. Rejected because it leaves a clear SSRF-style gap for render-time secondary requests. |
| 39 | + |
| 40 | +### Default-deny special network ranges while keeping explicit opt-in overrides |
| 41 | + |
| 42 | +Model network protection with one main switch, `allowPrivateNetworks`, that defaults to `false` and covers loopback, RFC1918 private ranges, link-local addresses, and equivalent IPv6 special-use ranges. Add `allowedHosts` and `allowedCidrs` so trusted deployments can permit only the specific internal targets they need while the broad default-deny behavior remains in place. |
| 43 | + |
| 44 | +When `allowPrivateNetworks` is `false`, an empty `allowedHosts` and `allowedCidrs` configuration means no private or special-use network targets are allowed. Public internet targets remain allowed unless blocked by some other rule. When `allowPrivateNetworks` is `true`, all network targets become eligible for access, subject to the same redirect and DNS resolution checks. |
| 45 | + |
| 46 | +`allowedHosts` and `allowedCidrs` serve different purposes. `allowedHosts` grants a hostname-bound exception for requests explicitly targeting that host, even when it resolves to private or special-use addresses. `allowedCidrs` grants an address-bound exception for direct IP targets or any hostname whose resolved connection address falls within an allowed subnet. Redirects and secondary requests are evaluated independently against the same rules, so a request allowed via `allowedHosts` does not automatically authorize a different hostname or a direct IP target unless that new target is also allowed. |
| 47 | + |
| 48 | +Example default-secure network configuration: |
| 49 | + |
| 50 | +```yaml |
| 51 | +scraper: |
| 52 | + security: |
| 53 | + network: |
| 54 | + allowPrivateNetworks: false |
| 55 | + allowedHosts: [] |
| 56 | + allowedCidrs: [] |
| 57 | +``` |
| 58 | +
|
| 59 | +Example selective internal allowlist: |
| 60 | +
|
| 61 | +```yaml |
| 62 | +scraper: |
| 63 | + security: |
| 64 | + network: |
| 65 | + allowPrivateNetworks: false |
| 66 | + allowedHosts: |
| 67 | + - docs.internal.example |
| 68 | + allowedCidrs: |
| 69 | + - 10.42.0.0/16 |
| 70 | +``` |
| 71 | +
|
| 72 | +Alternative considered: permissive defaults with warnings only. Rejected because the objective of this change is meaningful hardening for exposed deployments, and warnings do not reduce actual risk. |
| 73 | +
|
| 74 | +Alternative considered: separate booleans such as `allowLoopback` and `allowLinkLocal`. Rejected because they create overlapping policy states and make it harder to explain how exceptions should be configured. |
| 75 | + |
| 76 | +### Add file access modes with allowlisted roots and secure traversal defaults |
| 77 | + |
| 78 | +Model local file access as `disabled`, `allowedRoots`, or `unrestricted`. In `allowedRoots` mode, expand configured tokens such as `$DOCUMENTS`, canonicalize paths, and require the effective target to remain inside an allowed root. Set `followSymlinks` and `includeHidden` to `false` by default. Hidden paths will be blocked even when explicitly named unless the user opts in. |
| 79 | + |
| 80 | +When `fileAccess.mode` is `allowedRoots`, an empty `allowedRoots` list means all user-requested `file://` access is denied. This does not block internally managed temporary archive handoff for accepted web archive roots. |
| 81 | + |
| 82 | +If `$DOCUMENTS` cannot be resolved on the current platform or runtime account, it will expand to no path and therefore grant no access by itself. Deployments that require local file access in containers or service accounts must configure explicit absolute paths. |
| 83 | + |
| 84 | +Example default file policy: |
| 85 | + |
| 86 | +```yaml |
| 87 | +scraper: |
| 88 | + security: |
| 89 | + fileAccess: |
| 90 | + mode: allowedRoots |
| 91 | + allowedRoots: |
| 92 | + - $DOCUMENTS |
| 93 | + followSymlinks: false |
| 94 | + includeHidden: false |
| 95 | +``` |
| 96 | + |
| 97 | +Example fully disabled local file access: |
| 98 | + |
| 99 | +```yaml |
| 100 | +scraper: |
| 101 | + security: |
| 102 | + fileAccess: |
| 103 | + mode: disabled |
| 104 | + allowedRoots: [] |
| 105 | + followSymlinks: false |
| 106 | + includeHidden: false |
| 107 | +``` |
| 108 | + |
| 109 | +Alternative considered: denylisting sensitive directories under `$HOME`. Rejected because denylisting is incomplete and easier to bypass than root allowlisting. |
| 110 | + |
| 111 | +### Preserve internal temporary archive handoff |
| 112 | + |
| 113 | +Treat temporary files created by the application itself for supported web archive scraping as internal implementation artifacts rather than user-requested `file://` access. Policy enforcement will still apply to the original network URL, but once a web archive root has been accepted and downloaded, the handoff into local archive processing must not be blocked merely because the temp file lives outside configured user roots such as `$DOCUMENTS`. This exception is limited to the downloaded archive artifact produced for that accepted request and virtual members read from that archive during the same processing flow. |
| 114 | + |
| 115 | +Alternative considered: requiring the temp directory to be added to default allowed roots. Rejected because it couples user-facing file policy to OS temp locations and would either over-broaden local access or break existing archive behavior. |
| 116 | + |
| 117 | +### Use `$DOCUMENTS` as a convenience token, not a policy special case |
| 118 | + |
| 119 | +Support `$DOCUMENTS` expansion during config resolution or policy evaluation, but enforce access against concrete absolute paths after expansion. This keeps the user experience simple without coupling enforcement logic to platform-specific home-directory conventions. |
| 120 | + |
| 121 | +Alternative considered: using `$HOME` as a default root. Rejected because it is too broad and includes common secret locations such as `.ssh`, `.aws`, and dot-config directories. |
| 122 | + |
| 123 | +## Risks / Trade-offs |
| 124 | + |
| 125 | +- [Behavior break for existing internal-doc users] -> Provide explicit `allowedHosts` and `allowedCidrs` overrides and document migration clearly. |
| 126 | +- [Regression in supported web archive scraping] -> Exempt internally managed temp archive handoff from user file-root checks while preserving network policy on the original URL. |
| 127 | +- [DNS resolution and redirect validation complexity] -> Keep the first implementation limited to HTTP(S) plus redirect revalidation and well-defined CIDR checks, covered by focused tests. |
| 128 | +- [Browser subrequest enforcement complexity] -> Intercept all Playwright requests centrally and apply the same resolver/policy helper used by non-browser fetchers. |
| 129 | +- [Platform-specific `$DOCUMENTS` resolution differences] -> Fall back to explicit paths when the token cannot be resolved reliably. |
| 130 | +- [Confusion around hidden path semantics] -> Treat `includeHidden: false` as a hard deny for both direct fetch and traversal, and document this explicitly. |
| 131 | +- [Symlink handling may surprise users with linked doc folders] -> Keep an opt-in `followSymlinks` flag and validate the resolved target stays within an allowed root. |
| 132 | + |
| 133 | +## Migration Plan |
| 134 | + |
| 135 | +1. Add the new config schema and defaults. |
| 136 | +2. Implement shared access-policy utilities and wire them into fetchers and local file traversal. |
| 137 | +3. Preserve archive workflows by validating virtual archive paths against the concrete archive file and by exempting internal temp archive handoff from user file-root checks. |
| 138 | +4. Update tests to cover blocked and allowed cases for top-level and browser-initiated network requests, hidden paths, symlinks, and archive workflows. |
| 139 | +5. Update user-facing documentation to explain defaults, array/env encoding, token expansion, and override patterns. |
| 140 | +6. Release with changelog notes highlighting that private-network access is now blocked by default and file access is restricted by policy. |
| 141 | + |
| 142 | +Rollback consists of removing the policy enforcement layer and reverting to the previous permissive defaults, but the preferred operational mitigation is to loosen the new config values rather than removing the feature. |
| 143 | + |
| 144 | +## Open Questions |
| 145 | + |
| 146 | +- Do we want CLI flags for temporary overrides in addition to config and environment variables, or is config-only sufficient for the first iteration? |
0 commit comments