Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ See **[Embedding Models](docs/guides/embedding-models.md)** for configuring **Ol
### Key Concepts & Architecture
- **[Deployment Modes](docs/infrastructure/deployment-modes.md)**: Standalone vs. Distributed (Docker Compose).
- **[Authentication](docs/infrastructure/authentication.md)**: Securing your server with OAuth2/OIDC.
- **[Security](docs/infrastructure/security.md)**: Trust boundaries, deployment hardening, and outbound access controls.
- **[Telemetry](docs/infrastructure/telemetry.md)**: Privacy-first usage data collection.
- **[Architecture](ARCHITECTURE.md)**: Deep dive into the system design.

Expand Down
13 changes: 13 additions & 0 deletions docs/infrastructure/authentication.md
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,19 @@ DEBUG=mcp:auth npx docs-mcp-server --auth-enabled --auth-issuer-url "..."
- Implement proper CORS policies for web clients
- Use secure OAuth2 flows (Authorization Code with PKCE)

### Outbound Access Controls

Authentication controls who can use MCP and HTTP endpoints. It does not, by itself, restrict where scraper-driven fetches can connect after a request is authenticated.

The scraper now applies outbound access controls by default:

- Private, loopback, link-local, and other special-use network targets are blocked unless explicitly allowlisted or `scraper.security.network.allowPrivateNetworks` is enabled.
- Local `file://` access is constrained by `scraper.security.fileAccess` and defaults to `$DOCUMENTS` only.
- Hidden files and symlinks are blocked by default.
- `allowInvalidTls` only changes certificate validation after the target has already passed the network policy. It does not bypass host or CIDR restrictions.

For deployment hardening guidance, see [Infrastructure Security](./security.md).

### Access Control

- All authenticated users receive full access to all endpoints
Expand Down
163 changes: 163 additions & 0 deletions docs/infrastructure/security.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Infrastructure Security

## Overview

The Docs MCP Server intentionally fetches remote URLs and local files. That capability is useful for documentation indexing, but it also creates trust-boundary decisions for deployments that are shared, internet-exposed, or connected to sensitive internal networks.

This document describes the security model for scraper-driven access and the deployment controls that matter most in practice.

## Trust Boundaries

The server has three independent security surfaces:

1. Inbound access to MCP, web, and API endpoints.
2. Outbound network access performed by URL fetching and browser-based scraping.
3. Local file access performed by direct `file://` fetches and local crawling.

OAuth2 protects inbound MCP and HTTP usage. Scraper security settings protect outbound access and local file reads. These concerns are related but separate.

## Deployment Hardening

Use the following defaults for any shared or remotely reachable deployment:

- Enable authentication for exposed MCP and web endpoints.
- Keep the server behind TLS termination.
- Restrict worker and internal API networking at the infrastructure layer.
- Leave `scraper.security.network.allowPrivateNetworks` disabled unless you have an explicit internal use case.
- Keep `scraper.security.fileAccess.mode` at `allowedRoots` or `disabled` unless the host is fully trusted.
- Do not rely on broad local home-directory access for service accounts or containers.

## Outbound Network Access Policy

By default, scraper-driven HTTP and browser requests may reach public internet targets but may not reach private or special-use network targets.

Blocked by default:

- Loopback addresses
- RFC1918 private IPv4 ranges
- Link-local ranges
- Other special-use IPv4 and IPv6 ranges covered by the shared policy

Allowed by default:

- Public internet HTTP and HTTPS targets

Override options:

- `allowedHosts`: hostname-bound exceptions for explicitly targeted internal hosts
- `allowedCidrs`: address-bound exceptions for resolved or direct IP targets
- `allowPrivateNetworks: true`: broad opt-in for private and special-use network access

Important semantics:

- `allowedHosts` does not allow direct IP access to the same service.
- `allowedCidrs` allows by resolved or direct address, not by hostname label alone.
- Redirect targets are revalidated independently.
- Browser subrequests use the same policy as non-browser HTTP fetches.

## TLS Verification Policy

HTTPS certificate validation stays enabled by default.

`allowInvalidTls: true` is a broad override that allows invalid or self-signed certificates for HTTPS requests that are already permitted by the network policy.

Important semantics:

- It does not bypass `allowedHosts`, `allowedCidrs`, or private-network restrictions.
- It applies broadly, so treat it as an environment-level trust decision.
- If you need narrower trust, prefer proper certificates or a future custom-certificate workflow rather than enabling broad invalid TLS trust.

## Local File Access Policy

Local file access defaults to `allowedRoots` mode with `$DOCUMENTS` as the only configured root.

Modes:

- `disabled`: all user-requested `file://` access is blocked
- `allowedRoots`: only configured roots are allowed
- `unrestricted`: local file access is fully trusted

Traversal defaults:

- Hidden files and hidden directories are blocked by default
- Symlinks are blocked by default
- Archive-member paths are validated against the real archive file, not the synthetic combined virtual path

Important semantics:

- `allowedRoots: []` in `allowedRoots` mode means no user-requested local file access is allowed.
- `$DOCUMENTS` is only a convenience token. If it cannot be resolved for the runtime account, it grants no access.
- Hidden paths remain blocked even when explicitly requested unless `includeHidden` is enabled.

## Archive Workflows

Supported web archive scraping downloads an accepted remote archive to a temporary file and then processes it through the local file path.

That handoff remains allowed without requiring the temp directory to be added to user-configured roots. The exception is intentionally narrow:

- It applies only after the original network URL passes the network policy.
- It applies only to the downloaded archive artifact and its virtual members.
- Unrelated temp files remain subject to the normal local file policy.

## Example Configurations

Conservative shared deployment:

```yaml
scraper:
security:
network:
allowPrivateNetworks: false
allowedHosts: []
allowedCidrs: []
allowInvalidTls: false
fileAccess:
mode: allowedRoots
allowedRoots:
- $DOCUMENTS
followSymlinks: false
includeHidden: false
```

Selective internal docs deployment:

```yaml
scraper:
security:
network:
allowPrivateNetworks: false
allowedHosts:
- docs.internal.example
allowedCidrs:
- 10.42.0.0/16
allowInvalidTls: true
fileAccess:
mode: allowedRoots
allowedRoots:
- /srv/docs
followSymlinks: false
includeHidden: false
```

Fully trusted local workstation:

```yaml
scraper:
security:
network:
allowPrivateNetworks: true
allowInvalidTls: false
fileAccess:
mode: unrestricted
allowedRoots: []
followSymlinks: true
includeHidden: true
```

## Operational Guidance

- Start with the defaults and add the smallest explicit exception that satisfies the use case.
- Prefer `allowedHosts` and `allowedCidrs` over `allowPrivateNetworks: true`.
- Prefer valid certificates over `allowInvalidTls: true`.
- Prefer narrow `allowedRoots` over `unrestricted` file access.
- Review these settings when moving from local development to shared infrastructure.
103 changes: 103 additions & 0 deletions docs/setup/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,18 @@ scraper:
maxDepth: 3
document:
maxSize: 10485760 # 10MB
security:
network:
allowPrivateNetworks: false
allowedHosts: []
allowedCidrs: []
allowInvalidTls: false
fileAccess:
mode: allowedRoots
allowedRoots:
- $DOCUMENTS
followSymlinks: false
includeHidden: false

splitter:
preferredChunkSize: 1500
Expand Down Expand Up @@ -74,6 +86,17 @@ export DOCS_MCP_SPLITTER_PREFERRED_CHUNK_SIZE=2000

# Override app settings
export DOCS_MCP_APP_TELEMETRY_ENABLED=false

# Override scraper security settings
export DOCS_MCP_SCRAPER_SECURITY_NETWORK_ALLOW_INVALID_TLS=true
export DOCS_MCP_SCRAPER_SECURITY_FILE_ACCESS_FOLLOW_SYMLINKS=true
```

Array-valued settings accept JSON or YAML-style inline arrays:

```bash
export DOCS_MCP_SCRAPER_SECURITY_NETWORK_ALLOWED_HOSTS='["docs.internal.example","wiki.corp.local"]'
export DOCS_MCP_SCRAPER_SECURITY_FILE_ACCESS_ALLOWED_ROOTS='["$DOCUMENTS", "/srv/docs"]'
```

Some settings also have **legacy aliases** for convenience:
Expand Down Expand Up @@ -176,6 +199,86 @@ _Note: Scraper settings are often overridden per-job via CLI arguments like `--m

> **Migration Note:** In versions prior to 1.37, `document.maxSize` was a top-level setting. It has been moved to `scraper.document.maxSize`. Update your config files accordingly.

### Scraper Security (`scraper.security`)

Outbound network access and local file access defaults are intentionally conservative so internet-exposed deployments do not automatically gain access to private networks or broad local file trees.

#### Network (`scraper.security.network`)

| Option | Default | Description |
|:-------|:--------|:------------|
| `allowPrivateNetworks` | `false` | Blocks loopback, RFC1918 private ranges, link-local, and other special-use targets unless explicitly allowed. When set to `true`, all network targets become eligible. |
| `allowedHosts` | `[]` | Hostname-bound exceptions. Only requests explicitly targeting these hostnames are exempt from private-network blocking. |
| `allowedCidrs` | `[]` | Address-bound exceptions. Allows direct IP access or resolved host connections whose address falls inside one of the configured CIDRs. |
| `allowInvalidTls` | `false` | Broad HTTPS certificate-verification override. This only applies after the target has already passed network access checks. |

An empty `allowedHosts` and `allowedCidrs` list does not allow any private or special-use targets while `allowPrivateNetworks` remains `false`.

#### File Access (`scraper.security.fileAccess`)

| Option | Default | Description |
|:-------|:--------|:------------|
| `mode` | `allowedRoots` | Local file access mode: `disabled`, `allowedRoots`, or `unrestricted`. |
| `allowedRoots` | `[$DOCUMENTS]` | Allowlisted local roots when `mode` is `allowedRoots`. Empty means no user-requested `file://` access is permitted. |
| `followSymlinks` | `false` | Blocks symlinks by default. When enabled, resolved targets must still stay inside an allowed root. |
| `includeHidden` | `false` | Blocks hidden files, hidden directories, and hidden archive members by default, even when explicitly requested. |

`$DOCUMENTS` is a convenience token. It expands to the current account's documents directory when that directory can be resolved. If it cannot be resolved, it grants no access by itself.

Internally managed temporary archive files created during accepted web archive scraping remain allowed even when they sit outside user-configured roots. That exception is limited to the downloaded archive artifact and its virtual members.

### Security Examples

Selective internal network access with self-signed HTTPS:

```yaml
scraper:
security:
network:
allowPrivateNetworks: false
allowedHosts:
- docs.internal.example
allowedCidrs:
- 10.42.0.0/16
allowInvalidTls: true
```

Restricted local file access:

```yaml
scraper:
security:
fileAccess:
mode: allowedRoots
allowedRoots:
- $DOCUMENTS
- /srv/docs
followSymlinks: false
includeHidden: false
```

Fully trusted local deployment:

```yaml
scraper:
security:
network:
allowPrivateNetworks: true
allowInvalidTls: false
fileAccess:
mode: unrestricted
allowedRoots: []
followSymlinks: true
includeHidden: true
```

Be explicit with these overrides:

- `allowPrivateNetworks: true` broadens network reach beyond public internet targets.
- `allowInvalidTls: true` broadly trusts invalid HTTPS certificates, but it still does not bypass network allowlists.
- `fileAccess.mode: allowedRoots` with `allowedRoots: []` denies all user-requested `file://` access.
- Unresolved `$DOCUMENTS` tokens do not fall back to `$HOME` or any other implicit path.

### GitHub Authentication

Environment variables for authenticating with GitHub when scraping private repositories.
Expand Down
2 changes: 2 additions & 0 deletions openspec/changes/add-fetch-access-controls/.openspec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-29
Loading