diff --git a/.github/workflows/notify-website.yml b/.github/workflows/notify-website.yml new file mode 100644 index 0000000..986fd13 --- /dev/null +++ b/.github/workflows/notify-website.yml @@ -0,0 +1,19 @@ +name: Notify Website of Doc Changes + +on: + push: + branches: [main] + paths: + - 'docs/**' + - 'README.md' + +jobs: + notify: + runs-on: ubuntu-latest + steps: + - name: Trigger website sync + uses: peter-evans/repository-dispatch@v4 + with: + token: ${{ secrets.WEBSITE_SYNC_TOKEN }} + repository: multiagentcoordinationprotocol/website + event-type: docs-updated diff --git a/CHANGELOG.md b/CHANGELOG.md index b8f4e0f..94836b3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- `docs/` directory with long-form documentation: `README.md` (index), + `getting-started.md`, `integration.md`, `architecture.md`, `API.md`, + `deployment.md`, and `operations.md`. Style and structure match the + `runtime/docs` layout so the website can pick both up with the same sync. + The integration and index pages call out both minter consumers + (control-plane, SDK-based orchestrators) and bearer consumers + (TS + Python SDK agents), with cross-links to the corresponding + control-plane, SDK, and runtime auth docs. +- `.github/workflows/notify-website.yml` — on push to `main` with changes + under `docs/**` or to `README.md`, dispatches a `docs-updated` event to + `multiagentcoordinationprotocol/website`. - Testable factory — `createApp(config, signing)` exported from `src/server.ts` so supertest can exercise the HTTP surface without opening a port. - Jest + supertest unit/integration tests covering `/healthz`, diff --git a/README.md b/README.md index bafbde8..23a741b 100644 --- a/README.md +++ b/README.md @@ -1,30 +1,44 @@ # MACP auth-service JWT-minting identity service for the MACP runtime. Implements RFC-MACP-0004 §4 -(direct-agent-auth) as a dedicated identity provider so that spawned agents can -authenticate directly to the runtime with short-lived RS256 bearer tokens. +(direct-agent-auth) as a dedicated identity provider so that SDK-based agents +can authenticate directly to the runtime with short-lived RS256 bearer tokens. ## Role in the stack ``` - examples-service ──POST /tokens──► auth-service :3200 ──┐ - control-plane ──POST /tokens──► auth-service :3200 │ - │ public keys - macp-runtime (gRPC) ◄──GET /.well-known/jwks.json───────┘ cached 60s - for JWT verify + control-plane ──POST /tokens──► auth-service :3200 ──┐ + SDK orchestrators ──POST /tokens──► auth-service :3200 │ + │ public keys + macp-runtime (gRPC) ◄────GET /.well-known/jwks.json─────────┘ cached per + MACP_AUTH_JWKS_TTL_SECS + + SDK agents (TS / Python) ──Authorization: Bearer ──► macp-runtime (gRPC) ``` -- **Minting:** `examples-service` calls `POST /tokens` once per agent it spawns, - passing `sender` + scopes. The returned JWT is written into the agent's - bootstrap payload (`runtime.bearerToken`). The agent then presents that - bearer directly to the runtime's gRPC endpoint. +- **Minting:** the [control-plane](https://github.com/multiagentcoordinationprotocol/control-plane) + (or any orchestrator built on the [TypeScript SDK](https://github.com/multiagentcoordinationprotocol/typescript-sdk) + or [Python SDK](https://github.com/multiagentcoordinationprotocol/python-sdk)) + calls `POST /tokens` once per agent it spawns, passing `sender` + scopes. + The returned JWT is handed to the agent in its bootstrap payload under + `runtime.bearerToken`. +- **Bearer presentation:** SDK-based agents load the bearer from bootstrap + and present it as `Authorization: Bearer ` on every gRPC call to the + runtime. See the SDK auth guides + ([TypeScript](https://github.com/multiagentcoordinationprotocol/typescript-sdk/blob/main/docs/guides/authentication.md), + [Python](https://github.com/multiagentcoordinationprotocol/python-sdk/blob/main/docs/guides/direct-agent-auth.md)). - **Verification:** the runtime is configured with `MACP_AUTH_JWKS_URL=http://auth-service:3200/.well-known/jwks.json`. It fetches the JWKS (cached per `MACP_AUTH_JWKS_TTL_SECS`) and validates every - incoming JWT's signature + `iss` + `aud` + `exp` + `nbf` on each gRPC frame. + incoming JWT's signature + `iss` + `aud` + `exp` on each gRPC frame. See + the runtime + [Getting Started](https://github.com/multiagentcoordinationprotocol/runtime/blob/main/docs/getting-started.md#jwt-mode) + and + [Deployment](https://github.com/multiagentcoordinationprotocol/runtime/blob/main/docs/deployment.md#authentication) + guides. This service is *not* in the hot path of a running session — tokens are minted -once per agent at scenario launch, then reused for the session lifetime. +once per agent at provisioning time, then reused for the session lifetime. ## API @@ -138,6 +152,19 @@ curl http://localhost:3200/healthz The published CI image is `ghcr.io/multiagentcoordinationprotocol/auth-service` (see `.github/workflows/docker.yml`). +## Documentation + +Full documentation lives under [`docs/`](docs/README.md): + +| Page | Purpose | +|------|---------| +| [Getting Started](docs/getting-started.md) | Install, run locally, mint your first token, verify against JWKS | +| [Integration Guide](docs/integration.md) | End-to-end wiring with the control-plane, SDK orchestrators, SDK agents, and the runtime | +| [Architecture](docs/architecture.md) | Module layout, request flow, key lifecycle, design goals | +| [API Reference](docs/API.md) | All three HTTP endpoints, JWT claim structure, error table | +| [Deployment](docs/deployment.md) | Production checklist, env vars, Docker, Kubernetes, TLS termination | +| [Operations Runbook](docs/operations.md) | Key rotation, diagnostics, common failures, incident response | + ## Security notes - **`POST /tokens` has no client authentication in this implementation.** It diff --git a/docs/API.md b/docs/API.md new file mode 100644 index 0000000..9f290bb --- /dev/null +++ b/docs/API.md @@ -0,0 +1,199 @@ +# API Reference + +This is the reference for every HTTP endpoint exposed by the auth-service. The default base URL is `http://127.0.0.1:3200`, configurable via the `PORT` environment variable. + +For protocol-level transport semantics and the JWT claim model, see the [protocol transports documentation](https://www.multiagentcoordinationprotocol.io/docs/transports) and [protocol security documentation](https://www.multiagentcoordinationprotocol.io/docs/security). + +## Endpoints at a glance + +| Method | Path | Purpose | Auth | +|--------|------|---------|------| +| `GET` | `/healthz` | Liveness probe | none | +| `GET` | `/.well-known/jwks.json` | Public JWKS for JWT verification | none | +| `POST` | `/tokens` | Mint a short-lived RS256 JWT | **none by default** (see [Deployment](deployment.md)) | + +All responses are `application/json`. All requests that carry a body must use `content-type: application/json`. + +## Liveness + +### `GET /healthz` + +Liveness probe. Returns `200 OK` as long as the process is accepting connections. There is no readiness signal distinct from liveness: the service is stateless and ready as soon as `loadKey` completes during startup. + +**Response** + +```json +{ "ok": true } +``` + +**Example** + +```bash +curl -sS http://localhost:3200/healthz +``` + +Use this endpoint for Kubernetes `livenessProbe`, Docker `HEALTHCHECK`, and load-balancer health checks. The Dockerfile ships with a built-in `HEALTHCHECK` wired to this path. + +## Key distribution + +### `GET /.well-known/jwks.json` + +Returns the public JWKS document that verifiers (typically the MACP runtime) fetch to validate token signatures. Private material is never exposed here. + +**Response** + +```json +{ + "keys": [ + { + "kty": "RSA", + "n": "…base64url modulus…", + "e": "AQAB", + "kid": "dev-key-1", + "alg": "RS256", + "use": "sig" + } + ] +} +``` + +**Response fields (per key entry)** + +| Field | Type | Description | +|-------|------|-------------| +| `kty` | string | Key type. Always `RSA` for this service. | +| `n` | string | base64url-encoded RSA modulus | +| `e` | string | base64url-encoded RSA exponent (typically `AQAB`) | +| `kid` | string | Key identifier. `dev-key-1` for ephemeral keys; whatever was set in the JWK for pinned keys. | +| `alg` | string | Signature algorithm. Always `RS256`. | +| `use` | string | Key usage. Always `sig`. | + +The service publishes exactly one key at any given time. Rotating keys means replacing the JWK, redeploying, and waiting `MACP_AUTH_JWKS_TTL_SECS` for verifiers to refresh. See [Operations — Key rotation](operations.md#key-rotation). + +**Example** + +```bash +curl -sS http://localhost:3200/.well-known/jwks.json | jq . +``` + +## Token minting + +### `POST /tokens` + +Mints an RS256-signed JWT for the requested `sender` with the supplied scopes and TTL. The returned token can be presented as a gRPC `Authorization: Bearer ` header to the MACP runtime. + +**Request body** + +```json +{ + "sender": "agent://risk", + "scopes": { + "can_start_sessions": true, + "is_observer": false, + "allowed_modes": ["macp.mode.decision.v1"], + "max_open_sessions": 1, + "can_manage_mode_registry": false + }, + "ttl_seconds": 3600 +} +``` + +**Request fields** + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `sender` | string | Yes | The agent identity. Becomes the JWT `sub` claim and the authenticated sender the runtime associates with incoming frames. Must be a non-empty string. | +| `scopes` | object | No | Capability flags, serialized verbatim under `macp_scopes`. Defaults to `{}` (permissive in the runtime's current interpretation — see scopes schema below). | +| `ttl_seconds` | number | No | Requested token lifetime in seconds. Must be positive and finite. Clamped to `MACP_AUTH_MAX_TTL_SECONDS`. Defaults to `MACP_AUTH_DEFAULT_TTL_SECONDS` when omitted. | + +**Scopes schema** (all fields optional) + +| Field | Type | Runtime meaning | +|-------|------|-----------------| +| `can_start_sessions` | boolean | May submit `SessionStart` envelopes. | +| `can_manage_mode_registry` | boolean | May register / unregister / promote extension modes. | +| `is_observer` | boolean | May passive-subscribe to sessions the caller is not a declared participant of. | +| `allowed_modes` | string[] | If non-empty, restricts the set of modes the sender may use. Empty or omitted = all modes allowed. | +| `max_open_sessions` | number | Upper bound on concurrent open sessions the sender can initiate. | + +The auth-service does not inspect these fields beyond serializing them — enforcement is entirely on the runtime side. You can pass additional keys and the runtime will surface them via the identity's scopes map, but they will be ignored by current runtime capability checks. + +**Response** + +```json +{ + "token": "eyJhbGciOiJSUzI1NiIsImtpZCI6ImRldi1rZXktMSJ9.eyJtYWNwX3Njb3BlcyI6e30sImlhdCI6...", + "sender": "agent://risk", + "expires_in_seconds": 3600 +} +``` + +**Response fields** + +| Field | Type | Description | +|-------|------|-------------| +| `token` | string | The compact serialized JWT. Present as `Authorization: Bearer ` to the runtime. | +| `sender` | string | Echo of the request's `sender`. Also present as the JWT's `sub` claim. | +| `expires_in_seconds` | number | The effective TTL after clamping against `MACP_AUTH_MAX_TTL_SECONDS`. May be less than the requested `ttl_seconds`. | + +**JWT claim structure** + +| Claim | Source | Description | +|-------|--------|-------------| +| `iss` | `MACP_AUTH_ISSUER` | Token issuer. Must match the runtime's configured issuer. | +| `aud` | `MACP_AUTH_AUDIENCE` | Token audience. Must match the runtime's configured audience. | +| `sub` | request `sender` | Authenticated agent identity. | +| `iat` | now | Issued-at (seconds since epoch). | +| `exp` | `iat + effective_ttl` | Expiration (seconds since epoch). | +| `macp_scopes` | request `scopes` | Capability flags, serialized verbatim. | + +The JWT header always carries `alg: RS256` and `kid` matching the key advertised in the JWKS. + +**Example** + +```bash +curl -sS -X POST http://localhost:3200/tokens \ + -H 'content-type: application/json' \ + -d '{ + "sender": "agent://risk", + "scopes": { "can_start_sessions": true, "allowed_modes": ["macp.mode.decision.v1"] }, + "ttl_seconds": 600 + }' +``` + +## Errors + +The service returns plain JSON errors with an `error` field. Only the validation errors below are emitted by the service itself; signature-verification and claim-validation errors surface at the **verifier** (the runtime), not here. + +### Error table + +| Status | Body | Cause | +|--------|------|-------| +| `400` | `{"error":"sender is required"}` | Body missing, not JSON, or `sender` absent / empty / not a string. | +| `400` | `{"error":"ttl_seconds must be a positive number"}` | `ttl_seconds` is `0`, negative, `NaN`, `Infinity`, or non-numeric. | +| `404` | (express default) | Unknown path. | +| `500` | (express default) | Should not occur in normal operation. Indicates an unexpected exception during signing; check server logs. | + +### Verifier-side errors + +The following are raised by `jose.jwtVerify` (or an equivalent verifier) at the runtime, not by this service. They are listed here as reference because operators often see them while debugging an integration. + +| Error name | Cause | Resolution | +|------------|-------|------------| +| `JWSSignatureVerificationFailed` | Key rotation not yet reflected in verifier's JWKS cache, or token signed by a different key entirely. | Wait `MACP_AUTH_JWKS_TTL_SECS`, or restart the verifier; confirm `kid` in token matches a JWKS entry. | +| `JWTClaimValidationFailed: iss` | Issuer mismatch between minter and verifier. | Align `MACP_AUTH_ISSUER`. | +| `JWTClaimValidationFailed: aud` | Audience mismatch. | Align `MACP_AUTH_AUDIENCE`. | +| `JWTExpired` | Token `exp` has passed, or large clock skew between minter and verifier. | Mint a fresh token; verify NTP sync. | +| `JWTClaimValidationFailed: nbf` | `nbf` in the future — only possible if a custom minter adds `nbf`; this service does not. | N/A for this service. | + +## Rate limiting + +The service does **not** rate-limit `POST /tokens`. Deployments that need per-caller limits should add them in the reverse proxy (nginx `limit_req`, Envoy local rate limit, API gateway rules, etc.). See [Operations — Abuse mitigation](operations.md#abuse-mitigation). + +## Request size limits + +`express.json()` accepts payloads up to the default 100 KiB. The service does not override this. A well-formed mint request is under 1 KiB; payloads anywhere near the limit indicate misuse. + +## Idempotency + +Mint requests are not idempotent. Every call generates a new JWT with fresh `iat` / `exp` claims even when the request body is identical. Callers that need idempotency (e.g. an outer control-plane that retries) should cache the minted token keyed by request parameters and replay within the returned `expires_in_seconds` window. diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..ab75647 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,57 @@ +# MACP auth-service Documentation + +**Version**: v1.0.0 | **Protocol**: MACP 1.0 | **Language**: TypeScript (Node 20+) + +The MACP auth-service is the reference identity provider for the [Multi-Agent Coordination Protocol](https://www.multiagentcoordinationprotocol.io). It is a small, stateless Express service that mints short-lived RS256 JWTs for agents so they can authenticate directly to the MACP runtime over gRPC. It implements the identity-provider side of direct-agent-auth as described in RFC-MACP-0004 §4. + +This documentation covers the **auth-service implementation** -- how to build, configure, integrate, deploy, and operate it. For protocol-level concepts like the authentication model, session capabilities, and the two-plane architecture, see the [protocol documentation](https://www.multiagentcoordinationprotocol.io/docs). + +## What the auth-service provides + +The service exposes three HTTP endpoints. `POST /tokens` mints a signed JWT for a requested `sender` identity with a scopes payload and a clamped time-to-live. `GET /.well-known/jwks.json` advertises the corresponding public JWK so that any MACP runtime pointed at this service can verify tokens without ever touching the private key. `GET /healthz` is an unauthenticated liveness probe for load balancers and container orchestrators. + +The service sits between three kinds of consumer in a typical MACP deployment: + +- **Orchestrators that mint.** The [control-plane](https://github.com/multiagentcoordinationprotocol/control-plane) and any custom orchestrator built on the [TypeScript SDK](https://github.com/multiagentcoordinationprotocol/typescript-sdk) or [Python SDK](https://github.com/multiagentcoordinationprotocol/python-sdk) call `POST /tokens` to obtain a JWT per agent they provision. +- **SDK-based agents that present.** Agents built with the SDKs receive a minted JWT in their bootstrap payload and carry it on every gRPC frame to the runtime. The SDKs themselves do not call `POST /tokens` — they are the bearer side of the exchange. See the SDK auth guides ([TypeScript](https://github.com/multiagentcoordinationprotocol/typescript-sdk/blob/main/docs/guides/authentication.md), [Python](https://github.com/multiagentcoordinationprotocol/python-sdk/blob/main/docs/guides/direct-agent-auth.md)). +- **The runtime that verifies.** The [runtime](https://github.com/multiagentcoordinationprotocol/runtime) fetches `/.well-known/jwks.json`, caches it, and verifies every incoming JWT's signature, `iss`, `aud`, and `exp`. See the runtime [Getting Started — JWT mode](https://github.com/multiagentcoordinationprotocol/runtime/blob/main/docs/getting-started.md#jwt-mode) and [Deployment — Authentication](https://github.com/multiagentcoordinationprotocol/runtime/blob/main/docs/deployment.md#authentication). + +The service is stateless by design. It holds exactly one RSA keypair in memory for its lifetime, maintains no database, issues no refresh tokens, and keeps no revocation list. Short TTLs and key rotation are the mitigations for compromised tokens. In development the service generates an ephemeral keypair on start; in any shared environment you provide `MACP_AUTH_SIGNING_KEY_JSON` so the key survives restarts and the runtime's JWKS cache stays stable. + +The mint endpoint is intentionally open-by-default: anyone who can reach the port can mint a token for any `sender`. That model assumes a trusted intra-cluster network. Deployments that expose the service more widely front it with mTLS, an authenticating reverse proxy, or a shared-secret `Authorization` check — see [Deployment](deployment.md) and [Operations](operations.md). + +## Documentation + +### Getting started +- [**Getting Started**](getting-started.md) -- Install, run locally, mint your first token, verify it against the JWKS +- [**Integration Guide**](integration.md) -- How the control-plane, SDK orchestrators, SDK agents, and the runtime consume this service end-to-end + +### Implementation reference +- [**Architecture**](architecture.md) -- Module layout, request flow, key lifecycle, and the config / keys / server split +- [**API Reference**](API.md) -- All three HTTP endpoints with request/response fields, error codes, and JWT claim structure + +### Operations +- [**Deployment**](deployment.md) -- Production checklist, environment variables, container deployment, TLS termination, and secret handling +- [**Operations Runbook**](operations.md) -- Key rotation, restart procedure, diagnostics, log interpretation, and common failure modes + +## Protocol documentation + +The auth-service implements the identity-provider side of the protocol's authentication model. For protocol-level topics, refer to the specification documentation: + +| Topic | Link | +|-------|------| +| Security model and threat surface | [Protocol Security](https://www.multiagentcoordinationprotocol.io/docs/security) | +| Transport bindings (gRPC, JWT, JWKS) | [Protocol Transports](https://www.multiagentcoordinationprotocol.io/docs/transports) | +| Agent discovery and capability negotiation | [Protocol Discovery](https://www.multiagentcoordinationprotocol.io/docs/discovery) | +| Session lifecycle and participant model | [Protocol Lifecycle](https://www.multiagentcoordinationprotocol.io/docs/lifecycle) | + +## Related repositories + +| Repository | Role | Auth docs | +|------------|------|-----------| +| [multiagentcoordinationprotocol](https://github.com/multiagentcoordinationprotocol/multiagentcoordinationprotocol) | Protocol specification, RFCs, and canonical docs | [Security](https://www.multiagentcoordinationprotocol.io/docs/security) | +| [runtime](https://github.com/multiagentcoordinationprotocol/runtime) | Rust reference runtime that verifies JWTs issued by this service | [JWT mode](https://github.com/multiagentcoordinationprotocol/runtime/blob/main/docs/getting-started.md#jwt-mode), [Deployment auth](https://github.com/multiagentcoordinationprotocol/runtime/blob/main/docs/deployment.md#authentication) | +| [control-plane](https://github.com/multiagentcoordinationprotocol/control-plane) | Orchestrator that mints per-agent tokens via this service | [Integration](https://github.com/multiagentcoordinationprotocol/control-plane/blob/main/docs/INTEGRATION.md), [Architecture](https://github.com/multiagentcoordinationprotocol/control-plane/blob/main/docs/ARCHITECTURE.md) | +| [typescript-sdk](https://github.com/multiagentcoordinationprotocol/typescript-sdk) | TypeScript agent SDK — presents the JWT to the runtime | [Authentication guide](https://github.com/multiagentcoordinationprotocol/typescript-sdk/blob/main/docs/guides/authentication.md) | +| [python-sdk](https://github.com/multiagentcoordinationprotocol/python-sdk) | Python agent SDK — presents the JWT to the runtime | [Direct-agent-auth](https://github.com/multiagentcoordinationprotocol/python-sdk/blob/main/docs/guides/direct-agent-auth.md), [Auth overview](https://github.com/multiagentcoordinationprotocol/python-sdk/blob/main/docs/auth.md) | +| [auth-service](https://github.com/multiagentcoordinationprotocol/auth-service) | **This repository** — JWT minting identity provider | — | diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..2282b1e --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,167 @@ +# Architecture + +This page describes the internal shape of the auth-service: the module split, the request lifecycle, the key lifecycle, and the reasoning behind the design choices. For the protocol-level authentication model — why agents hold short-lived JWTs, why the runtime verifies via JWKS, and what guarantees that gives — see the [protocol security documentation](https://www.multiagentcoordinationprotocol.io/docs/security). + +## Design goals + +The service has four goals, in order of priority. + +1. **Stateless.** No database, no session store, no revocation list. A restarted process is indistinguishable from a fresh one, provided the signing key is pinned. +2. **Single responsibility.** Mint tokens. Publish the verification key. Nothing else. No user management, no policy evaluation, no rate limiting. +3. **Testable without I/O.** The HTTP handlers can be exercised by supertest against a constructed app without starting a listener or touching the environment. +4. **Container-native.** Runs as a non-root user, exposes an unauthenticated health probe, ships a multi-stage `node:20-alpine` image, and reads all configuration from environment variables. + +Everything else — the specific library choices (`express`, `jose`), the file layout, the request-handling order — falls out of these four goals. + +## Source layout + +``` +src/ + config.ts — loadConfigFromEnv(): parses env into an AuthServiceConfig + keys.ts — loadKey(): returns SigningMaterial from env JWK or ephemeral gen + server.ts — createApp(config, signing): pure Express app factory + index.ts — main(): wires config + keys + app + listen + graceful shutdown + *.spec.ts — co-located jest + supertest tests +``` + +The split between `config`, `keys`, `server`, and `index` is the central architectural decision. It is not accidental and should not be collapsed. + +- `config.ts` is the only place that reads `MACP_AUTH_*` / `PORT` env vars (other than `index.ts` reading nothing). `loadConfigFromEnv(env)` takes the env as a parameter with a default of `process.env`, so tests can pass a synthetic environment. +- `keys.ts` knows nothing about HTTP or Express. It takes a JWK string (or `undefined`) and returns `{ privateKey, jwks, source }`. Tests can feed it a fixed JWK and get a deterministic key back. +- `server.ts` exports `createApp(config, signing)` — a pure function that builds an `Express` app bound to the given config and signing material. It never calls `.listen()`, never reads `process.env`, and never touches the filesystem. +- `index.ts` is the only file that owns side effects: it calls `loadConfigFromEnv()`, `loadKey()`, `createApp()`, `.listen()`, and wires `SIGTERM` / `SIGINT` handlers. + +This split means tests construct a real Express app with a real RSA key, round-trip real HTTP, and assert real JWT signatures — without opening a network socket or depending on the environment. `server.spec.ts` is the reference for this pattern. + +## Request lifecycle + +The service handles three distinct request types. None of them touch any persistence layer; all responses are derived from in-memory state plus the current request. + +### `GET /healthz` + +``` +request → express.json middleware → handler → { ok: true } response +``` + +No input parsing, no key use, no error path. + +### `GET /.well-known/jwks.json` + +``` +request → handler → signing.jwks response +``` + +Returns the pre-computed JWKS from `SigningMaterial.jwks`. This object is built once in `loadKey()` and never mutates. There is no caching header — the operator sets cache semantics at the reverse proxy if desired. + +### `POST /tokens` + +``` +request + → express.json middleware + → validate body.sender (non-empty string) + → resolve ttl (body.ttl_seconds ?? config.defaultTtlSeconds) + → validate ttl (finite, positive) + → clamp ttl (min(ttl, config.maxTtlSeconds)) + → jose.SignJWT({ macp_scopes: body.scopes ?? {} }) + .setProtectedHeader({ alg: 'RS256', kid }) + .setSubject(body.sender) + .setIssuer(config.issuer) + .setAudience(config.audience) + .setIssuedAt() + .setExpirationTime(`${ttl}s`) + .sign(signing.privateKey) + → { token, sender, expires_in_seconds } response +``` + +Two validation branches fail fast with `400` before any signing work happens. Once validation passes, `jose.SignJWT` builds the JWS compact serialization entirely in memory. The private key is held as a Node `KeyObject` (via `jose.importJWK` or `jose.generateKeyPair`); it is never exposed outside this module and is never logged. + +Clock skew handling is deliberately simple: `iat` is set to the process's current time and `exp` is `iat + ttl`. The verifier is responsible for tolerating skew via `clockTolerance` — the runtime defaults to a small window. + +## Key lifecycle + +`loadKey()` returns a `SigningMaterial` object that the app holds for its entire lifetime. + +``` +SigningMaterial { + privateKey: jose.KeyLike // signing key, never exposed + jwks: { keys: [JWK] } // public JWKS document, served verbatim + source: 'ephemeral' | 'env' // diagnostic only — logged on startup +} +``` + +There are two paths: + +### Ephemeral (dev) + +When `MACP_AUTH_SIGNING_KEY_JSON` is unset, `loadKey()` calls `jose.generateKeyPair('RS256')` at startup. It exports the public half as a JWK, tags it with `kid: 'dev-key-1'`, and returns it. The keypair is process-scoped — a restart generates a fresh key, invalidating every outstanding token the previous process signed. + +Use this only for local iteration. Any verifier that fetched and cached the prior JWKS will fail every token issued after a restart until the cache expires. + +### Pinned (prod) + +When `MACP_AUTH_SIGNING_KEY_JSON` is set, `loadKey()` parses it as a JWK, imports the private key, and derives the public JWK by stripping the private component (`d`) and re-importing. The `kid` comes from the JWK itself, falling back to `'key-1'` if absent. + +The JWK must be RSA, must contain private fields (`d`, `p`, `q`, `dp`, `dq`, `qi`), and must be compatible with `alg: RS256`. A parse failure or import failure is fatal — `main()` catches the error in `void main().catch(...)` and exits with code 1. This is intentional: a misconfigured key should prevent startup, not silently fall back to ephemeral. + +### Rotation + +There is no in-process rotation. The service advertises exactly one key at any given time. To rotate: + +1. Generate a new JWK with a new `kid`. +2. Update `MACP_AUTH_SIGNING_KEY_JSON` in the secret store. +3. Restart the service. +4. Wait `MACP_AUTH_JWKS_TTL_SECS` for verifiers to refresh their JWKS caches. +5. In-flight tokens signed by the previous key stop verifying at the end of their own TTL. + +See [Operations — Key rotation](operations.md#key-rotation) for the operational procedure. + +## Concurrency model + +Node.js single-threaded event loop. `jose.SignJWT.sign()` is async — it returns a Promise and releases the event loop while the underlying crypto runs in libuv worker threads. The service does not serialize requests; many mints run in parallel bounded only by libuv's thread pool (default 4). + +The in-memory `SigningMaterial` is immutable for the process lifetime, so no locking is needed. There is no shared mutable state. + +## Failure modes and responses + +| Failure | Response | Recovery | +|---------|----------|----------| +| Malformed JSON body | `400` (express default) | Caller resends valid JSON. | +| Missing `sender` | `400` with `{"error":"sender is required"}` | Caller adds `sender`. | +| Invalid `ttl_seconds` | `400` with `{"error":"ttl_seconds must be a positive number"}` | Caller passes a finite positive number. | +| `jose.SignJWT(...).sign()` throws | `500` (express default) | Unexpected — investigate logs. Usually indicates a key material corruption. | +| Invalid `MACP_AUTH_SIGNING_KEY_JSON` at startup | Process exit code 1 | Fix the JWK in the secret store and restart. | +| Port already in use | Process exit code 1 | Change `PORT` or free the port. | + +There are deliberately no fallbacks. If the key is broken, the service exits rather than limping along. + +## Observability + +Two observables leave the process: + +- **Stdout logs.** `index.ts` prints five `[auth-service]` lines on startup (port, issuer/audience, key source, JWKS URL, mint URL) and one line on shutdown. The service does not log per-request — adding an access log is a deployment-time concern handled by the reverse proxy. +- **HTTP status codes.** `GET /healthz` returns 200 always-on; the orchestrator's probe treats the first successful response after start as "ready." + +The service does not emit metrics. If you need Prometheus counters (mint count, mint latency, error rate), add a `prom-client` registry in a thin wrapper; the hooks are straightforward because `createApp` is pure. + +## Dependency choices + +| Dependency | Version | Why | +|------------|---------|-----| +| `express` | `^5.2.1` | Standard Node HTTP framework. Async-first in v5, no need for `express-async-errors`. | +| `jose` | `^5.9.6` | Spec-compliant JOSE (JWS, JWE, JWK, JWKS) with full TypeScript types. **Pinned to v5** for CommonJS compatibility with ts-jest. v6+ is ESM-only. | +| `typescript` | `^5.6.3` | Matches the rest of the MACP monorepo. `strict` mode plus `noUnusedLocals` / `noUnusedParameters` / `noImplicitOverride`. | +| `@types/node` | `^20.x` | Matches `engines.node >= 20`. | +| `jest` + `ts-jest` + `supertest` | current stable | HTTP integration tests against the real Express surface with real RSA keys. | + +`jose` is the only non-framework dependency. No custom crypto, no `jsonwebtoken`, no `node-jose`. This narrows the trusted-code surface for a service whose entire job is signing tokens. + +## What this architecture rules out + +A few capabilities are intentionally absent and should not be added without first questioning the design goals above. + +- **User accounts / password auth.** The auth-service does not know who the caller is. Caller identification is a reverse-proxy concern. +- **Token revocation.** The service has no store to revoke from. Short TTLs plus key rotation are the only revocation primitives. +- **Audit logging.** The service does not log mints. An upstream API gateway (or the caller — typically the [control-plane](https://github.com/multiagentcoordinationprotocol/control-plane) or a custom orchestrator built on the [TypeScript SDK](https://github.com/multiagentcoordinationprotocol/typescript-sdk) or [Python SDK](https://github.com/multiagentcoordinationprotocol/python-sdk)) is responsible for audit. +- **Multiple active keys.** `SigningMaterial.jwks` is always a single-entry array. Supporting N active keys would require a key manager, a selector, and a cache-invalidation strategy — all of which belong in a key management service, not here. + +If any of these is required for your deployment, the right move is to put a dedicated service in front of this one and keep this one minimal. diff --git a/docs/deployment.md b/docs/deployment.md new file mode 100644 index 0000000..c063441 --- /dev/null +++ b/docs/deployment.md @@ -0,0 +1,250 @@ +# Deployment Guide + +This guide covers everything you need to run the auth-service in production: the production checklist, environment variables, secret handling, container deployment, TLS termination, and verifier wiring. For protocol-level deployment topologies, see the [protocol deployment](https://www.multiagentcoordinationprotocol.io/docs/deployment) and [protocol security](https://www.multiagentcoordinationprotocol.io/docs/security) documentation. + +## Production checklist + +Before exposing the auth-service to production traffic, confirm these five items. + +1. **Pinned signing key.** Set `MACP_AUTH_SIGNING_KEY_JSON` from your secret store. The service generates an ephemeral keypair when unset, which is fatal in any shared deployment — every restart invalidates every issued token until the verifier's JWKS cache refreshes. + +2. **Issuer and audience alignment.** `MACP_AUTH_ISSUER` and `MACP_AUTH_AUDIENCE` on the auth-service must match the runtime's `MACP_AUTH_ISSUER` and `MACP_AUTH_AUDIENCE`. A mismatch is the single most common cause of `JWTClaimValidationFailed` errors in integrations. + +3. **Front it with auth.** The `POST /tokens` endpoint has no client authentication. Put the service behind mTLS, an authenticating reverse proxy, or a shared-secret `Authorization` check before anything outside your trust boundary can reach it. + +4. **TLS termination.** The service speaks plain HTTP. Run it behind a TLS-terminating proxy (nginx, Envoy, cloud load balancer). Tokens on the wire must be TLS-protected; the JWKS itself is public but should still be served over HTTPS so verifiers can trust the key distribution channel. + +5. **Bounded TTLs.** `MACP_AUTH_MAX_TTL_SECONDS` is your revocation horizon — a stolen token is valid until it expires. Keep the max TTL short (hours, not days) unless you have a compensating control. + +## Environment variables + +| Variable | Default | Required? | Description | +|----------|---------|-----------|-------------| +| `PORT` | `3200` | no | HTTP listen port | +| `MACP_AUTH_ISSUER` | `macp-auth-service` | no | JWT `iss` claim. Must match the verifier's expected issuer. | +| `MACP_AUTH_AUDIENCE` | `macp-runtime` | no | JWT `aud` claim. Must match the verifier's expected audience. | +| `MACP_AUTH_MAX_TTL_SECONDS` | `3600` | no | Upper bound on minted token lifetime. Clients requesting more are clamped down. | +| `MACP_AUTH_DEFAULT_TTL_SECONDS` | `300` | no | TTL applied when the request omits `ttl_seconds`. | +| `MACP_AUTH_SIGNING_KEY_JSON` | *(ephemeral)* | **yes in prod** | RSA private JWK as a JSON string. See below for generation. | + +Environment variables are read by `src/config.ts` via `loadConfigFromEnv()`. No other file reads `process.env`. This means a single command with the right env applies globally; there are no per-request overrides. + +## Signing key generation + +Generate an RSA keypair as a JWK once, store the private JWK in your secret manager, and inject it at process start. + +```bash +node -e "const {generateKeyPair,exportJWK}=require('jose'); \ + (async()=>{const {privateKey}=await generateKeyPair('RS256',{extractable:true}); \ + const jwk=await exportJWK(privateKey); \ + jwk.kid='prod-key-' + new Date().toISOString().slice(0,10); \ + jwk.alg='RS256'; jwk.use='sig'; \ + console.log(JSON.stringify(jwk))})()" +``` + +The `kid` should be unique per key version. Embedding the generation date (`prod-key-2026-04-22`) is a reasonable convention — it makes rotation history visible at a glance in logs and verifier caches. + +Store the output in your secret manager as a single string. Do **not** commit the JWK; do **not** log it; do **not** pipe it through a console that retains scrollback on a shared host. + +## Key rotation + +Key rotation is a deploy-and-wait procedure. There is no in-process rotation; there is no multi-key JWKS. + +``` +1. Generate a new JWK with a new kid. +2. Store the new JWK in the secret manager. +3. Roll the deployment so new replicas pick up the new MACP_AUTH_SIGNING_KEY_JSON. +4. Wait MACP_AUTH_JWKS_TTL_SECS for all verifiers' JWKS caches to refresh. +5. Existing tokens signed by the old key remain valid until their own exp. +``` + +See the [Operations Runbook](operations.md#key-rotation) for the step-by-step procedure including verification and rollback. + +## Secret handling + +The private JWK is the most sensitive artifact in the MACP security model. A leaked key lets an attacker mint tokens for any `sender` with any scopes until the key is rotated out. + +- **Kubernetes.** Store as a `Secret`, mount as an env var via `envFrom.secretKeyRef`. Do not use `configMap`. +- **AWS.** Store in Secrets Manager or Parameter Store (SecureString), inject via IAM-scoped retrieval at container start. +- **Vault.** Store under a dedicated path with short-TTL dynamic leases; restart the service when the lease renews. +- **Docker Compose / local.** Use `.env` files excluded from version control. Never commit a `.env` with `MACP_AUTH_SIGNING_KEY_JSON` set. + +Confirm the secret does not appear in container manifests, CI artifacts, or log output. The service never logs tokens or keys; audit that your infra around it matches that discipline. + +## Docker + +The provided Dockerfile is multi-stage and ships a minimal runtime image. + +```bash +docker build -t macp-auth-service:local . +docker run --rm -p 3200:3200 macp-auth-service:local +curl http://localhost:3200/healthz +``` + +### Image details + +- **Base:** `node:20-alpine` (builder, deps, runtime stages). +- **User:** non-root `appuser:appgroup`. +- **Final contents:** `dist/` (compiled TypeScript) + `node_modules` (production only) + `package.json`. +- **Entrypoint:** `node dist/index.js`. +- **Healthcheck:** `wget -qO- http://localhost:3200/healthz` with a 10 s interval. +- **Exposed port:** `3200`. +- **Size:** ~200 MB uncompressed. + +### Published images + +CI publishes to `ghcr.io/multiagentcoordinationprotocol/auth-service`. PR builds are tagged `pr-`. Merges to `main` are tagged `latest` and `sha-<7hex>`. See `.github/workflows/docker.yml` for the exact tagging strategy. + +### Recommended runtime configuration + +```bash +docker run -d \ + --name macp-auth \ + --restart unless-stopped \ + --read-only --tmpfs /tmp \ + -e MACP_AUTH_ISSUER=auth.example.com \ + -e MACP_AUTH_AUDIENCE=macp-runtime \ + -e MACP_AUTH_MAX_TTL_SECONDS=3600 \ + -e MACP_AUTH_SIGNING_KEY_JSON="$(cat /run/secrets/signing-key.json)" \ + -p 127.0.0.1:3200:3200 \ + ghcr.io/multiagentcoordinationprotocol/auth-service:latest +``` + +The `--read-only` flag is safe because the service writes nothing to disk. Binding to `127.0.0.1` ensures only local callers (or a reverse proxy on the same host) can reach the mint endpoint. + +## Kubernetes + +A minimal `Deployment` + `Service` manifest: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: macp-auth +spec: + replicas: 2 + selector: + matchLabels: { app: macp-auth } + template: + metadata: + labels: { app: macp-auth } + spec: + containers: + - name: auth + image: ghcr.io/multiagentcoordinationprotocol/auth-service:sha-abc1234 + ports: + - containerPort: 3200 + env: + - name: MACP_AUTH_ISSUER + value: auth.example.com + - name: MACP_AUTH_AUDIENCE + value: macp-runtime + - { name: MACP_AUTH_SIGNING_KEY_JSON, valueFrom: { secretKeyRef: { name: macp-auth-signing, key: jwk } } } + readinessProbe: + httpGet: { path: /healthz, port: 3200 } + livenessProbe: + httpGet: { path: /healthz, port: 3200 } + resources: + requests: { cpu: 50m, memory: 128Mi } + limits: { cpu: 500m, memory: 256Mi } + securityContext: + readOnlyRootFilesystem: true + runAsNonRoot: true + allowPrivilegeEscalation: false + capabilities: { drop: ["ALL"] } +--- +apiVersion: v1 +kind: Service +metadata: + name: macp-auth +spec: + selector: { app: macp-auth } + ports: + - port: 3200 + targetPort: 3200 +``` + +Because the service is stateless, horizontal scaling is trivial — any replica can handle any mint. All replicas in a deployment must share the same `MACP_AUTH_SIGNING_KEY_JSON`; mixing keys would advertise different JWKS to different verifiers. + +## Verifier (runtime) wiring + +Configure the Rust runtime to trust tokens issued by this service: + +```bash +export MACP_AUTH_ISSUER=auth.example.com # matches auth-service +export MACP_AUTH_AUDIENCE=macp-runtime # matches auth-service +export MACP_AUTH_JWKS_URL=https://auth.example.com/.well-known/jwks.json +export MACP_AUTH_JWKS_TTL_SECS=300 # cache refresh interval +``` + +The runtime fetches the JWKS on first use and caches it for `MACP_AUTH_JWKS_TTL_SECS`. Any token presented to the runtime is rejected unless: + +- The signature verifies against a key in the cached JWKS. +- `iss` matches `MACP_AUTH_ISSUER`. +- `aud` matches `MACP_AUTH_AUDIENCE`. +- `exp` is in the future (within tolerable clock skew). + +See the [runtime Getting Started guide](https://github.com/multiagentcoordinationprotocol/runtime/blob/main/docs/getting-started.md#jwt-mode) for the full JWT configuration reference on the runtime side. + +## TLS termination + +The service does not terminate TLS itself. Run it behind: + +- **nginx / Caddy / Traefik** — standard ingress proxy with a Let's Encrypt cert. +- **Envoy / Istio** — mesh-native TLS between the caller and the auth-service. +- **Cloud load balancers** — AWS ALB, GCP HTTPS LB, Azure Application Gateway. + +A minimal nginx snippet: + +```nginx +server { + listen 443 ssl http2; + server_name auth.example.com; + ssl_certificate /etc/letsencrypt/live/auth.example.com/fullchain.pem; + ssl_certificate_key /etc/letsencrypt/live/auth.example.com/privkey.pem; + + location = /.well-known/jwks.json { + proxy_pass http://127.0.0.1:3200; + add_header Cache-Control "public, max-age=60"; + } + + location /tokens { + # Replace with your actual caller-auth mechanism + auth_request /internal-auth-check; + proxy_pass http://127.0.0.1:3200; + } + + location /healthz { proxy_pass http://127.0.0.1:3200; } +} +``` + +The `Cache-Control` on the JWKS is optional but reduces runtime chatter once the cache warms. + +## CI/CD + +Two GitHub Actions workflows ship with the repo. + +| Workflow | Trigger | Purpose | +|----------|---------|---------| +| `ci.yml` | PR + push to `main` | Lint, typecheck, build, test | +| `docker.yml` | PR + push to `main` | Build and publish container image to GHCR | +| `notify-website.yml` | push to `main` with docs changes | Notify the docs website to sync | + +CI runs `npm ci && npm run lint && npm run typecheck && npm run build && npm test` against Node 20. The Docker workflow tags images `pr-` for PR builds and `latest` + `sha-<7hex>` for main. + +## Resource sizing + +A single replica comfortably handles hundreds of mints per second on a modest VM. RSA signing is the dominant cost and runs on libuv worker threads (default 4). Under sustained heavy load, raise `UV_THREADPOOL_SIZE` or scale horizontally — additional replicas cost effectively nothing since there is no shared state. + +Memory is dominated by Node's baseline plus the single keypair. Steady-state is well under 128 MiB. CPU is bursty — idle between mints, ~10 ms on a modern core per RS256 signature. + +## Rolling upgrades + +Because the service is stateless and every replica advertises the same JWK (they share `MACP_AUTH_SIGNING_KEY_JSON`), rolling upgrades have no cross-replica coordination requirement. Rolling from version N to N+1: + +1. Update the image tag in the deployment manifest. +2. Kubernetes / Nomad / ECS performs a standard rolling replacement. +3. Each replica's `/healthz` returns 200 as soon as `loadKey()` completes — typically under a second. +4. Existing tokens continue to verify because the JWK has not changed. + +The one caveat: if the upgrade also rotates the key, follow the [rotation runbook](operations.md#key-rotation) instead. diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..d131ada --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,216 @@ +# Getting Started + +This guide walks you from a fresh checkout to a running auth-service with a minted and verified token. By the end you will have the service listening locally, a token minted via `curl`, and the signature verified against the published JWKS. + +For protocol-level context on how agents authenticate to the runtime, see the [protocol security documentation](https://www.multiagentcoordinationprotocol.io/docs/security). + +## Prerequisites + +You need Node.js 20 or later and npm. The project uses TypeScript and `jose` for JWT signing — both are installed as dependencies. + +```bash +# macOS +brew install node@20 + +# Ubuntu / Debian +curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - +sudo apt-get install -y nodejs + +# Verify +node --version # v20.x or later +npm --version +``` + +## Install and run + +Clone the repository and install dependencies. + +```bash +git clone https://github.com/multiagentcoordinationprotocol/auth-service.git +cd auth-service +npm install +``` + +### Development server + +With no configuration, the service generates an ephemeral RSA keypair on start and listens on `127.0.0.1:3200`. The keypair lives only as long as the process. + +```bash +npm run dev +``` + +You should see: + +``` +[auth-service] listening on port 3200 +[auth-service] issuer=macp-auth-service audience=macp-runtime +[auth-service] key source: ephemeral +[auth-service] JWKS: http://localhost:3200/.well-known/jwks.json +[auth-service] Mint: POST http://localhost:3200/tokens +``` + +The `key source: ephemeral` line is the signal that you did not provide `MACP_AUTH_SIGNING_KEY_JSON`. That is fine for local development; it is never correct in production. + +### Production server + +In production the service requires a pinned signing key so it survives restarts and so the runtime's JWKS cache stays warm. Generate one once, store it in your secret manager, and inject it at process start. + +```bash +# Generate a production JWK +node -e "const {generateKeyPair,exportJWK}=require('jose'); \ + (async()=>{const {privateKey}=await generateKeyPair('RS256',{extractable:true}); \ + const jwk=await exportJWK(privateKey); jwk.kid='prod-key-1'; \ + console.log(JSON.stringify(jwk))})()" + +# Run with the pinned key +export MACP_AUTH_SIGNING_KEY_JSON='{"kty":"RSA","kid":"prod-key-1",...}' +export MACP_AUTH_ISSUER=auth.example.com +export MACP_AUTH_AUDIENCE=macp-runtime +npm run build && npm start +``` + +See the [Deployment Guide](deployment.md) for the full environment variable reference and the production checklist. + +## Your first minted token + +The mint flow is a single POST. The service validates the request, clamps the TTL to `MACP_AUTH_MAX_TTL_SECONDS`, signs a JWT with the in-memory private key, and returns the token together with the resolved TTL. + +### Step 1: Mint a token + +```bash +curl -sS -X POST http://localhost:3200/tokens \ + -H 'content-type: application/json' \ + -d '{ + "sender": "agent://risk", + "scopes": { + "can_start_sessions": true, + "is_observer": false, + "allowed_modes": ["macp.mode.decision.v1"], + "max_open_sessions": 1, + "can_manage_mode_registry": false + }, + "ttl_seconds": 600 + }' +``` + +Response: + +```json +{ + "token": "eyJhbGciOiJSUzI1NiIsImtpZCI6ImRldi1rZXktMSJ9...", + "sender": "agent://risk", + "expires_in_seconds": 600 +} +``` + +The returned `expires_in_seconds` reflects the **effective** TTL after clamping. If you request `ttl_seconds: 999999` with the default config, you will get `3600` back, not `999999` — `MACP_AUTH_MAX_TTL_SECONDS` is authoritative. + +### Step 2: Inspect the JWT + +The payload carries the requested scopes under the `macp_scopes` claim, plus the standard JWT claims set by the service. + +```bash +curl -sS -X POST http://localhost:3200/tokens \ + -H 'content-type: application/json' \ + -d '{"sender":"agent://risk"}' \ + | jq -r .token \ + | cut -d. -f2 \ + | base64 -d 2>/dev/null \ + | jq +``` + +Example decoded body: + +```json +{ + "macp_scopes": {}, + "iat": 1713800000, + "exp": 1713800300, + "iss": "macp-auth-service", + "aud": "macp-runtime", + "sub": "agent://risk" +} +``` + +### Step 3: Fetch the JWKS + +The public key is published at `/.well-known/jwks.json`. The runtime fetches this endpoint on first use and caches the result for `MACP_AUTH_JWKS_TTL_SECS` seconds. + +```bash +curl -sS http://localhost:3200/.well-known/jwks.json | jq +``` + +```json +{ + "keys": [ + { + "kty": "RSA", + "n": "ukL3...pQ", + "e": "AQAB", + "kid": "dev-key-1", + "alg": "RS256", + "use": "sig" + } + ] +} +``` + +Note that no private material appears here — only `n`, `e`, and the metadata needed for signature verification. + +### Step 4: Verify the signature + +Round-trip the token through `jose.jwtVerify` to confirm the signature, issuer, and audience match. + +```bash +node -e " +const jose = require('jose'); +(async () => { + const token = process.argv[1]; + const jwks = jose.createRemoteJWKSet(new URL('http://localhost:3200/.well-known/jwks.json')); + const { payload } = await jose.jwtVerify(token, jwks, { + issuer: 'macp-auth-service', + audience: 'macp-runtime', + }); + console.log(payload); +})(); +" "$(curl -sS -X POST http://localhost:3200/tokens \ + -H 'content-type: application/json' \ + -d '{"sender":"agent://risk"}' | jq -r .token)" +``` + +If the signature verifies you will see the decoded payload. If it does not, the error is one of `JWSSignatureVerificationFailed`, `JWTExpired`, `JWTClaimValidationFailed`, etc. — the [API Reference](API.md#error-table) maps them to root causes. + +## Pointing the runtime at your dev auth-service + +Configure the Rust runtime to trust tokens issued by this service. Set the issuer, audience, and JWKS URL on the runtime and start it. + +```bash +export MACP_AUTH_ISSUER=macp-auth-service +export MACP_AUTH_AUDIENCE=macp-runtime +export MACP_AUTH_JWKS_URL=http://127.0.0.1:3200/.well-known/jwks.json +export MACP_AUTH_JWKS_TTL_SECS=60 +# (plus the usual runtime config: MACP_ALLOW_INSECURE=1, MACP_BIND_ADDR, etc.) +cargo run --manifest-path ../runtime/Cargo.toml +``` + +Now run any gRPC client with the minted JWT as a bearer token. The runtime will fetch your JWKS on the first request and cache it for 60 seconds. + +## Common errors + +| Error | Cause | Fix | +|-------|-------|-----| +| `400 sender is required` | Missing or empty `sender` in request body | Include a non-empty string for `sender` | +| `400 ttl_seconds must be a positive number` | `ttl_seconds` non-positive, `NaN`, or `Infinity` | Pass a positive finite number, or omit to use the default | +| `JWSSignatureVerificationFailed` at the runtime | Runtime's JWKS cache is stale after a key rotation | Wait `MACP_AUTH_JWKS_TTL_SECS` or restart the runtime | +| `JWTClaimValidationFailed: "iss" claim` | `MACP_AUTH_ISSUER` mismatch between auth-service and runtime | Align the two env vars | +| `JWTClaimValidationFailed: "aud" claim` | `MACP_AUTH_AUDIENCE` mismatch | Align the two env vars | +| `JWTExpired` | Token's `exp` has passed | Mint a fresh token; check clock skew between issuer and verifier | +| Ephemeral key rotates every restart | `MACP_AUTH_SIGNING_KEY_JSON` unset | Set it from a secret store for any shared deployment | + +## Next steps + +- [**Integration Guide**](integration.md) — end-to-end wiring with the control-plane, SDK orchestrators, SDK agents, and the runtime +- [**API Reference**](API.md) — full endpoint surface and JWT claim structure +- [**Architecture**](architecture.md) — module layout and signing flow +- [**Deployment Guide**](deployment.md) — production configuration and Docker +- [**Operations Runbook**](operations.md) — key rotation and diagnostics diff --git a/docs/integration.md b/docs/integration.md new file mode 100644 index 0000000..3fd4ba9 --- /dev/null +++ b/docs/integration.md @@ -0,0 +1,438 @@ +# Integration Guide + +This guide is for engineers wiring the auth-service into a larger MACP deployment. It shows the full end-to-end flow from token mint to runtime verification, explains the two roles that consume this service (minting orchestrators and bearer-presenting SDK agents), and gives reference snippets in TypeScript, Rust, and Python. + +For the architectural rationale behind direct-agent-auth, see RFC-MACP-0004 §4 and the [protocol security documentation](https://www.multiagentcoordinationprotocol.io/docs/security). For companion views from the other side of the wire, see: + +- [Runtime — JWT mode](https://github.com/multiagentcoordinationprotocol/runtime/blob/main/docs/getting-started.md#jwt-mode) and [Runtime — Deployment › Authentication](https://github.com/multiagentcoordinationprotocol/runtime/blob/main/docs/deployment.md#authentication) — how the runtime verifies tokens this service issues. +- [Control-plane — Integration](https://github.com/multiagentcoordinationprotocol/control-plane/blob/main/docs/INTEGRATION.md) and [Control-plane — Architecture](https://github.com/multiagentcoordinationprotocol/control-plane/blob/main/docs/ARCHITECTURE.md) — how the reference orchestrator mints tokens and hands them to agents. +- [TypeScript SDK — Authentication](https://github.com/multiagentcoordinationprotocol/typescript-sdk/blob/main/docs/guides/authentication.md) and [Python SDK — Direct-agent-auth](https://github.com/multiagentcoordinationprotocol/python-sdk/blob/main/docs/guides/direct-agent-auth.md) — how SDK agents consume a minted JWT. + +## Two roles, one identity provider + +Anything that talks to the auth-service plays exactly one of these two roles. + +**Minters** call `POST /tokens`. They hold a trust relationship with the auth-service (typically intra-cluster network, optionally reinforced by a proxy-level auth check) and are authorised to issue identities to agents. In the reference stack, the [control-plane](https://github.com/multiagentcoordinationprotocol/control-plane) is the primary minter. Any orchestrator built directly on the [TypeScript SDK](https://github.com/multiagentcoordinationprotocol/typescript-sdk) or [Python SDK](https://github.com/multiagentcoordinationprotocol/python-sdk) can mint the same way — the SDKs ship with sample provisioning paths but do not themselves call `POST /tokens`. + +**Bearers** present a minted JWT to the runtime on every gRPC frame. SDK-based agents are the canonical bearers: they load `runtime.bearerToken` from their bootstrap payload, wrap it in `Auth.bearer(...)` / `AuthConfig.for_bearer(...)`, and let the SDK attach it as `Authorization: Bearer ` metadata on every RPC. Bearers never touch the auth-service directly — their only relationship with it is indirect, via the `iss` / `aud` / `kid` on the tokens they carry. + +The control-plane typically plays both roles: it mints tokens for agents it provisions, and it presents its own token when it talks to the runtime on its own behalf (for example to list sessions or install a policy). + +## End-to-end flow + +``` +┌─────────────────────┐ ┌──────────────────┐ ┌──────────────────┐ +│ Minter │ │ auth-service │ │ MACP runtime │ +│ (control-plane / │ │ :3200 │ │ :50051 (gRPC) │ +│ SDK orchestrator) │ │ │ │ │ +└──────────┬──────────┘ └────────┬─────────┘ └─────────┬────────┘ + │ │ │ + │ 1. POST /tokens │ │ + │ { sender, scopes, ttl } │ │ + ├─────────────────────────────────►│ │ + │ │ │ + │ 2. { token, expires_in_secs } │ │ + │◄─────────────────────────────────┤ │ + │ │ │ + │ 3. bootstrap agent with │ │ + │ runtime.bearerToken = token │ │ + │ │ │ + │ │ 4. GET /.well-known/jwks.json │ + │ │ (runtime-initiated, cached) │ + │ │◄─────────────────────────────────┤ + │ │ │ + │ │ 5. JWKS response │ + │ ├─────────────────────────────────►│ + │ │ │ + + ┌─────────────────────┐ ┌──────────────────┐ + │ SDK agent (bearer) │ │ MACP runtime │ + │ TS / Python │ │ │ + └──────────┬──────────┘ └─────────┬────────┘ + │ │ + │ 6. gRPC frame │ + │ metadata: authorization = Bearer │ + ├─────────────────────────────────────────────────────────►│ + │ │ + │ 7. runtime verifies sig + iss + aud + exp against │ + │ cached JWKS; maps sub → sender identity and │ + │ macp_scopes → capability set │ + │ │ +``` + +1. A minter — the control-plane or a custom orchestrator built on an SDK — asks the auth-service to mint a JWT for a specific `sender` with specific scopes. +2. The auth-service returns a signed JWT with the effective TTL. +3. The minter hands the token to the agent it is spawning, typically by embedding it in the bootstrap payload at `runtime.bearerToken`. +4. On first use, the runtime fetches the auth-service's JWKS at `MACP_AUTH_JWKS_URL` and caches it for `MACP_AUTH_JWKS_TTL_SECS`. +5. The runtime holds the JWKS in memory until the TTL expires. +6. The SDK agent opens a gRPC channel, wraps the bootstrap token in an `Auth` / `AuthConfig`, and sends every frame with `Authorization: Bearer ` metadata. The SDKs attach this automatically. +7. The runtime verifies signature, `iss`, `aud`, and `exp` on every frame. A successful verify maps the JWT's `sub` to the authenticated sender identity and the `macp_scopes` claim to the capability set. + +The auth-service is **not** in the hot path of a running session. Tokens are minted once at agent provisioning and reused for the session's lifetime. + +## Minter patterns + +### Pattern 1: control-plane provisions an agent + +The [control-plane](https://github.com/multiagentcoordinationprotocol/control-plane) is invoked by a human operator, a CI pipeline, or an upstream orchestration system. It enforces its own authorization policy, mints a scoped token for the target agent, and hands the bootstrap payload to the agent runner. This is the primary minting path in the reference stack. + +```typescript +async function provisionAgent(req: ProvisionRequest, operator: OperatorIdentity): Promise { + // 1. Authorize the operator (outside the scope of the auth-service). + await authorizeOperator(operator, req.targetSender); + + // 2. Compute the scopes the operator may grant. May be narrower than what + // they requested, based on the operator's own role. + const scopes = narrowScopes(req.scopes, operator); + + // 3. Mint the token. + const token = await mintToken({ + sender: req.targetSender, + scopes, + ttl_seconds: Math.min(req.ttlSeconds ?? 3600, 3600), + }); + + // 4. Build the bootstrap payload the agent will consume. + const bootstrap = { + runtime: { address: 'macp-runtime:50051', bearerToken: token, tls: true }, + participant: { participantId: req.targetSender }, + run: { sessionId: req.preallocatedSessionId }, + // ...scenario-specific fields + }; + return spawn(req.targetSender, bootstrap); +} +``` + +The control-plane is also the primary reason `POST /tokens` is unauthenticated at the service itself — operator authorization happens *before* the mint call, in the control-plane's own policy layer. If you deviate from that topology (for example, by exposing the auth-service to a less trusted network), add caller authentication at the reverse proxy. + +For the control-plane's own operator-facing surface and integration contract, see [Control-plane INTEGRATION](https://github.com/multiagentcoordinationprotocol/control-plane/blob/main/docs/INTEGRATION.md) and [Control-plane ARCHITECTURE](https://github.com/multiagentcoordinationprotocol/control-plane/blob/main/docs/ARCHITECTURE.md). + +### Pattern 2: custom orchestrator built on an SDK + +When you build an orchestrator directly on the [TypeScript SDK](https://github.com/multiagentcoordinationprotocol/typescript-sdk) or [Python SDK](https://github.com/multiagentcoordinationprotocol/python-sdk), your orchestrator plays the same minter role as the control-plane: it calls `POST /tokens` per agent, embeds the JWT in the bootstrap payload, then spawns the agent. The SDKs themselves are agent-side libraries — they present tokens but do not mint them. + +Typical flow for an SDK-based orchestrator: + +```typescript +// Per agent to provision: +async function mintTokenForAgent(senderId: string): Promise { + const resp = await fetch('http://auth-service:3200/tokens', { + method: 'POST', + headers: { 'content-type': 'application/json' }, + body: JSON.stringify({ + sender: senderId, + scopes: { + can_start_sessions: true, + is_observer: false, + allowed_modes: ['macp.mode.decision.v1'], + max_open_sessions: 1, + }, + ttl_seconds: 3600, + }), + }); + if (!resp.ok) { + throw new Error(`Token mint failed: ${resp.status} ${await resp.text()}`); + } + const { token } = (await resp.json()) as { token: string }; + return token; +} + +// Build the bootstrap the SDK agent will consume on startup. +// The shape is documented on the agent side — see the SDK auth guides. +const bootstrap = { + runtime: { + address: 'macp-runtime:50051', + bearerToken: await mintTokenForAgent('agent://risk'), + tls: true, + }, + participant: { participantId: 'agent://risk' }, + run: { sessionId: preallocatedSessionId }, + // ...scenario-specific configuration +}; +spawnAgent(bootstrap); +``` + +The bootstrap contract and the SDK-side consumption pattern are documented in the SDK guides — see [TypeScript SDK — Authentication](https://github.com/multiagentcoordinationprotocol/typescript-sdk/blob/main/docs/guides/authentication.md) and [Python SDK — Direct-agent-auth](https://github.com/multiagentcoordinationprotocol/python-sdk/blob/main/docs/guides/direct-agent-auth.md). + +### Pattern 3: ad-hoc tooling + +Direct curl for debugging, demo scripts, or one-off operator tools: + +```bash +TOKEN=$(curl -sS -X POST http://localhost:3200/tokens \ + -H 'content-type: application/json' \ + -d '{"sender":"operator:alice","scopes":{"can_start_sessions":true}}' \ + | jq -r .token) + +# Use the token with grpcurl to hit the runtime +grpcurl -H "authorization: Bearer ${TOKEN}" -d '{}' \ + macp-runtime:50051 macp.v1.MACPRuntimeService/Initialize +``` + +## Bearer pattern: SDK agents + +Agents built on the SDKs are pure bearers. They receive a minted JWT in their bootstrap, wrap it in the SDK's auth helper, and let the SDK attach `Authorization: Bearer ` on every gRPC frame. The SDKs also enforce an `expectedSender` identity guard that fails fast client-side if the `sender` on an outgoing envelope disagrees with the one the token will authenticate as — saving a runtime `UNAUTHENTICATED` round trip. + +### TypeScript + +```typescript +import { Auth, DecisionSession, MacpClient } from 'macp-sdk-typescript'; + +// Values loaded from the orchestrator-supplied bootstrap: +const runtimeAddress = bootstrap.runtime.address; +const bearerToken = bootstrap.runtime.bearerToken; // minted by this service +const participantId = bootstrap.participant.participantId; +const sessionId = bootstrap.run.sessionId; + +const auth = Auth.bearer(bearerToken, { expectedSender: participantId }); +const client = new MacpClient({ address: runtimeAddress, auth }); +await client.initialize(); + +const session = new DecisionSession(client, { sessionId, auth }); +// ... agent-specific flow: session.start(...), session.propose(...), etc. +``` + +See [TypeScript SDK — Authentication](https://github.com/multiagentcoordinationprotocol/typescript-sdk/blob/main/docs/guides/authentication.md) for the full auth surface, including per-operation auth, session-level defaults, and the identity guard. + +### Python + +```python +from macp_sdk import AuthConfig, DecisionSession, MacpClient + +bearer_token = bootstrap["runtime"]["bearerToken"] # minted by this service +participant_id = bootstrap["participant"]["participantId"] +session_id = bootstrap["run"]["sessionId"] + +auth = AuthConfig.for_bearer(bearer_token, expected_sender=participant_id) + +client = MacpClient(target=bootstrap["runtime"]["address"], auth=auth) +client.initialize() + +session = DecisionSession(client, session_id=session_id, auth=auth) +# ... agent-specific flow ... +``` + +See [Python SDK — Direct-agent-auth](https://github.com/multiagentcoordinationprotocol/python-sdk/blob/main/docs/guides/direct-agent-auth.md) for the initiator/non-initiator distinction, session pre-allocation, and cancellation patterns. + +### Why `expectedSender` matters + +The runtime derives the envelope `sender` from the authenticated identity; a spoofed `sender=` fails at the runtime with `UNAUTHENTICATED`. Setting `expectedSender` on the SDK auth lets the SDK catch the mistake locally and raise `MacpIdentityMismatchError` **before** the envelope hits the wire. Clearer traceback, no wasted RTT, and no ambiguity about whose identity the session was bound to. The SDK auth guides have detailed examples. + +## Runtime wiring + +The runtime must be told where to fetch the JWKS and which `iss` / `aud` to expect. + +```bash +export MACP_AUTH_ISSUER=macp-auth-service +export MACP_AUTH_AUDIENCE=macp-runtime +export MACP_AUTH_JWKS_URL=http://auth-service:3200/.well-known/jwks.json +export MACP_AUTH_JWKS_TTL_SECS=300 +# Runtime's own config: bind addr, TLS, storage, etc. +export MACP_BIND_ADDR=0.0.0.0:50051 +export MACP_ALLOW_INSECURE=1 # or MACP_TLS_CERT_PATH / MACP_TLS_KEY_PATH in prod +cargo run --manifest-path runtime/Cargo.toml +``` + +When `MACP_AUTH_ISSUER` is set, the runtime's JWT resolver activates and the static-bearer resolver is bypassed for JWT-shaped tokens (tokens containing dots). If you configure **both** a JWT issuer and a static `MACP_AUTH_TOKENS_FILE`, the runtime runs the JWT resolver first, then the static resolver. Dev-mode fallback is only active when **neither** is configured. + +## Reference snippets: minting + +### TypeScript / Node + +```typescript +import { mintToken } from './mintToken'; + +const token = await mintToken({ + sender: 'agent://risk', + scopes: { can_start_sessions: true, allowed_modes: ['macp.mode.decision.v1'] }, + ttl_seconds: 3600, +}); +``` + +```typescript +// mintToken.ts +export interface MintArgs { + sender: string; + scopes?: Record; + ttl_seconds?: number; +} +export async function mintToken(args: MintArgs, baseUrl = process.env.AUTH_SERVICE_URL ?? 'http://auth-service:3200'): Promise { + const resp = await fetch(`${baseUrl}/tokens`, { + method: 'POST', + headers: { 'content-type': 'application/json' }, + body: JSON.stringify(args), + }); + if (!resp.ok) { + throw new Error(`mint failed: ${resp.status} ${await resp.text()}`); + } + const { token } = (await resp.json()) as { token: string }; + return token; +} +``` + +### Rust + +```rust +use reqwest::Client; +use serde::{Deserialize, Serialize}; + +#[derive(Serialize)] +struct MintReq<'a> { + sender: &'a str, + scopes: serde_json::Value, + ttl_seconds: u64, +} + +#[derive(Deserialize)] +struct MintResp { + token: String, + expires_in_seconds: u64, +} + +pub async fn mint(base: &str, sender: &str, ttl: u64) -> anyhow::Result { + let body = MintReq { + sender, + scopes: serde_json::json!({ + "can_start_sessions": true, + "allowed_modes": ["macp.mode.decision.v1"], + }), + ttl_seconds: ttl, + }; + let resp = Client::new() + .post(format!("{base}/tokens")) + .json(&body) + .send() + .await? + .error_for_status()? + .json::() + .await?; + Ok(resp) +} +``` + +### Python + +```python +import httpx + +def mint_token(base_url: str, sender: str, ttl_seconds: int = 3600) -> str: + resp = httpx.post( + f"{base_url}/tokens", + json={ + "sender": sender, + "scopes": { + "can_start_sessions": True, + "allowed_modes": ["macp.mode.decision.v1"], + }, + "ttl_seconds": ttl_seconds, + }, + timeout=5.0, + ) + resp.raise_for_status() + return resp.json()["token"] +``` + +## Reference snippets: verifying + +You typically do not verify tokens yourself — the runtime does that for you. These snippets are useful for debugging or for non-runtime verifiers (e.g. auxiliary services that also want to trust the same identity provider). + +### TypeScript / Node + +```typescript +import { jwtVerify, createRemoteJWKSet } from 'jose'; + +const JWKS = createRemoteJWKSet(new URL('http://auth-service:3200/.well-known/jwks.json')); + +export async function verifyMacpToken(token: string) { + const { payload } = await jwtVerify(token, JWKS, { + issuer: process.env.MACP_AUTH_ISSUER ?? 'macp-auth-service', + audience: process.env.MACP_AUTH_AUDIENCE ?? 'macp-runtime', + }); + return { + sender: payload.sub as string, + scopes: (payload as any).macp_scopes as Record, + }; +} +``` + +### Rust (reference — see runtime for production-grade impl) + +```rust +use jsonwebtoken::{decode, DecodingKey, Validation, Algorithm}; + +#[derive(serde::Deserialize)] +struct Claims { + sub: String, + iss: String, + aud: String, + exp: usize, + macp_scopes: serde_json::Value, +} + +pub fn verify(token: &str, jwks_key: &DecodingKey) -> anyhow::Result { + let mut validation = Validation::new(Algorithm::RS256); + validation.set_issuer(&["macp-auth-service"]); + validation.set_audience(&["macp-runtime"]); + let data = decode::(token, jwks_key, &validation)?; + Ok(data.claims) +} +``` + +## Scopes model + +The auth-service serializes scopes verbatim into the `macp_scopes` claim — it does **not** interpret them. Interpretation lives in the runtime. The canonical fields the runtime understands are: + +| Field | Type | Meaning | +|-------|------|---------| +| `can_start_sessions` | boolean | May submit `SessionStart` envelopes. | +| `can_manage_mode_registry` | boolean | May register/unregister/promote extension modes. | +| `is_observer` | boolean | May passive-subscribe to sessions they are not a participant of. | +| `allowed_modes` | string[] | Non-empty = restrict to these mode ids; empty or omitted = all modes. | +| `max_open_sessions` | number | Upper bound on concurrent open sessions initiated by this sender. | + +Because the mint endpoint passes scopes through unmodified, any additional keys you add are surfaced to the runtime. The runtime ignores unknown scope fields for forward compatibility — you can safely extend the shape as long as the runtime's enforcement logic is updated in lockstep. + +## Common integration mistakes + +| Mistake | Symptom | Fix | +|---------|---------|-----| +| `MACP_AUTH_ISSUER` differs between auth-service and runtime | Every mint fails verification with `JWTClaimValidationFailed: iss` | Set both from the same config source. | +| `MACP_AUTH_AUDIENCE` differs | Same, with `aud` | Same fix. | +| Caller forgets to set `content-type: application/json` on mint requests | `400 sender is required` even with `sender` in body | `express.json()` only parses when the header is correct. Set it. | +| Caller passes `ttl_seconds: 0` | `400 ttl_seconds must be a positive number` | Omit the field (defaults apply) or pass a positive number. | +| Agent retries a token after `exp` | Runtime returns `UNAUTHENTICATED` | Mint a fresh token; tokens are not refreshed server-side. | +| Runtime started before the auth-service is reachable | First mint-backed request fails with `UNAUTHENTICATED` because JWKS fetch errored | Ensure orchestration starts auth-service first, or make the runtime's JWKS fetch resilient (retry with backoff). | +| Two auth-service replicas with different `MACP_AUTH_SIGNING_KEY_JSON` | Intermittent `JWSSignatureVerificationFailed` depending on which replica served the JWKS last | Every replica in a deployment must share the same key. Use a single secret source. | + +## Observability tips for callers + +The auth-service does not log mints. If you need an audit trail, instrument it **in the caller**: + +```typescript +async function mintTokenAudited(args: MintArgs, context: CallerContext): Promise { + const start = Date.now(); + try { + const token = await mintToken(args); + auditLog.info({ + event: 'token.minted', + caller: context.caller, + target_sender: args.sender, + scopes: args.scopes, + ttl_requested: args.ttl_seconds, + duration_ms: Date.now() - start, + }); + return token; + } catch (err) { + auditLog.error({ + event: 'token.mint_failed', + caller: context.caller, + target_sender: args.sender, + error: (err as Error).message, + duration_ms: Date.now() - start, + }); + throw err; + } +} +``` + +Treat the resulting log as security-sensitive — it reveals which identities exist and which capabilities they hold. diff --git a/docs/operations.md b/docs/operations.md new file mode 100644 index 0000000..47e3c54 --- /dev/null +++ b/docs/operations.md @@ -0,0 +1,207 @@ +# Operations Runbook + +This is the runbook for operators running the auth-service in production. It covers routine operations (key rotation, restarts), diagnostics (log interpretation, common failures), and incident response (suspected key compromise, mint endpoint abuse). + +For first-time setup and the production checklist, see the [Deployment Guide](deployment.md). For protocol-level security considerations, see the [protocol security documentation](https://www.multiagentcoordinationprotocol.io/docs/security). + +## Startup logs + +The service prints five lines on startup. Use these as a fingerprint for correct configuration. + +``` +[auth-service] listening on port 3200 +[auth-service] issuer=macp-auth-service audience=macp-runtime +[auth-service] key source: env +[auth-service] JWKS: http://localhost:3200/.well-known/jwks.json +[auth-service] Mint: POST http://localhost:3200/tokens +``` + +Key things to check: + +| Field | Expect | If wrong | +|-------|--------|----------| +| `port` | Matches your `PORT` env or 3200 default | Container port mapping broken; fix the orchestrator manifest. | +| `issuer` | Matches the runtime's `MACP_AUTH_ISSUER` | Verifiers will reject every token with `JWTClaimValidationFailed: iss`. | +| `audience` | Matches the runtime's `MACP_AUTH_AUDIENCE` | Verifiers will reject every token with `JWTClaimValidationFailed: aud`. | +| `key source` | `env` in production | **`ephemeral` in production is an incident.** See below. | + +### `key source: ephemeral` in production + +If you see `key source: ephemeral` in a production replica, treat it as a P1. + +**Impact.** Every issued token is signed by a process-local keypair that will not survive a restart. Verifiers cache the current JWKS; when the pod restarts (for any reason) every token issued by the new process will fail signature verification until the verifier's cache expires and re-fetches. + +**Cause.** `MACP_AUTH_SIGNING_KEY_JSON` is unset in the container's environment. + +**Remediation.** +1. Confirm the secret is present in your secret store. +2. Confirm the deployment manifest references the secret correctly (`envFrom` / `valueFrom.secretKeyRef`). +3. Roll the deployment. On restart the log should read `key source: env`. + +## Key rotation + +Rotation is the primary remediation for a suspected key compromise and a routine hygiene operation otherwise. Plan for rotations to be seamless to callers — tokens in flight keep working, new tokens use the new key. + +### Routine rotation + +``` +Step 1. Generate a new JWK. + node -e "const {generateKeyPair,exportJWK}=require('jose'); + (async()=>{const {privateKey}=await generateKeyPair('RS256',{extractable:true}); + const jwk=await exportJWK(privateKey); + jwk.kid='prod-key-' + new Date().toISOString().slice(0,10); + jwk.alg='RS256'; jwk.use='sig'; + console.log(JSON.stringify(jwk))})()" + +Step 2. Store the new JWK in the secret manager. + - Do not delete the previous key yet. + - Keep the previous key available for rollback until step 6 completes. + +Step 3. Roll the deployment. + - Kubernetes: kubectl rollout restart deployment/macp-auth + - Docker Compose: docker compose up -d --force-recreate macp-auth + - Verify: every replica logs `key source: env` and the new kid. + +Step 4. Verify the new JWKS is served. + curl -sS https://auth.example.com/.well-known/jwks.json | jq .keys[0].kid + # Should print the new kid. + +Step 5. Wait MACP_AUTH_JWKS_TTL_SECS. + - Verifiers refresh their cache on this interval. + - During this window, new tokens (signed with new key) may fail verification + on verifiers whose cache still holds the old JWKS. Existing tokens + (signed with old key) also fail on verifiers that have already refreshed. + - This window is the only observable disruption; keep it short + (MACP_AUTH_JWKS_TTL_SECS=60 for fast rotations, 300 for routine). + +Step 6. Retire the old key. + - Delete the previous JWK from the secret manager. + - Optionally audit access to confirm no replica is still holding it. +``` + +### Emergency rotation (suspected compromise) + +Rotate immediately. Do not wait for a maintenance window. The process is the same as routine rotation but with these additions: + +- **Set `MACP_AUTH_JWKS_TTL_SECS=30` on verifiers before rotating** so the disruption window shrinks from minutes to seconds. Restart verifiers to apply. +- **Shorten `MACP_AUTH_MAX_TTL_SECONDS` temporarily** to reduce the life of any outstanding tokens signed by the compromised key. Every token issued by the compromised key remains valid until its own `exp`. +- **Audit the mint log.** The auth-service itself does not log mints, so you need your reverse-proxy access log, API gateway log, or caller-side audit trail. Reconstruct which `sender` identities were minted during the exposure window. +- **Notify downstream operators** that tokens issued before the rotation timestamp should be treated as suspect for the remainder of their TTL. + +### Rollback + +If the new key is broken (rare — typically a malformed JWK prevents startup entirely), revert the secret to the previous JWK and roll the deployment again. Because the old key's JWKS may still be in verifier caches, rollback usually completes with zero verifier-observable disruption. + +## Restart procedure + +Restarts are low-risk when the signing key is pinned. The startup sequence is: + +1. `loadConfigFromEnv()` parses env. +2. `loadKey()` imports the JWK and derives the public JWKS. +3. `createApp(config, signing)` builds the Express app. +4. `app.listen(config.port)` binds the socket. +5. First `/healthz` returns 200. + +Total elapsed time is typically under 500 ms. `readinessProbe` traffic should succeed on the first attempt after the container starts. + +### Graceful shutdown + +On `SIGTERM` or `SIGINT`, `index.ts` calls `server.close()`, which stops accepting new connections and waits for in-flight requests to complete. If shutdown takes longer than 10 seconds (unlikely — signing is fast), a fallback `setTimeout` forces exit code 1. + +Kubernetes sends `SIGTERM` during a pod termination. The service handles it cleanly; no `preStop` lifecycle hook is required. + +## Common failures + +### Callers receive `ECONNREFUSED` from `/tokens` + +Check: +- Container is running: `docker ps` / `kubectl get pods`. +- Port binding: the container must expose `PORT` and the orchestrator must route it. +- Reverse proxy config: if the proxy's upstream points to a stale IP, connections fail even though the container is up. + +### Callers receive `400 sender is required` unexpectedly + +Check: +- `content-type: application/json` is set on the request. Without it, `express.json()` does not parse the body and `req.body` is `undefined`. +- Body is valid JSON. Malformed bodies are rejected by the express json middleware before the handler runs. + +### Verifier reports `JWSSignatureVerificationFailed` after a rotation + +Normal during the `MACP_AUTH_JWKS_TTL_SECS` window immediately after rotation. If it persists: +- Confirm the new JWKS is served: `curl https://auth/.well-known/jwks.json`. +- Confirm the verifier is fetching it: check verifier logs for JWKS fetch activity. +- Restart the verifier to force a cache refresh. + +### Verifier reports `JWTClaimValidationFailed: iss` or `aud` + +The auth-service and the runtime disagree on the issuer/audience. Audit both sides: + +```bash +# Auth-service +kubectl exec deploy/macp-auth -- printenv | grep MACP_AUTH +# Runtime +kubectl exec deploy/macp-runtime -- printenv | grep MACP_AUTH_ISSUER MACP_AUTH_AUDIENCE +``` + +Any drift between the two is the bug. These values should be set from the same source of truth (shared config map or centralized env source). + +### Tokens expire faster than expected + +Check: +- `MACP_AUTH_MAX_TTL_SECONDS` is not clamping harder than you expect. The returned `expires_in_seconds` in the mint response is authoritative. +- Clock skew between the auth-service host and the verifier host. Large skew (>30 s) makes tokens appear expired to the verifier. Verify NTP sync on both. + +### Cannot start: "Invalid key: …" at boot + +`MACP_AUTH_SIGNING_KEY_JSON` is malformed. Common causes: +- Shell quoting stripped the quotes inside the JSON. Quote the value correctly or use a file-based secret mount. +- The JWK is public-only (missing `d`, `p`, `q`, etc.). The service requires a **private** JWK. +- The JWK is for a non-RSA algorithm. The service only supports RS256. + +Regenerate the JWK using the command in the [Deployment Guide](deployment.md#signing-key-generation) and re-inject. + +## Abuse mitigation + +`POST /tokens` has no rate limit, no caller authentication by default, and no audit log. If the endpoint is reachable beyond a trusted perimeter, assume abuse is possible. + +Recommended controls, in order of strength: + +1. **Network isolation.** Bind to `127.0.0.1` and front with a local reverse proxy; or deploy in a private subnet with no public ingress. This is the default assumption. +2. **Reverse-proxy rate limiting.** `nginx limit_req`, Envoy local rate limit filter, or your API gateway's per-source quotas. Cap at what your legitimate callers need. +3. **Caller authentication at the proxy.** mTLS, a shared-secret `Authorization` header check, or upstream OAuth. The auth-service does not perform this check itself — always put it in the proxy. +4. **Audit logging at the proxy.** The service does not log mints. The proxy should log `{ timestamp, caller_identity, sender, scopes, ttl }` for every 2xx response. Treat these logs as security-sensitive (they reveal which identities are active). +5. **Short TTLs.** `MACP_AUTH_MAX_TTL_SECONDS` is the ceiling on damage from a single abused mint. Tune down aggressively — legitimate callers can always re-mint. + +## Monitoring signals + +The service does not emit metrics. Monitor it via external signals: + +| Signal | Source | What it tells you | +|--------|--------|-------------------| +| `/healthz` 2xx rate | Reverse proxy / orchestrator probe | Service availability. | +| `POST /tokens` 4xx rate | Reverse proxy access log | Malformed client requests; investigate if spiking. | +| `POST /tokens` 5xx rate | Reverse proxy access log | Unexpected server errors; should be zero in normal operation. | +| `POST /tokens` p95 latency | Reverse proxy access log | RS256 signing latency. Typically <50 ms. Sustained >250 ms indicates CPU pressure — scale horizontally. | +| Container restarts | Orchestrator | Unexpected restarts mean unexpected key rotations (for ephemeral keys) or env/secret regressions. Alert on non-zero. | +| JWKS fetch rate (runtime-side) | Runtime metrics | Should equal `1 / MACP_AUTH_JWKS_TTL_SECS` per replica. Missing or bursty fetches indicate a verifier-side caching bug. | + +If you need in-process metrics, wrap `createApp` to register Prometheus counters before mounting the routes. The hooks are straightforward because `createApp` is pure. + +## Incident checklist: suspected key compromise + +1. **Rotate the signing key immediately** using the emergency procedure above. +2. **Shorten `MACP_AUTH_MAX_TTL_SECONDS`** to bound outstanding-token exposure. +3. **Reduce `MACP_AUTH_JWKS_TTL_SECS` on every verifier** to speed rollout, then restart verifiers. +4. **Audit the proxy access log** for unexpected `sender` values or scope escalations during the exposure window. +5. **Audit the runtime-side auth log** to see which tokens were actually *used* during the window. +6. **Notify downstream operators** and file an incident record with timestamps, exposed kid, rotation time, and affected sender identities. +7. **Post-rotation: run a deployment audit** — confirm the secret store held no stale copies of the compromised JWK, and confirm CI / ops tooling rotated any cached values. + +## Incident checklist: unauthorized access to the mint endpoint + +If you discover that `POST /tokens` was reachable from an untrusted network: + +1. **Restrict network access immediately** — update the reverse proxy, firewall, or service mesh to block the exposure. +2. **Treat every active token as potentially compromised.** Rotate the signing key as in the compromise runbook. +3. **Audit the proxy access log** for every request during the exposure window — especially unrecognized `sender` values. +4. **Confirm no log-based persistence issues.** The auth-service does not persist, but upstream audit logs may contain sensitive `sender` values and should be reviewed for further exposure.