From 47d8f84ebd24a80a78dc0e16b53358a989a78bc8 Mon Sep 17 00:00:00 2001 From: Ajit Koti Date: Wed, 22 Apr 2026 18:30:26 -0700 Subject: [PATCH] Update Docs --- .env.example | 12 ++++++ README.md | 40 +++++++---------- docs/API.md | 23 ++-------- docs/ARCHITECTURE.md | 96 ++++++++++++++++++++++++++++++----------- docs/INTEGRATION.md | 65 ++++++++++++++-------------- docs/TROUBLESHOOTING.md | 30 +++++++++++-- 6 files changed, 160 insertions(+), 106 deletions(-) diff --git a/.env.example b/.env.example index 380c3d1..e6c1ace 100644 --- a/.env.example +++ b/.env.example @@ -27,6 +27,18 @@ RUNTIME_USE_DEV_HEADER=true # Defaults to false in production (dev-only) RUNTIME_DEV_AGENT_ID=control-plane RUNTIME_REQUEST_TIMEOUT_MS=30000 # Warn if < 5000 +# ── Runtime auth (JWT-mint mode, preferred) ────── +# When MACP_AUTH_SERVICE_URL is set, RuntimeCredentialResolverService mints a +# short-lived RS256 JWT for the control-plane (scopes: is_observer=true, +# can_start_sessions=false) instead of using the static RUNTIME_BEARER_TOKEN. +# Tokens are cached until TTL minus a 30s refresh buffer and 10s clock-skew; +# concurrent refreshes are deduped. On mint failure the resolver falls back to +# the static Bearer so a brief auth-service outage doesn't fail every call. +MACP_AUTH_SERVICE_URL= # e.g. https://auth.internal — leave blank to disable +MACP_AUTH_SERVICE_TIMEOUT_MS=5000 +MACP_AUTH_TOKEN_TTL_SECONDS=3600 +MACP_AUTH_TOKEN_SENDER=control-plane + # ── Session polling (observer mode) ────────────── # Control-plane polls GetSession(sessionId) until the initiator agent opens # the session, then subscribes read-only via StreamSession. diff --git a/README.md b/README.md index c20dc4b..6e36de3 100644 --- a/README.md +++ b/README.md @@ -16,11 +16,12 @@ The control plane is an observer. **It never calls `Send`** on the runtime. ## Invariants (see `../ui-console/plans/direct-agent-auth.md` §Invariants) -1. The control-plane runtime identity is least-privilege: `can_start_sessions: false` in runtime's `MACP_AUTH_TOKENS_JSON`. +1. The control-plane runtime identity is least-privilege: `can_start_sessions: false, is_observer: true` — either encoded in a minted short-lived JWT (preferred) or in a static entry in the runtime's `MACP_AUTH_TOKENS_JSON`. 2. The control-plane never calls `Send` — enforced by an invariant lint test (`src/runtime/observer-invariant.spec.ts`). 3. `POST /runs` accepts only a scenario-agnostic `RunDescriptor`. Fields like `kickoff[]`, `participants[].role`, `policyHints`, `commitments[]`, `initiatorParticipantId` are rejected (`forbidNonWhitelisted: true`). 4. `sessionId` ownership: allocated by the control-plane (UUID v4) at `POST /runs` and returned to the caller, which distributes it to agents via bootstrap. 5. Cancellation authority stays with the initiator agent unless the scenario's policy explicitly delegates to the control-plane (see `metadata.cancellationDelegated`). +6. The observer `StreamSession` writes exactly one passive-subscribe frame (`{subscribeSessionId, afterSequence}`) per RFC-MACP-0006 §3.2 and then **keeps the write side open** — half-closing would signal "client is done" and stop live-envelope broadcast. ## Endpoints @@ -97,15 +98,7 @@ npm run drizzle:migrate npm run start:dev ``` -Make sure the runtime is running at `RUNTIME_ADDRESS`. For dev auth against the reference runtime profile: - -```bash -export MACP_ALLOW_INSECURE=1 -export MACP_ALLOW_DEV_SENDER_HEADER=1 -cargo run -``` - -Then: +Make sure the runtime is running at `RUNTIME_ADDRESS`. For dev auth against the reference runtime profile, start the runtime with `MACP_ALLOW_INSECURE=1 MACP_ALLOW_DEV_SENDER_HEADER=1` (see [runtime/docs/getting-started.md#authentication](../runtime/docs/getting-started.md#authentication) → *Development mode*) and set on the control-plane: ```bash RUNTIME_ALLOW_INSECURE=true @@ -113,26 +106,23 @@ RUNTIME_USE_DEV_HEADER=true RUNTIME_DEV_AGENT_ID=control-plane ``` -## Production runtime auth +## Runtime auth (observer identity) -Add one entry to the runtime's `MACP_AUTH_TOKENS_JSON` for the control-plane. It is a **read-only observer** and must not have session-start authority: +The control-plane has **exactly one** runtime identity with fixed scope `is_observer: true, can_start_sessions: false`. `RuntimeCredentialResolverService` resolves credentials per gRPC call using a three-step fallback chain: -```json -{ - "token": "obs-control-plane-token", - "sender": "control-plane", - "can_start_sessions": false -} -``` +| Mode | Trigger | Control-plane env | +| --- | --- | --- | +| JWT mint (preferred) | `MACP_AUTH_SERVICE_URL` set | `MACP_AUTH_SERVICE_URL`, `MACP_AUTH_SERVICE_TIMEOUT_MS`, `MACP_AUTH_TOKEN_TTL_SECONDS`, `MACP_AUTH_TOKEN_SENDER` | +| Static Bearer | JWT disabled or mint fails | `RUNTIME_BEARER_TOKEN` (must match an entry in the runtime's `MACP_AUTH_TOKENS_JSON` with `can_start_sessions: false`) | +| Dev header | `RUNTIME_USE_DEV_HEADER=true`, local only | `RUNTIME_DEV_AGENT_ID` | -If your deployment makes the control-plane the policy admin (optional), set `can_manage_mode_registry: true`. +For the runtime-side token configuration, TLS, and the full production auth story, see: -Then in the control-plane environment: -```bash -RUNTIME_BEARER_TOKEN=obs-control-plane-token -``` +- [runtime/docs/getting-started.md#authentication](../runtime/docs/getting-started.md#authentication) — dev / production / JWT modes and resolver order +- [runtime/docs/deployment.md#authentication](../runtime/docs/deployment.md#authentication) — production resolver chain (JWT → static bearer → dev fallback); TLS env vars live in [§ Production checklist](../runtime/docs/deployment.md#production-checklist) and [§ Environment variables](../runtime/docs/deployment.md#environment-variables) +- [python-sdk/docs/auth.md#observer-identities](../python-sdk/docs/auth.md#observer-identities) — observer-identity pattern (the shape the control-plane uses) and `expected_sender` guardrail -Each agent additionally gets its own entry (with `can_start_sessions: true` for the initiator). Per-agent tokens are **not** shared with the control-plane — the scenario layer distributes them to agents via bootstrap. See `../ui-console/plans/direct-agent-auth.md` for the full onboarding flow. +Per-agent tokens are **not** held by the control-plane — the scenario layer distributes them to agents via bootstrap. See `../ui-console/plans/direct-agent-auth.md` for the onboarding flow. ## Migration from pre-2026-04 control-plane diff --git a/docs/API.md b/docs/API.md index 43b2d88..cf771d2 100644 --- a/docs/API.md +++ b/docs/API.md @@ -17,18 +17,11 @@ Rate limit: 100 requests per 60 seconds per client. Payload limit: 1MB. ### Upstream runtime auth (observer identity) -The control-plane has **exactly one** runtime identity — its own least-privilege -Bearer token. It never calls `Send`; agents authenticate to the runtime directly -(RFC-MACP-0004 §4). Its entry in the runtime's `MACP_AUTH_TOKENS_JSON` must have -`can_start_sessions: false`. +The control-plane has **exactly one** runtime identity. It never calls `Send`; agents authenticate to the runtime directly (RFC-MACP-0004 §4). The scope is fixed: `is_observer: true, can_start_sessions: false`. -| Env var | Purpose | -| --- | --- | -| `RUNTIME_BEARER_TOKEN` | Control-plane's own observer Bearer token. Used for every runtime call (`Initialize`, `GetSession`, `StreamSession`, `ListPolicies`, etc.). | -| `RUNTIME_USE_DEV_HEADER` | Local dev fallback — sends `x-macp-agent-id: ` when no Bearer token is configured. Requires `MACP_ALLOW_DEV_SENDER_HEADER=1` on the runtime. | +Configuration, env vars, and the three-step fallback chain (JWT mint → static Bearer → dev header) are documented in [ARCHITECTURE.md § Runtime Credential Resolution](./ARCHITECTURE.md#runtime-credential-resolution). For the runtime-side token configuration (`MACP_AUTH_TOKENS_JSON` shape, JWT claim expectations, TLS/mTLS), see [runtime/docs/getting-started.md#authentication](../../runtime/docs/getting-started.md#authentication) and [runtime/docs/deployment.md#authentication](../../runtime/docs/deployment.md#authentication). -Per-agent tokens are **not** held by the control-plane. They live in the scenario -layer (examples-service) and flow to agents via their bootstrap. +Per-agent tokens are **not** held by the control-plane — they live in the scenario layer (examples-service) and flow to agents via their bootstrap. --- @@ -542,15 +535,7 @@ Policy events are produced when: ### Policy Rule Schemas (RFC-MACP-0012) -Rules are opaque to the control plane (passed through as JSON to the runtime), but must conform to the RFC's per-mode schemas: - -| Mode | Rule Sections | -|------|---------------| -| **Decision** | `voting` (algorithm, threshold, quorum, weights), `objection_handling` (block_severity_vetoes, `veto_threshold`), `evaluation` (required_before_voting, `minimum_confidence`), `commitment` (authority, `designated_roles`, require_vote_quorum) | -| **Quorum** | `threshold` (type: `n_of_m`/`percentage`/`weighted`, value), `abstention` (`counts_toward_quorum`, `interpretation`), `commitment` | -| **Proposal** | `acceptance` (`criterion`), `counter_proposal` (`max_rounds`), `rejection` (`terminal_on_any_reject`), `commitment` | -| **Task** | `assignment` (`allow_reassignment_on_reject`), `completion` (`require_output`), `commitment` | -| **Handoff** | `acceptance` (`implicit_accept_timeout_ms`), `commitment` | +Rules are opaque to the control-plane — the request body is passed through as JSON to `runtime.RegisterPolicy`. Per-mode rule schemas (Decision / Proposal / Task / Handoff / Quorum), worked examples, and evaluation semantics are documented canonically in [runtime/docs/policy.md](../../runtime/docs/policy.md) — see *Rule examples by mode*, *How evaluation works*, and *Commitment authority*. --- diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index ebf8a1d..eaafa7c 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -40,6 +40,8 @@ MACP distinguishes between two communication planes: └─────────────────────────────────────┘ └───────────────────────────────────┘ ``` +Deeper explainers: [python-sdk/docs/protocol.md#two-planes-of-communication](../../python-sdk/docs/protocol.md#two-planes-of-communication) (the plane-split invariant), [python-sdk/docs/protocol.md#envelopes](../../python-sdk/docs/protocol.md#envelopes) (envelope shape + session binding), and [runtime/docs/API.md#streaming-watches](../../runtime/docs/API.md#streaming-watches) (`WatchSignals` semantics on the ambient plane). + ## Request Flow (observer mode — direct-agent-auth 2026-04-15) ``` @@ -61,22 +63,51 @@ POST /runs (RunDescriptor — scenario-agnostic; see CP-1) ``` The control-plane **never** calls `Send` — agents drive the session via their own gRPC -connection with their own Bearer tokens (RFC-MACP-0004 §4). The read-only observer stream -filters envelopes by `sessionId` and never writes a frame. +connection with their own Bearer tokens (RFC-MACP-0004 §4). The observer `StreamSession` +writes exactly one passive-subscribe frame (`{subscribeSessionId, afterSequence}`) per +RFC-MACP-0006 §3.2 and then **keeps the write side open**; half-closing would signal +"client is done" and cause the runtime to stop forwarding envelopes. The read-only +stream filters envelopes by `sessionId` and never writes another frame. + +## Runtime Credential Resolution + +Every gRPC call goes through `RuntimeCredentialResolverService`, which resolves the +control-plane's observer identity using a **three-step fallback chain**: + +1. **JWT mint** (when `MACP_AUTH_SERVICE_URL` is set) — `RuntimeJwtMinterService` POSTs to `${url}/tokens` for a short-lived RS256 token with scope `{is_observer: true, can_start_sessions: false}`. Cached until expiry minus a 30s refresh buffer and 10s clock-skew; concurrent refreshes deduped via in-flight promise. Mint failures log `auth_mint_failure` and fall through. +2. **Static Bearer** — attaches `RUNTIME_BEARER_TOKEN` verbatim. Must match an entry in the runtime's `MACP_AUTH_TOKENS_JSON` with `can_start_sessions: false`. +3. **Dev header** — attaches `x-macp-agent-id: ` instead of `Authorization`. Requires the runtime to enable `MACP_ALLOW_DEV_SENDER_HEADER=1`. + +For token configuration on the runtime side and the resolver order as the runtime sees it, see [runtime/docs/getting-started.md#authentication](../../runtime/docs/getting-started.md#authentication) and [runtime/docs/deployment.md#authentication](../../runtime/docs/deployment.md#authentication). The minter is covered by `src/runtime/runtime-jwt-minter.service.spec.ts` (TTL refresh, concurrent-refresh dedupe, 4xx / missing-token / network failure modes). ## Event Pipeline +Two gRPC stream sources feed the same normalization pipeline: + ``` -Runtime gRPC stream - → StreamConsumerService (consumption loop + idle timeout + reconnection) - → EventNormalizerService (raw → canonical, including derived events) - → RunEventService (transactional sequence allocation + persistence) - → EventRepository.appendRaw + appendCanonical - → ProjectionService.applyAndPersist (update UI read model) - → MetricsService.recordEvents (update counters) - → StreamHubService.publishEvent (SSE → live UI subscribers) + ┌─→ EventRepository + │ (appendRaw + appendCanonical) + StreamSession (per-session) ─┐ │ + ├→ EventNormalizer ─┼─→ ProjectionService.applyAndPersist + WatchSignals (ambient) ─┘ (raw → canonical) │ (UI read model, per-run lock) + │ + ├─→ MetricsService.recordEvents + │ (tokenUsage, costUsd, counts) + │ + └─→ StreamHubService.publishEvent + (SSE → live UI subscribers) ``` +- **`StreamConsumerService`** drives the per-session stream with idle timeout + reconnection, + and persists a stream cursor for lossless resume. +- **`SignalConsumerService`** drives the ambient `WatchSignals` stream. Signal envelopes + carry an empty `sessionId`; the consumer correlates each envelope to a run through the + decoded payload's `correlation_session_id` (or `envelope.sessionId` for progress + envelopes that are session-scoped). Without this, agent-emitted signals like + `llm.call.completed` (token usage) would be invisible. +- **`RunEventService.persistRawAndCanonical`** runs sequence allocation, raw append, + canonical append, and projection update inside a single DB transaction. + ## Session Discovery (WatchSessions) When `SESSION_DISCOVERY_ENABLED=true` (default), the `SessionDiscoveryService` subscribes @@ -85,9 +116,22 @@ started by external launchers (not via `POST /runs`). For each `created` event, a run, binds the session, subscribes the observer stream, and begins projecting events. Terminal events (`resolved`, `expired`) finalize the auto-discovered run. +`SignalConsumerService` is gated on the same `SESSION_DISCOVERY_ENABLED` flag — if session +discovery is off, ambient signals are also ignored. + This enables the control-plane to observe and project any session the runtime hosts, even if the launching service doesn't use the control-plane's `POST /runs` endpoint. +The three long-running observation services (`StreamConsumerService`, +`SessionDiscoveryService`, `SignalConsumerService`) each track their in-flight loop promise +and drain it on `onModuleDestroy` with a bounded 2s timeout. Reconnect sleeps are +cancellable via an aborted timer, so shutdown doesn't stall for 5s after a transient +stream end. This matters for both production graceful shutdown and integration-test +teardown — it's the fix that lets the DB pool close after all `persistRawAndCanonical` +chain entries have resolved, rather than under them. The integration-test helper +(`test/helpers/test-app.ts`) also wires `drainBackgroundWork()` into `app.close()` to +force-terminate in-progress runs before the drain. + ## Message / Signal / Context — removed (direct-agent-auth CP-5/6/7) The `POST /runs/:id/{messages,signal,context}` endpoints were removed 2026-04-15 and now @@ -101,11 +145,11 @@ them into canonical events via the pipeline above. | Layer | Directory | Responsibility | |-------|-----------|---------------| | Controllers | `src/controllers/` | HTTP endpoints — runs, runtime, dashboard, webhooks, admin, health | -| Run Orchestration | `src/runs/` | RunManager (state machine), RunExecutor (coordination), StreamConsumer (event loop) | -| Runtime Abstraction | `src/runtime/` | `RuntimeProvider` interface, `RustRuntimeProvider` (gRPC), `ProtoRegistryService` | +| Run Orchestration | `src/runs/` | RunManager (state machine), RunExecutor (coordination), StreamConsumer (per-session event loop), SessionDiscovery (`WatchSessions`), SignalConsumer (`WatchSignals`) | +| Runtime Abstraction | `src/runtime/` | `RuntimeProvider` interface, `RustRuntimeProvider` (gRPC), `ProtoRegistryService`, `RuntimeCredentialResolverService` (JWT → static-bearer → dev-header chain), `RuntimeJwtMinterService` (short-lived JWT mint + cache) | | Events | `src/events/` | Normalization (raw→canonical), transactional persistence, SSE publishing | | Projection | `src/projection/` | Applies canonical events to build UI read models (versioned) | -| Dashboard | `src/dashboard/` | Aggregated KPIs, recent runs, runtime health, and time-series chart data | +| Dashboard | `src/dashboard/` | Aggregated KPIs (runs, signals, tokens, cost), recent runs, runtime health, time-series charts | | Insights | `src/insights/` | Export bundles, run comparison | | Webhooks | `src/webhooks/` | Webhook registration, HMAC delivery, retry logic | | Audit | `src/audit/` | Administrative action logging | @@ -140,22 +184,22 @@ Key relationships: ## Coordination Modes -| Mode | Proto Package | Key Message Types | -|------|--------------|-------------------| -| Decision | `macp.modes.decision.v1` | Proposal, Evaluation, Objection, Vote | -| Proposal | `macp.modes.proposal.v1` | Proposal, CounterProposal, Accept, Reject, Withdraw | -| Task | `macp.modes.task.v1` | TaskRequest, TaskAccept, TaskUpdate, TaskComplete, TaskFail | -| Handoff | `macp.modes.handoff.v1` | HandoffOffer, HandoffContext, HandoffAccept, HandoffDecline | -| Quorum | `macp.modes.quorum.v1` | ApprovalRequest, Approve, Reject, Abstain | +The control-plane is mode-agnostic — it forwards mode URIs to the runtime, observes the resulting envelopes, and projects them for the UI. The canonical mode specifications (message flow, terminal conditions, payload shapes) live in the runtime docs: + +- [runtime/docs/modes.md](../../runtime/docs/modes.md) — Decision, Proposal, Task, Handoff, Quorum, plus Multi-Round and extension modes +- [runtime/docs/examples.md](../../runtime/docs/examples.md) — end-to-end walkthroughs per mode -All modes terminate with `Commitment` (from `macp.v1.CommitmentPayload`). +All modes terminate with `Commitment` (`macp.v1.CommitmentPayload`). The control-plane normalises the per-mode message types into two canonical events — `proposal.created` / `proposal.updated` — preserving `messageType` in `data.messageType` for discrimination. See the [Canonical Event Types](./API.md#canonical-event-types) table in API.md for the mapping. ## Key Design Decisions 1. **Scenario-agnostic**: Accepts only a generic `RunDescriptor` — scenario-specific fields (`kickoff[]`, `participants[].role`, `policyHints`, `commitments[]`, `initiatorParticipantId`) are rejected with 400 via `forbidNonWhitelisted: true`. 2. **Three-layer event pipeline**: Raw → canonical → projections. Raw preserves original data; canonical provides normalized, typed view. -3. **Observer-only streaming**: `subscribeSession({runId, sessionId})` returns a read-only `RuntimeSessionHandle` — `events` async iterable + `abort()`. No `send()`. -4. **Transactional event persistence**: Sequence allocation + persistence in single DB transaction. -5. **Snake_case → camelCase normalization**: ProtoRegistryService converts Python/JSON snake_case to protobufjs camelCase. -6. **Proto-encoded payloads**: Real runtime requires proto encoding; control plane supports JSON fallback for testing. -7. **Circuit breaker**: CLOSED/OPEN/HALF_OPEN wrapping all gRPC unary calls with configurable threshold and reset. +3. **Observer-only streaming**: `subscribeSession({runId, sessionId, afterSequence?})` returns a read-only `RuntimeSessionHandle` — `events` async iterable + `abort()`. No `send()`. The provider writes exactly one passive-subscribe frame and keeps the write side open for the session's lifetime (RFC-MACP-0006 §3.2). +4. **JWT-first runtime auth**: The credential resolver prefers minted short-lived JWTs (via `MACP_AUTH_SERVICE_URL`) and falls back to a static Bearer or dev header. Scopes are fixed at mint time (`is_observer: true, can_start_sessions: false`) so the observer identity can never accidentally gain write authority. +5. **Transactional event persistence**: Sequence allocation + persistence in single DB transaction. +6. **Snake_case → camelCase normalization**: ProtoRegistryService converts Python/JSON snake_case to protobufjs camelCase. +7. **Proto-encoded payloads**: Real runtime requires proto encoding; control plane supports JSON fallback for testing. +8. **Circuit breaker**: CLOSED/OPEN/HALF_OPEN wrapping all gRPC unary calls with configurable threshold and reset. +9. **Bindable idempotency**: `bindSession` catches `ConflictException` from the state-machine guard and returns the current run, so a raced transition (RunExecutor vs SessionDiscovery) logs a warning instead of crashing the process. +10. **Graceful drain on shutdown**: Background observation services expose tracked loop promises and a bounded drain (default 2s) from `onModuleDestroy`, ensuring in-flight `persistRawAndCanonical` chain entries complete before the DB pool closes. diff --git a/docs/INTEGRATION.md b/docs/INTEGRATION.md index 0de8c23..48c5998 100644 --- a/docs/INTEGRATION.md +++ b/docs/INTEGRATION.md @@ -8,45 +8,35 @@ Key methods to implement (observer-only surface, post direct-agent-auth): - `initialize()` — protocol version negotiation. -- `subscribeSession({runId, runtimeSessionId, afterSequence?})` — read-only `StreamSession` observer; returns `{events, abort}`. **Never writes envelopes.** Per RFC-MACP-0006 §3.2 the provider writes a single passive-subscribe frame (`{subscribeSessionId, afterSequence}`) and immediately half-closes the write side; the runtime then replays accepted history from `afterSequence` (default 0 = full replay) before switching to live broadcast. +- `subscribeSession({runId, runtimeSessionId, afterSequence?})` — read-only `StreamSession` observer; returns `{events, abort}`. **Never writes envelopes.** Per RFC-MACP-0006 §3.2 the provider writes exactly one passive-subscribe frame (`{subscribeSessionId, afterSequence}`) and **keeps the write side open** for the session's lifetime. Half-closing would signal "client is done" and cause the runtime to drop every envelope broadcast afterwards. The runtime replays accepted history from `afterSequence` (default 0 = full replay) then switches to live broadcast. See [runtime/docs/sdk-guide.md#streaming](../../runtime/docs/sdk-guide.md#streaming) and [runtime/docs/API.md#message-transport](../../runtime/docs/API.md#message-transport) for the canonical stream lifecycle. +- `watchSessions()` — returns an `AsyncIterable` for `created` / `resolved` / `expired` events. Backs `SessionDiscoveryService`. Canonical RPC: [runtime/docs/API.md#session-lifecycle](../../runtime/docs/API.md#session-lifecycle); SDK-side discovery patterns: [python-sdk/docs/guides/session-discovery.md](../../python-sdk/docs/guides/session-discovery.md). +- `watchSignals()` — returns an `AsyncIterable` of ambient Signal/Progress envelopes off the runtime's `signal_bus`. Backs `SignalConsumerService` — token-usage signals (`llm.call.completed`) arrive here, not on per-session streams. See [runtime/docs/API.md#streaming-watches](../../runtime/docs/API.md#streaming-watches). - `getSession()` — poll for session state (used by the observer's `pollForOpenSession` loop). - `cancelSession()` — only called when `run.metadata.cancellationDelegated === true` (Option B in direct-agent-auth §Cancellation design). - `getManifest()` / `listModes()` / `listRoots()` / `health()` — metadata. -- `registerPolicy()` / `unregisterPolicy()` / `getPolicy()` / `listPolicies()` — governance (RFC-MACP-0012). +- `registerPolicy()` / `unregisterPolicy()` / `getPolicy()` / `listPolicies()` — governance. Rule schemas and evaluation semantics: [runtime/docs/policy.md](../../runtime/docs/policy.md) (RFC-MACP-0012). ## Agents emit envelopes directly -Agents authenticate to the runtime with their own Bearer tokens (RFC-MACP-0004 §4) and emit envelopes via `macp-sdk-python` / `macp-sdk-typescript`: +Agents authenticate to the runtime with their own Bearer tokens (RFC-MACP-0004 §4) and emit envelopes via `macp-sdk-python` / `macp-sdk-typescript`. The control-plane never brokers agent envelopes — the old HTTP escalation endpoints (`POST /runs/:id/{messages,signal,context}`) now return **410 Gone**. -```python -# Python example (direct-agent-auth) -from macp_sdk import MacpClient, AuthConfig, DecisionSession, new_session_id +For the agent-side bootstrap and how `sessionId` flows from `POST /runs` to the initiator and non-initiator agents, see: -auth = AuthConfig.for_bearer(os.environ["MACP_BEARER_TOKEN"], expected_sender="evaluator") -client = MacpClient(target="runtime.internal:50051", secure=True, auth=auth) -await client.initialize() -session = DecisionSession(client, session_id=bootstrap.run.sessionId, auth=auth) -stream = session.open_stream() -await session.evaluate(proposal_id="prop-1", recommendation="APPROVE", confidence=0.95) -``` +- **Python SDK** — [guides/direct-agent-auth.md](../../python-sdk/docs/guides/direct-agent-auth.md) (bootstrap shape, initiator vs non-initiator, `expected_sender`, cancellation) and [guides/agent-framework.md](../../python-sdk/docs/guides/agent-framework.md) (`from_bootstrap` factory + handler context) +- **TypeScript SDK** — [README.md § Agent Framework](../../typescript-sdk/README.md#agent-framework) and [docs/guides/agent-framework.md](../../typescript-sdk/docs/guides/agent-framework.md) (`fromBootstrap()` + strategies) +- **Migration** — `../../ui-console/plans/direct-agent-auth.md` (end-to-end story of the 2026-04-15 refactor) -```typescript -// TypeScript example -import { MacpClient, Auth, DecisionSession } from 'macp-sdk-typescript'; - -const client = new MacpClient({ - address: 'runtime.internal:50051', - secure: true, - auth: Auth.bearer(process.env.MACP_BEARER_TOKEN!, { expectedSender: 'evaluator' }), -}); -await client.initialize(); -const session = new DecisionSession(client, { sessionId: bootstrap.run.sessionId }); -const stream = session.openStream(); -await session.evaluate({ proposalId: 'prop-1', recommendation: 'APPROVE', confidence: 0.95 }); -``` +## Authenticating to the runtime + +Per-gRPC-call credential resolution uses a three-step fallback chain: + +| Mode | Trigger | Control-plane env vars | +| --- | --- | --- | +| **JWT mint (preferred)** | `MACP_AUTH_SERVICE_URL` set | `MACP_AUTH_SERVICE_URL`, `MACP_AUTH_SERVICE_TIMEOUT_MS` (5000), `MACP_AUTH_TOKEN_TTL_SECONDS` (3600), `MACP_AUTH_TOKEN_SENDER` (`control-plane`) | +| **Static Bearer** | JWT disabled or mint failed | `RUNTIME_BEARER_TOKEN` | +| **Dev header** (local only) | `RUNTIME_USE_DEV_HEADER=true` | `RUNTIME_DEV_AGENT_ID` (`control-plane`) | -The control-plane's old HTTP escalation endpoints (`POST /runs/:id/{messages,signal,context}`) -now return **410 Gone**. See `../plans/../../ui-console/plans/direct-agent-auth.md` for the full migration story. +Mint behaviour: token cached until expiry minus 30s refresh buffer minus 10s clock-skew, concurrent refreshes deduped, mint failures log `auth_mint_failure` and fall through to the static Bearer. For the runtime-side token shape (`MACP_AUTH_TOKENS_JSON`), TLS/mTLS, and the JWT claim expectations, see [runtime/docs/getting-started.md#authentication](../../runtime/docs/getting-started.md#authentication) and [runtime/docs/deployment.md#authentication](../../runtime/docs/deployment.md#authentication). ## Consuming SSE Streams @@ -112,17 +102,26 @@ Webhook deliveries include `X-MACP-Signature` (HMAC-SHA256) and `X-MACP-Event` h ## Running Integration Tests ```bash +# Start the test Postgres (port 5433 — separate from the dev DB on 5432) +docker compose -f docker-compose.test.yml up -d postgres-test + # Mock runtime (fast, no external dependencies) npm run test:integration # Real Rust runtime (needs runtime on port 50051) INTEGRATION_RUNTIME=remote RUNTIME_ADDRESS=127.0.0.1:50051 npm run test:integration - -# Python agent E2E tests (LangChain + CrewAI) -./scripts/run-e2e.sh decision ``` -See `test/integration/` for TypeScript integration tests. Python agent harnesses now live in the `examples-service` repo (not `test-agents/`). +See `test/integration/` for the suites and `test/helpers/test-app.ts` for the NestJS boot +harness. The harness wraps `app.close()` so every `afterAll` hook runs +`drainBackgroundWork()` first — force-terminating in-progress runs, then awaiting +`StreamConsumerService`, `SessionDiscoveryService`, and `SignalConsumerService` drains +before the DB pool closes. Without this, pending `persistRawAndCanonical` chain entries +would race the pool teardown and surface as "Test suite failed to run" even when every +assertion passed. + +Python agent E2E tests live in the `examples-service` repo and run against the runtime +directly via `macp-sdk-python` — see `examples-service/README.md`. ## Environment Variables diff --git a/docs/TROUBLESHOOTING.md b/docs/TROUBLESHOOTING.md index fd4fb28..037712f 100644 --- a/docs/TROUBLESHOOTING.md +++ b/docs/TROUBLESHOOTING.md @@ -43,11 +43,32 @@ 4. Manually cancel: `POST /runs/{id}/cancel` 5. If recovery is enabled (`RUN_RECOVERY_ENABLED=true`), the system auto-recovers orphaned runs on startup +## Auth-service unreachable / JWT mint failure + +**Symptom:** Log line `auth_mint_failure reason=...` or `JWT mint failed; falling back to static bearer`. + +**Explanation:** `MACP_AUTH_SERVICE_URL` is set, but the auth-service is down, returned non-2xx, or its response was unparseable. The credential resolver automatically falls back to `RUNTIME_BEARER_TOKEN` for this call. + +**Checks:** +1. Is the auth-service reachable? `curl -X POST $MACP_AUTH_SERVICE_URL/tokens -d '{}' -H 'content-type: application/json'` (expect a 4xx response, not a connection error). +2. Is `RUNTIME_BEARER_TOKEN` set as a fallback? Without it the call eventually proceeds with no `Authorization` header (dev-header mode) or fails auth on the runtime side. +3. If the auth-service is healthy but calls still fail, check `MACP_AUTH_SERVICE_TIMEOUT_MS` (default 5000 ms) — slow auth-services can time out under load. + +**See also:** [runtime/docs/getting-started.md#authentication](../../runtime/docs/getting-started.md#authentication) → *Resolver order* for how the runtime evaluates inbound credentials, and [ARCHITECTURE.md § Runtime Credential Resolution](./ARCHITECTURE.md#runtime-credential-resolution) for the control-plane side of the chain. + +## bindSession ConflictException in logs + +**Symptom:** Log line `bindSession no-op for run : cannot transition ... (current status=running)`. + +**Explanation:** Not an error. Two paths can race to bind the same run — `RunExecutorService` for `POST /runs`-created runs, and `SessionDiscoveryService` for runs auto-discovered via `WatchSessions`. Whichever arrives second sees the run already past `binding_session`. As of the `subscribe-session` PR, the second call is a logged no-op; it no longer crashes the process. + +**When to investigate:** only if you see this repeatedly for the *same* runId — that would indicate a loop somewhere retrying the bind. A single occurrence per run is normal. + ## Legacy Write Endpoints Return 410 Gone **Symptom:** `POST /runs/:id/messages`, `/signal`, or `/context` returns `410 Gone` with `errorCode: ENDPOINT_REMOVED`. -**Explanation:** The control-plane is observer-only as of the 2026-04-15 direct-agent-auth refactor. Agents authenticate to the runtime directly and emit their own envelopes via `macp-sdk-python` / `macp-sdk-typescript`. See `docs/API.md` § "Messages & Signals — emission is NOT via the control-plane" for migration guidance. +**Explanation:** The control-plane is observer-only as of the 2026-04-15 direct-agent-auth refactor. Agents authenticate to the runtime directly and emit their own envelopes via `macp-sdk-python` / `macp-sdk-typescript`. See `docs/API.md` § "Messages & Signals — emission is NOT via the control-plane" for the mapping, and the SDK guides for the new agent flow: [python-sdk direct-agent-auth](../../python-sdk/docs/guides/direct-agent-auth.md), [typescript-sdk agent-framework](../../typescript-sdk/docs/guides/agent-framework.md). ## Agent Envelopes Not Appearing in Projection @@ -56,8 +77,8 @@ **Checks:** 1. Confirm the run's `runtimeSessionId` matches the `session_id` the agent is writing to (`GET /runs/:id`). 2. Check stream consumer logs for `StreamSession` reconnection loops — the observer subscribes read-only and must be connected. -3. Confirm the runtime echoes envelopes back on the stream (some runtimes only echo certain message types). `signal.emitted` and `message.sent` canonical events require `stream-envelope` entries on the observer stream. -4. For session discovery, verify `SESSION_DISCOVERY_ENABLED=true` so externally-launched sessions auto-create runs. +3. Confirm the runtime echoes envelopes back on the stream (some runtimes only echo certain message types). `signal.emitted` and `message.sent` canonical events require `stream-envelope` entries on the observer stream. See [runtime/docs/API.md#message-transport](../../runtime/docs/API.md#message-transport) for StreamSession semantics and [runtime/docs/sdk-guide.md#streaming](../../runtime/docs/sdk-guide.md#streaming) for the observer lifecycle. +4. For session discovery, verify `SESSION_DISCOVERY_ENABLED=true` so externally-launched sessions auto-create runs. Concepts: [python-sdk/docs/guides/session-discovery.md](../../python-sdk/docs/guides/session-discovery.md). ## SSE Stream Drops @@ -90,6 +111,9 @@ **Prometheus metric re-registration error:** - Tests that create multiple NestJS apps must call `promClient.register.clear()` between apps +**"Test suite failed to run" even though every assertion passed:** +- Teardown leak — background observation services (`StreamConsumerService`, `SignalConsumerService`, `SessionDiscoveryService`) had in-flight `persistRawAndCanonical` work when the DB pool closed. Fixed by `test/helpers/test-app.ts` → `drainBackgroundWork()` which awaits each service's bounded drain before Nest's own `onModuleDestroy` sweep. If you see this in a *new* test, make sure you created the app via `createTestApp(...)` so the `app.close()` wrapper is in place. + ## Common Error Codes | Code | HTTP | Meaning |