Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 17 additions & 6 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,34 @@ DB_POOL_MAX=20 # Min 2 in production
DB_POOL_IDLE_TIMEOUT=30000
DB_POOL_CONNECTION_TIMEOUT=5000

# ── Runtime (gRPC) ───────────────────────────────
# ── Runtime (gRPC) — observer-only control-plane identity ────
# The control-plane has ONE least-privilege runtime identity. Agents authenticate
# to the runtime directly (RFC-MACP-0004 §4); the control-plane never calls Send.
# Its token in MACP_AUTH_TOKENS_JSON must have `can_start_sessions: false`.
RUNTIME_KIND=rust
RUNTIME_ADDRESS=127.0.0.1:50051
RUNTIME_TLS=false # Must be true in production (or set RUNTIME_ALLOW_INSECURE=true)
RUNTIME_ALLOW_INSECURE=true # Defaults to false in production
RUNTIME_BEARER_TOKEN= # Required in production when RUNTIME_USE_DEV_HEADER is enabled
RUNTIME_USE_DEV_HEADER=true # Defaults to false in production
RUNTIME_BEARER_TOKEN= # Control-plane's own observer Bearer token
RUNTIME_USE_DEV_HEADER=true # Defaults to false in production (dev-only)
RUNTIME_DEV_AGENT_ID=control-plane
RUNTIME_REQUEST_TIMEOUT_MS=30000 # Warn if < 5000

# ── Session polling (observer mode) ──────────────
# Control-plane polls GetSession(sessionId) until the initiator agent opens
# the session, then subscribes read-only via StreamSession.
SESSION_POLL_BASE_MS=100
SESSION_POLL_MAX_MS=1000
SESSION_POLL_TIMEOUT_MS=60000

# ── UI-initiated cancel (Option A) ────────────────
# HTTP timeout for proxying a UI cancel to the initiator agent's cancelCallback.
CANCEL_CALLBACK_TIMEOUT_MS=5000

# ── Circuit Breaker ──────────────────────────────
RUNTIME_CIRCUIT_BREAKER_THRESHOLD=5
RUNTIME_CIRCUIT_BREAKER_RESET_MS=30000

# ── Kickoff Retry ────────────────────────────────
KICKOFF_MAX_RETRIES=3

# ── Stream Consumer ─────────────────────────────
STREAM_IDLE_TIMEOUT_MS=120000
STREAM_MAX_RETRIES=5
Expand Down
264 changes: 102 additions & 162 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,124 +1,92 @@
# MACP Control Plane (NestJS)

A scenario-agnostic control plane for the MACP runtime.
A scenario-agnostic, **observer-only** control plane for the MACP runtime.

This service is the backend that a Next.js UI talks to for run lifecycle, live stream projection, replay, traces, metrics, and artifacts.

## Boundary
## Role

The control plane intentionally does **not** own scenario definitions.
The control plane is an observer. **It never calls `Send`** on the runtime.

- **UI**: browse scenarios, launch runs, render graphs and traces
- **Scenario Registry**: scenario packs, templates, validation, scenario-to-execution request creation
- **Control Plane**: run lifecycle, runtime execution, session streaming, event normalization, replay, traces, artifacts
- **Runtime**: actual MACP orchestration and mode semantics
- **UI**: browse runs, launch, render graphs and traces.
- **Scenario layer** (e.g. `examples-service`): compile scenarios → produce a generic `RunDescriptor` for this service + per-agent bootstrap for the initiator + participant agents.
- **Agents**: authenticate to the runtime directly with their own Bearer tokens and emit their own envelopes (SessionStart, kickoff, Proposal / Evaluation / Vote / etc.) via `macp-sdk-python` or `macp-sdk-typescript`.
- **Control plane**: allocates `sessionId`, polls `GetSession(sessionId)` until the initiator agent opens it, then subscribes read-only to `StreamSession(sessionId)`. Projects canonical events for the UI.
- **Runtime**: authoritative orchestrator of MACP envelopes and modes.

## Why this repo is generic
## Invariants (see `../ui-console/plans/direct-agent-auth.md` §Invariants)

This service only accepts a fully resolved `ExecutionRequest`.
It does not accept `scenarioId`, interpret fraud/business meaning, or infer domain semantics.

It only knows:

- how to validate execution-plan structure
- how to start a runtime session
- how to stream runtime events
- how to normalize them for the UI
- how to persist projections and replay data

## Important contract additions

To make the request truly runtime-safe, this implementation adds three fields beyond the original sketch:

1. `session.initiatorParticipantId`
- MACP session start needs a sender identity.
- If omitted, the control plane falls back to the first kickoff sender, then the requester, then the first participant.

2. `kickoff[].messageType`
- The runtime envelope needs an exact MACP `message_type`.
- A generic `kind = "request"` is not enough to build a runtime envelope.

3. `payloadEnvelope` / `contextEnvelope`
- The runtime uses raw `bytes` payloads.
- JSON is supported for convenience, but this repo also supports:
- `json`
- `text`
- `base64`
- `proto` (fully-qualified protobuf type name + value)

These additions keep the API scenario-agnostic while making it executable against the runtime.

## Runtime integration notes

This repo vendors the runtime protobuf files under `proto/` and uses `@grpc/grpc-js` + `@grpc/proto-loader` at runtime.

The current runtime protobuf surface supports:

- `Initialize`
- `Send`
- `StreamSession`
- `GetSession`
- `CancelSession`
- `GetManifest`
- `ListModes`
- `ListRoots`

### StreamSession assumption

The uploaded runtime currently disables `StreamSession`, but your target design says it will be added.

This repo therefore introduces one explicit runtime-facing assumption:

- the control plane opens `StreamSession`
- the first outbound streaming frame is a subscription envelope
- the subscription `messageType` defaults to `SessionWatch`
- the payload defaults to `{ "sessionId": "..." }`

That behavior is isolated in `RustRuntimeProvider` so you can update it the moment the runtime finalizes the stream-subscription contract.
1. The control-plane runtime identity is least-privilege: `can_start_sessions: false` in runtime's `MACP_AUTH_TOKENS_JSON`.
2. The control-plane never calls `Send` — enforced by an invariant lint test (`src/runtime/observer-invariant.spec.ts`).
3. `POST /runs` accepts only a scenario-agnostic `RunDescriptor`. Fields like `kickoff[]`, `participants[].role`, `policyHints`, `commitments[]`, `initiatorParticipantId` are rejected (`forbidNonWhitelisted: true`).
4. `sessionId` ownership: allocated by the control-plane (UUID v4) at `POST /runs` and returned to the caller, which distributes it to agents via bootstrap.
5. Cancellation authority stays with the initiator agent unless the scenario's policy explicitly delegates to the control-plane (see `metadata.cancellationDelegated`).

## Endpoints

### Runs

- `POST /runs`
- `GET /runs/:id`
- `GET /runs/:id/state`
- `GET /runs/:id/events`
- `GET /runs/:id/stream` (SSE)
- `POST /runs/:id/cancel`
- `POST /runs/:id/messages` (send a session-bound MACP message into an active run)
- `POST /runs/:id/replay`
- `GET /runs/:id/replay/stream` (SSE)
- `GET /runs/:id/replay/state`
- `POST /runs` — accepts a `RunDescriptor`; returns `{runId, sessionId, status, traceId}`
- `GET /runs/:id` — run record
- `GET /runs/:id/state` — projected UI state
- `GET /runs/:id/events` — canonical events
- `GET /runs/:id/stream` — SSE of live events
- `POST /runs/:id/cancel` — UI cancel (Option A: proxies to initiator agent's cancelCallback; Option B: calls runtime.CancelSession when policy-delegated)
- `POST /runs/validate` — preflight validation
- `POST /runs/:id/clone` — clone with optional tag overrides (session context overrides rejected)
- `POST /runs/:id/replay` — replay descriptor

### Removed (direct-agent-auth CP-5/6/7)
These endpoints return **410 Gone**. Agents emit envelopes via the SDKs directly:
- ~~`POST /runs/:id/messages`~~
- ~~`POST /runs/:id/signal`~~
- ~~`POST /runs/:id/context`~~

### Runtime discovery

- `GET /runtime/manifest`
- `GET /runtime/modes`
- `GET /runtime/roots`
- `GET /runtime/health`
- `GET /runtime/manifest`, `/runtime/modes`, `/runtime/roots`, `/runtime/health`
- `GET /runtime/policies`, `POST /runtime/policies`, `DELETE /runtime/policies/:id`

### Observability
- `GET /runs/:id/traces`, `/runs/:id/artifacts`, `/runs/:id/metrics`
- `GET /dashboard/overview`, `/dashboard/agents/metrics`
- `GET /healthz`, `/readyz`, `/metrics`, `/docs` (dev only)

- `GET /runs/:id/traces`
- `GET /runs/:id/artifacts`
- `GET /runs/:id/metrics`

### Ops
## Request shape

- `GET /healthz`
- `GET /readyz`
- `GET /docs`

## Database tables
```json
{
"mode": "live",
"runtime": { "kind": "rust" },
"session": {
"sessionId": "optional — UUID v4/v7 or base64url 22+",
"modeName": "macp.mode.decision.v1",
"modeVersion": "1.0.0",
"configurationVersion": "config.default",
"policyVersion": "policy.default",
"ttlMs": 600000,
"participants": [
{ "id": "fraud-agent" },
{ "id": "risk-agent" },
{ "id": "growth-agent" }
],
"metadata": {
"source": "examples-service",
"sourceRef": "fraud/high-value-new-device@1.0.0",
"environment": "production",
"cancelCallback": {
"url": "http://initiator.internal/agent/cancel",
"bearer": "opt-in-shared-secret"
}
}
},
"execution": {
"idempotencyKey": "fraud-high-value-new-device-demo-1",
"tags": ["demo", "fraud"],
"requester": { "actorId": "coordinator", "actorType": "service" }
}
}
```

- `runs`
- `runtime_sessions`
- `run_events_raw`
- `run_events_canonical`
- `run_projections`
- `run_artifacts`
- `run_metrics`
Response: `{ "runId": "<uuid>", "sessionId": "<uuid>", "status": "queued", "traceId": "..." }`

## Local development

Expand All @@ -129,91 +97,63 @@ npm run drizzle:migrate
npm run start:dev
```

Make sure the runtime is running and accessible at `RUNTIME_ADDRESS`.

For local development against the current reference runtime profile:
Make sure the runtime is running at `RUNTIME_ADDRESS`. For dev auth against the reference runtime profile:

```bash
export MACP_ALLOW_INSECURE=1
export MACP_ALLOW_DEV_SENDER_HEADER=1
cargo run
```

Then set:
Then:

```bash
RUNTIME_ALLOW_INSECURE=true
RUNTIME_USE_DEV_HEADER=true
RUNTIME_DEV_AGENT_ID=control-plane
```

## Example execution request
## Production runtime auth

Add one entry to the runtime's `MACP_AUTH_TOKENS_JSON` for the control-plane. It is a **read-only observer** and must not have session-start authority:

```json
{
"mode": "live",
"runtime": { "kind": "rust", "version": "v1" },
"session": {
"modeName": "macp.mode.decision.v1",
"modeVersion": "1.0.0",
"configurationVersion": "config.default",
"policyVersion": "policy.default",
"ttlMs": 600000,
"initiatorParticipantId": "coordinator",
"participants": [
{ "id": "fraud-agent", "role": "fraud" },
{ "id": "risk-agent", "role": "risk" },
{ "id": "growth-agent", "role": "growth" },
{ "id": "coordinator", "role": "coordinator" }
],
"context": {
"transactionAmount": 1800,
"deviceTrustScore": 0.14
},
"metadata": {
"source": "scenario-registry",
"sourceRef": "fraud/high-value-new-device@1.0.0",
"intent": "evaluate transaction"
}
},
"kickoff": [
{
"from": "coordinator",
"to": ["fraud-agent", "risk-agent", "growth-agent"],
"kind": "proposal",
"messageType": "Proposal",
"payloadEnvelope": {
"encoding": "proto",
"proto": {
"typeName": "macp.modes.decision.v1.ProposalPayload",
"value": {
"proposal_id": "p1",
"option": "step_up_verification",
"rationale": "new device + elevated amount"
}
}
}
}
],
"execution": {
"idempotencyKey": "fraud-high-value-new-device-demo-1",
"tags": ["demo", "fraud"],
"requester": {
"actorId": "coordinator",
"actorType": "service"
}
}
"token": "obs-control-plane-token",
"sender": "control-plane",
"can_start_sessions": false
}
```

If your deployment makes the control-plane the policy admin (optional), set `can_manage_mode_registry: true`.

Then in the control-plane environment:
```bash
RUNTIME_BEARER_TOKEN=obs-control-plane-token
```

Each agent additionally gets its own entry (with `can_start_sessions: true` for the initiator). Per-agent tokens are **not** shared with the control-plane — the scenario layer distributes them to agents via bootstrap. See `../ui-console/plans/direct-agent-auth.md` for the full onboarding flow.

## Migration from pre-2026-04 control-plane

If you're upgrading from a control-plane version that had `POST /runs/:id/{messages,signal,context}`, those endpoints now return **410 Gone**. Agents must migrate to `macp-sdk-python` or `macp-sdk-typescript` and authenticate directly to the runtime. `RUNTIME_AGENT_TOKENS_JSON` is removed; its entries move to the runtime's `MACP_AUTH_TOKENS_JSON` (one per agent) and to the scenario layer's per-agent bootstrap.

## Database tables

- `runs` (with `runtime_session_id` populated at creation)
- `runtime_sessions`
- `run_events_raw`, `run_events_canonical`
- `run_projections`, `run_artifacts`, `run_metrics`
- `run_outbound_messages`, `audit_log`, `webhooks`, `webhook_deliveries`

## Repo layout

```text
src/
controllers/ # NestJS controllers
runs/ # run manager, executor, stream consumer
runtime/ # runtime provider registry + Rust provider + proto codec
events/ # canonical event normalizer + live SSE hub
runs/ # run manager, observer executor, stream consumer
runtime/ # observer-only runtime provider, proto decoder, credential resolver
events/ # canonical event normalizer + SSE hub
projection/ # UI read models
replay/ # deterministic replay endpoints
metrics/ # metrics aggregation
Expand All @@ -222,5 +162,5 @@ src/
db/ # Drizzle schema + database service
telemetry/ # OpenTelemetry bootstrap and manual spans
dto/ # request/response schemas for OpenAPI
contracts/ # TypeScript interfaces
contracts/ # TypeScript interfaces (RunDescriptor, RuntimeProvider, ...)
```
Loading
Loading