Skip to content

Commit message#17

Merged
ajit-zer07 merged 1 commit intomainfrom
direct-agent-auth
Apr 16, 2026
Merged

Commit message#17
ajit-zer07 merged 1 commit intomainfrom
direct-agent-auth

Conversation

@ajit-zer07
Copy link
Copy Markdown
Contributor

Save as the GitHub PR body:

Summary

  • Observer-only control-plane. Implements direct-agent-auth plan tasks CP-1..CP-13: POST /runs accepts a scenario-agnostic RunDescriptor, returns {runId, sessionId}; the control-plane polls GetSession until the initiator agent opens it, then subscribes to a read-only StreamSession. POST /runs/:id/{messages,signal,context} return 410 Gone. Agents authenticate to the runtime directly via macp-sdk-python / macp-sdk-typescript. RFC-MACP-0004 §4
    conformant.
  • Quality cleanup. Implements plans/quality-cleanup.md (Phases 0–3): dead code purge, hardcoded sleeps → conditional waitFor() polling, instrumentation assertions
    tightened, conventions documented + enforced by grep, gRPC helpers extracted from the runtime provider.
  • Two new invariant tests keep the boundaries from regressing: observer-invariant.spec.ts (no provider.send( / openSession( / chooseInitiator( / retryKickoff(
    in src/) and projection-coverage.spec.ts (every canonical event type has a reducer).

Behavioural changes (call out for reviewers)

Endpoint Before After
POST /runs Accepted ExecutionRequest with kickoff[], participants[].role, policyHints, commitments[], initiatorParticipantId Accepts RunDescriptor only
(forbidNonWhitelisted: true); returns sessionId
POST /runs/:id/messages 200 — control-plane forged envelope via provider.send 410 Gone (ENDPOINT_REMOVED)
POST /runs/:id/signal 200 — same 410 Gone
POST /runs/:id/context 200 — same 410 Gone
POST /runs/:id/cancel Always called runtime.CancelSession with control-plane's identity Option A (default): proxies to metadata.cancelCallback.url. Option B:
calls runtime.CancelSession only when metadata.cancellationDelegated === true

Examples-service and the SDKs landed their direct-agent-auth slices before this PR; they already produce RunDescriptor shapes against the new contract — see
ui-console/plans/direct-agent-auth.md.

Schema changes

  • runs.runtime_session_id is now populated at INSERT (was set later via markBindingSession). GET /runs/:id returns the sessionId immediately.
  • runtime_sessions.initiator_participant_id is populated from the runtime's GetSession snapshot (was forged by the deleted chooseInitiator). Still used by recovery.
  • No new migrations.

Removed (deletion notice — coordinate with downstream consumers)

  • RUNTIME_AGENT_TOKENS_JSON env var.
  • KICKOFF_MAX_RETRIES env var.
  • MockRuntimeProvider class (use ScriptedMockRuntimeProvider for tests).
  • test-agents/ Python harness + scripts/run-e2e.sh.
  • DTOs: send-run-message.dto.ts, send-signal.dto.ts, update-context.dto.ts.
  • Integration specs: decision-mode, proposal-mode, task-mode, quorum-mode, handoff-mode, runs-messaging (replaced by consolidated
    observer-mode.integration.spec.ts).
  • ExecutionRequest type alias (use RunDescriptor).
  • encodeSessionContext / encodePayloadEnvelope / encodeMessage on ProtoRegistryService and their types (PayloadEnvelopeInput, ProtoPayload, PayloadEncoding).
  • TestClient.sendMessage / sendSignal / updateContext helpers + ts-agent.ts.

New env vars

Var Default Purpose
SESSION_POLL_BASE_MS 100 Initial GetSession poll interval (observer mode)
SESSION_POLL_MAX_MS 1000 Capped backoff
SESSION_POLL_TIMEOUT_MS 60000 Give up waiting for initiator
CANCEL_CALLBACK_TIMEOUT_MS 5000 UI cancel → initiator agent (Option A)
THROTTLE_TTL_MS / THROTTLE_LIMIT 60000 / 100 Already used; now go through AppConfigService

Test plan

  • npm run lint --max-warnings=0 — clean
  • npm run build — clean
  • npm test577/577 across 44 suites
  • npm run test:integration59/59 across 14 suites
  • Invariant lints (observer-mode, projection-coverage) green
  • Convention grep rules (CLAUDE.md §Conventions) all clean
  • Manual smoke (reviewer): point a local examples-service at this branch, launch a fraud scenario; verify POST /runs returns {runId, sessionId}, agents emit
    envelopes directly (runtime gRPC logs), control-plane shows zero Send RPCs, projection populates from runtime Proposal / Vote / Commitment events, UI cancel routes
    via the initiator's cancelCallback.
  • Production runbook (reviewer): confirm RUNTIME_BEARER_TOKEN is set with a least-privilege identity (can_start_sessions: false); confirm
    RUNTIME_AGENT_TOKENS_JSON removed from Railway env; confirm dev-mode toggles (MACP_ALLOW_DEV_SENDER_HEADER, RUNTIME_USE_DEV_HEADER) are NOT set in prod.

Migration impact

  • Examples-service: already uses RunDescriptor (compiler shipped 2026-04-15). Reading the response's sessionId and forwarding the initiator's cancelCallback URL
    into session.metadata.cancelCallback are the only follow-ups needed there.
  • UI: POST /runs/:id/{messages,signal,context} callers need to be removed. Display sessionId in run details (already returned by GET /runs/:id).
  • Railway / production: drop RUNTIME_AGENT_TOKENS_JSON. Set RUNTIME_BEARER_TOKEN to one least-privilege observer token. The runtime's MACP_AUTH_TOKENS_JSON entry
    for the control-plane must have can_start_sessions: false.

Net diff

80 files changed, 2,056 insertions, 7,605 deletions — net −5,549 LoC.

  Observer-only control-plane: direct-agent-auth + quality cleanup

  Two-phase change. Phase 1 (direct-agent-auth, plan CP-1..CP-13) makes the
  control-plane a scenario-agnostic observer: agents authenticate to the runtime
  directly and emit their own envelopes; the control-plane never calls Send.
  Phase 2 (quality-cleanup) removes dead code from the refactor, replaces
  hardcoded test sleeps with conditional polling, documents conventions, and
  extracts gRPC marshalling helpers.

  Why: RFC-MACP-0004 §4 requires `envelope.sender` to derive from authenticated
  identity. The previous control-plane forged envelopes and consumed
  scenario-specific fields (`kickoff[]`, `participants[].role`, `policyHints`,
  `commitments[]`, `initiatorParticipantId`) that violated the documented
  "scenario-agnostic observer" boundary. Once that was unwound, the codebase
  had stale tests, dead types, and inconsistent conventions that this commit
  also addresses.

  Behavioural changes (HTTP):
  - POST /runs accepts a scenario-agnostic RunDescriptor only; unknown keys
    rejected (forbidNonWhitelisted: true). Returns {runId, sessionId, status,
    traceId} — sessionId is allocated by the control-plane (UUID v4) or echoed
    back when caller provides a valid one.
  - POST /runs/:id/{messages,signal,context} return 410 Gone with
    errorCode: ENDPOINT_REMOVED. Agents migrate to macp-sdk-python /
    macp-sdk-typescript.
  - POST /runs/:id/cancel: Option A proxies to metadata.cancelCallback.url;
    Option B (metadata.cancellationDelegated=true) calls runtime.CancelSession.

  Internal changes:
  - RuntimeProvider narrowed to observer surface: subscribeSession() opens a
    read-only StreamSession and ends the write side immediately.
  - New RunExecutor.execute() flow: initialize → pollForOpenSession (100ms→1s
    backoff, 60s timeout) → bindSession → subscribeSession → stream consumer.
  - RuntimeCredentialResolverService reverted to single-bearer form; per-agent
    tokens (RUNTIME_AGENT_TOKENS_JSON) removed.
  - Extracted src/runtime/grpc-helpers.ts (fromEnvelope, fromAck,
    fromSessionMetadata, buildMetadata, getClientMethod) — 553→459 lines on
    the provider.
  - ProtoRegistryService.decodeKnown now falls back to JSON when proto decode
    throws (mock runtime emits JSON; real runtime emits proto).

  Invariants enforced by tests:
  - src/runtime/observer-invariant.spec.ts — fails CI on any provider.send(,
    openSession(, chooseInitiator(, retryKickoff( in src/.
  - src/projection/projection-coverage.spec.ts — every CANONICAL_EVENT_TYPES
    entry must have a reducer branch in ProjectionService.applyEvents.

  Quality cleanup:
  - Deleted dead code: MockRuntimeProvider, encodeSessionContext /
    encodePayloadEnvelope / encodeMessage + their types, ExecutionRequest type
    alias (migrated 6 imports → RunDescriptor), test-agents/ Python harness
    (drove removed HTTP endpoints), 5 mode-specific integration specs +
    runs-messaging spec, 3 stale write-path DTOs, ts-agent.ts test helper.
  - Replaced 17 hardcoded sleep() calls with new test/helpers/wait-for.ts
    conditional polling (~80s of fixed waits → ~5s of polls).
  - Tightened 15 instrumentation assertions to verify metric type + labels
    + round-trip observation.
  - Documented 5 conventions in CLAUDE.md §Conventions with grep rules
    (errors, logger, ValidationPipe, env vars, metrics).
  - Migrated THROTTLE_TTL_MS / THROTTLE_LIMIT to AppConfigService via
    ThrottlerModule.forRootAsync.

  Schema:
  - runs.runtime_session_id is now populated at INSERT (so GET /runs/:id
    returns the sessionId immediately after POST /runs).
  - runtime_sessions.initiator_participant_id is populated from the runtime's
    GetSession snapshot (was forged by chooseInitiator). Used by run
    recovery as the subscriberId fallback.

  Docs:
  - README.md, docs/API.md, docs/ARCHITECTURE.md, docs/INTEGRATION.md, CLAUDE.md
    rewritten for observer model. Fixed test counts (407→577) and endpoint
    table (write paths now 410 Gone).
@ajit-zer07 ajit-zer07 merged commit fad005c into main Apr 16, 2026
7 checks passed
@ajit-zer07 ajit-zer07 deleted the direct-agent-auth branch April 16, 2026 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant