Skip to content

feat: push-based kill switch for instant badge revocation #78

@beonde

Description

@beonde

Problem

The MCP guard currently has two extremes for revocation:

  1. Online verification (previous behavior): 3 HTTPS round-trips per call_tool (JWKS fetch, badge status, agent status) — adds ~90ms per call, defeats the sub-millisecond design goal, creates registry dependency on every tool invocation.

  2. Local-only verification (current fix in guard.go): Pure local crypto — parse JWS, verify Ed25519 signature, check exp. Sub-5ms. But no revocation awareness — a revoked badge remains valid until its TTL expires (5 minutes).

Neither is acceptable for the "instant kill switch" scenario: a compromised agent is actively exfiltrating data and needs to be stopped NOW.

Design Direction

Two-tier revocation model

Tier 1 — Routine revocation (no infrastructure needed):

  • Admin suspends agent in dashboard → server marks agent as suspended
  • BadgeKeeper's next refresh attempt fails → no new badge issued
  • Existing badge expires within TTL (5 min) → agent locked out
  • Zero additional infrastructure, zero per-request latency
  • Sufficient for 95%+ of cases (offboarding, policy violation, routine access changes)

Tier 2 — Kill switch (push-based, sub-second):

  • Admin triggers kill switch → registry publishes revocation event
  • Push channel delivers event to connected sidecars
  • Sidecar updates local RevocationCache (in-memory map[string]bool)
  • Guard's next local check catches the revoked JTI — still sub-millisecond
  • Revocation propagation: < 1 second

Push channel options

Option Pros Cons
OPA bundle piggyback Already exists, conditional fetching (304), zero new infrastructure Revocation latency = bundle sync interval (10-30s)
SSE from registry Simple, HTTP-based, true real-time Server manages persistent connections
gRPC streaming Bidirectional, type-safe New transport path from sidecar to registry
External pub/sub (Redis/NATS) Decoupled, scalable fan-out New infrastructure dependency

Recommended approach

Start with OPA bundle piggyback — include revoked JTIs as a data document in the policy bundle. Zero new infrastructure, bounded revocation latency (configurable sync interval). The RevocationCache interface already exists in pkg/badge/verifier.go.

If sub-second revocation is required, upgrade to SSE push from the registry alongside the bundle approach (SSE for real-time, bundle as fallback/reconciliation).

Implementation scope

capiscio-core

  • Concrete RevocationCache implementation backed by in-memory set
  • Wire RevocationCache into guard's VerifyOptions (replace current SkipRevocationCheck: true)
  • Populate cache from OPA bundle data document on each sync
  • (Future) SSE subscriber goroutine as alternative cache source

capiscio-server

  • Include revoked JTIs in OPA bundle response (data document)
  • Bump bundle ETag when badge is revoked (triggers sidecar re-fetch)
  • (Future) SSE endpoint GET /v1/events/revocations

Guard verification flow (target state)

call_tool → parse JWS (local) → verify Ed25519 (local) → check exp (local)
          → check JTI against RevocationCache (local, in-memory)
          → evaluate policy (local, OPA)
          → ALLOW/DENY (sub-millisecond total)

All network activity happens in background goroutines (bundle sync, SSE subscription), never on the hot path.

Context

  • Guard fix PR: local-only verification (skips online revocation/agent status checks)
  • RevocationCache interface: pkg/badge/verifier.go lines 96-101
  • OPA bundle sync: already runs on configurable interval in embedded sidecar
  • Demo Scenario 5: reframed to use agent suspension (Tier 1) instead of JTI revocation

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions