Skip to content

High availability — what happens when the broker restarts #37

@devonartis

Description

@devonartis

Current state: Single broker instance, SQLite on disk. The signing key IS persistent — tokens already issued survive a restart. Audit trail and revocation lists are persisted to SQLite and reloaded. What's lost: challenge nonces (30s TTL) and in-memory agent records.

The question from the community: "Broker goes down, every agent loses its credential source. What is the HA plan?"

The real answer today: Agents with valid tokens keep working during a restart — they're self-contained JWTs verified against the persistent key. They just can't register NEW agents until the broker is back.

What's needed:

  • Document the restart story clearly (what survives, what doesn't)
  • Investigate PostgreSQL backend as alternative to SQLite (enables multi-instance)
  • Investigate Redis for transient state (nonces, agent records) to enable shared state across instances
  • Health check integration with orchestrators (Kubernetes readiness/liveness probes already work via /v1/health)

Who needs this: Any small company running agents in production where broker downtime means agents can't authenticate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions