Skip to content

[arch] Design discussion: shared pluggable service for cross-agent state (feedback, sessions, traces) #112

@rdwj

Description

@rdwj

Background

The v0.12.0 enterprise feature track added several stateful surfaces to BaseAgent's server layer:

  • POST/GET /v1/sessions — conversation persistence
  • GET /v1/traces — trace inspection
  • POST/GET/PATCH /v1/feedback — user feedback collection
  • GET /metrics — Prometheus

All four follow the same shape: BaseAgent owns a pluggable store (Null / SQLite / Postgres), the server layer exposes REST endpoints, and the gateway-template proxies through. This was the right call when most deployments had one or two agents.

In multi-agent deployments (eg 10 agents fronted by a single gateway and UI) the per-agent ownership starts to chafe:

  • Duplication — 10 Postgres pools, 10 schema migrations, 10 housekeeping loops, all writing the same tables
  • Fan-out for cross-agent queries — "show me all thumbs-down feedback this week" requires the dashboard to hit 10 endpoints and merge client-side, OR query the shared Postgres directly out-of-band (the schema becomes a de-facto API)
  • Schema becomes a contract — once N agents write the same table, schema changes need coordinated rollouts
  • No auth boundary between agents sharing storage
  • Sessions are conceptually cross-agent — a user talks to "the system," not to agent base-agent: Implement prompt loader (prompts.py) #4. Today there's no clean way to follow a conversation that gets routed to different agents

What to design

Open question: what does the shape of a shared 'agent platform' service look like, and which surfaces move there?

Initial options to discuss (not a decision, a starting point):

  1. Status quo + documentation. Document that multi-agent deployments should point all agents at the same Postgres and treat the shared schema as a stable join point. Cheapest, but the rough edges remain.

  2. Full extraction. A new FastAPI service (working name: `fipsagents-platform`) owns sessions + traces + feedback. BaseAgent becomes a thin client. Gateway routes `/v1/sessions`, `/v1/traces`, `/v1/feedback` to the platform service rather than fanning out to per-agent endpoints. One Postgres pool, one REST surface, one dashboard backend.

  3. Partial extraction. Move feedback + sessions (genuinely cross-agent) but leave traces in BaseAgent shipping to an Otel collector (industry-standard answer, already partially done via `OTELTraceStore`). Less moving parts, addresses the highest-value duplication.

  4. Something else. Maybe BaseAgent keeps everything but grows a 'remote store' adapter for each — same code, configurable backend (in-process vs HTTP). Lets a deployer choose per-feature without forcing a topology.

Things to think about during the discussion

  • Migration story. The longer we wait, the more deployments depend on the per-agent endpoints. Cheap to do now while there's effectively one production user; observable migration later
  • Memory is intentionally NOT in this list — `self.memory` is per-agent by design, and MemoryHub already provides the centralized option
  • Metrics is also separate — Prometheus scrape targets are inherently per-pod, that's fine
  • Auth — if multiple agents share a backend, who's allowed to write what? Today there's no model for this
  • Deployment friction — every service we extract is another Helm chart, another readiness probe, another thing for ops to think about. Worth it iff the cross-agent benefits land
  • Pluggability shape — same `FeedbackStore`/`SessionStore`/`TraceStore` ABCs we have today, just running in a different process? Or a different abstraction entirely?

Out of scope for this issue

This is a design discussion issue, not an implementation. The goal is to come out with a written architecture decision (in `docs/architecture.md` or similar) that we can point at when implementing.

Captured during the v0.12.0 feedback feature track. Conversation context: the per-agent ownership felt fine for one or two agents but the smell got louder once we considered the 10-agent case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions