Architecture Overview

Context

AgentFlow is an event-native metrics layer: business metrics are generated by operational events and stay live — the serving cache is invalidated when events arrive, with a measured 1.06 s p50 / 1.99 s p95 event-to-metric delay on production defaults (freshness benchmark). Each metric declares which events move it through versioned contracts, so the event→metric graph is a tested artifact rather than tribal knowledge.

Consumers are anything that needs the current number at the moment of decision: humans, dashboards, downstream services, and AI agents. Agents stress the same properties hardest — seconds-level freshness, semantic context, machine-readable contracts — but they are one consumer of the boundary, not its definition.

System Context (C4 Level 1)

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Web Store   │     │   Payment    │     │  Inventory   │
│  (events)    │     │   Gateway    │     │   Service    │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       ▼                    ▼                    ▼
┌─────────────────────────────────────────────────────────┐
│                    AgentFlow Platform                    │
│                                                         │
│  Kafka → Flink → Iceberg → Semantic Layer → Agent API   │
└─────────────────────────────┬───────────────���───────────┘
                              │
                    ┌─────────▼──────────┐
                    │     Consumers      │
                    │ (humans, dashboards│
                    │  services, agents) │
                    └────────────────────┘

Key Design Principles

Streaming-first: Batch is a special case of streaming (bounded stream). One codebase, one semantics.
Quality gates before storage: Bad data never reaches the serving layer. Agents never see it.
Semantic over raw: Agents query entities and metrics, not tables and columns.
Cost-aware: Every component has autoscaling and lifecycle policies. We measure $/GB processed.
Observable: If you can't measure it, you can't operate it. Every stage emits latency, throughput, and error metrics.

Data Flow

Production: Kafka → Flink → Iceberg (p50 latency: ~220ms)

Ingestion: Events arrive via Kafka producers (orders, payments, clicks) or Debezium CDC connectors running on Kafka Connect
Processing: Flink validates (schema + semantic), enriches, deduplicates, and routes events
Storage: Valid events land in Iceberg tables; production uses AWS Glue as the catalog over object storage
Quality: Pre-storage gates check schema + semantic rules. Failures → dead letter topic
Serving: Agent API reads from the configured serving backend; the checked-in paths are DuckDB by default and optional ClickHouse for higher read concurrency, while Iceberg remains the lakehouse storage layer

For CDC sources, Debezium/Kafka Connect handles source capture while a shared normalizer converts Postgres/MySQL envelopes into one canonical AgentFlow CDC contract before validation. See ADR 0005.

Local: Generate → Validate → Enrich → DuckDB + Iceberg

Same pipeline logic, no infrastructure dependencies:

Generate: local_pipeline.py creates realistic e-commerce events
Validate: Schema validation (Pydantic) + semantic validation (business rules)
Enrich: Domain enrichment per event type (order sizing, click classification, payment risk)
Store: Validated events written to DuckDB for serving and to Iceberg via PyIceberg
Serve: Agent API reads from DuckDB while /v1/health reports Iceberg row counts
Catalog: Development uses a MinIO-backed REST catalog from docker-compose.iceberg.yml (writing to the same agentflow-lake S3 object store as the Flink stack); production uses AWS Glue

Both paths use the same validator and enrichment code (src/quality/, src/processing/transformations/).

Batch Path (daily)

Orchestration: Dagster triggers compaction, aggregation, and quality reports
User profiles: Materialized from orders_v2 → users_enriched
Quality report: Row counts, null rates, dead letter ratio
Compaction: Iceberg snapshot expiry + data file compaction (production only)

Serving & Control Plane

The serving layer has grown beyond the original read-only surface. The current API groups into four slices:

Core agent reads: /v1/entity, /v1/metrics, /v1/query, /v1/catalog, /v1/health
Discovery and audit: /v1/search, /v1/contracts, /v1/lineage, /v1/changelog
Operational workflows: /v1/batch, /v1/stream/events, /v1/deadletter, /v1/webhooks, /v1/alerts, /v1/slo
SDK contract: Python sync/async clients and a TypeScript client wrap the same HTTP surface

The API process also starts several background components:

DuckDBPool: shared read cursors plus serialized writes for the local serving path
QueryCache: Redis-backed metric cache with invalidation on new events
WebhookDispatcher: polls validated pipeline events and delivers signed webhook callbacks
AlertDispatcher: evaluates metric thresholds and records alert history
OutboxProcessor: retries replay delivery from DuckDB outbox rows to Kafka

These components keep the local demo and the production architecture aligned around one agent-facing contract, even when the backing infrastructure differs.

Technology Choices

See Architecture Decision Records for detailed trade-off analysis.

Component	Choice	Runner-up	Key differentiator
Streaming	Kafka 3.7 (KRaft)	Pulsar	Ecosystem maturity, MSK managed service
CDC capture	Debezium + Kafka Connect	Python-native connectors	Mature Postgres/MySQL CDC, built-in offsets/schema history, one ops model
Processing	Flink 2.2	Spark Structured Streaming	True event-time, lower latency, native watermarks
Storage	Iceberg 1.5	Delta Lake	Vendor-neutral, hidden partitioning, time-travel
Local query	DuckDB	SQLite	Columnar, fast analytics, Iceberg support
Orchestration	Dagster	Airflow	Software-defined assets, better testing, type safety
API	FastAPI	Flask	Async, auto-docs, Pydantic integration
IaC	Terraform	Pulumi	Team familiarity, HCL readability, module ecosystem

Failure Modes & Mitigations

Failure	Impact	Mitigation
Kafka broker down	Reduced throughput	3-broker cluster, replication factor 3, min.insync.replicas=2
Flink job crash	Processing stops	Exactly-once checkpointing (30s), auto-restart on failure
Bad data in source	Incorrect agent answers	Pre-storage quality gates, dead letter topic, alerting
S3 outage	No new data in serving	Flink checkpoints to S3 pause; resumes on recovery
API overload	Agent queries fail	Per-key rate limiting, DuckDB connection pooling, horizontal scaling
Redis unavailable	Cache or rate-limit degradation	Metric cache misses fall back to source queries; rate limiter fail-opens instead of blocking the API
Webhook target failure	Lost downstream notification	Signed delivery logs, retries with backoff, alert/webhook history in DuckDB

Security

Implemented

API authentication: API key via X-API-Key header (set AGENTFLOW_API_KEYS env var)
Rate limiting: Per-key sliding window with Redis backing when available and in-memory fallback for local/test, configurable via AGENTFLOW_RATE_LIMIT_RPM (default: 120/min)
Health/docs exempt: /v1/health, /docs, /metrics don't require auth
No secrets in code: All credentials via environment variables
Terraform state: Encrypted S3 backend with DynamoDB locking

Production (via infrastructure)

Kafka: TLS in-transit, SASL authentication (MSK config)
S3: SSE-KMS encryption, bucket policy restricts to VPC endpoints
Network: Private subnets, security groups per component

Observability & Operations

Metrics: Prometheus scrapes /metrics; Grafana dashboards cover pipeline health plus support, ops, and merch journeys.
Tracing: OpenTelemetry spans export to Jaeger through OTEL_EXPORTER_OTLP_ENDPOINT; the production-like compose stack exposes Jaeger on :16686.
Logs: Structlog emits JSON logs with trace_id, span_id, correlation_id, and tenant context so incidents can be traced across API, cache, and background loops.
Operational APIs: /v1/alerts, /v1/webhooks, /v1/deadletter, /v1/slo, and /v1/stream/events are part of the control plane, not side tooling.

Deployment Topologies

Environment	Primary components	Purpose
Local demo	`src.processing.local_pipeline` + DuckDB + FastAPI	Fastest path for developers and SDK examples
Prod-like Docker	`docker-compose.prod.yml` with Kafka, Redis, Jaeger, Prometheus, Grafana, API, and optional ClickHouse	Observability and production-shaped debugging against a realistic local stack
Lite E2E Docker	`docker-compose.e2e.yml` with the narrowed CI service set	Faster E2E and smoke coverage without the full observability stack
Chaos harness	`docker-compose.chaos.yml` + Toxiproxy + pytest chaos suite	Validate graceful degradation under Kafka/Redis failures
kind staging	`helm/agentflow`, `k8s/`, `scripts/k8s_staging_up.sh`	Production-shaped staging on a local Kubernetes cluster
Production	Managed Kafka/Flink/Iceberg/object storage + Helm/Terraform	Durable, autoscaled multi-service deployment

v1-v6 Capability Map

Capability	Implementation	Architectural impact
Durable replay and outbox	`OutboxProcessor`, dead-letter replay, DuckDB outbox rows	Background delivery is retried without coupling API latency to Kafka availability
Redis-backed rate limiting	`AuthManager` + `RateLimiter` with fail-open fallback	Per-key throttling stays centralized in prod, while local/test can continue when Redis is absent
Typed SDK surface	`sdk/` and `sdk-ts/`	Python and TypeScript agents consume one HTTP contract instead of bespoke adapters
Distributed observability	OTel tracing, structlog correlation, `/metrics`, Grafana, Jaeger	API, background jobs, and streaming workflows share the same debugging context
Chaos engineering	`tests/chaos/`, `docker-compose.chaos.yml`, `config/toxiproxy.json`	Failure handling is exercised continuously rather than assumed from code review
Kubernetes staging	`helm/agentflow`, `k8s/kind-config.yaml`, staging scripts	Helm releases, image loading, and smoke validation can be rehearsed before production
DevContainer DX	`.devcontainer/` with Docker-in-Docker, Helm/kubectl, `kind`, `toxiproxy-cli`	Contributors get one workspace that can run local demo, chaos, and staging workflows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Overview

Context

System Context (C4 Level 1)

Key Design Principles

Data Flow

Production: Kafka → Flink → Iceberg (p50 latency: ~220ms)

Local: Generate → Validate → Enrich → DuckDB + Iceberg

Batch Path (daily)

Serving & Control Plane

Technology Choices

Failure Modes & Mitigations

Security

Implemented

Production (via infrastructure)

Observability & Operations

Deployment Topologies

v1-v6 Capability Map

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

Architecture Overview

Context

System Context (C4 Level 1)

Key Design Principles

Data Flow

Production: Kafka → Flink → Iceberg (p50 latency: ~220ms)

Local: Generate → Validate → Enrich → DuckDB + Iceberg

Batch Path (daily)

Serving & Control Plane

Technology Choices

Failure Modes & Mitigations

Security

Implemented

Production (via infrastructure)

Observability & Operations

Deployment Topologies

v1-v6 Capability Map