AgentFlow is an event-native metrics layer: business metrics are generated by operational events and stay live — the serving cache is invalidated when events arrive, with a measured 1.06 s p50 / 1.99 s p95 event-to-metric delay on production defaults (freshness benchmark). Each metric declares which events move it through versioned contracts, so the event→metric graph is a tested artifact rather than tribal knowledge.
Consumers are anything that needs the current number at the moment of decision: humans, dashboards, downstream services, and AI agents. Agents stress the same properties hardest — seconds-level freshness, semantic context, machine-readable contracts — but they are one consumer of the boundary, not its definition.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Web Store │ │ Payment │ │ Inventory │
│ (events) │ │ Gateway │ │ Service │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ AgentFlow Platform │
│ │
│ Kafka → Flink → Iceberg → Semantic Layer → Agent API │
└─────────────────────────────┬───────────────���───────────┘
│
┌─────────▼──────────┐
│ Consumers │
│ (humans, dashboards│
│ services, agents) │
└────────────────────┘
- Streaming-first: Batch is a special case of streaming (bounded stream). One codebase, one semantics.
- Quality gates before storage: Bad data never reaches the serving layer. Agents never see it.
- Semantic over raw: Agents query entities and metrics, not tables and columns.
- Cost-aware: Every component has autoscaling and lifecycle policies. We measure $/GB processed.
- Observable: If you can't measure it, you can't operate it. Every stage emits latency, throughput, and error metrics.
- Ingestion: Events arrive via Kafka producers (orders, payments, clicks) or Debezium CDC connectors running on Kafka Connect
- Processing: Flink validates (schema + semantic), enriches, deduplicates, and routes events
- Storage: Valid events land in Iceberg tables; production uses AWS Glue as the catalog over object storage
- Quality: Pre-storage gates check schema + semantic rules. Failures → dead letter topic
- Serving: Agent API reads from the configured serving backend; the checked-in paths are DuckDB by default and optional ClickHouse for higher read concurrency, while Iceberg remains the lakehouse storage layer
For CDC sources, Debezium/Kafka Connect handles source capture while a shared normalizer converts Postgres/MySQL envelopes into one canonical AgentFlow CDC contract before validation. See ADR 0005.
Same pipeline logic, no infrastructure dependencies:
- Generate:
local_pipeline.pycreates realistic e-commerce events - Validate: Schema validation (Pydantic) + semantic validation (business rules)
- Enrich: Domain enrichment per event type (order sizing, click classification, payment risk)
- Store: Validated events written to DuckDB for serving and to Iceberg via PyIceberg
- Serve: Agent API reads from DuckDB while
/v1/healthreports Iceberg row counts - Catalog: Development uses a MinIO-backed REST catalog from
docker-compose.iceberg.yml(writing to the sameagentflow-lakeS3 object store as the Flink stack); production uses AWS Glue
Both paths use the same validator and enrichment code (src/quality/, src/processing/transformations/).
- Orchestration: Dagster triggers compaction, aggregation, and quality reports
- User profiles: Materialized from
orders_v2→users_enriched - Quality report: Row counts, null rates, dead letter ratio
- Compaction: Iceberg snapshot expiry + data file compaction (production only)
The serving layer has grown beyond the original read-only surface. The current API groups into four slices:
- Core agent reads:
/v1/entity,/v1/metrics,/v1/query,/v1/catalog,/v1/health - Discovery and audit:
/v1/search,/v1/contracts,/v1/lineage,/v1/changelog - Operational workflows:
/v1/batch,/v1/stream/events,/v1/deadletter,/v1/webhooks,/v1/alerts,/v1/slo - SDK contract: Python sync/async clients and a TypeScript client wrap the same HTTP surface
The API process also starts several background components:
- DuckDBPool: shared read cursors plus serialized writes for the local serving path
- QueryCache: Redis-backed metric cache with invalidation on new events
- WebhookDispatcher: polls validated pipeline events and delivers signed webhook callbacks
- AlertDispatcher: evaluates metric thresholds and records alert history
- OutboxProcessor: retries replay delivery from DuckDB outbox rows to Kafka
These components keep the local demo and the production architecture aligned around one agent-facing contract, even when the backing infrastructure differs.
See Architecture Decision Records for detailed trade-off analysis.
| Component | Choice | Runner-up | Key differentiator |
|---|---|---|---|
| Streaming | Kafka 3.7 (KRaft) | Pulsar | Ecosystem maturity, MSK managed service |
| CDC capture | Debezium + Kafka Connect | Python-native connectors | Mature Postgres/MySQL CDC, built-in offsets/schema history, one ops model |
| Processing | Flink 2.2 | Spark Structured Streaming | True event-time, lower latency, native watermarks |
| Storage | Iceberg 1.5 | Delta Lake | Vendor-neutral, hidden partitioning, time-travel |
| Local query | DuckDB | SQLite | Columnar, fast analytics, Iceberg support |
| Orchestration | Dagster | Airflow | Software-defined assets, better testing, type safety |
| API | FastAPI | Flask | Async, auto-docs, Pydantic integration |
| IaC | Terraform | Pulumi | Team familiarity, HCL readability, module ecosystem |
| Failure | Impact | Mitigation |
|---|---|---|
| Kafka broker down | Reduced throughput | 3-broker cluster, replication factor 3, min.insync.replicas=2 |
| Flink job crash | Processing stops | Exactly-once checkpointing (30s), auto-restart on failure |
| Bad data in source | Incorrect agent answers | Pre-storage quality gates, dead letter topic, alerting |
| S3 outage | No new data in serving | Flink checkpoints to S3 pause; resumes on recovery |
| API overload | Agent queries fail | Per-key rate limiting, DuckDB connection pooling, horizontal scaling |
| Redis unavailable | Cache or rate-limit degradation | Metric cache misses fall back to source queries; rate limiter fail-opens instead of blocking the API |
| Webhook target failure | Lost downstream notification | Signed delivery logs, retries with backoff, alert/webhook history in DuckDB |
- API authentication: API key via
X-API-Keyheader (setAGENTFLOW_API_KEYSenv var) - Rate limiting: Per-key sliding window with Redis backing when available and in-memory fallback for local/test, configurable via
AGENTFLOW_RATE_LIMIT_RPM(default: 120/min) - Health/docs exempt:
/v1/health,/docs,/metricsdon't require auth - No secrets in code: All credentials via environment variables
- Terraform state: Encrypted S3 backend with DynamoDB locking
- Kafka: TLS in-transit, SASL authentication (MSK config)
- S3: SSE-KMS encryption, bucket policy restricts to VPC endpoints
- Network: Private subnets, security groups per component
- Metrics: Prometheus scrapes
/metrics; Grafana dashboards cover pipeline health plus support, ops, and merch journeys. - Tracing: OpenTelemetry spans export to Jaeger through
OTEL_EXPORTER_OTLP_ENDPOINT; the production-like compose stack exposes Jaeger on:16686. - Logs: Structlog emits JSON logs with
trace_id,span_id,correlation_id, and tenant context so incidents can be traced across API, cache, and background loops. - Operational APIs:
/v1/alerts,/v1/webhooks,/v1/deadletter,/v1/slo, and/v1/stream/eventsare part of the control plane, not side tooling.
| Environment | Primary components | Purpose |
|---|---|---|
| Local demo | src.processing.local_pipeline + DuckDB + FastAPI |
Fastest path for developers and SDK examples |
| Prod-like Docker | docker-compose.prod.yml with Kafka, Redis, Jaeger, Prometheus, Grafana, API, and optional ClickHouse |
Observability and production-shaped debugging against a realistic local stack |
| Lite E2E Docker | docker-compose.e2e.yml with the narrowed CI service set |
Faster E2E and smoke coverage without the full observability stack |
| Chaos harness | docker-compose.chaos.yml + Toxiproxy + pytest chaos suite |
Validate graceful degradation under Kafka/Redis failures |
| kind staging | helm/agentflow, k8s/, scripts/k8s_staging_up.sh |
Production-shaped staging on a local Kubernetes cluster |
| Production | Managed Kafka/Flink/Iceberg/object storage + Helm/Terraform | Durable, autoscaled multi-service deployment |
| Capability | Implementation | Architectural impact |
|---|---|---|
| Durable replay and outbox | OutboxProcessor, dead-letter replay, DuckDB outbox rows |
Background delivery is retried without coupling API latency to Kafka availability |
| Redis-backed rate limiting | AuthManager + RateLimiter with fail-open fallback |
Per-key throttling stays centralized in prod, while local/test can continue when Redis is absent |
| Typed SDK surface | sdk/ and sdk-ts/ |
Python and TypeScript agents consume one HTTP contract instead of bespoke adapters |
| Distributed observability | OTel tracing, structlog correlation, /metrics, Grafana, Jaeger |
API, background jobs, and streaming workflows share the same debugging context |
| Chaos engineering | tests/chaos/, docker-compose.chaos.yml, config/toxiproxy.json |
Failure handling is exercised continuously rather than assumed from code review |
| Kubernetes staging | helm/agentflow, k8s/kind-config.yaml, staging scripts |
Helm releases, image loading, and smoke validation can be rehearsed before production |
| DevContainer DX | .devcontainer/ with Docker-in-Docker, Helm/kubectl, kind, toxiproxy-cli |
Contributors get one workspace that can run local demo, chaos, and staging workflows |