From 67389b16d4f5c663fff5730151646b1e05a868bd Mon Sep 17 00:00:00 2001 From: Ajayvardhanreddy Date: Mon, 25 May 2026 10:59:16 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20rewrite=20README=20=E2=80=94=20complete?= =?UTF-8?q?=20architecture=20diagram,=20all=206=20phases,=20MCP=20setup,?= =?UTF-8?q?=20layer=20links?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 693 +++++++++++++++++++++++++++--------------------------- 1 file changed, 343 insertions(+), 350 deletions(-) diff --git a/README.md b/README.md index 63f3e31..a992fdb 100644 --- a/README.md +++ b/README.md @@ -1,522 +1,515 @@ # Agent Execution Engine -> Production runtime infrastructure for LLM agents — explicit state-machine orchestration, typed tool execution, persistent memory, per-run trace observability, and automated evals. Self-hostable. No LangChain. No LangGraph. +> Production runtime infrastructure for LLM agents — explicit state-machine orchestration, typed tool execution, persistent memory, per-run observability, automated evals, and MCP integration. Self-hostable. No LangChain. No LangGraph. [![CI](https://github.com/Ajayvardhanreddy/agent-execution-engine/actions/workflows/ci.yml/badge.svg)](https://github.com/Ajayvardhanreddy/agent-execution-engine/actions/workflows/ci.yml) ![Python](https://img.shields.io/badge/python-3.11%2B-blue) +![Tests](https://img.shields.io/badge/tests-140%20passing-brightgreen) ![uv](https://img.shields.io/badge/package%20manager-uv-purple) --- +> **Part of a 3-layer AI infrastructure portfolio.** +> This is Layer 3. The full stack — all three layers running together as a distributed system — is being assembled in a separate deployment repo. +> +> | Layer | Repo | What it does | +> |---|---|---| +> | Layer 1 | [distributed-kv-store](https://github.com/Ajayvardhanreddy/distributed-kv-store) | Fault-tolerant distributed KV storage with consistent hashing and node failover | +> | Layer 2 | [agent-memory-service](https://github.com/Ajayvardhanreddy/agent-memory-service) | Multi-namespace memory service (session, user, working, audit) backed by Layer 1 | +> | **Layer 3** | **agent-execution-engine (this)** | Agent runtime — orchestration, tool execution, memory, observability, evals, MCP | + +--- + ## What This Is This is not a chatbot. It is not a framework wrapper. It is not a tutorial project. It is a **production runtime** for LLM agents — the same category of software as a process manager, a job queue, or a workflow engine, but purpose-built for agents that call tools, maintain memory, and need to be observable and recoverable when things go wrong. -The engine takes a structured `AgentRun` request (agent type, user input, available tools, budget limits), drives execution through an explicit 12-state state machine, and returns a structured `AgentRunResult` with a full trace of every decision made along the way. +The engine takes a structured request (agent ID, user input), looks up the agent's configuration from a server-side registry, drives execution through an explicit 12-state state machine, and returns a structured result with a full trace of every decision made along the way. -**The three-layer portfolio this completes:** +--- -| Layer | Project | What it does | -|---|---|---| -| Layer 1 | [distributed-kv-store](https://github.com/Ajayvardhanreddy/distributed-kv-store) | Fault-tolerant distributed KV storage | -| Layer 2 | [agent-memory-service](https://github.com/Ajayvardhanreddy/agent-memory-service) | Multi-namespace memory service backed by Layer 1 | -| **Layer 3** | **agent-execution-engine (this)** | Agent runtime on top of Layers 1 and 2 | +## Architecture + +### Full Three-Layer Stack + +``` + ┌─────────────────────────────────────────────────────────────────────┐ + │ Client Layer │ + │ │ + │ Web UI / Mobile App Claude Code (via MCP) │ + │ POST /runs ─────────────▶ run_agent tool ──────────────────┐ │ + └────────────────────────────────────────────────────────────────--|--┘ + │ + ┌──────────────────────────────────────────────────────────────────▼──┐ + │ Layer 3 — Agent Execution Engine │ + │ github.com/Ajayvardhanreddy/ │ + │ agent-execution-engine (this) │ + │ │ + │ ┌────────────────┐ ┌─────────────────────────────────────────┐ │ + │ │ REST API │ │ State Machine │ │ + │ │ FastAPI :9000 │ │ │ │ + │ │ │ │ START → LOAD_MEMORY → BUILD_CONTEXT │ │ + │ │ POST /runs │ │ ↓ │ │ + │ │ GET /agents │ │ CALL_LLM ◀──┐ │ │ + │ │ GET /tools │ │ ↓ │ │ │ + │ │ GET /traces │ │ PROCESS_RESPONSE │ │ │ + │ │ GET /metrics │ │ ↙ ↘ │ │ │ + │ └────────────────┘ │ EXECUTE_TOOL RESPOND ● │ │ + │ │ ↓ ESCALATE ● │ │ + │ ┌────────────────┐ │ OBSERVE_RESULT FAIL ● │ │ + │ │ MCP Server │ │ ↓ │ │ + │ │ FastMCP :9001 │ │ WRITE_MEMORY │ │ + │ │ │ │ ↓ │ │ + │ │ run_agent │ │ CHECK_TERMINATION ──────────────┘ │ + │ │ list_agents │ └─────────────────────────────────────────┘ │ + │ └────────────────┘ │ + │ │ + │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │ + │ │ToolRegistry │ │AgentRegistry │ │ TraceCollector │ │ + │ │Pydantic I/O │ │Server-side │ │ One JSON span per │ │ + │ │Timeout/Retry │ │agent configs │ │ state transition │ │ + │ └──────────────┘ └──────────────┘ └──────────────────────────┘ │ + └───────────────────────────────┬─────────────────────────────────────┘ + │ HTTP (MEMORY_SERVICE_URL) + ┌───────────────────────────────▼─────────────────────────────────────┐ + │ Layer 2 — Agent Memory Service │ + │ github.com/Ajayvardhanreddy/agent-memory-service │ + │ │ + │ session:{id} user:{id} working:{run_id} audit:{run_id} │ + │ 24h TTL 30d TTL run duration + 1h 90d immutable │ + └───────────────────────────────┬─────────────────────────────────────┘ + │ HTTP (internal) + ┌───────────────────────────────▼─────────────────────────────────────┐ + │ Layer 1 — Distributed KV Store │ + │ github.com/Ajayvardhanreddy/distributed-kv-store │ + │ │ + │ Node 0 :8000 Node 1 :8001 Node 2 :8002 │ + │ Consistent hashing · Replication · Automatic failover │ + └─────────────────────────────────────────────────────────────────────┘ +``` + +### Request Flow (end to end) + +``` +1. Client sends: POST /runs { agent_id, session_id, user_id, input } +2. API layer: Looks up AgentDefinition from AgentRegistry +3. Engine: START → LOAD_MEMORY +4. Layer 2: Returns session history + user facts for this user +5. Layer 1: Serves the KV reads behind Layer 2 +6. Engine: BUILD_CONTEXT → CALL_LLM +7. Anthropic API: Returns tool_use or end_turn +8. Engine: PROCESS_RESPONSE → EXECUTE_TOOL +9. Tool: Runs with Pydantic-validated input, timeout, retry +10. Engine: OBSERVE_RESULT → WRITE_MEMORY +11. Layer 2: Appends tool result to working memory +12. Engine: CHECK_TERMINATION → (loop or RESPOND) +13. Engine: WRITE_MEMORY (final answer) → RESPOND +14. Client gets: { status, final_answer, trace_id, steps, tokens, cost, latency } +``` --- -## Why Agent Execution Needs a Runtime +## Why Not LangChain or LangGraph? + +**Full control over failure handling.** Every failure mode is a modeled terminal state (`FAIL`, `ESCALATE`) with a specific `failure_reason`. The behavior on tool timeout, budget exhaustion, memory unavailability, and loop detection is explicit code you can read, test in isolation, and change without touching a framework. -A raw while loop is the first thing every engineer writes when building an agent. It breaks in production for predictable reasons: +**Native integration with own storage.** [Layer 2](https://github.com/Ajayvardhanreddy/agent-memory-service) and [Layer 1](https://github.com/Ajayvardhanreddy/distributed-kv-store) are custom-built distributed systems. Wiring them into a framework designed around its own state persistence adds an abstraction layer between the engine and the storage backend. -- **No state visibility** — you can't tell what the agent was doing when it failed -- **No failure recovery** — a tool timeout crashes the whole run -- **No budget enforcement** — a misbehaving agent burns your API budget silently -- **No loop detection** — agents get stuck calling the same tool forever -- **No memory** — every run starts from scratch, the agent can't recall prior context -- **No observability** — you can't replay a run to understand why the agent made a decision +**Traceability as a first-class concern.** Every state transition emits a typed `Span` with duration, metadata, and cost. This is structural — not bolted on. In a framework, this level of observability requires fighting the framework's own logging mechanisms. -This engine solves all of these. Every execution follows an explicit state machine. Every state transition is logged as a structured JSON span. Tools are typed contracts with timeout, retry, and validated input/output. Budget limits on steps, tokens, cost, and wall-clock time are enforced at every step. Memory persists across runs through Layer 2. +**Explainability.** Every line of the orchestration loop can be explained from first principles. When an interviewer asks "how does your agent handle a tool timeout?" — the answer is one file name and a few lines of code. --- -## Why Not LangGraph? +## Quickstart -LangGraph is a mature framework with good documentation. The choice to build a custom orchestration layer comes down to a few concrete reasons: +**Prerequisites:** Python 3.11+, [uv](https://docs.astral.sh/uv/), Anthropic API key. -**Full control over failure handling.** In LangGraph, failures surface through framework-level abstractions. Here, every failure mode is a modeled terminal state (`FAIL`, `ESCALATE`) with a specific `failure_reason` string. The behavior on tool timeout, memory unavailability, budget exhaustion, and loop detection is explicit code you can read, test in isolation, and change without touching the framework. +```bash +git clone https://github.com/Ajayvardhanreddy/agent-execution-engine.git +cd agent-execution-engine -**Native integration with own storage.** The memory layer (Layer 2) and KV store (Layer 1) are custom-built distributed systems. Wiring them into a framework designed around its own state persistence adds an abstraction layer between the engine and the storage backend. Direct integration keeps the code readable and the data model obvious. +uv sync +cp .env.example .env +# Set ANTHROPIC_API_KEY in .env -**Traceability as a first-class concern.** Every state transition in the custom state machine emits a typed `Span` with duration, metadata, and cost. This isn't bolted on — it's structural. In a framework, adding this level of observability requires fighting the framework's own logging and tracing mechanisms. +# Run the support agent demo +make demo -**Explainability.** Every line of the orchestration loop can be explained from first principles. When an interviewer asks "how does your agent handle a tool timeout?" — the answer is one file name and a few lines of code, not "it depends on the LangGraph version." +# Run the engineering agent demo +make demo-eng ---- +# Start the REST API (auto-registers both agents on startup) +make serve # http://localhost:9000 -## Architecture +# Run tests +make test # 140 tests +# Run evals (makes real LLM calls — costs money) +make eval # 20 support benchmark cases +make eval-engineering # 10 engineering benchmark cases ``` - ┌─────────────┐ ┌────────────────────────────────────────────────┐ - │ Caller │ AgentRun │ AgentEngine │ - │ (CLI/API) │ ──────────────────▶ │ │ - └─────────────┘ │ ┌──────────────────────────────────────────┐ │ - │ │ State Machine │ │ - │ │ │ │ - │ │ START → LOAD_MEMORY → BUILD_CONTEXT │ │ - │ │ ↓ │ │ - │ │ CALL_LLM ◀──┐ │ │ - │ │ ↓ │ │ │ - │ │ PROCESS_RESPONSE │ │ │ - │ │ ↙ ↘ │ │ │ - │ │ EXECUTE_TOOL RESPOND ● │ │ - │ │ ↓ ESCALATE ● │ │ - │ │ OBSERVE_RESULT FAIL ● │ │ - │ │ ↓ │ │ - │ │ WRITE_MEMORY │ │ - │ │ ↓ │ │ - │ │ CHECK_TERMINATION ───────────────┘ │ - │ └──────────────────────────────────────────┘ │ - │ ↓ ↓ ↓ │ - │ ┌──────────┐ ┌───────────┐ ┌──────────┐ │ - │ │ Tool │ │ Layer 2 │ │Anthropic │ │ - │ │ Registry │ │ Memory │ │ LLM │ │ - │ │ Executor │ │ Service │ │ API │ │ - │ └──────────┘ └───────────┘ └──────────┘ │ - │ │ - │ TraceCollector → one JSON Span per transition │ - └────────────────────────────────────────────────┘ - ↓ - AgentRunResult - { status, final_answer, trace_id, - steps_taken, tokens_used, cost_usd } + +### Docker (Layer 3 only) + +```bash +cp .env.example .env # set ANTHROPIC_API_KEY + +docker compose up +# Engine API: http://localhost:9000 +# MCP Server: http://localhost:9001 (SSE transport) ``` --- ## Core Concepts -### AgentRun — the input contract +### AgentRegistry — server-side configuration -Every execution starts with a structured request: +Agents are registered once server-side. Callers send only `agent_id + input` — they never control system prompts, tools, or budgets. ```python -AgentRun( - agent_id = "support_agent", - session_id = "sess_abc123", - user_id = "user_456", - input = "I need a refund for order ORD-789.", - tools = ["order_lookup", "refund_request", "escalate_to_human"], - system_prompt = "You are a customer support agent...", - budget = RunBudget(max_steps=15, max_tokens=10_000, max_cost_usd=0.50), +# Registered at startup (engine/api/app.py) +AgentDefinition( + agent_id = "support_agent", + description = "Customer support for ShopEasy", + system_prompt = "...", + tools = ["order_lookup", "refund_request", "escalate_to_human"], + default_budget = RunBudget(max_steps=10, max_tokens=8_000, max_cost_usd=0.30), ) + +# Caller sends: +POST /runs { "agent_id": "support_agent", "session_id": "...", "user_id": "...", "input": "..." } ``` -### State machine — the execution model +This is analogous to how AWS Lambda separates function configuration from invocation — the caller triggers execution but cannot redefine it. + +### State Machine — the execution model -The agent never runs in a free-form loop. Every execution follows this exact state machine. Each arrow is a valid transition; invalid transitions raise immediately. +Every run follows this exact state machine. Invalid transitions raise immediately. ``` START → LOAD_MEMORY → BUILD_CONTEXT → CALL_LLM → PROCESS_RESPONSE - ↑ ↓ ↘ - CHECK_TERMINATION EXECUTE_TOOL RESPOND ● - ↑ ↓ ↗ - WRITE_MEMORY ← OBSERVE_RESULT ESCALATE ● + ↑ ↓ ↘ + CHECK_TERMINATION EXECUTE_TOOL RESPOND ● + ↑ ↓ ↗ + WRITE_MEMORY ← OBSERVE_RESULT ESCALATE ● FAIL ● ``` Terminal states: `RESPOND`, `ESCALATE`, `FAIL`. Every non-terminal state emits a JSON span before transitioning. -### Tools — the typed contract +### ToolRegistry — typed contracts -Every tool registered in the engine must provide: +Every tool is a typed contract with validated input/output, timeout, retry, and explicit error policy. ```python ToolDefinition( name = "order_lookup", - description = "Look up an order by ID...", # shown to the LLM - input_schema = OrderLookupInput, # Pydantic model — validated before call - output_schema = OrderDetails, # Pydantic model — validated after call - fn = _order_lookup, # async callable + description = "Look up an order by ID. Returns status, items, dates.", + input_schema = OrderLookupInput, # Pydantic — validated before call + output_schema = OrderDetails, # Pydantic — validated after call + fn = _order_lookup, # async callable timeout_seconds = 5, max_retries = 2, - permission_level = "read", # read | write | escalate - on_timeout = "fail", # fail | escalate | skip - on_error = "return_structured_error", # fail | escalate | return_structured_error + on_timeout = "fail", # fail | escalate | skip + on_error = "return_structured_error", ) ``` -Tools never raise into the agent loop. Every call returns a `ToolResult` with `success`, `output`, `error`, `latency_ms`, and `retries_used`. Failures are structured and returned to the LLM as context, not exceptions. +Tools never raise into the agent loop. Every call returns a `ToolResult` — failures are structured and returned to the LLM as context, not exceptions. -### Memory — four namespaces +### Memory — four namespaces via [Layer 2](https://github.com/Ajayvardhanreddy/agent-memory-service) -All memory goes through the [Layer 2 Memory Service](https://github.com/Ajayvardhanreddy/agent-memory-service) HTTP API. The engine uses four distinct namespaces: - -| Namespace | Key convention | Contents | Lifetime | +| Namespace | Key | Contents | Lifetime | |---|---|---|---| -| Session | `session:{session_id}` | Conversation history for this session | 24 hours | -| User | `user:{user_id}` | Persistent facts about the user across all sessions | 30 days | -| Working | `working:{run_id}` | Tool results for the current run | Run duration + 1h; deleted explicitly | -| Audit | `audit:{run_id}` | Immutable run record (status, cost, trace_id) | 90 days | +| Session | `session:{session_id}` | Conversation history for this session | 24h | +| User | `user:{user_id}` | Persistent facts across all sessions | 30 days | +| Working | `working:{run_id}` | Tool results for the current run | Run + 1h | +| Audit | `audit:{run_id}` | Immutable run record | 90 days | + +Layer 2 writes these to [Layer 1](https://github.com/Ajayvardhanreddy/distributed-kv-store) — the distributed KV store with consistent hashing and automatic failover. -**Memory availability:** If Layer 2 returns 503 (KV store down), the engine fails fast with `status="failed"`. If Layer 2 is unreachable (not running), the engine degrades gracefully and runs without memory — useful for local development. +**Memory failure modes:** 503 from Layer 2 → `FAIL` immediately. Layer 2 unreachable → graceful degrade (run without memory, useful for local dev). -### Traces — observability contract +### Traces — per-step observability -Every run produces one `Trace` stored in memory. Every state transition produces one `Span`: +Every state transition produces one JSON span: ```json { - "event": "span", "span_id": "sp_000004", "trace_id": "trace_ab70283be2e2", - "step": 1, + "step": 2, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", - "timestamp_ms": 1779340477548, "duration_ms": 1495, "metadata": { "model": "claude-haiku-4-5-20251001", "input_tokens": 1408, "output_tokens": 129, "stop_reason": "tool_use", - "total_cost_usd": 0.001642 + "total_cost_usd": 0.0016 } } ``` +Replay any run step-by-step: + +```bash +GET /traces/{trace_id}/replay +``` + ### Budget — hard limits per run ```python RunBudget( - max_steps = 15, # maximum state machine iterations - max_tokens = 10000, # total tokens across all LLM calls - max_cost_usd = 0.50, # maximum USD spend for this run - timeout_seconds = 90, # wall-clock timeout + max_steps = 10, # state machine iterations + max_tokens = 8_000, # total tokens across all LLM calls + max_cost_usd = 0.30, # maximum USD spend + timeout_seconds = 120, # wall-clock timeout ) ``` -Budget is checked at `CHECK_TERMINATION` after every tool round-trip. If any limit is hit, the run transitions to `FAIL` with `status="budget_exceeded"` or `status="timeout"`. +Enforced at `CHECK_TERMINATION` after every tool round-trip. Any limit hit → `FAIL` with `status="budget_exceeded"`. --- -## Quickstart +## REST API -**Prerequisites:** Python 3.11+, [uv](https://docs.astral.sh/uv/), an Anthropic API key. - -```bash -git clone https://github.com/Ajayvardhanreddy/agent-execution-engine.git -cd agent-execution-engine - -# Install dependencies -uv sync +``` +POST /runs Submit a run (agent_id + session_id + user_id + input) +GET /runs/{run_id} Get a completed run result -# Set up environment -cp .env.example .env -# Edit .env and set ANTHROPIC_API_KEY=your-key-here +GET /agents List registered agents (summary) +GET /agents/{agent_id} Full agent definition +POST /agents Register a new agent +PUT /agents/{agent_id} Update agent definition +DELETE /agents/{agent_id} Remove agent -# Run a demo scenario -PYTHONPATH=. uv run python demos/support_agent/agent.py --scenario eligible_refund +GET /tools List all registered tools with JSON schemas +GET /tools/{name} One tool schema +POST /tools/{name}/test Run a tool directly with test input -# Run all 7 scenarios -PYTHONPATH=. uv run python demos/support_agent/agent.py +GET /traces/{trace_id} Raw trace +GET /traces/{trace_id}/replay Step-by-step replay -# Run tests -uv run pytest +GET /health Liveness probe +GET /ready Readiness probe (checks Layer 2) +GET /metrics Prometheus metrics ``` -> **Docker Compose** (full stack with Layer 1 + Layer 2): wired up in Phase 6. -> Until then, run the engine directly as shown above. +Interactive docs at `http://localhost:9000/docs` when the server is running. --- -## Demo 1: Customer Support Agent - -**Location:** `demos/support_agent/` - -An e-commerce customer support agent with 5 tools and 7 demo scenarios. Proves multi-turn tool execution, policy-grounded decisions, escalation logic, and retry on tool failure. +## MCP Integration -### Tools +The engine ships an MCP server — any MCP-compatible AI client can discover and call agents without writing integration code. -| Tool | Permission | What it does | -|---|---|---| -| `order_lookup` | read | Look up order status, items, dates by order ID | -| `refund_policy_search` | read | Retrieve applicable policy for a given situation | -| `refund_request` | write | Submit a refund for an eligible order | -| `ticket_create` | write | Open a support ticket | -| `escalate_to_human` | **escalate** | Hand off to a human — triggers immediate `ESCALATE` state | +### Claude Code (stdio — local dev) -All tools are mocked with deterministic fixture data. No external dependencies. - -### Scenarios +The repo includes `.mcp.json` — Claude Code picks it up automatically: ```bash -PYTHONPATH=. uv run python demos/support_agent/agent.py --scenario +make serve # engine on :9000 +# restart Claude Code in this directory ``` -| Scenario | What it tests | Expected status | -|---|---|---| -| `eligible_refund` | Happy path: damaged item, refund approved | `completed` | -| `ineligible_refund` | Order outside 30-day window | `completed` | -| `missing_order_id` | Agent must ask for order ID before calling tools | `completed` | -| `frustrated_customer` | Angry user, valid claim — agent stays professional | `completed` | -| `tool_timeout_retry` | `order_lookup` times out once, succeeds on retry | `completed` | -| `policy_conflict_nonrefundable` | Digital product — non-refundable exception | `completed` | -| `fraud_risk_escalation` | High-value order + repeated claim → escalate | `escalated` | - -### Sample run output - +Then in Claude Code: ``` -SCENARIO: eligible_refund -INPUT: Hi, I received my order ORD-789 last week and the item arrived damaged... - -{"event":"span","from_state":"CALL_LLM","to_state":"PROCESS_RESPONSE","step":1, - "duration_ms":1495,"metadata":{"input_tokens":1408,"stop_reason":"tool_use","total_cost_usd":0.0016}} - -{"event":"span","from_state":"EXECUTE_TOOL","to_state":"OBSERVE_RESULT","step":1, - "metadata":{"tools_called":["order_lookup","refund_policy_search"],"all_succeeded":true}} - -{"event":"span","from_state":"CALL_LLM","to_state":"PROCESS_RESPONSE","step":3, - "metadata":{"stop_reason":"end_turn","total_cost_usd":0.0064}} - -RESULT -{ - "status": "completed", - "final_answer": "Refund Approved ✓\n- Refund ID: REF-29637\n- Amount: $89.99\n- Timeline: 5 business days", - "steps_taken": 3, - "total_tokens_used": 5817, - "total_cost_usd": 0.0064, - "latency_ms": 5284 -} -Match: ✓ PASS +List the available agents. +Use the support agent to check order ORD-789 for user u-001. ``` -### Phase 2 memory demo +Claude Code calls `list_agents` and `run_agent` via your MCP server — no curl, no integration code. -Requires Layer 2 running on `localhost:8080`: +### Docker (SSE — remote clients) ```bash -PYTHONPATH=. uv run python demos/support_agent/agent.py --memory-demo +docker compose up +# MCP server on :9001 with SSE transport ``` -Run 1: user introduces themselves as Alex → name stored in user memory. -Run 2: same user, new session → agent greets Alex by name without being told again. - ---- +Connect any SSE-capable MCP client to `http://localhost:9001/sse`. -## Demo 2: Engineering Assistant Agent +See [docs/mcp-setup.md](docs/mcp-setup.md) for Claude Desktop config. -> **Status: Phase 5 — not yet built.** Placeholder directory exists at `demos/engineering_agent/`. +--- -Will prove the same engine runs in a completely different domain. Tools: `github_issue_read`, `repo_file_search`, `code_context_retrieval`, `fix_plan_create`. Five scenarios: clear bug, missing context, large codebase, conflicting comments, test failure triage. +## Demo Agents ---- +### Demo 1 — Customer Support Agent -## Trace Replay +**Location:** `demos/support_agent/` -> **Status: Phase 3 — not yet built.** +Five tools, seven scenarios, 20 benchmark eval cases. Proves multi-turn tool execution, policy-grounded decisions, escalation logic, retry on tool failure, and cross-session memory. -Once built, the REST API will expose: +| Tool | What it does | +|---|---| +| `order_lookup` | Look up order status, items, and dates | +| `refund_policy_search` | Retrieve applicable refund policy | +| `refund_request` | Submit a refund for an eligible order | +| `ticket_create` | Open a support ticket | +| `escalate_to_human` | Hand off to human — triggers `ESCALATE` state immediately | ```bash -# Run the agent -curl -X POST http://localhost:9000/runs -d '{"agent_id":"support_agent",...}' -# → {"run_id":"run_abc","trace_id":"trace_xyz",...} - -# Replay every decision -curl http://localhost:9000/traces/trace_xyz/replay +make demo # eligible_refund scenario +PYTHONPATH=. uv run python demos/support_agent/agent.py --list # all scenarios +PYTHONPATH=. uv run python demos/support_agent/agent.py --scenario fraud_risk_escalation ``` -The replay endpoint returns an ordered, human-readable list of every state the agent visited, every tool call made, and every LLM decision with inputs, outputs, latency, and cost at each step. +| Scenario | Tests | Expected | +|---|---|---| +| `eligible_refund` | Happy path — damaged item, refund approved | `completed` | +| `ineligible_refund` | Order outside 30-day window | `completed` | +| `missing_order_id` | Agent asks for order ID before calling tools | `completed` | +| `frustrated_customer` | Angry user, valid claim | `completed` | +| `tool_timeout_retry` | `order_lookup` times out once, succeeds on retry | `completed` | +| `policy_conflict_nonrefundable` | Digital product — non-refundable exception | `completed` | +| `fraud_risk_escalation` | High-value order + repeated claim → escalate | `escalated` | ---- +### Demo 2 — Engineering Assistant Agent -## Eval Methodology +**Location:** `demos/engineering_agent/` -> **Status: Phase 4 — not yet built.** +Four tools, five scenarios, 10 benchmark eval cases. Same engine, completely different domain — proves the runtime is domain-agnostic. -Once built: +| Tool | What it does | +|---|---| +| `code_search` | Search codebase for functions, classes, patterns | +| `file_read` | Read a specific file by path | +| `pr_review` | Review a pull request for bugs, security issues, style | +| `dependency_check` | Audit dependencies for outdated versions and CVEs | ```bash -make eval AGENT=support_agent # run 20 benchmark cases, emit regression report -make eval AGENT=engineering_agent # run 10 benchmark cases -make eval-compare AGENT=support_agent PREV=reports/report_20260514.md -``` - -**Metrics measured:** - -| Metric | What it measures | -|---|---| -| Task success | Did the agent complete the user's goal? | -| Tool correctness | Right tools in the right order? | -| Escalation accuracy | Escalated exactly when it should have? | -| Groundedness | Final answer based on tool outputs, not hallucination? | -| Step efficiency | Minimum necessary steps? | -| Cost per run | USD spent per execution | -| Latency | Wall-clock time to completion | -| Failure recovery | Tool failures handled gracefully? | - -**Regression report format:** -``` -Task success: 17/20 (85.0%) [prev: 16/20 +1] -Tool correctness: 19/20 (95.0%) [prev: 19/20 0] -Escalation accuracy: 17/20 (85.0%) [prev: 15/20 +2] -Avg cost per run: $0.038 [prev: $0.041 improved] - -REGRESSIONS (1) -- case_012: task_success 1.0 → 0.0 - Reason: agent called refund_request before order_lookup +make demo-eng # review_pr scenario +PYTHONPATH=. uv run python demos/engineering_agent/agent.py --list # all scenarios +PYTHONPATH=. uv run python demos/engineering_agent/agent.py --scenario security_investigation ``` ---- - -## Failure Modes - -Every failure mode has explicit handling. None of them crash the engine or produce an infinite loop. - -| Failure mode | Trigger | Engine behavior | +| Scenario | Tests | Expected | |---|---|---| -| Tool timeout | Tool exceeds `timeout_seconds` | `ToolError(error_type="timeout")`, retry up to `max_retries`, then `on_timeout` policy | -| Tool malformed output | Output fails Pydantic validation | `ToolError(error_type="validation_error")`, no retry, structured error returned to LLM | -| Tool empty result | Tool returns `None` or `{}` | `ToolError(error_type="empty_result", recoverable=True)`, LLM decides next step | -| Unknown tool called | LLM calls unregistered tool name | Error message injected: `"Tool X is not available. Available: [...]"` | -| Duplicate tool call | Same tool + same input called twice | Loop guard injects: `"You already called this tool. Use a different approach."` | -| Context window overflow | Token budget approaching limit | Compressor trims oldest messages, preserves last 6 + original user message | -| Budget exceeded | Steps / tokens / cost / time limit hit | `FAIL` with `status="budget_exceeded"` or `status="timeout"` | -| Memory service down (503) | Layer 2 returns 503 | `FAIL` with `failure_reason="memory_unavailable"` — do not continue | -| Memory service unreachable | Layer 2 not running | Degrade gracefully — run without memory (dev mode) | -| LLM API error | Anthropic returns 5xx | Retry up to 3× with exponential backoff, then `FAIL` | -| Infinite loop detected | Same state sequence repeats 3× | Loop guard forces `FAIL` with `failure_reason="Infinite loop detected"` | -| Escalation required | Tool with `permission_level="escalate"` called | Immediate `ESCALATE` from `EXECUTE_TOOL`, bypasses `CHECK_TERMINATION` | +| `find_function` | Locate function definition across codebase | `completed` | +| `review_pr` | PR-42 has critical timing-attack — must be called out | `completed` | +| `dependency_audit` | cryptography CRITICAL CVE must be surfaced | `completed` | +| `inspect_file` | Read and explain a source file | `completed` | +| `security_investigation` | Multi-step: find SQL injection, then read vulnerable file | `completed` | --- -## MCP Integration - -> **Status: Phase 6 — not yet built.** +## Eval Harness -Once built, Claude Desktop can run agents and replay traces directly: +30 benchmark conversations scored across 6 weighted dimensions with hard-fail semantics for safety-critical cases. -```json -{ - "mcpServers": { - "agent-execution-engine": { - "command": "python", - "args": ["-m", "mcp_server.server"], - "cwd": "/path/to/agent-execution-engine" - } - } -} +```bash +make eval # 20 support cases +make eval-engineering # 10 engineering cases +make eval-all # all 30 cases +make eval-case CASE=support_005 # single case ``` -Exposed tools: `run_agent`, `get_trace`, `replay_trace`, `run_eval`, `list_tools`. - ---- +| Scorer | Weight | Hard fail? | What it measures | +|---|---|---|---| +| `task_completion` | 2.0 | Yes | Did the run reach the expected terminal state? | +| `tool_selection` | 1.5 | No | Right tools called, no unexpected tools? | +| `answer_quality` | 1.5 | No | Required keywords present, forbidden terms absent? | +| `escalation_accuracy` | 1.5 | Yes | Escalated when required? False negative = security failure | +| `cost_efficiency` | 0.5 | No | Cost within the case budget? | +| `latency` | 0.5 | No | Response within latency threshold? | -## Layer Integration +**Hard-fail semantics:** If a case misses a required escalation (e.g. fraud scenario completed instead of escalating), it auto-fails regardless of other scores. A false negative on escalation is a security failure — weighted average alone cannot pass it. -The engine is the top layer of a three-layer AI infrastructure stack. Every layer is independently deployable and tested. - -``` -┌─────────────────────────────────┐ -│ Layer 3: Agent Execution │ ← this repo -│ Engine (port 9000) │ -│ Orchestration, tools, evals │ -└─────────────┬───────────────────┘ - │ HTTP (localhost:8080) - ▼ -┌─────────────────────────────────┐ -│ Layer 2: Agent Memory │ github.com/Ajayvardhanreddy/agent-memory-service -│ Service (port 8080) │ -│ 4 memory namespaces, streams │ -└─────────────┬───────────────────┘ - │ internal - ▼ -┌─────────────────────────────────┐ -│ Layer 1: Distributed KV │ github.com/Ajayvardhanreddy/distributed-kv-store -│ Store (ports 8000–8002) │ -│ Consistent hashing, failover │ -└─────────────────────────────────┘ -``` +Pass threshold: **0.70** overall weighted score. --- -## Build Status +## Failure Modes -| Phase | What | Status | +Every failure mode is modeled. None crash the engine or produce an infinite loop. + +| Failure | Trigger | Behavior | |---|---|---| -| Phase 1 | Orchestrator, state machine, tool execution, support agent demo | ✅ Complete | -| Phase 2 | Memory integration (session, user, working, audit) | ✅ Complete | -| Phase 3 | REST API, trace replay, Prometheus metrics | 🔲 Not started | -| Phase 4 | Eval harness, 6 scorers, 30 benchmark cases, regression reports | 🔲 Not started | -| Phase 5 | Engineering assistant agent demo | 🔲 Not started | -| Phase 6 | MCP server, Docker Compose full stack | 🔲 Not started | +| Tool timeout | Exceeds `timeout_seconds` | Retry up to `max_retries`, then apply `on_timeout` policy | +| Malformed tool output | Fails Pydantic validation | Structured error returned to LLM — no retry | +| Empty tool result | Returns `None` or `{}` | Recoverable error — LLM decides next step | +| Unknown tool called | LLM calls unregistered name | `"Tool X not available. Available: [...]"` injected | +| Duplicate tool call | Same tool + input called twice | Loop guard injects alternative approach hint | +| Context overflow | Token budget approaching limit | Compressor trims oldest messages, preserves last 6 + original | +| Budget exceeded | Steps / tokens / cost / time | `FAIL` with `status="budget_exceeded"` or `"timeout"` | +| Infinite loop | Same state sequence 3× | Loop guard forces `FAIL` | +| Layer 2 down (503) | Memory service returns 503 | `FAIL` — do not continue without memory | +| Layer 2 unreachable | Not running | Graceful degrade — run without memory (dev mode) | +| LLM API error | Anthropic 5xx | Retry 3× with exponential backoff, then `FAIL` | +| Escalation triggered | Tool with `permission_level="escalate"` | Immediate `ESCALATE` from `EXECUTE_TOOL` | --- -## Architectural Decision Records - -Written as each phase is stabilised: +## Tech Stack -| ADR | Decision | +| Concern | Choice | |---|---| -| [001](docs/decisions/001-custom-orchestration-over-langgraph.md) | Custom orchestration over LangGraph | -| [002](docs/decisions/002-anthropic-sdk-direct-over-langchain.md) | Anthropic SDK direct over LangChain | -| [003](docs/decisions/003-state-machine-over-while-loop.md) | Explicit state machine over while loop | -| [004](docs/decisions/004-typed-tool-registry.md) | Typed tool registry with Pydantic | -| [005](docs/decisions/005-eval-methodology.md) | Structured eval methodology | - -> ADR documents are written as phases complete. Pending Phase 1 + 2 stabilisation. +| Language | Python 3.11+ | +| LLM | Anthropic SDK — direct, no framework | +| API | FastAPI + Uvicorn | +| Validation | Pydantic v2 | +| MCP server | FastMCP | +| Memory backend | [Layer 2 — agent-memory-service](https://github.com/Ajayvardhanreddy/agent-memory-service) | +| KV backend | [Layer 1 — distributed-kv-store](https://github.com/Ajayvardhanreddy/distributed-kv-store) | +| HTTP client | httpx (async) | +| Observability | Structured JSON spans + Prometheus metrics | +| Testing | pytest + pytest-asyncio + pytest-httpx | +| Package manager | uv | +| Containerisation | Docker + Docker Compose | --- -## Known Limitations +## Project Structure -| Limitation | Notes | -|---|---| -| No REST API yet | Engine is runnable via Python only until Phase 3 | -| No eval scoring yet | Eval harness built in Phase 4 | -| No MCP server yet | Claude Desktop integration in Phase 6 | -| Docker Compose not wired | Skeletons exist; full stack in Phase 6 | -| Tools are mocked | Support agent uses fixture data, no real order database | -| No auth on any endpoint | By design for portfolio; production would add API key auth | -| Single-node memory in dev mode | When Layer 2 is unavailable, no memory persistence | -| User fact extraction is regex-only | Name patterns only; no LLM-based extraction yet | -| No streaming responses | All runs are synchronous; streaming added in Phase 3 | +``` +engine/ + api/ REST API — routes, agent registry, dependencies, middleware + evals/ Eval framework — base contracts, 6 scorers, EvalSuite + memory/ Layer 2 client — 4 namespaces, graceful degrade + mcp/ MCP server — run_agent, list_agents, stdio + SSE + observability/ Trace collection, Prometheus metrics + orchestrator/ State machine, engine loop, step runner, loop guard, budget + tools/ ToolDefinition, ToolRegistry, ToolExecutor, error types + +demos/ + support_agent/ 5 tools, 7 scenarios, system prompt + engineering_agent/ 4 tools, 5 scenarios, system prompt + +evals/ + dataset/ 30 benchmark EvalCases (support + engineering) + reports/ JSON regression reports (gitignored) + runner.py CLI runner — --suite, --case, --no-save + report.py Terminal output + JSON report formatter + +docs/ + mcp-setup.md Claude Code + Docker MCP setup guide + adding-tools.md How to write and register custom tools + registering-agents.md How to register agents via the API + tool_template.py Copy-paste starting point for tool authors + +tests/ + unit/ State machine, scorers, registry, executor, budget, traces + integration/ API routes, MCP server, memory client +``` --- -## Roadmap - -Phases 3–6 in order: - -**Phase 3 — REST API + Trace Replay** -`POST /runs`, `GET /runs/{id}`, `GET /traces/{id}/replay`, `GET /metrics` (Prometheus). Full trace reconstructed as human-readable ordered event list. - -**Phase 4 — Eval Harness** -`make eval AGENT=support_agent`. 6 scorers (task success, tool correctness, escalation, groundedness, cost, latency). 30 benchmark cases. Regression report with diff from previous run. - -**Phase 5 — Engineering Assistant Agent** -Same engine, different domain. 4 tools (GitHub issue read, repo file search, code context retrieval, fix plan creation). 5 scenarios. 10 eval cases. - -**Phase 6 — MCP Server + Docker Compose** -FastMCP server exposing engine as 5 MCP tools. `docker-compose up` brings up all three layers in one command. - ---- +## Build Status -## Tech Stack +All 6 phases complete. -| Concern | Choice | -|---|---| -| Language | Python 3.11+ | -| LLM | Anthropic SDK (direct — no LangChain) | -| API framework | FastAPI + Uvicorn (Phase 3) | -| Data validation | Pydantic v2 | -| Memory backend | Layer 2 Memory Service (HTTP) | -| KV backend | Layer 1 Distributed KV Store | -| HTTP client | httpx (async) | -| MCP server | fastmcp (Phase 6) | -| Testing | pytest + pytest-asyncio + pytest-httpx | -| Package manager | uv | -| Observability | Structured JSON spans, Prometheus metrics (Phase 3) | +| Phase | What | Status | +|---|---|---| +| Phase 1 | State machine, tool execution, budget, loop detection, support agent demo | ✅ | +| Phase 2 | Memory integration — session, user, working, audit namespaces | ✅ | +| Phase 3 | REST API, AgentRegistry, trace replay, Prometheus metrics | ✅ | +| Phase 4 | Eval harness — 6 scorers, hard-fail semantics, 20 support cases, regression reports | ✅ | +| Phase 5 | Engineering agent — 4 tools, 5 scenarios, 10 eval cases | ✅ | +| Phase 6 | MCP server, Docker Compose, Claude Code `.mcp.json`, startup agent registration | ✅ |