From 67389b16d4f5c663fff5730151646b1e05a868bd Mon Sep 17 00:00:00 2001
From: Ajayvardhanreddy <ajayvardhanreddyrondla@gmail.com>
Date: Mon, 25 May 2026 10:59:16 -0400
Subject: [PATCH] =?UTF-8?q?docs:=20rewrite=20README=20=E2=80=94=20complete?=
 =?UTF-8?q?=20architecture=20diagram,=20all=206=20phases,=20MCP=20setup,?=
 =?UTF-8?q?=20layer=20links?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 README.md | 693 +++++++++++++++++++++++++++---------------------------
 1 file changed, 343 insertions(+), 350 deletions(-)

diff --git a/README.md b/README.md
index 63f3e31..a992fdb 100644
--- a/README.md
+++ b/README.md
@@ -1,522 +1,515 @@
 # Agent Execution Engine
 
-> Production runtime infrastructure for LLM agents — explicit state-machine orchestration, typed tool execution, persistent memory, per-run trace observability, and automated evals. Self-hostable. No LangChain. No LangGraph.
+> Production runtime infrastructure for LLM agents — explicit state-machine orchestration, typed tool execution, persistent memory, per-run observability, automated evals, and MCP integration. Self-hostable. No LangChain. No LangGraph.
 
 [![CI](https://github.com/Ajayvardhanreddy/agent-execution-engine/actions/workflows/ci.yml/badge.svg)](https://github.com/Ajayvardhanreddy/agent-execution-engine/actions/workflows/ci.yml)
 ![Python](https://img.shields.io/badge/python-3.11%2B-blue)
+![Tests](https://img.shields.io/badge/tests-140%20passing-brightgreen)
 ![uv](https://img.shields.io/badge/package%20manager-uv-purple)
 
 ---
 
+> **Part of a 3-layer AI infrastructure portfolio.**
+> This is Layer 3. The full stack — all three layers running together as a distributed system — is being assembled in a separate deployment repo.
+>
+> | Layer | Repo | What it does |
+> |---|---|---|
+> | Layer 1 | [distributed-kv-store](https://github.com/Ajayvardhanreddy/distributed-kv-store) | Fault-tolerant distributed KV storage with consistent hashing and node failover |
+> | Layer 2 | [agent-memory-service](https://github.com/Ajayvardhanreddy/agent-memory-service) | Multi-namespace memory service (session, user, working, audit) backed by Layer 1 |
+> | **Layer 3** | **agent-execution-engine (this)** | Agent runtime — orchestration, tool execution, memory, observability, evals, MCP |
+
+---
+
 ## What This Is
 
 This is not a chatbot. It is not a framework wrapper. It is not a tutorial project.
 
 It is a **production runtime** for LLM agents — the same category of software as a process manager, a job queue, or a workflow engine, but purpose-built for agents that call tools, maintain memory, and need to be observable and recoverable when things go wrong.
 
-The engine takes a structured `AgentRun` request (agent type, user input, available tools, budget limits), drives execution through an explicit 12-state state machine, and returns a structured `AgentRunResult` with a full trace of every decision made along the way.
+The engine takes a structured request (agent ID, user input), looks up the agent's configuration from a server-side registry, drives execution through an explicit 12-state state machine, and returns a structured result with a full trace of every decision made along the way.
 
-**The three-layer portfolio this completes:**
+---
 
-| Layer | Project | What it does |
-|---|---|---|
-| Layer 1 | [distributed-kv-store](https://github.com/Ajayvardhanreddy/distributed-kv-store) | Fault-tolerant distributed KV storage |
-| Layer 2 | [agent-memory-service](https://github.com/Ajayvardhanreddy/agent-memory-service) | Multi-namespace memory service backed by Layer 1 |
-| **Layer 3** | **agent-execution-engine (this)** | Agent runtime on top of Layers 1 and 2 |
+## Architecture
+
+### Full Three-Layer Stack
+
+```
+  ┌─────────────────────────────────────────────────────────────────────┐
+  │                         Client Layer                                │
+  │                                                                     │
+  │   Web UI / Mobile App         Claude Code (via MCP)                 │
+  │   POST /runs  ─────────────▶  run_agent tool  ──────────────────┐  │
+  └────────────────────────────────────────────────────────────────--|--┘
+                                                                     │
+  ┌──────────────────────────────────────────────────────────────────▼──┐
+  │                    Layer 3 — Agent Execution Engine                 │
+  │                         github.com/Ajayvardhanreddy/               │
+  │                         agent-execution-engine  (this)             │
+  │                                                                     │
+  │  ┌────────────────┐   ┌─────────────────────────────────────────┐  │
+  │  │  REST API      │   │           State Machine                 │  │
+  │  │  FastAPI :9000 │   │                                         │  │
+  │  │                │   │  START → LOAD_MEMORY → BUILD_CONTEXT    │  │
+  │  │  POST /runs    │   │                            ↓            │  │
+  │  │  GET  /agents  │   │                        CALL_LLM ◀──┐   │  │
+  │  │  GET  /tools   │   │                            ↓        │   │  │
+  │  │  GET  /traces  │   │                   PROCESS_RESPONSE  │   │  │
+  │  │  GET  /metrics │   │                    ↙           ↘     │   │  │
+  │  └────────────────┘   │            EXECUTE_TOOL      RESPOND ●  │  │
+  │                        │                 ↓          ESCALATE ●  │  │
+  │  ┌────────────────┐   │          OBSERVE_RESULT      FAIL   ●  │  │
+  │  │  MCP Server    │   │                 ↓                       │  │
+  │  │  FastMCP :9001 │   │          WRITE_MEMORY                   │  │
+  │  │                │   │                 ↓                       │  │
+  │  │  run_agent     │   │         CHECK_TERMINATION ──────────────┘  │
+  │  │  list_agents   │   └─────────────────────────────────────────┘  │
+  │  └────────────────┘                                                 │
+  │                                                                     │
+  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐  │
+  │  │ToolRegistry  │  │AgentRegistry │  │   TraceCollector         │  │
+  │  │Pydantic I/O  │  │Server-side   │  │   One JSON span per      │  │
+  │  │Timeout/Retry │  │agent configs │  │   state transition       │  │
+  │  └──────────────┘  └──────────────┘  └──────────────────────────┘  │
+  └───────────────────────────────┬─────────────────────────────────────┘
+                                  │ HTTP (MEMORY_SERVICE_URL)
+  ┌───────────────────────────────▼─────────────────────────────────────┐
+  │                    Layer 2 — Agent Memory Service                   │
+  │              github.com/Ajayvardhanreddy/agent-memory-service       │
+  │                                                                     │
+  │   session:{id}   user:{id}   working:{run_id}   audit:{run_id}      │
+  │   24h TTL        30d TTL     run duration + 1h  90d immutable       │
+  └───────────────────────────────┬─────────────────────────────────────┘
+                                  │ HTTP (internal)
+  ┌───────────────────────────────▼─────────────────────────────────────┐
+  │                    Layer 1 — Distributed KV Store                   │
+  │              github.com/Ajayvardhanreddy/distributed-kv-store       │
+  │                                                                     │
+  │   Node 0 :8000    Node 1 :8001    Node 2 :8002                      │
+  │   Consistent hashing · Replication · Automatic failover             │
+  └─────────────────────────────────────────────────────────────────────┘
+```
+
+### Request Flow (end to end)
+
+```
+1.  Client sends:  POST /runs  { agent_id, session_id, user_id, input }
+2.  API layer:     Looks up AgentDefinition from AgentRegistry
+3.  Engine:        START → LOAD_MEMORY
+4.  Layer 2:       Returns session history + user facts for this user
+5.  Layer 1:       Serves the KV reads behind Layer 2
+6.  Engine:        BUILD_CONTEXT → CALL_LLM
+7.  Anthropic API: Returns tool_use or end_turn
+8.  Engine:        PROCESS_RESPONSE → EXECUTE_TOOL
+9.  Tool:          Runs with Pydantic-validated input, timeout, retry
+10. Engine:        OBSERVE_RESULT → WRITE_MEMORY
+11. Layer 2:       Appends tool result to working memory
+12. Engine:        CHECK_TERMINATION → (loop or RESPOND)
+13. Engine:        WRITE_MEMORY (final answer) → RESPOND
+14. Client gets:   { status, final_answer, trace_id, steps, tokens, cost, latency }
+```
 
 ---
 
-## Why Agent Execution Needs a Runtime
+## Why Not LangChain or LangGraph?
+
+**Full control over failure handling.** Every failure mode is a modeled terminal state (`FAIL`, `ESCALATE`) with a specific `failure_reason`. The behavior on tool timeout, budget exhaustion, memory unavailability, and loop detection is explicit code you can read, test in isolation, and change without touching a framework.
 
-A raw while loop is the first thing every engineer writes when building an agent. It breaks in production for predictable reasons:
+**Native integration with own storage.** [Layer 2](https://github.com/Ajayvardhanreddy/agent-memory-service) and [Layer 1](https://github.com/Ajayvardhanreddy/distributed-kv-store) are custom-built distributed systems. Wiring them into a framework designed around its own state persistence adds an abstraction layer between the engine and the storage backend.
 
-- **No state visibility** — you can't tell what the agent was doing when it failed
-- **No failure recovery** — a tool timeout crashes the whole run
-- **No budget enforcement** — a misbehaving agent burns your API budget silently
-- **No loop detection** — agents get stuck calling the same tool forever
-- **No memory** — every run starts from scratch, the agent can't recall prior context
-- **No observability** — you can't replay a run to understand why the agent made a decision
+**Traceability as a first-class concern.** Every state transition emits a typed `Span` with duration, metadata, and cost. This is structural — not bolted on. In a framework, this level of observability requires fighting the framework's own logging mechanisms.
 
-This engine solves all of these. Every execution follows an explicit state machine. Every state transition is logged as a structured JSON span. Tools are typed contracts with timeout, retry, and validated input/output. Budget limits on steps, tokens, cost, and wall-clock time are enforced at every step. Memory persists across runs through Layer 2.
+**Explainability.** Every line of the orchestration loop can be explained from first principles. When an interviewer asks "how does your agent handle a tool timeout?" — the answer is one file name and a few lines of code.
 
 ---
 
-## Why Not LangGraph?
+## Quickstart
 
-LangGraph is a mature framework with good documentation. The choice to build a custom orchestration layer comes down to a few concrete reasons:
+**Prerequisites:** Python 3.11+, [uv](https://docs.astral.sh/uv/), Anthropic API key.
 
-**Full control over failure handling.** In LangGraph, failures surface through framework-level abstractions. Here, every failure mode is a modeled terminal state (`FAIL`, `ESCALATE`) with a specific `failure_reason` string. The behavior on tool timeout, memory unavailability, budget exhaustion, and loop detection is explicit code you can read, test in isolation, and change without touching the framework.
+```bash
+git clone https://github.com/Ajayvardhanreddy/agent-execution-engine.git
+cd agent-execution-engine
 
-**Native integration with own storage.** The memory layer (Layer 2) and KV store (Layer 1) are custom-built distributed systems. Wiring them into a framework designed around its own state persistence adds an abstraction layer between the engine and the storage backend. Direct integration keeps the code readable and the data model obvious.
+uv sync
+cp .env.example .env
+# Set ANTHROPIC_API_KEY in .env
 
-**Traceability as a first-class concern.** Every state transition in the custom state machine emits a typed `Span` with duration, metadata, and cost. This isn't bolted on — it's structural. In a framework, adding this level of observability requires fighting the framework's own logging and tracing mechanisms.
+# Run the support agent demo
+make demo
 
-**Explainability.** Every line of the orchestration loop can be explained from first principles. When an interviewer asks "how does your agent handle a tool timeout?" — the answer is one file name and a few lines of code, not "it depends on the LangGraph version."
+# Run the engineering agent demo
+make demo-eng
 
----
+# Start the REST API (auto-registers both agents on startup)
+make serve          # http://localhost:9000
 
-## Architecture
+# Run tests
+make test           # 140 tests
 
+# Run evals (makes real LLM calls — costs money)
+make eval           # 20 support benchmark cases
+make eval-engineering  # 10 engineering benchmark cases
 ```
- ┌─────────────┐                     ┌────────────────────────────────────────────────┐
- │   Caller    │   AgentRun          │                  AgentEngine                   │
- │  (CLI/API)  │ ──────────────────▶ │                                                │
- └─────────────┘                     │  ┌──────────────────────────────────────────┐  │
-                                     │  │              State Machine               │  │
-                                     │  │                                          │  │
-                                     │  │  START → LOAD_MEMORY → BUILD_CONTEXT     │  │
-                                     │  │                              ↓            │  │
-                                     │  │                         CALL_LLM ◀──┐    │  │
-                                     │  │                              ↓       │    │  │
-                                     │  │                     PROCESS_RESPONSE  │    │  │
-                                     │  │                       ↙         ↘     │    │  │
-                                     │  │              EXECUTE_TOOL      RESPOND ●   │  │
-                                     │  │                   ↓          ESCALATE ●    │  │
-                                     │  │            OBSERVE_RESULT      FAIL   ●    │  │
-                                     │  │                   ↓                        │  │
-                                     │  │            WRITE_MEMORY                    │  │
-                                     │  │                   ↓                        │  │
-                                     │  │           CHECK_TERMINATION ───────────────┘  │
-                                     │  └──────────────────────────────────────────┘  │
-                                     │        ↓                ↓              ↓        │
-                                     │  ┌──────────┐   ┌───────────┐   ┌──────────┐  │
-                                     │  │   Tool   │   │  Layer 2  │   │Anthropic │  │
-                                     │  │ Registry │   │  Memory   │   │   LLM    │  │
-                                     │  │ Executor │   │  Service  │   │   API    │  │
-                                     │  └──────────┘   └───────────┘   └──────────┘  │
-                                     │                                                │
-                                     │  TraceCollector → one JSON Span per transition │
-                                     └────────────────────────────────────────────────┘
-                                                           ↓
-                                                    AgentRunResult
-                                           { status, final_answer, trace_id,
-                                             steps_taken, tokens_used, cost_usd }
+
+### Docker (Layer 3 only)
+
+```bash
+cp .env.example .env   # set ANTHROPIC_API_KEY
+
+docker compose up
+# Engine API:  http://localhost:9000
+# MCP Server:  http://localhost:9001  (SSE transport)
 ```
 
 ---
 
 ## Core Concepts
 
-### AgentRun — the input contract
+### AgentRegistry — server-side configuration
 
-Every execution starts with a structured request:
+Agents are registered once server-side. Callers send only `agent_id + input` — they never control system prompts, tools, or budgets.
 
 ```python
-AgentRun(
-    agent_id    = "support_agent",
-    session_id  = "sess_abc123",
-    user_id     = "user_456",
-    input       = "I need a refund for order ORD-789.",
-    tools       = ["order_lookup", "refund_request", "escalate_to_human"],
-    system_prompt = "You are a customer support agent...",
-    budget      = RunBudget(max_steps=15, max_tokens=10_000, max_cost_usd=0.50),
+# Registered at startup (engine/api/app.py)
+AgentDefinition(
+    agent_id      = "support_agent",
+    description   = "Customer support for ShopEasy",
+    system_prompt = "...",
+    tools         = ["order_lookup", "refund_request", "escalate_to_human"],
+    default_budget = RunBudget(max_steps=10, max_tokens=8_000, max_cost_usd=0.30),
 )
+
+# Caller sends:
+POST /runs  { "agent_id": "support_agent", "session_id": "...", "user_id": "...", "input": "..." }
 ```
 
-### State machine — the execution model
+This is analogous to how AWS Lambda separates function configuration from invocation — the caller triggers execution but cannot redefine it.
+
+### State Machine — the execution model
 
-The agent never runs in a free-form loop. Every execution follows this exact state machine. Each arrow is a valid transition; invalid transitions raise immediately.
+Every run follows this exact state machine. Invalid transitions raise immediately.
 
 ```
 START → LOAD_MEMORY → BUILD_CONTEXT → CALL_LLM → PROCESS_RESPONSE
-                                          ↑              ↓         ↘
-                               CHECK_TERMINATION    EXECUTE_TOOL   RESPOND ●
-                                          ↑              ↓         ↗
-                                    WRITE_MEMORY  ← OBSERVE_RESULT     ESCALATE ●
+                                           ↑              ↓          ↘
+                                  CHECK_TERMINATION  EXECUTE_TOOL   RESPOND ●
+                                           ↑              ↓          ↗
+                                     WRITE_MEMORY ← OBSERVE_RESULT     ESCALATE ●
                                                                         FAIL ●
 ```
 
 Terminal states: `RESPOND`, `ESCALATE`, `FAIL`. Every non-terminal state emits a JSON span before transitioning.
 
-### Tools — the typed contract
+### ToolRegistry — typed contracts
 
-Every tool registered in the engine must provide:
+Every tool is a typed contract with validated input/output, timeout, retry, and explicit error policy.
 
 ```python
 ToolDefinition(
     name             = "order_lookup",
-    description      = "Look up an order by ID...",   # shown to the LLM
-    input_schema     = OrderLookupInput,               # Pydantic model — validated before call
-    output_schema    = OrderDetails,                   # Pydantic model — validated after call
-    fn               = _order_lookup,                  # async callable
+    description      = "Look up an order by ID. Returns status, items, dates.",
+    input_schema     = OrderLookupInput,    # Pydantic — validated before call
+    output_schema    = OrderDetails,         # Pydantic — validated after call
+    fn               = _order_lookup,        # async callable
     timeout_seconds  = 5,
     max_retries      = 2,
-    permission_level = "read",                         # read | write | escalate
-    on_timeout       = "fail",                         # fail | escalate | skip
-    on_error         = "return_structured_error",      # fail | escalate | return_structured_error
+    on_timeout       = "fail",               # fail | escalate | skip
+    on_error         = "return_structured_error",
 )
 ```
 
-Tools never raise into the agent loop. Every call returns a `ToolResult` with `success`, `output`, `error`, `latency_ms`, and `retries_used`. Failures are structured and returned to the LLM as context, not exceptions.
+Tools never raise into the agent loop. Every call returns a `ToolResult` — failures are structured and returned to the LLM as context, not exceptions.
 
-### Memory — four namespaces
+### Memory — four namespaces via [Layer 2](https://github.com/Ajayvardhanreddy/agent-memory-service)
 
-All memory goes through the [Layer 2 Memory Service](https://github.com/Ajayvardhanreddy/agent-memory-service) HTTP API. The engine uses four distinct namespaces:
-
-| Namespace | Key convention | Contents | Lifetime |
+| Namespace | Key | Contents | Lifetime |
 |---|---|---|---|
-| Session | `session:{session_id}` | Conversation history for this session | 24 hours |
-| User | `user:{user_id}` | Persistent facts about the user across all sessions | 30 days |
-| Working | `working:{run_id}` | Tool results for the current run | Run duration + 1h; deleted explicitly |
-| Audit | `audit:{run_id}` | Immutable run record (status, cost, trace_id) | 90 days |
+| Session | `session:{session_id}` | Conversation history for this session | 24h |
+| User | `user:{user_id}` | Persistent facts across all sessions | 30 days |
+| Working | `working:{run_id}` | Tool results for the current run | Run + 1h |
+| Audit | `audit:{run_id}` | Immutable run record | 90 days |
+
+Layer 2 writes these to [Layer 1](https://github.com/Ajayvardhanreddy/distributed-kv-store) — the distributed KV store with consistent hashing and automatic failover.
 
-**Memory availability:** If Layer 2 returns 503 (KV store down), the engine fails fast with `status="failed"`. If Layer 2 is unreachable (not running), the engine degrades gracefully and runs without memory — useful for local development.
+**Memory failure modes:** 503 from Layer 2 → `FAIL` immediately. Layer 2 unreachable → graceful degrade (run without memory, useful for local dev).
 
-### Traces — observability contract
+### Traces — per-step observability
 
-Every run produces one `Trace` stored in memory. Every state transition produces one `Span`:
+Every state transition produces one JSON span:
 
 ```json
 {
-  "event": "span",
   "span_id": "sp_000004",
   "trace_id": "trace_ab70283be2e2",
-  "step": 1,
+  "step": 2,
   "from_state": "CALL_LLM",
   "to_state": "PROCESS_RESPONSE",
-  "timestamp_ms": 1779340477548,
   "duration_ms": 1495,
   "metadata": {
     "model": "claude-haiku-4-5-20251001",
     "input_tokens": 1408,
     "output_tokens": 129,
     "stop_reason": "tool_use",
-    "total_cost_usd": 0.001642
+    "total_cost_usd": 0.0016
   }
 }
 ```
 
+Replay any run step-by-step:
+
+```bash
+GET /traces/{trace_id}/replay
+```
+
 ### Budget — hard limits per run
 
 ```python
 RunBudget(
-    max_steps       = 15,    # maximum state machine iterations
-    max_tokens      = 10000, # total tokens across all LLM calls
-    max_cost_usd    = 0.50,  # maximum USD spend for this run
-    timeout_seconds = 90,    # wall-clock timeout
+    max_steps       = 10,    # state machine iterations
+    max_tokens      = 8_000, # total tokens across all LLM calls
+    max_cost_usd    = 0.30,  # maximum USD spend
+    timeout_seconds = 120,   # wall-clock timeout
 )
 ```
 
-Budget is checked at `CHECK_TERMINATION` after every tool round-trip. If any limit is hit, the run transitions to `FAIL` with `status="budget_exceeded"` or `status="timeout"`.
+Enforced at `CHECK_TERMINATION` after every tool round-trip. Any limit hit → `FAIL` with `status="budget_exceeded"`.
 
 ---
 
-## Quickstart
+## REST API
 
-**Prerequisites:** Python 3.11+, [uv](https://docs.astral.sh/uv/), an Anthropic API key.
-
-```bash
-git clone https://github.com/Ajayvardhanreddy/agent-execution-engine.git
-cd agent-execution-engine
-
-# Install dependencies
-uv sync
+```
+POST   /runs                      Submit a run (agent_id + session_id + user_id + input)
+GET    /runs/{run_id}             Get a completed run result
 
-# Set up environment
-cp .env.example .env
-# Edit .env and set ANTHROPIC_API_KEY=your-key-here
+GET    /agents                    List registered agents (summary)
+GET    /agents/{agent_id}         Full agent definition
+POST   /agents                    Register a new agent
+PUT    /agents/{agent_id}         Update agent definition
+DELETE /agents/{agent_id}         Remove agent
 
-# Run a demo scenario
-PYTHONPATH=. uv run python demos/support_agent/agent.py --scenario eligible_refund
+GET    /tools                     List all registered tools with JSON schemas
+GET    /tools/{name}              One tool schema
+POST   /tools/{name}/test         Run a tool directly with test input
 
-# Run all 7 scenarios
-PYTHONPATH=. uv run python demos/support_agent/agent.py
+GET    /traces/{trace_id}         Raw trace
+GET    /traces/{trace_id}/replay  Step-by-step replay
 
-# Run tests
-uv run pytest
+GET    /health                    Liveness probe
+GET    /ready                     Readiness probe (checks Layer 2)
+GET    /metrics                   Prometheus metrics
 ```
 
-> **Docker Compose** (full stack with Layer 1 + Layer 2): wired up in Phase 6.
-> Until then, run the engine directly as shown above.
+Interactive docs at `http://localhost:9000/docs` when the server is running.
 
 ---
 
-## Demo 1: Customer Support Agent
-
-**Location:** `demos/support_agent/`
-
-An e-commerce customer support agent with 5 tools and 7 demo scenarios. Proves multi-turn tool execution, policy-grounded decisions, escalation logic, and retry on tool failure.
+## MCP Integration
 
-### Tools
+The engine ships an MCP server — any MCP-compatible AI client can discover and call agents without writing integration code.
 
-| Tool | Permission | What it does |
-|---|---|---|
-| `order_lookup` | read | Look up order status, items, dates by order ID |
-| `refund_policy_search` | read | Retrieve applicable policy for a given situation |
-| `refund_request` | write | Submit a refund for an eligible order |
-| `ticket_create` | write | Open a support ticket |
-| `escalate_to_human` | **escalate** | Hand off to a human — triggers immediate `ESCALATE` state |
+### Claude Code (stdio — local dev)
 
-All tools are mocked with deterministic fixture data. No external dependencies.
-
-### Scenarios
+The repo includes `.mcp.json` — Claude Code picks it up automatically:
 
 ```bash
-PYTHONPATH=. uv run python demos/support_agent/agent.py --scenario <name>
+make serve   # engine on :9000
+# restart Claude Code in this directory
 ```
 
-| Scenario | What it tests | Expected status |
-|---|---|---|
-| `eligible_refund` | Happy path: damaged item, refund approved | `completed` |
-| `ineligible_refund` | Order outside 30-day window | `completed` |
-| `missing_order_id` | Agent must ask for order ID before calling tools | `completed` |
-| `frustrated_customer` | Angry user, valid claim — agent stays professional | `completed` |
-| `tool_timeout_retry` | `order_lookup` times out once, succeeds on retry | `completed` |
-| `policy_conflict_nonrefundable` | Digital product — non-refundable exception | `completed` |
-| `fraud_risk_escalation` | High-value order + repeated claim → escalate | `escalated` |
-
-### Sample run output
-
+Then in Claude Code:
 ```
-SCENARIO: eligible_refund
-INPUT: Hi, I received my order ORD-789 last week and the item arrived damaged...
-
-{"event":"span","from_state":"CALL_LLM","to_state":"PROCESS_RESPONSE","step":1,
- "duration_ms":1495,"metadata":{"input_tokens":1408,"stop_reason":"tool_use","total_cost_usd":0.0016}}
-
-{"event":"span","from_state":"EXECUTE_TOOL","to_state":"OBSERVE_RESULT","step":1,
- "metadata":{"tools_called":["order_lookup","refund_policy_search"],"all_succeeded":true}}
-
-{"event":"span","from_state":"CALL_LLM","to_state":"PROCESS_RESPONSE","step":3,
- "metadata":{"stop_reason":"end_turn","total_cost_usd":0.0064}}
-
-RESULT
-{
-  "status": "completed",
-  "final_answer": "Refund Approved ✓\n- Refund ID: REF-29637\n- Amount: $89.99\n- Timeline: 5 business days",
-  "steps_taken": 3,
-  "total_tokens_used": 5817,
-  "total_cost_usd": 0.0064,
-  "latency_ms": 5284
-}
-Match: ✓ PASS
+List the available agents.
+Use the support agent to check order ORD-789 for user u-001.
 ```
 
-### Phase 2 memory demo
+Claude Code calls `list_agents` and `run_agent` via your MCP server — no curl, no integration code.
 
-Requires Layer 2 running on `localhost:8080`:
+### Docker (SSE — remote clients)
 
 ```bash
-PYTHONPATH=. uv run python demos/support_agent/agent.py --memory-demo
+docker compose up
+# MCP server on :9001 with SSE transport
 ```
 
-Run 1: user introduces themselves as Alex → name stored in user memory.
-Run 2: same user, new session → agent greets Alex by name without being told again.
-
----
+Connect any SSE-capable MCP client to `http://localhost:9001/sse`.
 
-## Demo 2: Engineering Assistant Agent
+See [docs/mcp-setup.md](docs/mcp-setup.md) for Claude Desktop config.
 
-> **Status: Phase 5 — not yet built.** Placeholder directory exists at `demos/engineering_agent/`.
+---
 
-Will prove the same engine runs in a completely different domain. Tools: `github_issue_read`, `repo_file_search`, `code_context_retrieval`, `fix_plan_create`. Five scenarios: clear bug, missing context, large codebase, conflicting comments, test failure triage.
+## Demo Agents
 
----
+### Demo 1 — Customer Support Agent
 
-## Trace Replay
+**Location:** `demos/support_agent/`
 
-> **Status: Phase 3 — not yet built.**
+Five tools, seven scenarios, 20 benchmark eval cases. Proves multi-turn tool execution, policy-grounded decisions, escalation logic, retry on tool failure, and cross-session memory.
 
-Once built, the REST API will expose:
+| Tool | What it does |
+|---|---|
+| `order_lookup` | Look up order status, items, and dates |
+| `refund_policy_search` | Retrieve applicable refund policy |
+| `refund_request` | Submit a refund for an eligible order |
+| `ticket_create` | Open a support ticket |
+| `escalate_to_human` | Hand off to human — triggers `ESCALATE` state immediately |
 
 ```bash
-# Run the agent
-curl -X POST http://localhost:9000/runs -d '{"agent_id":"support_agent",...}'
-# → {"run_id":"run_abc","trace_id":"trace_xyz",...}
-
-# Replay every decision
-curl http://localhost:9000/traces/trace_xyz/replay
+make demo                                                    # eligible_refund scenario
+PYTHONPATH=. uv run python demos/support_agent/agent.py --list   # all scenarios
+PYTHONPATH=. uv run python demos/support_agent/agent.py --scenario fraud_risk_escalation
 ```
 
-The replay endpoint returns an ordered, human-readable list of every state the agent visited, every tool call made, and every LLM decision with inputs, outputs, latency, and cost at each step.
+| Scenario | Tests | Expected |
+|---|---|---|
+| `eligible_refund` | Happy path — damaged item, refund approved | `completed` |
+| `ineligible_refund` | Order outside 30-day window | `completed` |
+| `missing_order_id` | Agent asks for order ID before calling tools | `completed` |
+| `frustrated_customer` | Angry user, valid claim | `completed` |
+| `tool_timeout_retry` | `order_lookup` times out once, succeeds on retry | `completed` |
+| `policy_conflict_nonrefundable` | Digital product — non-refundable exception | `completed` |
+| `fraud_risk_escalation` | High-value order + repeated claim → escalate | `escalated` |
 
----
+### Demo 2 — Engineering Assistant Agent
 
-## Eval Methodology
+**Location:** `demos/engineering_agent/`
 
-> **Status: Phase 4 — not yet built.**
+Four tools, five scenarios, 10 benchmark eval cases. Same engine, completely different domain — proves the runtime is domain-agnostic.
 
-Once built:
+| Tool | What it does |
+|---|---|
+| `code_search` | Search codebase for functions, classes, patterns |
+| `file_read` | Read a specific file by path |
+| `pr_review` | Review a pull request for bugs, security issues, style |
+| `dependency_check` | Audit dependencies for outdated versions and CVEs |
 
 ```bash
-make eval AGENT=support_agent          # run 20 benchmark cases, emit regression report
-make eval AGENT=engineering_agent      # run 10 benchmark cases
-make eval-compare AGENT=support_agent PREV=reports/report_20260514.md
-```
-
-**Metrics measured:**
-
-| Metric | What it measures |
-|---|---|
-| Task success | Did the agent complete the user's goal? |
-| Tool correctness | Right tools in the right order? |
-| Escalation accuracy | Escalated exactly when it should have? |
-| Groundedness | Final answer based on tool outputs, not hallucination? |
-| Step efficiency | Minimum necessary steps? |
-| Cost per run | USD spent per execution |
-| Latency | Wall-clock time to completion |
-| Failure recovery | Tool failures handled gracefully? |
-
-**Regression report format:**
-```
-Task success:        17/20  (85.0%)  [prev: 16/20  +1]
-Tool correctness:    19/20  (95.0%)  [prev: 19/20   0]
-Escalation accuracy: 17/20  (85.0%)  [prev: 15/20  +2]
-Avg cost per run:    $0.038          [prev: $0.041 improved]
-
-REGRESSIONS (1)
-- case_012: task_success 1.0 → 0.0
-  Reason: agent called refund_request before order_lookup
+make demo-eng                                                        # review_pr scenario
+PYTHONPATH=. uv run python demos/engineering_agent/agent.py --list  # all scenarios
+PYTHONPATH=. uv run python demos/engineering_agent/agent.py --scenario security_investigation
 ```
 
----
-
-## Failure Modes
-
-Every failure mode has explicit handling. None of them crash the engine or produce an infinite loop.
-
-| Failure mode | Trigger | Engine behavior |
+| Scenario | Tests | Expected |
 |---|---|---|
-| Tool timeout | Tool exceeds `timeout_seconds` | `ToolError(error_type="timeout")`, retry up to `max_retries`, then `on_timeout` policy |
-| Tool malformed output | Output fails Pydantic validation | `ToolError(error_type="validation_error")`, no retry, structured error returned to LLM |
-| Tool empty result | Tool returns `None` or `{}` | `ToolError(error_type="empty_result", recoverable=True)`, LLM decides next step |
-| Unknown tool called | LLM calls unregistered tool name | Error message injected: `"Tool X is not available. Available: [...]"` |
-| Duplicate tool call | Same tool + same input called twice | Loop guard injects: `"You already called this tool. Use a different approach."` |
-| Context window overflow | Token budget approaching limit | Compressor trims oldest messages, preserves last 6 + original user message |
-| Budget exceeded | Steps / tokens / cost / time limit hit | `FAIL` with `status="budget_exceeded"` or `status="timeout"` |
-| Memory service down (503) | Layer 2 returns 503 | `FAIL` with `failure_reason="memory_unavailable"` — do not continue |
-| Memory service unreachable | Layer 2 not running | Degrade gracefully — run without memory (dev mode) |
-| LLM API error | Anthropic returns 5xx | Retry up to 3× with exponential backoff, then `FAIL` |
-| Infinite loop detected | Same state sequence repeats 3× | Loop guard forces `FAIL` with `failure_reason="Infinite loop detected"` |
-| Escalation required | Tool with `permission_level="escalate"` called | Immediate `ESCALATE` from `EXECUTE_TOOL`, bypasses `CHECK_TERMINATION` |
+| `find_function` | Locate function definition across codebase | `completed` |
+| `review_pr` | PR-42 has critical timing-attack — must be called out | `completed` |
+| `dependency_audit` | cryptography CRITICAL CVE must be surfaced | `completed` |
+| `inspect_file` | Read and explain a source file | `completed` |
+| `security_investigation` | Multi-step: find SQL injection, then read vulnerable file | `completed` |
 
 ---
 
-## MCP Integration
-
-> **Status: Phase 6 — not yet built.**
+## Eval Harness
 
-Once built, Claude Desktop can run agents and replay traces directly:
+30 benchmark conversations scored across 6 weighted dimensions with hard-fail semantics for safety-critical cases.
 
-```json
-{
-  "mcpServers": {
-    "agent-execution-engine": {
-      "command": "python",
-      "args": ["-m", "mcp_server.server"],
-      "cwd": "/path/to/agent-execution-engine"
-    }
-  }
-}
+```bash
+make eval              # 20 support cases
+make eval-engineering  # 10 engineering cases
+make eval-all          # all 30 cases
+make eval-case CASE=support_005   # single case
 ```
 
-Exposed tools: `run_agent`, `get_trace`, `replay_trace`, `run_eval`, `list_tools`.
-
----
+| Scorer | Weight | Hard fail? | What it measures |
+|---|---|---|---|
+| `task_completion` | 2.0 | Yes | Did the run reach the expected terminal state? |
+| `tool_selection` | 1.5 | No | Right tools called, no unexpected tools? |
+| `answer_quality` | 1.5 | No | Required keywords present, forbidden terms absent? |
+| `escalation_accuracy` | 1.5 | Yes | Escalated when required? False negative = security failure |
+| `cost_efficiency` | 0.5 | No | Cost within the case budget? |
+| `latency` | 0.5 | No | Response within latency threshold? |
 
-## Layer Integration
+**Hard-fail semantics:** If a case misses a required escalation (e.g. fraud scenario completed instead of escalating), it auto-fails regardless of other scores. A false negative on escalation is a security failure — weighted average alone cannot pass it.
 
-The engine is the top layer of a three-layer AI infrastructure stack. Every layer is independently deployable and tested.
-
-```
-┌─────────────────────────────────┐
-│   Layer 3: Agent Execution      │  ← this repo
-│   Engine (port 9000)            │
-│   Orchestration, tools, evals   │
-└─────────────┬───────────────────┘
-              │ HTTP (localhost:8080)
-              ▼
-┌─────────────────────────────────┐
-│   Layer 2: Agent Memory         │  github.com/Ajayvardhanreddy/agent-memory-service
-│   Service (port 8080)           │
-│   4 memory namespaces, streams  │
-└─────────────┬───────────────────┘
-              │ internal
-              ▼
-┌─────────────────────────────────┐
-│   Layer 1: Distributed KV       │  github.com/Ajayvardhanreddy/distributed-kv-store
-│   Store (ports 8000–8002)       │
-│   Consistent hashing, failover  │
-└─────────────────────────────────┘
-```
+Pass threshold: **0.70** overall weighted score.
 
 ---
 
-## Build Status
+## Failure Modes
 
-| Phase | What | Status |
+Every failure mode is modeled. None crash the engine or produce an infinite loop.
+
+| Failure | Trigger | Behavior |
 |---|---|---|
-| Phase 1 | Orchestrator, state machine, tool execution, support agent demo | ✅ Complete |
-| Phase 2 | Memory integration (session, user, working, audit) | ✅ Complete |
-| Phase 3 | REST API, trace replay, Prometheus metrics | 🔲 Not started |
-| Phase 4 | Eval harness, 6 scorers, 30 benchmark cases, regression reports | 🔲 Not started |
-| Phase 5 | Engineering assistant agent demo | 🔲 Not started |
-| Phase 6 | MCP server, Docker Compose full stack | 🔲 Not started |
+| Tool timeout | Exceeds `timeout_seconds` | Retry up to `max_retries`, then apply `on_timeout` policy |
+| Malformed tool output | Fails Pydantic validation | Structured error returned to LLM — no retry |
+| Empty tool result | Returns `None` or `{}` | Recoverable error — LLM decides next step |
+| Unknown tool called | LLM calls unregistered name | `"Tool X not available. Available: [...]"` injected |
+| Duplicate tool call | Same tool + input called twice | Loop guard injects alternative approach hint |
+| Context overflow | Token budget approaching limit | Compressor trims oldest messages, preserves last 6 + original |
+| Budget exceeded | Steps / tokens / cost / time | `FAIL` with `status="budget_exceeded"` or `"timeout"` |
+| Infinite loop | Same state sequence 3× | Loop guard forces `FAIL` |
+| Layer 2 down (503) | Memory service returns 503 | `FAIL` — do not continue without memory |
+| Layer 2 unreachable | Not running | Graceful degrade — run without memory (dev mode) |
+| LLM API error | Anthropic 5xx | Retry 3× with exponential backoff, then `FAIL` |
+| Escalation triggered | Tool with `permission_level="escalate"` | Immediate `ESCALATE` from `EXECUTE_TOOL` |
 
 ---
 
-## Architectural Decision Records
-
-Written as each phase is stabilised:
+## Tech Stack
 
-| ADR | Decision |
+| Concern | Choice |
 |---|---|
-| [001](docs/decisions/001-custom-orchestration-over-langgraph.md) | Custom orchestration over LangGraph |
-| [002](docs/decisions/002-anthropic-sdk-direct-over-langchain.md) | Anthropic SDK direct over LangChain |
-| [003](docs/decisions/003-state-machine-over-while-loop.md) | Explicit state machine over while loop |
-| [004](docs/decisions/004-typed-tool-registry.md) | Typed tool registry with Pydantic |
-| [005](docs/decisions/005-eval-methodology.md) | Structured eval methodology |
-
-> ADR documents are written as phases complete. Pending Phase 1 + 2 stabilisation.
+| Language | Python 3.11+ |
+| LLM | Anthropic SDK — direct, no framework |
+| API | FastAPI + Uvicorn |
+| Validation | Pydantic v2 |
+| MCP server | FastMCP |
+| Memory backend | [Layer 2 — agent-memory-service](https://github.com/Ajayvardhanreddy/agent-memory-service) |
+| KV backend | [Layer 1 — distributed-kv-store](https://github.com/Ajayvardhanreddy/distributed-kv-store) |
+| HTTP client | httpx (async) |
+| Observability | Structured JSON spans + Prometheus metrics |
+| Testing | pytest + pytest-asyncio + pytest-httpx |
+| Package manager | uv |
+| Containerisation | Docker + Docker Compose |
 
 ---
 
-## Known Limitations
+## Project Structure
 
-| Limitation | Notes |
-|---|---|
-| No REST API yet | Engine is runnable via Python only until Phase 3 |
-| No eval scoring yet | Eval harness built in Phase 4 |
-| No MCP server yet | Claude Desktop integration in Phase 6 |
-| Docker Compose not wired | Skeletons exist; full stack in Phase 6 |
-| Tools are mocked | Support agent uses fixture data, no real order database |
-| No auth on any endpoint | By design for portfolio; production would add API key auth |
-| Single-node memory in dev mode | When Layer 2 is unavailable, no memory persistence |
-| User fact extraction is regex-only | Name patterns only; no LLM-based extraction yet |
-| No streaming responses | All runs are synchronous; streaming added in Phase 3 |
+```
+engine/
+  api/           REST API — routes, agent registry, dependencies, middleware
+  evals/         Eval framework — base contracts, 6 scorers, EvalSuite
+  memory/        Layer 2 client — 4 namespaces, graceful degrade
+  mcp/           MCP server — run_agent, list_agents, stdio + SSE
+  observability/ Trace collection, Prometheus metrics
+  orchestrator/  State machine, engine loop, step runner, loop guard, budget
+  tools/         ToolDefinition, ToolRegistry, ToolExecutor, error types
+
+demos/
+  support_agent/    5 tools, 7 scenarios, system prompt
+  engineering_agent/ 4 tools, 5 scenarios, system prompt
+
+evals/
+  dataset/       30 benchmark EvalCases (support + engineering)
+  reports/       JSON regression reports (gitignored)
+  runner.py      CLI runner — --suite, --case, --no-save
+  report.py      Terminal output + JSON report formatter
+
+docs/
+  mcp-setup.md         Claude Code + Docker MCP setup guide
+  adding-tools.md      How to write and register custom tools
+  registering-agents.md How to register agents via the API
+  tool_template.py     Copy-paste starting point for tool authors
+
+tests/
+  unit/          State machine, scorers, registry, executor, budget, traces
+  integration/   API routes, MCP server, memory client
+```
 
 ---
 
-## Roadmap
-
-Phases 3–6 in order:
-
-**Phase 3 — REST API + Trace Replay**
-`POST /runs`, `GET /runs/{id}`, `GET /traces/{id}/replay`, `GET /metrics` (Prometheus). Full trace reconstructed as human-readable ordered event list.
-
-**Phase 4 — Eval Harness**
-`make eval AGENT=support_agent`. 6 scorers (task success, tool correctness, escalation, groundedness, cost, latency). 30 benchmark cases. Regression report with diff from previous run.
-
-**Phase 5 — Engineering Assistant Agent**
-Same engine, different domain. 4 tools (GitHub issue read, repo file search, code context retrieval, fix plan creation). 5 scenarios. 10 eval cases.
-
-**Phase 6 — MCP Server + Docker Compose**
-FastMCP server exposing engine as 5 MCP tools. `docker-compose up` brings up all three layers in one command.
-
----
+## Build Status
 
-## Tech Stack
+All 6 phases complete.
 
-| Concern | Choice |
-|---|---|
-| Language | Python 3.11+ |
-| LLM | Anthropic SDK (direct — no LangChain) |
-| API framework | FastAPI + Uvicorn (Phase 3) |
-| Data validation | Pydantic v2 |
-| Memory backend | Layer 2 Memory Service (HTTP) |
-| KV backend | Layer 1 Distributed KV Store |
-| HTTP client | httpx (async) |
-| MCP server | fastmcp (Phase 6) |
-| Testing | pytest + pytest-asyncio + pytest-httpx |
-| Package manager | uv |
-| Observability | Structured JSON spans, Prometheus metrics (Phase 3) |
+| Phase | What | Status |
+|---|---|---|
+| Phase 1 | State machine, tool execution, budget, loop detection, support agent demo | ✅ |
+| Phase 2 | Memory integration — session, user, working, audit namespaces | ✅ |
+| Phase 3 | REST API, AgentRegistry, trace replay, Prometheus metrics | ✅ |
+| Phase 4 | Eval harness — 6 scorers, hard-fail semantics, 20 support cases, regression reports | ✅ |
+| Phase 5 | Engineering agent — 4 tools, 5 scenarios, 10 eval cases | ✅ |
+| Phase 6 | MCP server, Docker Compose, Claude Code `.mcp.json`, startup agent registration | ✅ |