khalilCodeX · khalilCodeX · Jan 14, 2026 · Jan 14, 2026
diff --git a/README.md b/README.md
@@ -197,7 +197,7 @@ result = client.replay(
 | `max_revisions` | `int` | `3` | Max revision attempts before fallback |
 | `strictness` | `Strictness` | `BALANCED` | Eval gate strictness (LENIENT, BALANCED, STRICT) |
 | `retriever_fn` | `Callable` | `None` | Custom retriever callback for RAG |
-| `enable_cache` | `bool` | `True` | Enable response caching (roadmap) |
+| `enable_cache` | `bool` | `True` | Enable LLM response caching |
 
 ---
 
@@ -264,10 +264,16 @@ pytest tests/test_client.py::test_basic_run_without_retriever -v
 
 ## Roadmap
 
-- [ ] Multi-provider fallback (Anthropic, Gemini)
-- [ ] Response caching implementation
-- [ ] Streamlit ops dashboard
+- [x] Multi-provider support (OpenAI, Anthropic)
+- [x] Response caching implementation
+- [x] Streamlit ops dashboard
+- [x] Trace persistence with SQLite + WAL mode
+- [x] Eval gate pattern with revision loop
+- [x] Cost tracking per request
+- [x] CI/CD with GitHub Actions
+- [x] Architecture Decision Records (ADRs)
 - [ ] CLI tool (`traceflow run "query"`)
+- [ ] Budget-aware model fallback
 - [ ] Advanced evaluators (relevance scoring, citation validation)
 - [ ] Async execution support
 - [ ] OpenTelemetry export integration

diff --git a/docs/adr/0001-use-sqlite-with-wal-mode.md b/docs/adr/0001-use-sqlite-with-wal-mode.md
@@ -0,0 +1,72 @@
+# ADR-0001: Use SQLite with WAL Mode for Persistence
+
+## Status
+
+Accepted
+
+## Context
+
+TraceFlow Lite needs a persistence layer to store traces, steps, evaluations, and cached LLM responses. The system is designed as a lightweight, single-node observability tool that developers can run locally or in small-scale deployments.
+
+Requirements:
+- Zero external dependencies (no separate database server)
+- ACID compliance for data integrity
+- Support for concurrent reads during writes (Streamlit UI + agent execution)
+- Simple deployment and backup (single file)
+- Good performance for read-heavy workloads (trace replay, analytics)
+
+Options considered:
+1. **PostgreSQL/MySQL** - Full-featured but requires external server
+2. **SQLite (default journal mode)** - Simple but blocks readers during writes
+3. **SQLite with WAL mode** - Simple with concurrent read/write support
+4. **File-based JSON/JSONL** - No schema, poor query performance
+5. **Redis** - In-memory, requires external server
+
+## Decision
+
+Use SQLite with Write-Ahead Logging (WAL) mode enabled.
+
+Configuration:
+```python
+connection = sqlite3.connect(
+    db_path,
+    check_same_thread=False  # Required for Streamlit's threading model
+)
+connection.execute("PRAGMA journal_mode=WAL")
+connection.execute("PRAGMA busy_timeout=5000")
+```
+
+Schema design:
+- `traces` - Parent records for agent runs
+- `steps` - Individual LLM calls within a trace
+- `evals` - Evaluation results linked to steps
+- `llm_cache` - Cached LLM responses keyed by hash
+
+## Consequences
+
+### Positive
+
+- **Zero infrastructure** - No database server to install, configure, or maintain
+- **Portable** - Single `.db` file can be copied, backed up, or shared
+- **Concurrent access** - WAL mode allows reads during writes (critical for live UI)
+- **ACID compliance** - Full transaction support with rollback capability
+- **Fast reads** - Excellent performance for the read-heavy trace viewer
+- **Python stdlib** - No additional dependencies beyond `sqlite3`
+
+### Negative
+
+- **Single-node only** - Cannot scale horizontally (acceptable for target use case)
+- **Write throughput** - Lower than dedicated databases (sufficient for observability)
+- **No built-in replication** - Manual backup required for durability
+- **Threading complexity** - Requires `check_same_thread=False` for Streamlit
+
+### Neutral
+
+- WAL mode creates additional `-wal` and `-shm` files alongside the database
+- Database file grows monotonically; periodic `VACUUM` may be needed
+
+## References
+
+- [SQLite WAL Mode Documentation](https://www.sqlite.org/wal.html)
+- [SQLite Threading Modes](https://www.sqlite.org/threadsafe.html)
+- [Streamlit Threading Model](https://docs.streamlit.io/library/api-reference/performance/st.cache_data)
diff --git a/docs/adr/0002-langgraph-for-agent-orchestration.md b/docs/adr/0002-langgraph-for-agent-orchestration.md
@@ -0,0 +1,71 @@
+# ADR-0002: Use LangGraph for Agent Orchestration
+
+## Status
+
+Accepted
+
+## Context
+
+TraceFlow Lite implements a multi-step agent workflow with distinct phases: intake, planning, execution, evaluation, and revision. The workflow requires:
+
+- Conditional branching based on evaluation results
+- State management across nodes
+- Clear separation between planning and execution phases
+- Support for iterative refinement (revision loops)
+- Observable, debuggable execution flow
+
+Options considered:
+1. **Custom state machine** - Full control but significant implementation effort
+2. **LangChain LCEL** - Good for linear chains, limited branching support
+3. **LangGraph** - Graph-based orchestration with conditional edges
+4. **Temporal/Prefect** - Enterprise workflow engines, heavy dependencies
+5. **Simple function composition** - Minimal overhead but poor observability
+
+## Decision
+
+Use LangGraph for agent workflow orchestration.
+
+Graph structure:
+```
+intake_node → planner_node → executor_node → eval_node
+                   ↑                              ↓
+                   └──── revision_node ←──────────┘
+                              ↓
+                         (conditional: retry or end)
+```
+
+Key design choices:
+- **TypedDict state** - Strongly typed state passed between nodes
+- **Conditional edges** - Route based on evaluation pass/fail
+- **Node isolation** - Each node has single responsibility
+- **Explicit wiring** - Graph structure defined declaratively
+
+## Consequences
+
+### Positive
+
+- **Visual clarity** - Graph structure maps directly to agent workflow
+- **Conditional routing** - Native support for eval-based branching
+- **State typing** - TypedDict provides IDE support and validation
+- **Debugging** - Clear node boundaries for step-by-step inspection
+- **LangChain ecosystem** - Compatible with LangChain tools and callbacks
+- **Extensibility** - Easy to add new nodes or modify routing logic
+
+### Negative
+
+- **Dependency** - Adds `langgraph` as a required dependency
+- **Learning curve** - Graph concepts may be unfamiliar to some developers
+- **Overhead** - More abstraction than simple function calls
+- **Version coupling** - Must track LangGraph API changes
+
+### Neutral
+
+- Graph compilation creates a static execution plan
+- State is passed by value between nodes (immutable pattern)
+- Requires explicit `END` node for termination
+
+## References
+
+- [LangGraph Documentation](https://python.langchain.com/docs/langgraph)
+- [LangGraph Conditional Edges](https://langchain-ai.github.io/langgraph/concepts/low_level/#conditional-edges)
+- [TraceFlow Workflow Diagram](../Architecture/traceflow_workflow.md)
diff --git a/docs/adr/0003-provider-abstraction-pattern.md b/docs/adr/0003-provider-abstraction-pattern.md
@@ -0,0 +1,77 @@
+# ADR-0003: Provider Abstraction Pattern for LLM Integration
+
+## Status
+
+Accepted
+
+## Context
+
+TraceFlow Lite needs to support multiple LLM providers (OpenAI, Anthropic) with the ability to:
+
+- Switch providers without changing application code
+- Route different tasks to different models (e.g., fast model for planning, powerful model for execution)
+- Add new providers with minimal effort
+- Apply cross-cutting concerns (caching, retry, cost tracking) uniformly
+
+Options considered:
+1. **Direct SDK calls** - Simple but tightly coupled, no abstraction
+2. **LangChain ChatModel** - Standard interface but heavy dependency
+3. **Custom Protocol/ABC** - Lightweight abstraction, full control
+4. **litellm** - Unified interface but additional dependency
+5. **Provider factory pattern** - Abstraction with runtime selection
+
+## Decision
+
+Implement a custom provider abstraction using Python's Protocol pattern.
+
+Architecture:
+```python
+class LLMProvider(Protocol):
+    """Protocol defining the LLM provider interface."""
+
+    def complete(
+        self,
+        messages: list[dict[str, str]],
+        model: str,
+        temperature: float = 0.7,
+        max_tokens: int = 2048,
+    ) -> LLMResponse: ...
+```
+
+Components:
+- **`base.py`** - Protocol definition and response types
+- **`openai_provider.py`** - OpenAI implementation
+- **`anthropic_provider.py`** - Anthropic implementation
+- **`router.py`** - Model-to-provider routing logic
+- **`cache_provider.py`** - Decorator for caching responses
+- **`retry.py`** - Retry logic with exponential backoff
+- **`cost.py`** - Token counting and cost calculation
+
+## Consequences
+
+### Positive
+
+- **Loose coupling** - Application code depends on Protocol, not implementations
+- **Easy testing** - Mock providers for unit tests without API calls
+- **Uniform interface** - Same `complete()` signature across all providers
+- **Decorator pattern** - Caching, retry, logging applied transparently
+- **No heavy dependencies** - Only `openai` and `anthropic` SDKs needed
+- **Type safety** - Protocol provides IDE autocomplete and type checking
+
+### Negative
+
+- **Maintenance burden** - Must update implementations for SDK changes
+- **Feature gaps** - May not expose all provider-specific features
+- **Translation overhead** - Must convert between provider message formats
+
+### Neutral
+
+- Each provider handles its own message format translation
+- Response normalization strips provider-specific metadata
+- Cost tracking requires manual price table maintenance
+
+## References
+
+- [Python Protocol (PEP 544)](https://peps.python.org/pep-0544/)
+- [OpenAI Python SDK](https://github.com/openai/openai-python)
+- [Anthropic Python SDK](https://github.com/anthropics/anthropic-sdk-python)
diff --git a/docs/adr/0004-llm-response-caching.md b/docs/adr/0004-llm-response-caching.md
@@ -0,0 +1,104 @@
+# ADR-0004: LLM Response Caching Strategy
+
+## Status
+
+Accepted
+
+## Context
+
+LLM API calls are expensive (cost) and slow (latency). During development and testing, the same prompts are often sent repeatedly. TraceFlow Lite needs a caching mechanism to:
+
+- Reduce API costs during development iterations
+- Speed up test execution
+- Enable deterministic replay for debugging
+- Support offline development with cached responses
+
+Requirements:
+- Cache key must uniquely identify equivalent requests
+- Cache must be persistent across sessions
+- Cache hits must be trackable for observability
+- Caching must be toggleable (disable for production)
+
+Options considered:
+1. **In-memory LRU cache** - Fast but not persistent, lost on restart
+2. **Redis** - Persistent but requires external service
+3. **File-based (JSON)** - Simple but poor query performance
+4. **SQLite table** - Persistent, queryable, no new dependencies
+5. **LangChain caching** - Built-in but couples to LangChain
+
+## Decision
+
+Implement SQLite-based LLM response caching with content-addressable keys.
+
+Cache key computation:
+```python
+def compute_key(
+    model: str,
+    messages: list[dict],
+    temperature: float,
+    max_tokens: int
+) -> str:
+    """SHA256 hash of normalized request parameters."""
+    payload = json.dumps({
+        "model": model,
+        "messages": messages,
+        "temperature": temperature,
+        "max_tokens": max_tokens,
+    }, sort_keys=True)
+    return hashlib.sha256(payload.encode()).hexdigest()
+```
+
+Schema:
+```sql
+CREATE TABLE llm_cache (
+    cache_key TEXT PRIMARY KEY,
+    model TEXT,
+    response_content TEXT,
+    prompt_tokens INTEGER,
+    completion_tokens INTEGER,
+    created_at TEXT
+)
+```
+
+Integration via decorator pattern:
+```python
+class CachedProvider:
+    """Wrapper that checks cache before calling underlying provider."""
+
+    def complete(self, messages, model, **kwargs) -> LLMResponse:
+        key = self.cache.compute_key(model, messages, **kwargs)
+        if cached := self.cache.get(key):
+            return LLMResponse(..., cache_hit=True)
+        response = self.provider.complete(messages, model, **kwargs)
+        self.cache.set(key, response)
+        return response
+```
+
+## Consequences
+
+### Positive
+
+- **Cost savings** - Identical requests return cached responses instantly
+- **Faster iteration** - Development loop accelerated significantly
+- **Deterministic tests** - Same input always produces same output
+- **Observability** - `cache_hit` field tracks cache effectiveness
+- **No new dependencies** - Reuses existing SQLite infrastructure
+- **Toggle support** - Enable/disable via UI or configuration
+
+### Negative
+
+- **Stale responses** - Cache doesn't invalidate when models update
+- **Storage growth** - Cache table grows unbounded (manual cleanup needed)
+- **Temperature sensitivity** - Different temperatures = different cache keys
+- **Not suitable for production** - Should be disabled for real user requests
+
+### Neutral
+
+- Cache key includes all parameters affecting output
+- JSON normalization with `sort_keys=True` ensures consistent hashing
+- Cache miss incurs small overhead for key computation
+
+## References
+
+- [Content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
+- [SHA-256 Hash Function](https://docs.python.org/3/library/hashlib.html)