From e636e31ed3dfcf790f763d75470e4caebc9108d0 Mon Sep 17 00:00:00 2001
From: khalilcodex <mkhalilsm@gmail.com>
Date: Tue, 13 Jan 2026 19:17:21 -0800
Subject: [PATCH] add adr section

---
 README.md                                     |  14 ++-
 docs/adr/0001-use-sqlite-with-wal-mode.md     |  72 ++++++++++++
 .../0002-langgraph-for-agent-orchestration.md |  71 ++++++++++++
 docs/adr/0003-provider-abstraction-pattern.md |  77 +++++++++++++
 docs/adr/0004-llm-response-caching.md         | 104 ++++++++++++++++++
 docs/adr/0005-evaluation-gate-pattern.md      |  88 +++++++++++++++
 docs/adr/0006-streamlit-for-ui.md             |  87 +++++++++++++++
 docs/adr/README.md                            |  25 +++++
 docs/adr/template.md                          |  33 ++++++
 9 files changed, 567 insertions(+), 4 deletions(-)
 create mode 100644 docs/adr/0001-use-sqlite-with-wal-mode.md
 create mode 100644 docs/adr/0002-langgraph-for-agent-orchestration.md
 create mode 100644 docs/adr/0003-provider-abstraction-pattern.md
 create mode 100644 docs/adr/0004-llm-response-caching.md
 create mode 100644 docs/adr/0005-evaluation-gate-pattern.md
 create mode 100644 docs/adr/0006-streamlit-for-ui.md
 create mode 100644 docs/adr/README.md
 create mode 100644 docs/adr/template.md

diff --git a/README.md b/README.md
index 2fe5284..621f8b0 100644
--- a/README.md
+++ b/README.md
@@ -197,7 +197,7 @@ result = client.replay(
 | `max_revisions` | `int` | `3` | Max revision attempts before fallback |
 | `strictness` | `Strictness` | `BALANCED` | Eval gate strictness (LENIENT, BALANCED, STRICT) |
 | `retriever_fn` | `Callable` | `None` | Custom retriever callback for RAG |
-| `enable_cache` | `bool` | `True` | Enable response caching (roadmap) |
+| `enable_cache` | `bool` | `True` | Enable LLM response caching |
 
 ---
 
@@ -264,10 +264,16 @@ pytest tests/test_client.py::test_basic_run_without_retriever -v
 
 ## Roadmap
 
-- [ ] Multi-provider fallback (Anthropic, Gemini)
-- [ ] Response caching implementation
-- [ ] Streamlit ops dashboard
+- [x] Multi-provider support (OpenAI, Anthropic)
+- [x] Response caching implementation
+- [x] Streamlit ops dashboard
+- [x] Trace persistence with SQLite + WAL mode
+- [x] Eval gate pattern with revision loop
+- [x] Cost tracking per request
+- [x] CI/CD with GitHub Actions
+- [x] Architecture Decision Records (ADRs)
 - [ ] CLI tool (`traceflow run "query"`)
+- [ ] Budget-aware model fallback
 - [ ] Advanced evaluators (relevance scoring, citation validation)
 - [ ] Async execution support
 - [ ] OpenTelemetry export integration
diff --git a/docs/adr/0001-use-sqlite-with-wal-mode.md b/docs/adr/0001-use-sqlite-with-wal-mode.md
new file mode 100644
index 0000000..f4bf459
--- /dev/null
+++ b/docs/adr/0001-use-sqlite-with-wal-mode.md
@@ -0,0 +1,72 @@
+# ADR-0001: Use SQLite with WAL Mode for Persistence
+
+## Status
+
+Accepted
+
+## Context
+
+TraceFlow Lite needs a persistence layer to store traces, steps, evaluations, and cached LLM responses. The system is designed as a lightweight, single-node observability tool that developers can run locally or in small-scale deployments.
+
+Requirements:
+- Zero external dependencies (no separate database server)
+- ACID compliance for data integrity
+- Support for concurrent reads during writes (Streamlit UI + agent execution)
+- Simple deployment and backup (single file)
+- Good performance for read-heavy workloads (trace replay, analytics)
+
+Options considered:
+1. **PostgreSQL/MySQL** - Full-featured but requires external server
+2. **SQLite (default journal mode)** - Simple but blocks readers during writes
+3. **SQLite with WAL mode** - Simple with concurrent read/write support
+4. **File-based JSON/JSONL** - No schema, poor query performance
+5. **Redis** - In-memory, requires external server
+
+## Decision
+
+Use SQLite with Write-Ahead Logging (WAL) mode enabled.
+
+Configuration:
+```python
+connection = sqlite3.connect(
+    db_path,
+    check_same_thread=False  # Required for Streamlit's threading model
+)
+connection.execute("PRAGMA journal_mode=WAL")
+connection.execute("PRAGMA busy_timeout=5000")
+```
+
+Schema design:
+- `traces` - Parent records for agent runs
+- `steps` - Individual LLM calls within a trace
+- `evals` - Evaluation results linked to steps
+- `llm_cache` - Cached LLM responses keyed by hash
+
+## Consequences
+
+### Positive
+
+- **Zero infrastructure** - No database server to install, configure, or maintain
+- **Portable** - Single `.db` file can be copied, backed up, or shared
+- **Concurrent access** - WAL mode allows reads during writes (critical for live UI)
+- **ACID compliance** - Full transaction support with rollback capability
+- **Fast reads** - Excellent performance for the read-heavy trace viewer
+- **Python stdlib** - No additional dependencies beyond `sqlite3`
+
+### Negative
+
+- **Single-node only** - Cannot scale horizontally (acceptable for target use case)
+- **Write throughput** - Lower than dedicated databases (sufficient for observability)
+- **No built-in replication** - Manual backup required for durability
+- **Threading complexity** - Requires `check_same_thread=False` for Streamlit
+
+### Neutral
+
+- WAL mode creates additional `-wal` and `-shm` files alongside the database
+- Database file grows monotonically; periodic `VACUUM` may be needed
+
+## References
+
+- [SQLite WAL Mode Documentation](https://www.sqlite.org/wal.html)
+- [SQLite Threading Modes](https://www.sqlite.org/threadsafe.html)
+- [Streamlit Threading Model](https://docs.streamlit.io/library/api-reference/performance/st.cache_data)
diff --git a/docs/adr/0002-langgraph-for-agent-orchestration.md b/docs/adr/0002-langgraph-for-agent-orchestration.md
new file mode 100644
index 0000000..c7a6a0b
--- /dev/null
+++ b/docs/adr/0002-langgraph-for-agent-orchestration.md
@@ -0,0 +1,71 @@
+# ADR-0002: Use LangGraph for Agent Orchestration
+
+## Status
+
+Accepted
+
+## Context
+
+TraceFlow Lite implements a multi-step agent workflow with distinct phases: intake, planning, execution, evaluation, and revision. The workflow requires:
+
+- Conditional branching based on evaluation results
+- State management across nodes
+- Clear separation between planning and execution phases
+- Support for iterative refinement (revision loops)
+- Observable, debuggable execution flow
+
+Options considered:
+1. **Custom state machine** - Full control but significant implementation effort
+2. **LangChain LCEL** - Good for linear chains, limited branching support
+3. **LangGraph** - Graph-based orchestration with conditional edges
+4. **Temporal/Prefect** - Enterprise workflow engines, heavy dependencies
+5. **Simple function composition** - Minimal overhead but poor observability
+
+## Decision
+
+Use LangGraph for agent workflow orchestration.
+
+Graph structure:
+```
+intake_node → planner_node → executor_node → eval_node
+                   ↑                              ↓
+                   └──── revision_node ←──────────┘
+                              ↓
+                         (conditional: retry or end)
+```
+
+Key design choices:
+- **TypedDict state** - Strongly typed state passed between nodes
+- **Conditional edges** - Route based on evaluation pass/fail
+- **Node isolation** - Each node has single responsibility
+- **Explicit wiring** - Graph structure defined declaratively
+
+## Consequences
+
+### Positive
+
+- **Visual clarity** - Graph structure maps directly to agent workflow
+- **Conditional routing** - Native support for eval-based branching
+- **State typing** - TypedDict provides IDE support and validation
+- **Debugging** - Clear node boundaries for step-by-step inspection
+- **LangChain ecosystem** - Compatible with LangChain tools and callbacks
+- **Extensibility** - Easy to add new nodes or modify routing logic
+
+### Negative
+
+- **Dependency** - Adds `langgraph` as a required dependency
+- **Learning curve** - Graph concepts may be unfamiliar to some developers
+- **Overhead** - More abstraction than simple function calls
+- **Version coupling** - Must track LangGraph API changes
+
+### Neutral
+
+- Graph compilation creates a static execution plan
+- State is passed by value between nodes (immutable pattern)
+- Requires explicit `END` node for termination
+
+## References
+
+- [LangGraph Documentation](https://python.langchain.com/docs/langgraph)
+- [LangGraph Conditional Edges](https://langchain-ai.github.io/langgraph/concepts/low_level/#conditional-edges)
+- [TraceFlow Workflow Diagram](../Architecture/traceflow_workflow.md)
diff --git a/docs/adr/0003-provider-abstraction-pattern.md b/docs/adr/0003-provider-abstraction-pattern.md
new file mode 100644
index 0000000..fd2f3f0
--- /dev/null
+++ b/docs/adr/0003-provider-abstraction-pattern.md
@@ -0,0 +1,77 @@
+# ADR-0003: Provider Abstraction Pattern for LLM Integration
+
+## Status
+
+Accepted
+
+## Context
+
+TraceFlow Lite needs to support multiple LLM providers (OpenAI, Anthropic) with the ability to:
+
+- Switch providers without changing application code
+- Route different tasks to different models (e.g., fast model for planning, powerful model for execution)
+- Add new providers with minimal effort
+- Apply cross-cutting concerns (caching, retry, cost tracking) uniformly
+
+Options considered:
+1. **Direct SDK calls** - Simple but tightly coupled, no abstraction
+2. **LangChain ChatModel** - Standard interface but heavy dependency
+3. **Custom Protocol/ABC** - Lightweight abstraction, full control
+4. **litellm** - Unified interface but additional dependency
+5. **Provider factory pattern** - Abstraction with runtime selection
+
+## Decision
+
+Implement a custom provider abstraction using Python's Protocol pattern.
+
+Architecture:
+```python
+class LLMProvider(Protocol):
+    """Protocol defining the LLM provider interface."""
+    
+    def complete(
+        self,
+        messages: list[dict[str, str]],
+        model: str,
+        temperature: float = 0.7,
+        max_tokens: int = 2048,
+    ) -> LLMResponse: ...
+```
+
+Components:
+- **`base.py`** - Protocol definition and response types
+- **`openai_provider.py`** - OpenAI implementation
+- **`anthropic_provider.py`** - Anthropic implementation
+- **`router.py`** - Model-to-provider routing logic
+- **`cache_provider.py`** - Decorator for caching responses
+- **`retry.py`** - Retry logic with exponential backoff
+- **`cost.py`** - Token counting and cost calculation
+
+## Consequences
+
+### Positive
+
+- **Loose coupling** - Application code depends on Protocol, not implementations
+- **Easy testing** - Mock providers for unit tests without API calls
+- **Uniform interface** - Same `complete()` signature across all providers
+- **Decorator pattern** - Caching, retry, logging applied transparently
+- **No heavy dependencies** - Only `openai` and `anthropic` SDKs needed
+- **Type safety** - Protocol provides IDE autocomplete and type checking
+
+### Negative
+
+- **Maintenance burden** - Must update implementations for SDK changes
+- **Feature gaps** - May not expose all provider-specific features
+- **Translation overhead** - Must convert between provider message formats
+
+### Neutral
+
+- Each provider handles its own message format translation
+- Response normalization strips provider-specific metadata
+- Cost tracking requires manual price table maintenance
+
+## References
+
+- [Python Protocol (PEP 544)](https://peps.python.org/pep-0544/)
+- [OpenAI Python SDK](https://github.com/openai/openai-python)
+- [Anthropic Python SDK](https://github.com/anthropics/anthropic-sdk-python)
diff --git a/docs/adr/0004-llm-response-caching.md b/docs/adr/0004-llm-response-caching.md
new file mode 100644
index 0000000..5a7cc6b
--- /dev/null
+++ b/docs/adr/0004-llm-response-caching.md
@@ -0,0 +1,104 @@
+# ADR-0004: LLM Response Caching Strategy
+
+## Status
+
+Accepted
+
+## Context
+
+LLM API calls are expensive (cost) and slow (latency). During development and testing, the same prompts are often sent repeatedly. TraceFlow Lite needs a caching mechanism to:
+
+- Reduce API costs during development iterations
+- Speed up test execution
+- Enable deterministic replay for debugging
+- Support offline development with cached responses
+
+Requirements:
+- Cache key must uniquely identify equivalent requests
+- Cache must be persistent across sessions
+- Cache hits must be trackable for observability
+- Caching must be toggleable (disable for production)
+
+Options considered:
+1. **In-memory LRU cache** - Fast but not persistent, lost on restart
+2. **Redis** - Persistent but requires external service
+3. **File-based (JSON)** - Simple but poor query performance
+4. **SQLite table** - Persistent, queryable, no new dependencies
+5. **LangChain caching** - Built-in but couples to LangChain
+
+## Decision
+
+Implement SQLite-based LLM response caching with content-addressable keys.
+
+Cache key computation:
+```python
+def compute_key(
+    model: str,
+    messages: list[dict],
+    temperature: float,
+    max_tokens: int
+) -> str:
+    """SHA256 hash of normalized request parameters."""
+    payload = json.dumps({
+        "model": model,
+        "messages": messages,
+        "temperature": temperature,
+        "max_tokens": max_tokens,
+    }, sort_keys=True)
+    return hashlib.sha256(payload.encode()).hexdigest()
+```
+
+Schema:
+```sql
+CREATE TABLE llm_cache (
+    cache_key TEXT PRIMARY KEY,
+    model TEXT,
+    response_content TEXT,
+    prompt_tokens INTEGER,
+    completion_tokens INTEGER,
+    created_at TEXT
+)
+```
+
+Integration via decorator pattern:
+```python
+class CachedProvider:
+    """Wrapper that checks cache before calling underlying provider."""
+    
+    def complete(self, messages, model, **kwargs) -> LLMResponse:
+        key = self.cache.compute_key(model, messages, **kwargs)
+        if cached := self.cache.get(key):
+            return LLMResponse(..., cache_hit=True)
+        response = self.provider.complete(messages, model, **kwargs)
+        self.cache.set(key, response)
+        return response
+```
+
+## Consequences
+
+### Positive
+
+- **Cost savings** - Identical requests return cached responses instantly
+- **Faster iteration** - Development loop accelerated significantly
+- **Deterministic tests** - Same input always produces same output
+- **Observability** - `cache_hit` field tracks cache effectiveness
+- **No new dependencies** - Reuses existing SQLite infrastructure
+- **Toggle support** - Enable/disable via UI or configuration
+
+### Negative
+
+- **Stale responses** - Cache doesn't invalidate when models update
+- **Storage growth** - Cache table grows unbounded (manual cleanup needed)
+- **Temperature sensitivity** - Different temperatures = different cache keys
+- **Not suitable for production** - Should be disabled for real user requests
+
+### Neutral
+
+- Cache key includes all parameters affecting output
+- JSON normalization with `sort_keys=True` ensures consistent hashing
+- Cache miss incurs small overhead for key computation
+
+## References
+
+- [Content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
+- [SHA-256 Hash Function](https://docs.python.org/3/library/hashlib.html)
diff --git a/docs/adr/0005-evaluation-gate-pattern.md b/docs/adr/0005-evaluation-gate-pattern.md
new file mode 100644
index 0000000..f486556
--- /dev/null
+++ b/docs/adr/0005-evaluation-gate-pattern.md
@@ -0,0 +1,88 @@
+# ADR-0005: Evaluation Gate Pattern for Quality Control
+
+## Status
+
+Accepted
+
+## Context
+
+LLM outputs are non-deterministic and can produce incorrect, unsafe, or low-quality results. TraceFlow Lite needs a mechanism to:
+
+- Automatically assess output quality before proceeding
+- Block unsafe or incorrect outputs from reaching users
+- Provide structured feedback for iterative improvement
+- Track evaluation metrics for observability
+
+The agent workflow includes execution steps that produce code or text that must meet quality thresholds before being accepted.
+
+Options considered:
+1. **No evaluation** - Fast but no quality control
+2. **Human-in-the-loop** - High quality but blocks automation
+3. **Rule-based validation** - Fast but limited to known patterns
+4. **LLM-as-judge** - Flexible but adds latency and cost
+5. **Hybrid (rules + LLM)** - Balance of speed and flexibility
+
+## Decision
+
+Implement an evaluation gate pattern using LLM-as-judge with structured output.
+
+Evaluation dimensions:
+```python
+class EvalResult(TypedDict):
+    correctness: float      # 0.0-1.0: Does it solve the problem?
+    code_quality: float     # 0.0-1.0: Is it well-structured?
+    safety: float           # 0.0-1.0: Is it safe to execute?
+    change_safety: float    # 0.0-1.0: Are changes minimal and reversible?
+    passed: bool            # Overall pass/fail
+    feedback: str           # Specific improvement suggestions
+```
+
+Gate logic:
+```python
+def should_pass(eval_result: EvalResult) -> bool:
+    return (
+        eval_result["correctness"] >= 0.7 and
+        eval_result["safety"] >= 0.8 and
+        eval_result["passed"]
+    )
+```
+
+Workflow integration:
+```
+executor_node → eval_node → [pass] → END
+                    ↓
+                 [fail]
+                    ↓
+              revision_node → planner_node (retry loop)
+```
+
+## Consequences
+
+### Positive
+
+- **Quality assurance** - Catches errors before they reach users
+- **Structured feedback** - Specific suggestions enable targeted revision
+- **Multi-dimensional** - Separate scores for correctness, safety, quality
+- **Auditable** - All evaluations stored in `evals` table
+- **Iterative improvement** - Failed evals trigger revision loop
+- **Configurable thresholds** - Adjust strictness per use case
+
+### Negative
+
+- **Added latency** - Extra LLM call for evaluation
+- **Added cost** - Evaluation consumes tokens
+- **False negatives** - May reject valid outputs
+- **False positives** - May accept flawed outputs
+- **Revision loops** - Can get stuck if evaluation is too strict
+
+### Neutral
+
+- Maximum revision attempts configurable (default: 3)
+- Evaluation prompts tuned for specific output types
+- Feedback quality depends on evaluation model capability
+
+## References
+
+- [LLM-as-Judge Paper](https://arxiv.org/abs/2306.05685)
+- [Constitutional AI](https://arxiv.org/abs/2212.08073)
+- [OpenAI Evals Framework](https://github.com/openai/evals)
diff --git a/docs/adr/0006-streamlit-for-ui.md b/docs/adr/0006-streamlit-for-ui.md
new file mode 100644
index 0000000..89b534b
--- /dev/null
+++ b/docs/adr/0006-streamlit-for-ui.md
@@ -0,0 +1,87 @@
+# ADR-0006: Use Streamlit for Operations UI
+
+## Status
+
+Accepted
+
+## Context
+
+TraceFlow Lite needs a user interface for:
+
+- Viewing trace history and step details
+- Inspecting LLM inputs/outputs for debugging
+- Monitoring costs and token usage
+- Analyzing evaluation results
+- Toggling runtime settings (e.g., caching)
+
+The UI is primarily for developers and operators, not end users. Requirements:
+
+- Rapid development (observability tool, not the core product)
+- Python-native (same language as the agent)
+- Real-time data display
+- Interactive filtering and selection
+- Minimal frontend expertise required
+
+Options considered:
+1. **Flask/FastAPI + React** - Full control but high development cost
+2. **Jupyter notebooks** - Interactive but poor UX for dashboards
+3. **Streamlit** - Rapid Python-native dashboards
+4. **Gradio** - ML-focused, less suitable for data dashboards
+5. **Panel/Dash** - More complex than Streamlit
+
+## Decision
+
+Use Streamlit for the operations UI.
+
+Architecture:
+```
+ui/
+├── app.py          # Main Streamlit application
+└── README.md       # UI-specific documentation
+```
+
+Key features implemented:
+- **Trace list** - Filterable table of all traces
+- **Step viewer** - Detailed view of individual steps with syntax highlighting
+- **Cost analytics** - Token usage and cost breakdown
+- **Eval dashboard** - Evaluation scores and feedback
+- **Cache toggle** - Runtime enable/disable of LLM caching
+- **Dark theme** - Developer-friendly appearance
+
+State management:
+```python
+# Session state for UI persistence
+if "selected_trace" not in st.session_state:
+    st.session_state.selected_trace = None
+```
+
+## Consequences
+
+### Positive
+
+- **Rapid development** - UI built in hours, not days
+- **Python-native** - No JavaScript/TypeScript required
+- **Reactive** - Automatic re-rendering on data changes
+- **Built-in components** - Tables, charts, code blocks included
+- **Easy deployment** - `streamlit run ui/app.py`
+- **Session state** - Maintains UI state across interactions
+
+### Negative
+
+- **Limited customization** - Constrained to Streamlit's component model
+- **Performance** - Full page re-render on each interaction
+- **Threading model** - Requires careful handling of shared resources
+- **Not production-grade** - Suitable for internal tools, not customer-facing
+- **Scaling limits** - Single-process, not designed for high concurrency
+
+### Neutral
+
+- Streamlit Cloud available for hosted deployments
+- Custom components possible but add complexity
+- Mobile support is limited
+
+## References
+
+- [Streamlit Documentation](https://docs.streamlit.io/)
+- [Streamlit Session State](https://docs.streamlit.io/library/api-reference/session-state)
+- [Streamlit Theming](https://docs.streamlit.io/library/advanced-features/theming)
diff --git a/docs/adr/README.md b/docs/adr/README.md
new file mode 100644
index 0000000..638e0b3
--- /dev/null
+++ b/docs/adr/README.md
@@ -0,0 +1,25 @@
+# Architecture Decision Records
+
+This directory contains Architecture Decision Records (ADRs) for TraceFlow Lite.
+
+ADRs are documents that capture important architectural decisions made during development, including the context, the decision itself, and its consequences.
+
+## Index
+
+| ADR | Title | Status |
+|-----|-------|--------|
+| [ADR-0001](0001-use-sqlite-with-wal-mode.md) | Use SQLite with WAL Mode for Persistence | Accepted |
+| [ADR-0002](0002-langgraph-for-agent-orchestration.md) | Use LangGraph for Agent Orchestration | Accepted |
+| [ADR-0003](0003-provider-abstraction-pattern.md) | Provider Abstraction Pattern for LLM Integration | Accepted |
+| [ADR-0004](0004-llm-response-caching.md) | LLM Response Caching Strategy | Accepted |
+| [ADR-0005](0005-evaluation-gate-pattern.md) | Evaluation Gate Pattern for Quality Control | Accepted |
+| [ADR-0006](0006-streamlit-for-ui.md) | Use Streamlit for Operations UI | Accepted |
+
+## ADR Template
+
+See [template.md](template.md) for creating new ADRs.
+
+## References
+
+- [Michael Nygard's ADR Article](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions)
+- [ADR GitHub Organization](https://adr.github.io/)
diff --git a/docs/adr/template.md b/docs/adr/template.md
new file mode 100644
index 0000000..64a25bf
--- /dev/null
+++ b/docs/adr/template.md
@@ -0,0 +1,33 @@
+# ADR-XXXX: Title
+
+## Status
+
+Proposed | Accepted | Deprecated | Superseded by [ADR-XXXX](XXXX-title.md)
+
+## Context
+
+What is the issue that we're seeing that is motivating this decision or change?
+
+## Decision
+
+What is the change that we're proposing and/or doing?
+
+## Consequences
+
+What becomes easier or more difficult to do because of this change?
+
+### Positive
+
+- List positive outcomes
+
+### Negative
+
+- List negative outcomes or trade-offs
+
+### Neutral
+
+- List neutral observations
+
+## References
+
+- Link to relevant documentation, issues, or discussions