From e636e31ed3dfcf790f763d75470e4caebc9108d0 Mon Sep 17 00:00:00 2001 From: khalilcodex Date: Tue, 13 Jan 2026 19:17:21 -0800 Subject: [PATCH] add adr section --- README.md | 14 ++- docs/adr/0001-use-sqlite-with-wal-mode.md | 72 ++++++++++++ .../0002-langgraph-for-agent-orchestration.md | 71 ++++++++++++ docs/adr/0003-provider-abstraction-pattern.md | 77 +++++++++++++ docs/adr/0004-llm-response-caching.md | 104 ++++++++++++++++++ docs/adr/0005-evaluation-gate-pattern.md | 88 +++++++++++++++ docs/adr/0006-streamlit-for-ui.md | 87 +++++++++++++++ docs/adr/README.md | 25 +++++ docs/adr/template.md | 33 ++++++ 9 files changed, 567 insertions(+), 4 deletions(-) create mode 100644 docs/adr/0001-use-sqlite-with-wal-mode.md create mode 100644 docs/adr/0002-langgraph-for-agent-orchestration.md create mode 100644 docs/adr/0003-provider-abstraction-pattern.md create mode 100644 docs/adr/0004-llm-response-caching.md create mode 100644 docs/adr/0005-evaluation-gate-pattern.md create mode 100644 docs/adr/0006-streamlit-for-ui.md create mode 100644 docs/adr/README.md create mode 100644 docs/adr/template.md diff --git a/README.md b/README.md index 2fe5284..621f8b0 100644 --- a/README.md +++ b/README.md @@ -197,7 +197,7 @@ result = client.replay( | `max_revisions` | `int` | `3` | Max revision attempts before fallback | | `strictness` | `Strictness` | `BALANCED` | Eval gate strictness (LENIENT, BALANCED, STRICT) | | `retriever_fn` | `Callable` | `None` | Custom retriever callback for RAG | -| `enable_cache` | `bool` | `True` | Enable response caching (roadmap) | +| `enable_cache` | `bool` | `True` | Enable LLM response caching | --- @@ -264,10 +264,16 @@ pytest tests/test_client.py::test_basic_run_without_retriever -v ## Roadmap -- [ ] Multi-provider fallback (Anthropic, Gemini) -- [ ] Response caching implementation -- [ ] Streamlit ops dashboard +- [x] Multi-provider support (OpenAI, Anthropic) +- [x] Response caching implementation +- [x] Streamlit ops dashboard +- [x] Trace persistence with SQLite + WAL mode +- [x] Eval gate pattern with revision loop +- [x] Cost tracking per request +- [x] CI/CD with GitHub Actions +- [x] Architecture Decision Records (ADRs) - [ ] CLI tool (`traceflow run "query"`) +- [ ] Budget-aware model fallback - [ ] Advanced evaluators (relevance scoring, citation validation) - [ ] Async execution support - [ ] OpenTelemetry export integration diff --git a/docs/adr/0001-use-sqlite-with-wal-mode.md b/docs/adr/0001-use-sqlite-with-wal-mode.md new file mode 100644 index 0000000..f4bf459 --- /dev/null +++ b/docs/adr/0001-use-sqlite-with-wal-mode.md @@ -0,0 +1,72 @@ +# ADR-0001: Use SQLite with WAL Mode for Persistence + +## Status + +Accepted + +## Context + +TraceFlow Lite needs a persistence layer to store traces, steps, evaluations, and cached LLM responses. The system is designed as a lightweight, single-node observability tool that developers can run locally or in small-scale deployments. + +Requirements: +- Zero external dependencies (no separate database server) +- ACID compliance for data integrity +- Support for concurrent reads during writes (Streamlit UI + agent execution) +- Simple deployment and backup (single file) +- Good performance for read-heavy workloads (trace replay, analytics) + +Options considered: +1. **PostgreSQL/MySQL** - Full-featured but requires external server +2. **SQLite (default journal mode)** - Simple but blocks readers during writes +3. **SQLite with WAL mode** - Simple with concurrent read/write support +4. **File-based JSON/JSONL** - No schema, poor query performance +5. **Redis** - In-memory, requires external server + +## Decision + +Use SQLite with Write-Ahead Logging (WAL) mode enabled. + +Configuration: +```python +connection = sqlite3.connect( + db_path, + check_same_thread=False # Required for Streamlit's threading model +) +connection.execute("PRAGMA journal_mode=WAL") +connection.execute("PRAGMA busy_timeout=5000") +``` + +Schema design: +- `traces` - Parent records for agent runs +- `steps` - Individual LLM calls within a trace +- `evals` - Evaluation results linked to steps +- `llm_cache` - Cached LLM responses keyed by hash + +## Consequences + +### Positive + +- **Zero infrastructure** - No database server to install, configure, or maintain +- **Portable** - Single `.db` file can be copied, backed up, or shared +- **Concurrent access** - WAL mode allows reads during writes (critical for live UI) +- **ACID compliance** - Full transaction support with rollback capability +- **Fast reads** - Excellent performance for the read-heavy trace viewer +- **Python stdlib** - No additional dependencies beyond `sqlite3` + +### Negative + +- **Single-node only** - Cannot scale horizontally (acceptable for target use case) +- **Write throughput** - Lower than dedicated databases (sufficient for observability) +- **No built-in replication** - Manual backup required for durability +- **Threading complexity** - Requires `check_same_thread=False` for Streamlit + +### Neutral + +- WAL mode creates additional `-wal` and `-shm` files alongside the database +- Database file grows monotonically; periodic `VACUUM` may be needed + +## References + +- [SQLite WAL Mode Documentation](https://www.sqlite.org/wal.html) +- [SQLite Threading Modes](https://www.sqlite.org/threadsafe.html) +- [Streamlit Threading Model](https://docs.streamlit.io/library/api-reference/performance/st.cache_data) diff --git a/docs/adr/0002-langgraph-for-agent-orchestration.md b/docs/adr/0002-langgraph-for-agent-orchestration.md new file mode 100644 index 0000000..c7a6a0b --- /dev/null +++ b/docs/adr/0002-langgraph-for-agent-orchestration.md @@ -0,0 +1,71 @@ +# ADR-0002: Use LangGraph for Agent Orchestration + +## Status + +Accepted + +## Context + +TraceFlow Lite implements a multi-step agent workflow with distinct phases: intake, planning, execution, evaluation, and revision. The workflow requires: + +- Conditional branching based on evaluation results +- State management across nodes +- Clear separation between planning and execution phases +- Support for iterative refinement (revision loops) +- Observable, debuggable execution flow + +Options considered: +1. **Custom state machine** - Full control but significant implementation effort +2. **LangChain LCEL** - Good for linear chains, limited branching support +3. **LangGraph** - Graph-based orchestration with conditional edges +4. **Temporal/Prefect** - Enterprise workflow engines, heavy dependencies +5. **Simple function composition** - Minimal overhead but poor observability + +## Decision + +Use LangGraph for agent workflow orchestration. + +Graph structure: +``` +intake_node → planner_node → executor_node → eval_node + ↑ ↓ + └──── revision_node ←──────────┘ + ↓ + (conditional: retry or end) +``` + +Key design choices: +- **TypedDict state** - Strongly typed state passed between nodes +- **Conditional edges** - Route based on evaluation pass/fail +- **Node isolation** - Each node has single responsibility +- **Explicit wiring** - Graph structure defined declaratively + +## Consequences + +### Positive + +- **Visual clarity** - Graph structure maps directly to agent workflow +- **Conditional routing** - Native support for eval-based branching +- **State typing** - TypedDict provides IDE support and validation +- **Debugging** - Clear node boundaries for step-by-step inspection +- **LangChain ecosystem** - Compatible with LangChain tools and callbacks +- **Extensibility** - Easy to add new nodes or modify routing logic + +### Negative + +- **Dependency** - Adds `langgraph` as a required dependency +- **Learning curve** - Graph concepts may be unfamiliar to some developers +- **Overhead** - More abstraction than simple function calls +- **Version coupling** - Must track LangGraph API changes + +### Neutral + +- Graph compilation creates a static execution plan +- State is passed by value between nodes (immutable pattern) +- Requires explicit `END` node for termination + +## References + +- [LangGraph Documentation](https://python.langchain.com/docs/langgraph) +- [LangGraph Conditional Edges](https://langchain-ai.github.io/langgraph/concepts/low_level/#conditional-edges) +- [TraceFlow Workflow Diagram](../Architecture/traceflow_workflow.md) diff --git a/docs/adr/0003-provider-abstraction-pattern.md b/docs/adr/0003-provider-abstraction-pattern.md new file mode 100644 index 0000000..fd2f3f0 --- /dev/null +++ b/docs/adr/0003-provider-abstraction-pattern.md @@ -0,0 +1,77 @@ +# ADR-0003: Provider Abstraction Pattern for LLM Integration + +## Status + +Accepted + +## Context + +TraceFlow Lite needs to support multiple LLM providers (OpenAI, Anthropic) with the ability to: + +- Switch providers without changing application code +- Route different tasks to different models (e.g., fast model for planning, powerful model for execution) +- Add new providers with minimal effort +- Apply cross-cutting concerns (caching, retry, cost tracking) uniformly + +Options considered: +1. **Direct SDK calls** - Simple but tightly coupled, no abstraction +2. **LangChain ChatModel** - Standard interface but heavy dependency +3. **Custom Protocol/ABC** - Lightweight abstraction, full control +4. **litellm** - Unified interface but additional dependency +5. **Provider factory pattern** - Abstraction with runtime selection + +## Decision + +Implement a custom provider abstraction using Python's Protocol pattern. + +Architecture: +```python +class LLMProvider(Protocol): + """Protocol defining the LLM provider interface.""" + + def complete( + self, + messages: list[dict[str, str]], + model: str, + temperature: float = 0.7, + max_tokens: int = 2048, + ) -> LLMResponse: ... +``` + +Components: +- **`base.py`** - Protocol definition and response types +- **`openai_provider.py`** - OpenAI implementation +- **`anthropic_provider.py`** - Anthropic implementation +- **`router.py`** - Model-to-provider routing logic +- **`cache_provider.py`** - Decorator for caching responses +- **`retry.py`** - Retry logic with exponential backoff +- **`cost.py`** - Token counting and cost calculation + +## Consequences + +### Positive + +- **Loose coupling** - Application code depends on Protocol, not implementations +- **Easy testing** - Mock providers for unit tests without API calls +- **Uniform interface** - Same `complete()` signature across all providers +- **Decorator pattern** - Caching, retry, logging applied transparently +- **No heavy dependencies** - Only `openai` and `anthropic` SDKs needed +- **Type safety** - Protocol provides IDE autocomplete and type checking + +### Negative + +- **Maintenance burden** - Must update implementations for SDK changes +- **Feature gaps** - May not expose all provider-specific features +- **Translation overhead** - Must convert between provider message formats + +### Neutral + +- Each provider handles its own message format translation +- Response normalization strips provider-specific metadata +- Cost tracking requires manual price table maintenance + +## References + +- [Python Protocol (PEP 544)](https://peps.python.org/pep-0544/) +- [OpenAI Python SDK](https://github.com/openai/openai-python) +- [Anthropic Python SDK](https://github.com/anthropics/anthropic-sdk-python) diff --git a/docs/adr/0004-llm-response-caching.md b/docs/adr/0004-llm-response-caching.md new file mode 100644 index 0000000..5a7cc6b --- /dev/null +++ b/docs/adr/0004-llm-response-caching.md @@ -0,0 +1,104 @@ +# ADR-0004: LLM Response Caching Strategy + +## Status + +Accepted + +## Context + +LLM API calls are expensive (cost) and slow (latency). During development and testing, the same prompts are often sent repeatedly. TraceFlow Lite needs a caching mechanism to: + +- Reduce API costs during development iterations +- Speed up test execution +- Enable deterministic replay for debugging +- Support offline development with cached responses + +Requirements: +- Cache key must uniquely identify equivalent requests +- Cache must be persistent across sessions +- Cache hits must be trackable for observability +- Caching must be toggleable (disable for production) + +Options considered: +1. **In-memory LRU cache** - Fast but not persistent, lost on restart +2. **Redis** - Persistent but requires external service +3. **File-based (JSON)** - Simple but poor query performance +4. **SQLite table** - Persistent, queryable, no new dependencies +5. **LangChain caching** - Built-in but couples to LangChain + +## Decision + +Implement SQLite-based LLM response caching with content-addressable keys. + +Cache key computation: +```python +def compute_key( + model: str, + messages: list[dict], + temperature: float, + max_tokens: int +) -> str: + """SHA256 hash of normalized request parameters.""" + payload = json.dumps({ + "model": model, + "messages": messages, + "temperature": temperature, + "max_tokens": max_tokens, + }, sort_keys=True) + return hashlib.sha256(payload.encode()).hexdigest() +``` + +Schema: +```sql +CREATE TABLE llm_cache ( + cache_key TEXT PRIMARY KEY, + model TEXT, + response_content TEXT, + prompt_tokens INTEGER, + completion_tokens INTEGER, + created_at TEXT +) +``` + +Integration via decorator pattern: +```python +class CachedProvider: + """Wrapper that checks cache before calling underlying provider.""" + + def complete(self, messages, model, **kwargs) -> LLMResponse: + key = self.cache.compute_key(model, messages, **kwargs) + if cached := self.cache.get(key): + return LLMResponse(..., cache_hit=True) + response = self.provider.complete(messages, model, **kwargs) + self.cache.set(key, response) + return response +``` + +## Consequences + +### Positive + +- **Cost savings** - Identical requests return cached responses instantly +- **Faster iteration** - Development loop accelerated significantly +- **Deterministic tests** - Same input always produces same output +- **Observability** - `cache_hit` field tracks cache effectiveness +- **No new dependencies** - Reuses existing SQLite infrastructure +- **Toggle support** - Enable/disable via UI or configuration + +### Negative + +- **Stale responses** - Cache doesn't invalidate when models update +- **Storage growth** - Cache table grows unbounded (manual cleanup needed) +- **Temperature sensitivity** - Different temperatures = different cache keys +- **Not suitable for production** - Should be disabled for real user requests + +### Neutral + +- Cache key includes all parameters affecting output +- JSON normalization with `sort_keys=True` ensures consistent hashing +- Cache miss incurs small overhead for key computation + +## References + +- [Content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage) +- [SHA-256 Hash Function](https://docs.python.org/3/library/hashlib.html) diff --git a/docs/adr/0005-evaluation-gate-pattern.md b/docs/adr/0005-evaluation-gate-pattern.md new file mode 100644 index 0000000..f486556 --- /dev/null +++ b/docs/adr/0005-evaluation-gate-pattern.md @@ -0,0 +1,88 @@ +# ADR-0005: Evaluation Gate Pattern for Quality Control + +## Status + +Accepted + +## Context + +LLM outputs are non-deterministic and can produce incorrect, unsafe, or low-quality results. TraceFlow Lite needs a mechanism to: + +- Automatically assess output quality before proceeding +- Block unsafe or incorrect outputs from reaching users +- Provide structured feedback for iterative improvement +- Track evaluation metrics for observability + +The agent workflow includes execution steps that produce code or text that must meet quality thresholds before being accepted. + +Options considered: +1. **No evaluation** - Fast but no quality control +2. **Human-in-the-loop** - High quality but blocks automation +3. **Rule-based validation** - Fast but limited to known patterns +4. **LLM-as-judge** - Flexible but adds latency and cost +5. **Hybrid (rules + LLM)** - Balance of speed and flexibility + +## Decision + +Implement an evaluation gate pattern using LLM-as-judge with structured output. + +Evaluation dimensions: +```python +class EvalResult(TypedDict): + correctness: float # 0.0-1.0: Does it solve the problem? + code_quality: float # 0.0-1.0: Is it well-structured? + safety: float # 0.0-1.0: Is it safe to execute? + change_safety: float # 0.0-1.0: Are changes minimal and reversible? + passed: bool # Overall pass/fail + feedback: str # Specific improvement suggestions +``` + +Gate logic: +```python +def should_pass(eval_result: EvalResult) -> bool: + return ( + eval_result["correctness"] >= 0.7 and + eval_result["safety"] >= 0.8 and + eval_result["passed"] + ) +``` + +Workflow integration: +``` +executor_node → eval_node → [pass] → END + ↓ + [fail] + ↓ + revision_node → planner_node (retry loop) +``` + +## Consequences + +### Positive + +- **Quality assurance** - Catches errors before they reach users +- **Structured feedback** - Specific suggestions enable targeted revision +- **Multi-dimensional** - Separate scores for correctness, safety, quality +- **Auditable** - All evaluations stored in `evals` table +- **Iterative improvement** - Failed evals trigger revision loop +- **Configurable thresholds** - Adjust strictness per use case + +### Negative + +- **Added latency** - Extra LLM call for evaluation +- **Added cost** - Evaluation consumes tokens +- **False negatives** - May reject valid outputs +- **False positives** - May accept flawed outputs +- **Revision loops** - Can get stuck if evaluation is too strict + +### Neutral + +- Maximum revision attempts configurable (default: 3) +- Evaluation prompts tuned for specific output types +- Feedback quality depends on evaluation model capability + +## References + +- [LLM-as-Judge Paper](https://arxiv.org/abs/2306.05685) +- [Constitutional AI](https://arxiv.org/abs/2212.08073) +- [OpenAI Evals Framework](https://github.com/openai/evals) diff --git a/docs/adr/0006-streamlit-for-ui.md b/docs/adr/0006-streamlit-for-ui.md new file mode 100644 index 0000000..89b534b --- /dev/null +++ b/docs/adr/0006-streamlit-for-ui.md @@ -0,0 +1,87 @@ +# ADR-0006: Use Streamlit for Operations UI + +## Status + +Accepted + +## Context + +TraceFlow Lite needs a user interface for: + +- Viewing trace history and step details +- Inspecting LLM inputs/outputs for debugging +- Monitoring costs and token usage +- Analyzing evaluation results +- Toggling runtime settings (e.g., caching) + +The UI is primarily for developers and operators, not end users. Requirements: + +- Rapid development (observability tool, not the core product) +- Python-native (same language as the agent) +- Real-time data display +- Interactive filtering and selection +- Minimal frontend expertise required + +Options considered: +1. **Flask/FastAPI + React** - Full control but high development cost +2. **Jupyter notebooks** - Interactive but poor UX for dashboards +3. **Streamlit** - Rapid Python-native dashboards +4. **Gradio** - ML-focused, less suitable for data dashboards +5. **Panel/Dash** - More complex than Streamlit + +## Decision + +Use Streamlit for the operations UI. + +Architecture: +``` +ui/ +├── app.py # Main Streamlit application +└── README.md # UI-specific documentation +``` + +Key features implemented: +- **Trace list** - Filterable table of all traces +- **Step viewer** - Detailed view of individual steps with syntax highlighting +- **Cost analytics** - Token usage and cost breakdown +- **Eval dashboard** - Evaluation scores and feedback +- **Cache toggle** - Runtime enable/disable of LLM caching +- **Dark theme** - Developer-friendly appearance + +State management: +```python +# Session state for UI persistence +if "selected_trace" not in st.session_state: + st.session_state.selected_trace = None +``` + +## Consequences + +### Positive + +- **Rapid development** - UI built in hours, not days +- **Python-native** - No JavaScript/TypeScript required +- **Reactive** - Automatic re-rendering on data changes +- **Built-in components** - Tables, charts, code blocks included +- **Easy deployment** - `streamlit run ui/app.py` +- **Session state** - Maintains UI state across interactions + +### Negative + +- **Limited customization** - Constrained to Streamlit's component model +- **Performance** - Full page re-render on each interaction +- **Threading model** - Requires careful handling of shared resources +- **Not production-grade** - Suitable for internal tools, not customer-facing +- **Scaling limits** - Single-process, not designed for high concurrency + +### Neutral + +- Streamlit Cloud available for hosted deployments +- Custom components possible but add complexity +- Mobile support is limited + +## References + +- [Streamlit Documentation](https://docs.streamlit.io/) +- [Streamlit Session State](https://docs.streamlit.io/library/api-reference/session-state) +- [Streamlit Theming](https://docs.streamlit.io/library/advanced-features/theming) diff --git a/docs/adr/README.md b/docs/adr/README.md new file mode 100644 index 0000000..638e0b3 --- /dev/null +++ b/docs/adr/README.md @@ -0,0 +1,25 @@ +# Architecture Decision Records + +This directory contains Architecture Decision Records (ADRs) for TraceFlow Lite. + +ADRs are documents that capture important architectural decisions made during development, including the context, the decision itself, and its consequences. + +## Index + +| ADR | Title | Status | +|-----|-------|--------| +| [ADR-0001](0001-use-sqlite-with-wal-mode.md) | Use SQLite with WAL Mode for Persistence | Accepted | +| [ADR-0002](0002-langgraph-for-agent-orchestration.md) | Use LangGraph for Agent Orchestration | Accepted | +| [ADR-0003](0003-provider-abstraction-pattern.md) | Provider Abstraction Pattern for LLM Integration | Accepted | +| [ADR-0004](0004-llm-response-caching.md) | LLM Response Caching Strategy | Accepted | +| [ADR-0005](0005-evaluation-gate-pattern.md) | Evaluation Gate Pattern for Quality Control | Accepted | +| [ADR-0006](0006-streamlit-for-ui.md) | Use Streamlit for Operations UI | Accepted | + +## ADR Template + +See [template.md](template.md) for creating new ADRs. + +## References + +- [Michael Nygard's ADR Article](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions) +- [ADR GitHub Organization](https://adr.github.io/) diff --git a/docs/adr/template.md b/docs/adr/template.md new file mode 100644 index 0000000..64a25bf --- /dev/null +++ b/docs/adr/template.md @@ -0,0 +1,33 @@ +# ADR-XXXX: Title + +## Status + +Proposed | Accepted | Deprecated | Superseded by [ADR-XXXX](XXXX-title.md) + +## Context + +What is the issue that we're seeing that is motivating this decision or change? + +## Decision + +What is the change that we're proposing and/or doing? + +## Consequences + +What becomes easier or more difficult to do because of this change? + +### Positive + +- List positive outcomes + +### Negative + +- List negative outcomes or trade-offs + +### Neutral + +- List neutral observations + +## References + +- Link to relevant documentation, issues, or discussions