Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ result = client.replay(
| `max_revisions` | `int` | `3` | Max revision attempts before fallback |
| `strictness` | `Strictness` | `BALANCED` | Eval gate strictness (LENIENT, BALANCED, STRICT) |
| `retriever_fn` | `Callable` | `None` | Custom retriever callback for RAG |
| `enable_cache` | `bool` | `True` | Enable response caching (roadmap) |
| `enable_cache` | `bool` | `True` | Enable LLM response caching |

---

Expand Down Expand Up @@ -264,10 +264,16 @@ pytest tests/test_client.py::test_basic_run_without_retriever -v

## Roadmap

- [ ] Multi-provider fallback (Anthropic, Gemini)
- [ ] Response caching implementation
- [ ] Streamlit ops dashboard
- [x] Multi-provider support (OpenAI, Anthropic)
- [x] Response caching implementation
- [x] Streamlit ops dashboard
- [x] Trace persistence with SQLite + WAL mode
- [x] Eval gate pattern with revision loop
- [x] Cost tracking per request
- [x] CI/CD with GitHub Actions
- [x] Architecture Decision Records (ADRs)
- [ ] CLI tool (`traceflow run "query"`)
- [ ] Budget-aware model fallback
- [ ] Advanced evaluators (relevance scoring, citation validation)
- [ ] Async execution support
- [ ] OpenTelemetry export integration
Expand Down
72 changes: 72 additions & 0 deletions docs/adr/0001-use-sqlite-with-wal-mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# ADR-0001: Use SQLite with WAL Mode for Persistence

## Status

Accepted

## Context

TraceFlow Lite needs a persistence layer to store traces, steps, evaluations, and cached LLM responses. The system is designed as a lightweight, single-node observability tool that developers can run locally or in small-scale deployments.

Requirements:
- Zero external dependencies (no separate database server)
- ACID compliance for data integrity
- Support for concurrent reads during writes (Streamlit UI + agent execution)
- Simple deployment and backup (single file)
- Good performance for read-heavy workloads (trace replay, analytics)

Options considered:
1. **PostgreSQL/MySQL** - Full-featured but requires external server
2. **SQLite (default journal mode)** - Simple but blocks readers during writes
3. **SQLite with WAL mode** - Simple with concurrent read/write support
4. **File-based JSON/JSONL** - No schema, poor query performance
5. **Redis** - In-memory, requires external server

## Decision

Use SQLite with Write-Ahead Logging (WAL) mode enabled.

Configuration:
```python
connection = sqlite3.connect(
db_path,
check_same_thread=False # Required for Streamlit's threading model
)
connection.execute("PRAGMA journal_mode=WAL")
connection.execute("PRAGMA busy_timeout=5000")
```

Schema design:
- `traces` - Parent records for agent runs
- `steps` - Individual LLM calls within a trace
- `evals` - Evaluation results linked to steps
- `llm_cache` - Cached LLM responses keyed by hash

## Consequences

### Positive

- **Zero infrastructure** - No database server to install, configure, or maintain
- **Portable** - Single `.db` file can be copied, backed up, or shared
- **Concurrent access** - WAL mode allows reads during writes (critical for live UI)
- **ACID compliance** - Full transaction support with rollback capability
- **Fast reads** - Excellent performance for the read-heavy trace viewer
- **Python stdlib** - No additional dependencies beyond `sqlite3`

### Negative

- **Single-node only** - Cannot scale horizontally (acceptable for target use case)
- **Write throughput** - Lower than dedicated databases (sufficient for observability)
- **No built-in replication** - Manual backup required for durability
- **Threading complexity** - Requires `check_same_thread=False` for Streamlit

### Neutral

- WAL mode creates additional `-wal` and `-shm` files alongside the database
- Database file grows monotonically; periodic `VACUUM` may be needed

## References

- [SQLite WAL Mode Documentation](https://www.sqlite.org/wal.html)
- [SQLite Threading Modes](https://www.sqlite.org/threadsafe.html)
- [Streamlit Threading Model](https://docs.streamlit.io/library/api-reference/performance/st.cache_data)
71 changes: 71 additions & 0 deletions docs/adr/0002-langgraph-for-agent-orchestration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# ADR-0002: Use LangGraph for Agent Orchestration

## Status

Accepted

## Context

TraceFlow Lite implements a multi-step agent workflow with distinct phases: intake, planning, execution, evaluation, and revision. The workflow requires:

- Conditional branching based on evaluation results
- State management across nodes
- Clear separation between planning and execution phases
- Support for iterative refinement (revision loops)
- Observable, debuggable execution flow

Options considered:
1. **Custom state machine** - Full control but significant implementation effort
2. **LangChain LCEL** - Good for linear chains, limited branching support
3. **LangGraph** - Graph-based orchestration with conditional edges
4. **Temporal/Prefect** - Enterprise workflow engines, heavy dependencies
5. **Simple function composition** - Minimal overhead but poor observability

## Decision

Use LangGraph for agent workflow orchestration.

Graph structure:
```
intake_node → planner_node → executor_node → eval_node
↑ ↓
└──── revision_node ←──────────┘
(conditional: retry or end)
```

Key design choices:
- **TypedDict state** - Strongly typed state passed between nodes
- **Conditional edges** - Route based on evaluation pass/fail
- **Node isolation** - Each node has single responsibility
- **Explicit wiring** - Graph structure defined declaratively

## Consequences

### Positive

- **Visual clarity** - Graph structure maps directly to agent workflow
- **Conditional routing** - Native support for eval-based branching
- **State typing** - TypedDict provides IDE support and validation
- **Debugging** - Clear node boundaries for step-by-step inspection
- **LangChain ecosystem** - Compatible with LangChain tools and callbacks
- **Extensibility** - Easy to add new nodes or modify routing logic

### Negative

- **Dependency** - Adds `langgraph` as a required dependency
- **Learning curve** - Graph concepts may be unfamiliar to some developers
- **Overhead** - More abstraction than simple function calls
- **Version coupling** - Must track LangGraph API changes

### Neutral

- Graph compilation creates a static execution plan
- State is passed by value between nodes (immutable pattern)
- Requires explicit `END` node for termination

## References

- [LangGraph Documentation](https://python.langchain.com/docs/langgraph)
- [LangGraph Conditional Edges](https://langchain-ai.github.io/langgraph/concepts/low_level/#conditional-edges)
- [TraceFlow Workflow Diagram](../Architecture/traceflow_workflow.md)
77 changes: 77 additions & 0 deletions docs/adr/0003-provider-abstraction-pattern.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# ADR-0003: Provider Abstraction Pattern for LLM Integration

## Status

Accepted

## Context

TraceFlow Lite needs to support multiple LLM providers (OpenAI, Anthropic) with the ability to:

- Switch providers without changing application code
- Route different tasks to different models (e.g., fast model for planning, powerful model for execution)
- Add new providers with minimal effort
- Apply cross-cutting concerns (caching, retry, cost tracking) uniformly

Options considered:
1. **Direct SDK calls** - Simple but tightly coupled, no abstraction
2. **LangChain ChatModel** - Standard interface but heavy dependency
3. **Custom Protocol/ABC** - Lightweight abstraction, full control
4. **litellm** - Unified interface but additional dependency
5. **Provider factory pattern** - Abstraction with runtime selection

## Decision

Implement a custom provider abstraction using Python's Protocol pattern.

Architecture:
```python
class LLMProvider(Protocol):
"""Protocol defining the LLM provider interface."""

def complete(
self,
messages: list[dict[str, str]],
model: str,
temperature: float = 0.7,
max_tokens: int = 2048,
) -> LLMResponse: ...
```

Components:
- **`base.py`** - Protocol definition and response types
- **`openai_provider.py`** - OpenAI implementation
- **`anthropic_provider.py`** - Anthropic implementation
- **`router.py`** - Model-to-provider routing logic
- **`cache_provider.py`** - Decorator for caching responses
- **`retry.py`** - Retry logic with exponential backoff
- **`cost.py`** - Token counting and cost calculation

## Consequences

### Positive

- **Loose coupling** - Application code depends on Protocol, not implementations
- **Easy testing** - Mock providers for unit tests without API calls
- **Uniform interface** - Same `complete()` signature across all providers
- **Decorator pattern** - Caching, retry, logging applied transparently
- **No heavy dependencies** - Only `openai` and `anthropic` SDKs needed
- **Type safety** - Protocol provides IDE autocomplete and type checking

### Negative

- **Maintenance burden** - Must update implementations for SDK changes
- **Feature gaps** - May not expose all provider-specific features
- **Translation overhead** - Must convert between provider message formats

### Neutral

- Each provider handles its own message format translation
- Response normalization strips provider-specific metadata
- Cost tracking requires manual price table maintenance

## References

- [Python Protocol (PEP 544)](https://peps.python.org/pep-0544/)
- [OpenAI Python SDK](https://github.com/openai/openai-python)
- [Anthropic Python SDK](https://github.com/anthropics/anthropic-sdk-python)
104 changes: 104 additions & 0 deletions docs/adr/0004-llm-response-caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# ADR-0004: LLM Response Caching Strategy

## Status

Accepted

## Context

LLM API calls are expensive (cost) and slow (latency). During development and testing, the same prompts are often sent repeatedly. TraceFlow Lite needs a caching mechanism to:

- Reduce API costs during development iterations
- Speed up test execution
- Enable deterministic replay for debugging
- Support offline development with cached responses

Requirements:
- Cache key must uniquely identify equivalent requests
- Cache must be persistent across sessions
- Cache hits must be trackable for observability
- Caching must be toggleable (disable for production)

Options considered:
1. **In-memory LRU cache** - Fast but not persistent, lost on restart
2. **Redis** - Persistent but requires external service
3. **File-based (JSON)** - Simple but poor query performance
4. **SQLite table** - Persistent, queryable, no new dependencies
5. **LangChain caching** - Built-in but couples to LangChain

## Decision

Implement SQLite-based LLM response caching with content-addressable keys.

Cache key computation:
```python
def compute_key(
model: str,
messages: list[dict],
temperature: float,
max_tokens: int
) -> str:
"""SHA256 hash of normalized request parameters."""
payload = json.dumps({
"model": model,
"messages": messages,
"temperature": temperature,
"max_tokens": max_tokens,
}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
```

Schema:
```sql
CREATE TABLE llm_cache (
cache_key TEXT PRIMARY KEY,
model TEXT,
response_content TEXT,
prompt_tokens INTEGER,
completion_tokens INTEGER,
created_at TEXT
)
```

Integration via decorator pattern:
```python
class CachedProvider:
"""Wrapper that checks cache before calling underlying provider."""

def complete(self, messages, model, **kwargs) -> LLMResponse:
key = self.cache.compute_key(model, messages, **kwargs)
if cached := self.cache.get(key):
return LLMResponse(..., cache_hit=True)
response = self.provider.complete(messages, model, **kwargs)
self.cache.set(key, response)
return response
```

## Consequences

### Positive

- **Cost savings** - Identical requests return cached responses instantly
- **Faster iteration** - Development loop accelerated significantly
- **Deterministic tests** - Same input always produces same output
- **Observability** - `cache_hit` field tracks cache effectiveness
- **No new dependencies** - Reuses existing SQLite infrastructure
- **Toggle support** - Enable/disable via UI or configuration

### Negative

- **Stale responses** - Cache doesn't invalidate when models update
- **Storage growth** - Cache table grows unbounded (manual cleanup needed)
- **Temperature sensitivity** - Different temperatures = different cache keys
- **Not suitable for production** - Should be disabled for real user requests

### Neutral

- Cache key includes all parameters affecting output
- JSON normalization with `sort_keys=True` ensures consistent hashing
- Cache miss incurs small overhead for key computation

## References

- [Content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
- [SHA-256 Hash Function](https://docs.python.org/3/library/hashlib.html)
Loading