evansenter · evansenter · Jan 8, 2026 · Jan 8, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -4,6 +4,8 @@ Queryable analytics for Claude Code session logs, exposed as an MCP server and C
 
 **API Reference**: `session-analytics-cli --help` or `src/session_analytics/guide.md` (MCP resource: `session-analytics://guide`).
 
+**Schema Design**: See [docs/SCHEMA.md](docs/SCHEMA.md) for database tables, indexes, and migration history.
+
 ---
 
 ## ⚠️ DATABASE PROTECTION

diff --git a/README.md b/README.md
@@ -138,6 +138,8 @@ make check
 3. **Auto-refresh**: Queries detect stale data (>5 min) and trigger re-ingestion
 4. **Patterns**: Pre-computes tool sequences and permission gaps for fast queries
 
+See [docs/SCHEMA.md](docs/SCHEMA.md) for detailed database schema documentation.
+
 ## Architecture
 
 Key patterns used in the codebase:

diff --git a/docs/SCHEMA.md b/docs/SCHEMA.md
@@ -0,0 +1,265 @@
+# Database Schema Design
+
+This document describes the SQLite database schema for session-analytics.
+
+**Location**: `~/.claude/contrib/analytics/data.db`
+
+---
+
+## Design Principles
+
+1. **Don't over-distill** - Store raw signals (error counts, timestamps, parameters) rather than pre-computed interpretations. The consuming LLM handles context.
+
+2. **Aggregate → drill-down** - Every aggregate must be traceable to specifics. If "821 Bash errors" appears, the schema must support finding which commands failed.
+
+3. **Denormalize for common queries** - Extract frequently-filtered fields (command, file_path, skill_name) into columns rather than requiring JSON parsing.
+
+---
+
+## Tables Overview
+
+| Table | Purpose | Rows (typical) |
+|-------|---------|----------------|
+| `events` | All tool calls, messages, and summaries from JSONL logs | 100K+ |
+| `sessions` | Aggregated session metadata | 1K+ |
+| `ingestion_state` | Tracks which JSONL files have been processed | ~100 |
+| `patterns` | Pre-computed patterns (re-computable, safe to drop) | ~1K |
+| `git_commits` | Git history for correlation | ~5K |
+| `session_commits` | Junction table linking sessions to commits | ~3K |
+| `bus_events` | Cross-session events from event-bus | ~2K |
+| `events_fts` | FTS5 virtual table for user message search | N/A |
+
+---
+
+## Core Tables
+
+### events
+
+The primary table storing all parsed JSONL entries.
+
+```sql
+CREATE TABLE events (
+    id INTEGER PRIMARY KEY,
+    uuid TEXT NOT NULL,        -- Unique within session (see UNIQUE constraint)
+    timestamp TIMESTAMP NOT NULL,
+    session_id TEXT NOT NULL,
+    project_path TEXT,
+    entry_type TEXT,           -- 'user', 'assistant', 'summary', 'tool_use', 'tool_result'
+
+    -- Tool-specific (null if not a tool call)
+    tool_name TEXT,
+    tool_input_json TEXT,      -- Full JSON for drill-down
+    tool_id TEXT,              -- Correlates tool_use with tool_result
+    is_error INTEGER DEFAULT 0,
+
+    -- Denormalized for common filters
+    command TEXT,              -- Bash: first word (e.g., "git")
+    command_args TEXT,         -- Bash: remaining args
+    file_path TEXT,            -- Read/Edit/Write target
+    skill_name TEXT,           -- Skill invocation name
+
+    -- Token tracking (only on assistant events to avoid duplication)
+    input_tokens INTEGER,
+    output_tokens INTEGER,
+    cache_read_tokens INTEGER,
+    cache_creation_tokens INTEGER,
+    model TEXT,
+
+    -- Context
+    git_branch TEXT,
+    cwd TEXT,
+
+    -- User journey (RFC #17)
+    user_message_text TEXT,    -- For FTS search
+    exit_code INTEGER,         -- Reserved for future extraction
+
+    -- Agent tracking (RFC #41)
+    parent_uuid TEXT,          -- Links tool_use to parent assistant event
+    agent_id TEXT,             -- Task subagent ID from agent-*.jsonl
+    is_sidechain INTEGER DEFAULT 0,
+    version TEXT,              -- Claude Code version
+
+    UNIQUE(session_id, uuid)   -- UUID unique within each session
+)
+```
+
+**Key patterns**:
+- `entry_type='tool_use'` + `entry_type='tool_result'` are correlated by `tool_id`
+- Token columns only populated on `entry_type='assistant'` to avoid double-counting
+- `user_message_text` enables FTS via `events_fts` virtual table
+- `tool_input_json` preserves full parameters for drill-down queries
+
+### sessions
+
+Aggregated metadata per session.
+
+```sql
+CREATE TABLE sessions (
+    id TEXT PRIMARY KEY,       -- UUID from session file
+    project_path TEXT,
+    first_seen TIMESTAMP,
+    last_seen TIMESTAMP,
+    entry_count INTEGER DEFAULT 0,
+    tool_use_count INTEGER DEFAULT 0,
+    total_input_tokens INTEGER DEFAULT 0,
+    total_output_tokens INTEGER DEFAULT 0,
+    primary_branch TEXT,
+    slug TEXT,                 -- Human-readable session name
+    context_switch_count INTEGER DEFAULT 0  -- RFC #26
+)
+```
+
+### git_commits
+
+Git history for session correlation.
+
+```sql
+CREATE TABLE git_commits (
+    sha TEXT PRIMARY KEY,
+    timestamp TIMESTAMP,
+    message TEXT,
+    session_id TEXT,           -- Inferred from timestamp proximity
+    project_path TEXT
+)
+```
+
+### session_commits
+
+Junction table for time-to-commit analysis.
+
+```sql
+CREATE TABLE session_commits (
+    session_id TEXT NOT NULL,
+    commit_sha TEXT NOT NULL,
+    time_to_commit_seconds INTEGER,
+    is_first_commit INTEGER DEFAULT 0,
+    PRIMARY KEY (session_id, commit_sha)
+)
+```
+
+### bus_events
+
+Events from the event-bus for cross-session insights.
+
+```sql
+CREATE TABLE bus_events (
+    id INTEGER PRIMARY KEY,
+    event_id INTEGER UNIQUE NOT NULL,  -- Original ID from event-bus
+    timestamp TIMESTAMP NOT NULL,
+    event_type TEXT NOT NULL,          -- 'gotcha_discovered', 'pattern_found', etc.
+    channel TEXT,
+    session_id TEXT,
+    repo TEXT,                         -- Extracted from channel
+    payload TEXT,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+)
+```
+
+### ingestion_state
+
+Tracks which JSONL files have been processed for incremental ingestion.
+
+```sql
+CREATE TABLE ingestion_state (
+    file_path TEXT PRIMARY KEY,
+    file_size INTEGER,
+    last_modified TIMESTAMP,
+    entries_processed INTEGER,
+    last_processed TIMESTAMP
+)
+```
+
+### patterns
+
+Pre-computed patterns for fast querying (re-computable, safe to delete).
+
+```sql
+CREATE TABLE patterns (
+    id INTEGER PRIMARY KEY,
+    pattern_type TEXT NOT NULL,        -- 'tool_frequency', 'sequence', etc.
+    pattern_key TEXT NOT NULL,         -- e.g., "Bash" or "Read → Edit"
+    count INTEGER DEFAULT 0,
+    last_seen TIMESTAMP,
+    metadata_json TEXT,
+    computed_at TIMESTAMP,
+    UNIQUE(pattern_type, pattern_key)
+)
+```
+
+---
+
+## Indexes
+
+Performance-critical indexes on the `events` table:
+
+| Index | Columns | Purpose |
+|-------|---------|---------|
+| `idx_events_timestamp` | `timestamp` | Time-range queries (days parameter) |
+| `idx_events_session` | `session_id` | Session-specific event lookup |
+| `idx_events_tool` | `tool_name` | Tool frequency analysis |
+| `idx_events_project` | `project_path` | Project filtering |
+| `idx_events_tool_id` | `tool_id` | Self-join for tool_use ↔ tool_result correlation |
+| `idx_events_parent_uuid` | `parent_uuid` | Token deduplication queries |
+| `idx_events_agent_id` | `agent_id` | Agent activity breakdown |
+| `idx_events_has_user_message` | Partial on `id` | FTS join optimization |
+
+**Performance note**: The `idx_events_tool_id` index is critical for `query_error_details()` which self-joins events to correlate errors with their input parameters. Without it, queries take ~25s on 160K rows; with it, ~0.3s.
+
+### Other Table Indexes
+
+| Table | Index | Columns |
+|-------|-------|---------|
+| `git_commits` | `idx_git_commits_timestamp` | `timestamp` |
+| `git_commits` | `idx_git_commits_session` | `session_id` |
+| `git_commits` | `idx_git_commits_project` | `project_path` |
+| `session_commits` | `idx_session_commits_session` | `session_id` |
+| `session_commits` | `idx_session_commits_commit` | `commit_sha` |
+| `bus_events` | `idx_bus_events_timestamp` | `timestamp` |
+| `bus_events` | `idx_bus_events_type` | `event_type` |
+| `bus_events` | `idx_bus_events_session` | `session_id` |
+| `bus_events` | `idx_bus_events_repo` | `repo` |
+
+---
+
+## Full-Text Search
+
+User messages are indexed via FTS5:
+
+```sql
+CREATE VIRTUAL TABLE events_fts USING fts5(
+    user_message_text,
+    content='events',
+    content_rowid='id'
+)
+```
+
+Sync triggers maintain index consistency:
+- `events_fts_insert`: Populates FTS on new events
+- `events_fts_delete`: Removes from FTS on delete
+- `events_fts_update`: Handles message text changes
+
+---
+
+## Migration History
+
+| Version | Name | Changes |
+|---------|------|---------|
+| 1 | Initial | Core tables: events, sessions, ingestion_state, patterns |
+| 2 | add_rfc17_phase1_columns | user_message_text, exit_code, git_commits table |
+| 3 | add_user_message_fts | FTS5 virtual table and sync triggers |
+| 4 | add_session_enrichment | session_commits junction, context_switch_count |
+| 5 | add_agent_tracking | parent_uuid, agent_id, is_sidechain, version |
+| 6 | add_event_bus_integration | bus_events table |
+| 7 | add_tool_id_index | Performance index for self-joins |
+
+---
+
+## Schema Evolution
+
+When adding schema changes:
+
+1. Add migration function with `@migration(N, "name")` decorator
+2. Update `SCHEMA_VERSION = N` constant
+3. Add to `_init_db()` for fresh installs
+4. Use `IF NOT EXISTS` for idempotency
+5. Test with both fresh DB and migration path
-Original file line number
+Diff line change
@@ Expand Up @@
     **API Reference**: `session-analytics-cli --help` or `src/session_analytics/guide.md` (MCP resource: `session-analytics://guide`).
+    **Schema Design**: See [docs/SCHEMA.md](docs/SCHEMA.md) for database tables, indexes, and migration history.
     ---
     ## ⚠️ DATABASE PROTECTION
@@ Expand Down @@