RFC: Data enrichment for deeper session analysis

## Summary

Add data collection and enrichment capabilities that enable deeper analysis of session patterns. These are the foundational data gaps identified during the RFC #25 analysis process.

## Motivation

While analyzing session data for RFC #25 (enhance /status-report), several data gaps limited the depth of insights possible. The current infrastructure captures tool usage and timing, but lacks semantic understanding, outcome tracking, and cross-system correlation.

## Proposed Capabilities

### 1. Semantic Clustering of User Messages
**Priority: High**

Currently: FTS5 search on keywords
Needed: Automatic clustering by topic/intent

```
"50% of tasks are refactoring, 30% are features, 20% are debugging"
```

Implementation options:
- Embed messages with a local model (e.g., sentence-transformers)
- Cluster with k-means or HDBSCAN
- Store cluster assignments in events table

### 2. Task Outcome Tracking
**Priority: High**

Currently: Know errors occurred during execution
Needed: Track whether the *overall task* succeeded

Signals to capture:
- User explicitly says "done", "thanks", "perfect"
- Session ends with commit/PR creation
- User abandons mid-task (long gap, then new topic)
- Explicit frustration signals ("this isn't working", "never mind")

Schema addition:
```sql
ALTER TABLE sessions ADD COLUMN outcome TEXT; -- 'success', 'abandoned', 'frustrated', 'unknown'
ALTER TABLE sessions ADD COLUMN outcome_confidence REAL;
```

### 3. Time-to-Completion by Task Type
**Priority: Medium**

Track duration from first user message to task completion, segmented by:
- Task type (from semantic clustering)
- Session classification (debugging, development, etc.)
- Project

Would enable: "Debugging sessions take 2.3x longer than feature work"

### 4. Git-Session Correlation
**Priority: Medium**

Currently: Commits ingested but not linked to sessions
Needed: Explicit session→commit relationships

```sql
CREATE TABLE session_commits (
    session_id TEXT,
    commit_sha TEXT,
    time_to_commit_seconds INTEGER,
    PRIMARY KEY (session_id, commit_sha)
);
```

Would enable:
- "Sessions that produce commits vs. sessions that don't"
- "Average time from session start to first commit"
- "Commits per session by project"

### 5. Context Switch Detection
**Priority: Medium**

Currently: See parallel sessions via `detect_parallel_sessions()`
Needed: Detect mid-task context switches within a session

Signals:
- Topic change in user messages
- Long gap followed by different file/tool patterns
- Explicit "actually, let's do X instead"

### 6. Session Purpose Summarization
**Priority: Medium**

Currently: Infer purpose from tool patterns
Needed: LLM-generated summary stored per session

```sql
ALTER TABLE sessions ADD COLUMN purpose_summary TEXT;
ALTER TABLE sessions ADD COLUMN purpose_generated_at TIMESTAMP;
```

Could run as background job or on-demand via `summarize_session(session_id)` tool.

### 7. User Satisfaction Signals
**Priority: Low**

No current way to capture subjective experience. Options:
- Parse sentiment from user messages
- Detect frustration patterns (repeated attempts, undo sequences)
- Eventually: explicit user feedback mechanism

## Implementation Plan

1. **Phase 1**: Task outcome tracking + git-session correlation (high impact, moderate effort)
2. **Phase 2**: Semantic clustering + session summarization (requires embedding model decision)
3. **Phase 3**: Context switch detection + satisfaction signals (refinement)

## Open Questions

1. Should embedding/clustering run locally or use Claude API?
2. How to handle outcome tracking for sessions that span multiple conversations?
3. Storage implications for embeddings (vector DB vs. SQLite with numpy)?

## Worklog Reference

This RFC was identified during the RFC #25 analysis process. See Notion worklog: https://www.notion.so/2db0eedcd74e80838d7eee59515fd439

---

🤖 Generated with [Claude Code](https://claude.ai/code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Data enrichment for deeper session analysis #26

Summary

Motivation

Proposed Capabilities

1. Semantic Clustering of User Messages

2. Task Outcome Tracking

3. Time-to-Completion by Task Type

4. Git-Session Correlation

5. Context Switch Detection

6. Session Purpose Summarization

7. User Satisfaction Signals

Implementation Plan

Open Questions

Worklog Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RFC: Data enrichment for deeper session analysis #26

Description

Summary

Motivation

Proposed Capabilities

1. Semantic Clustering of User Messages

2. Task Outcome Tracking

3. Time-to-Completion by Task Type

4. Git-Session Correlation

5. Context Switch Detection

6. Session Purpose Summarization

7. User Satisfaction Signals

Implementation Plan

Open Questions

Worklog Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions