Roadmap: world-class coding-agent analytics — 14 specs in 6 waves

## What this is

A 6-wave roadmap to take StackUnderflow from "great local cost dashboard with self-referential discovery" to a **world-class coding-agent analytics + blackbox**. 14 spec issues, each scoped for a single agent run, with pre-assigned schema slots so we don't collide.

## The thesis

Today the project is a passive viewer + a Q&A meta-agent. The leap is:

1. **Outcome attribution that holds up** (PR / CI / static analysis / LLM grading)
2. **Comparative benchmark + mode recommender** (the killer feature: empirical "model X wins on YOUR work at $Y per outcome")
3. **Replay + fork** (the blackbox: rebuild what the model saw, branch from any point)
4. **Active brain** (proactive nudges, multi-device, open exchange)

## Wave structure

Each wave can run ~3-4 agents in parallel on independent specs. Sequential between waves so the dependency graph holds.

### Wave 1 — Independent foundations (4 agents)

Closes the lowest-hanging fruit; nothing depends on anything else.

- #86 — Spec 16: file-risk recommender (S)
- #87 — Spec 17: burn projector v2 (S)
- #88 — Spec 18: mode recommender heuristic v1 (M)
- #89 — Spec 19: skill recommender (M)
- #90 — Spec 13: real-time observability tab (M)
- #91 — Spec 12: open session-schema spec doc (XS)

**Wall-clock estimate:** ~1.5h.

### Wave 2 — Outcome-attribution rails (3 agents)

Brings external signals into the store. Foundation for waves 3 + 5.

- #92 — Spec 20: PR / CI webhook ingest (L)
- #93 — Spec 21: per-session static analysis pass (L)

**Wall-clock estimate:** ~2.5h.

### Wave 3 — Outcome attribution + grading (2 agents, mostly sequential)

Combines wave 2's data into trustworthy attribution + adds the LLM grading dimension.

- #94 — Spec 22: outcome attribution v2 (XL — depends on #92)
- #95 — Spec 23: LLM-graded session quality (M — independent of #94)

**Wall-clock estimate:** ~3-4h.

### Wave 4 — Replay + active surfacing (3 agents)

The blackbox + the proactive nudge.

- #96 — Spec 24: context-window replay (L)
- #97 — Spec 27: active-surfacing meta-agent (L — depends on #86)

**Wall-clock estimate:** ~2.5h.

### Wave 5 — Fork + comparative benchmark (2 agents, sequential)

The two big swings. Each is XL.

- #98 — Spec 25: fork mode (XL — depends on #96)
- #99 — Spec 26: comparative benchmark engine (XL — depends on #93 + #95 + #98 + maintainer rubric design)

**Wall-clock estimate:** ~6-8h. Spec 26 needs a maintainer-written scoring rubric before dispatch.

### Wave 6 — Sensitive / long-tail (sequential, with design pauses)

- #100 — Spec 28: multi-device sync (XL — needs maintainer design call on encryption + conflict resolution)
- #101 — Spec 29: Windows test-fixture port (L — HANDOFF #4 carryover)
- #102 — Spec 30: real-world beta-normalizer fixtures (M — HANDOFF #6 carryover)

**Wall-clock estimate:** open-ended, depends on design pace.

## Schema-version pre-assignment

| Spec | Slot | Tables / columns added |
|---|---|---|
| 16 | v015 (only if needed) | maybe `session_mart.outcome` ALTER |
| 18 | v016 | `mode_recommendations` |
| 20 | v017 | `pr_outcomes`, `ci_runs` |
| 21 | v018 | `static_analysis_findings` |
| 22 | v019 | `commit_session_link` + `pr_outcomes` extensions |
| 23 | v020 | `session_quality_grades` |
| 25 | v021 | `session_forks` + `sessions.is_fork` / `parent_session_id` |
| 26 | v022 | `benchmark_runs`, `benchmark_outcomes` |
| 28 | v023 | `sync_state` |

Specs 12, 13, 17, 19, 24, 27, 29, 30 introduce no schema. Total new schema versions: 9 (v015 through v023).

## Hard rules every implementing agent must follow

These are duplicated in each spec issue body for safety:

- **DO NOT touch versions** (`__version__.py`, `pyproject.toml`, `package.json`, `package-lock.json`)
- **DO NOT move CHANGELOG `## [N.N.N]` headings** — entries go under `## [Unreleased]` only
- **DO NOT touch `~/.stackunderflow/store.db`** — tests use `tmp_path` / `:memory:`
- **Use the pre-assigned schema slot** in the spec body — do not invent another version number
- **Branch off `main`**, named `feat/<spec-slug>` or `fix/<spec-slug>`
- **DO NOT open a PR** — maintainer handles the merge
- **Preserve ruff baseline (38)** — no new lint errors
- **All tests green before push** — `pytest tests/ -q`, `cd stackunderflow-ui && npm run typecheck && npm run build && node --test tests/services/*.test.ts`
- **See `docs/HANDOFF.md`** for the standing rules

## What to expect

Each wave ships a release (v0.9.0 → v0.14.0 roughly), bundling its specs into one PyPI publish. Total: ~6 releases over 4-7 calendar days at typical agent-orchestration pace, with maintainer review + design calls between waves.

## When to stop / redirect

This is a roadmap, not a contract. After each wave, reassess:
- Did the wave's specs deliver real user value, or just code?
- Did downstream waves' assumptions hold?
- Is something more important emerging that should reprioritize?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap: world-class coding-agent analytics — 14 specs in 6 waves #103

What this is

The thesis

Wave structure

Wave 1 — Independent foundations (4 agents)

Wave 2 — Outcome-attribution rails (3 agents)

Wave 3 — Outcome attribution + grading (2 agents, mostly sequential)

Wave 4 — Replay + active surfacing (3 agents)

Wave 5 — Fork + comparative benchmark (2 agents, sequential)

Wave 6 — Sensitive / long-tail (sequential, with design pauses)

Schema-version pre-assignment

Hard rules every implementing agent must follow

What to expect

When to stop / redirect

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Spec	Slot	Tables / columns added
16	v015 (only if needed)	maybe `session_mart.outcome` ALTER
18	v016	`mode_recommendations`
20	v017	`pr_outcomes`, `ci_runs`
21	v018	`static_analysis_findings`
22	v019	`commit_session_link` + `pr_outcomes` extensions
23	v020	`session_quality_grades`
25	v021	`session_forks` + `sessions.is_fork` / `parent_session_id`
26	v022	`benchmark_runs`, `benchmark_outcomes`
28	v023	`sync_state`

Roadmap: world-class coding-agent analytics — 14 specs in 6 waves #103

Description

What this is

The thesis

Wave structure

Wave 1 — Independent foundations (4 agents)

Wave 2 — Outcome-attribution rails (3 agents)

Wave 3 — Outcome attribution + grading (2 agents, mostly sequential)

Wave 4 — Replay + active surfacing (3 agents)

Wave 5 — Fork + comparative benchmark (2 agents, sequential)

Wave 6 — Sensitive / long-tail (sequential, with design pauses)

Schema-version pre-assignment

Hard rules every implementing agent must follow

What to expect

When to stop / redirect

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions