What this is
A 6-wave roadmap to take StackUnderflow from "great local cost dashboard with self-referential discovery" to a world-class coding-agent analytics + blackbox. 14 spec issues, each scoped for a single agent run, with pre-assigned schema slots so we don't collide.
The thesis
Today the project is a passive viewer + a Q&A meta-agent. The leap is:
- Outcome attribution that holds up (PR / CI / static analysis / LLM grading)
- Comparative benchmark + mode recommender (the killer feature: empirical "model X wins on YOUR work at $Y per outcome")
- Replay + fork (the blackbox: rebuild what the model saw, branch from any point)
- Active brain (proactive nudges, multi-device, open exchange)
Wave structure
Each wave can run ~3-4 agents in parallel on independent specs. Sequential between waves so the dependency graph holds.
Wave 1 — Independent foundations (4 agents)
Closes the lowest-hanging fruit; nothing depends on anything else.
Wall-clock estimate: ~1.5h.
Wave 2 — Outcome-attribution rails (3 agents)
Brings external signals into the store. Foundation for waves 3 + 5.
Wall-clock estimate: ~2.5h.
Wave 3 — Outcome attribution + grading (2 agents, mostly sequential)
Combines wave 2's data into trustworthy attribution + adds the LLM grading dimension.
Wall-clock estimate: ~3-4h.
Wave 4 — Replay + active surfacing (3 agents)
The blackbox + the proactive nudge.
Wall-clock estimate: ~2.5h.
Wave 5 — Fork + comparative benchmark (2 agents, sequential)
The two big swings. Each is XL.
Wall-clock estimate: ~6-8h. Spec 26 needs a maintainer-written scoring rubric before dispatch.
Wave 6 — Sensitive / long-tail (sequential, with design pauses)
Wall-clock estimate: open-ended, depends on design pace.
Schema-version pre-assignment
| Spec |
Slot |
Tables / columns added |
| 16 |
v015 (only if needed) |
maybe session_mart.outcome ALTER |
| 18 |
v016 |
mode_recommendations |
| 20 |
v017 |
pr_outcomes, ci_runs |
| 21 |
v018 |
static_analysis_findings |
| 22 |
v019 |
commit_session_link + pr_outcomes extensions |
| 23 |
v020 |
session_quality_grades |
| 25 |
v021 |
session_forks + sessions.is_fork / parent_session_id |
| 26 |
v022 |
benchmark_runs, benchmark_outcomes |
| 28 |
v023 |
sync_state |
Specs 12, 13, 17, 19, 24, 27, 29, 30 introduce no schema. Total new schema versions: 9 (v015 through v023).
Hard rules every implementing agent must follow
These are duplicated in each spec issue body for safety:
- DO NOT touch versions (
__version__.py, pyproject.toml, package.json, package-lock.json)
- DO NOT move CHANGELOG
## [N.N.N] headings — entries go under ## [Unreleased] only
- DO NOT touch
~/.stackunderflow/store.db — tests use tmp_path / :memory:
- Use the pre-assigned schema slot in the spec body — do not invent another version number
- Branch off
main, named feat/<spec-slug> or fix/<spec-slug>
- DO NOT open a PR — maintainer handles the merge
- Preserve ruff baseline (38) — no new lint errors
- All tests green before push —
pytest tests/ -q, cd stackunderflow-ui && npm run typecheck && npm run build && node --test tests/services/*.test.ts
- See
docs/HANDOFF.md for the standing rules
What to expect
Each wave ships a release (v0.9.0 → v0.14.0 roughly), bundling its specs into one PyPI publish. Total: ~6 releases over 4-7 calendar days at typical agent-orchestration pace, with maintainer review + design calls between waves.
When to stop / redirect
This is a roadmap, not a contract. After each wave, reassess:
- Did the wave's specs deliver real user value, or just code?
- Did downstream waves' assumptions hold?
- Is something more important emerging that should reprioritize?
What this is
A 6-wave roadmap to take StackUnderflow from "great local cost dashboard with self-referential discovery" to a world-class coding-agent analytics + blackbox. 14 spec issues, each scoped for a single agent run, with pre-assigned schema slots so we don't collide.
The thesis
Today the project is a passive viewer + a Q&A meta-agent. The leap is:
Wave structure
Each wave can run ~3-4 agents in parallel on independent specs. Sequential between waves so the dependency graph holds.
Wave 1 — Independent foundations (4 agents)
Closes the lowest-hanging fruit; nothing depends on anything else.
Wall-clock estimate: ~1.5h.
Wave 2 — Outcome-attribution rails (3 agents)
Brings external signals into the store. Foundation for waves 3 + 5.
Wall-clock estimate: ~2.5h.
Wave 3 — Outcome attribution + grading (2 agents, mostly sequential)
Combines wave 2's data into trustworthy attribution + adds the LLM grading dimension.
Wall-clock estimate: ~3-4h.
Wave 4 — Replay + active surfacing (3 agents)
The blackbox + the proactive nudge.
Wall-clock estimate: ~2.5h.
Wave 5 — Fork + comparative benchmark (2 agents, sequential)
The two big swings. Each is XL.
Wall-clock estimate: ~6-8h. Spec 26 needs a maintainer-written scoring rubric before dispatch.
Wave 6 — Sensitive / long-tail (sequential, with design pauses)
Wall-clock estimate: open-ended, depends on design pace.
Schema-version pre-assignment
session_mart.outcomeALTERmode_recommendationspr_outcomes,ci_runsstatic_analysis_findingscommit_session_link+pr_outcomesextensionssession_quality_gradessession_forks+sessions.is_fork/parent_session_idbenchmark_runs,benchmark_outcomessync_stateSpecs 12, 13, 17, 19, 24, 27, 29, 30 introduce no schema. Total new schema versions: 9 (v015 through v023).
Hard rules every implementing agent must follow
These are duplicated in each spec issue body for safety:
__version__.py,pyproject.toml,package.json,package-lock.json)## [N.N.N]headings — entries go under## [Unreleased]only~/.stackunderflow/store.db— tests usetmp_path/:memory:main, namedfeat/<spec-slug>orfix/<spec-slug>pytest tests/ -q,cd stackunderflow-ui && npm run typecheck && npm run build && node --test tests/services/*.test.tsdocs/HANDOFF.mdfor the standing rulesWhat to expect
Each wave ships a release (v0.9.0 → v0.14.0 roughly), bundling its specs into one PyPI publish. Total: ~6 releases over 4-7 calendar days at typical agent-orchestration pace, with maintainer review + design calls between waves.
When to stop / redirect
This is a roadmap, not a contract. After each wave, reassess: