Goal
For every session that touched code, compute pre/post static-analysis deltas: cyclomatic complexity, test-coverage, lint findings, type-completeness. Surface "sessions where the agent reduced complexity by 20%+" vs "increased it by 20%+" — the first metric that lets you say agent X is actually better than agent Y on YOUR code.
Why now
Outcome attribution today is git-correlation. That tells you the code shipped; it doesn't tell you whether the code is good. Static analysis is the cheapest objective signal.
Schema
v018 — static_analysis_findings table:
CREATE TABLE static_analysis_findings (
id INTEGER PRIMARY KEY,
session_id TEXT NOT NULL,
file_path TEXT NOT NULL,
language TEXT NOT NULL, -- 'python' | 'typescript' | 'go' | ...
ts TEXT NOT NULL, -- when the analysis ran
metric TEXT NOT NULL, -- 'complexity' | 'coverage' | 'lint_count' | 'type_completeness'
pre_value REAL, -- before the session's edits
post_value REAL, -- after
delta REAL, -- post - pre (NULL when one side is unobservable)
details_json TEXT, -- per-metric extras (e.g. lint rule ids)
UNIQUE (session_id, file_path, metric)
);
CREATE INDEX idx_sa_session ON static_analysis_findings(session_id);
CREATE INDEX idx_sa_file ON static_analysis_findings(file_path);
Additive, IF NOT EXISTS-guarded.
User-visible surface
- CLI:
stackunderflow analyze session <id> runs analysis on a single session's touched files (using Playback v2 to reconstruct pre/post states).
- CLI:
stackunderflow analyze backfill [--since 30d] [--limit N] runs analysis on every recent session lacking findings.
- API:
GET /api/static-analysis/session/{id} — return findings for a session.
- Meta-agent tool:
get_session_quality(session_id) returns a structured quality summary.
- UI: Quality column on Sessions tab + a "Quality" panel on the per-session detail view.
Implementation plan
- v018 migration.
- New module
stackunderflow/services/static_analysis/ with one analyzer per language:
python_analyzer.py — radon for complexity (already-popular, MIT, optional dep), coverage.py parse, ruff --output-format=json for lint, mypy --no-error-summary for type completeness.
typescript_analyzer.py — tsc --noEmit --pretty false for type errors, eslint --format json for lint. Complexity: defer (no clean cross-toolchain answer).
go_analyzer.py — go vet, gocyclo, go test -coverprofile. Defer if go not on PATH.
- Coordinator in
services/static_analysis/runner.py — reconstruct pre/post via Playback v2's reconstruct_fs_at(at_pre) / reconstruct_fs_at(at_post), write to a tmpdir, run the analyzer, persist deltas.
- Optional dep: add
[analysis] extra in pyproject.toml with radon, coverage, mypy. Check for binaries (tsc, eslint, go) at runtime; skip cleanly if missing.
- CLI + API + meta-agent wiring.
- Backfill batch with concurrency cap (analyzers fork shell processes — cap at
min(4, cpu_count)).
Tests
- Each analyzer: synthetic file with known complexity/coverage/lint result, assert metric.
- Coordinator: pre + post fixture, assert delta computation.
- Missing-binary handling: TS analyzer skips cleanly when
tsc not on PATH.
- Backfill: idempotent (re-running doesn't duplicate findings).
Hard parts
- Cross-language is genuinely hard. Python / TS / Go cover ~80% of usage; the long tail (Rust, Ruby, Java, Swift, etc.) is per-language adapter work. Document explicitly which languages are supported v1.
- "pre" state for a session sometimes doesn't exist (the file was created in the session). Handle: pre_value = NULL, delta = NULL, details_json =
{"reason": "file_created_in_session"}.
- Some analyzers are slow (mypy on a big project can be 30s+). Use timeouts (default 60s per file) and cache results.
- Coverage requires running tests — that's a SEPARATE deliverable, defer (Spec 22 sub-task). v1 handles complexity + lint + types only.
Out of scope
- Test-running for coverage measurement (separate spec — needs sandboxing).
- Rust / Java / Swift / Ruby analyzers.
- Real-time analysis as the agent edits (defer; this is offline backfill).
Dependencies
- None blocking. Playback v2 (shipped) provides pre/post reconstruction.
- Consumed by Spec 22 (outcome attribution v2) and Spec 26 (comparative benchmark).
Estimated effort
Size L — single agent, ~2-2.5 hr.
Hard rules
- DO NOT touch versions / CHANGELOG headings.
- Pre-assigned schema slot: v018.
- Branch:
feat/static-analysis-pass off main.
- New optional dep
[analysis] in pyproject.toml is allowed (similar to [embeddings]).
Goal
For every session that touched code, compute pre/post static-analysis deltas: cyclomatic complexity, test-coverage, lint findings, type-completeness. Surface "sessions where the agent reduced complexity by 20%+" vs "increased it by 20%+" — the first metric that lets you say agent X is actually better than agent Y on YOUR code.
Why now
Outcome attribution today is git-correlation. That tells you the code shipped; it doesn't tell you whether the code is good. Static analysis is the cheapest objective signal.
Schema
v018 —
static_analysis_findingstable:Additive,
IF NOT EXISTS-guarded.User-visible surface
stackunderflow analyze session <id>runs analysis on a single session's touched files (using Playback v2 to reconstruct pre/post states).stackunderflow analyze backfill [--since 30d] [--limit N]runs analysis on every recent session lacking findings.GET /api/static-analysis/session/{id}— return findings for a session.get_session_quality(session_id)returns a structured quality summary.Implementation plan
stackunderflow/services/static_analysis/with one analyzer per language:python_analyzer.py—radonfor complexity (already-popular, MIT, optional dep),coverage.pyparse,ruff --output-format=jsonfor lint,mypy --no-error-summaryfor type completeness.typescript_analyzer.py—tsc --noEmit --pretty falsefor type errors,eslint --format jsonfor lint. Complexity: defer (no clean cross-toolchain answer).go_analyzer.py—go vet,gocyclo,go test -coverprofile. Defer ifgonot on PATH.services/static_analysis/runner.py— reconstruct pre/post via Playback v2'sreconstruct_fs_at(at_pre)/reconstruct_fs_at(at_post), write to a tmpdir, run the analyzer, persist deltas.[analysis]extra inpyproject.tomlwithradon,coverage,mypy. Check for binaries (tsc,eslint,go) at runtime; skip cleanly if missing.min(4, cpu_count)).Tests
tscnot on PATH.Hard parts
{"reason": "file_created_in_session"}.Out of scope
Dependencies
Estimated effort
Size L — single agent, ~2-2.5 hr.
Hard rules
feat/static-analysis-passoff main.[analysis]inpyproject.tomlis allowed (similar to[embeddings]).