0xMMA · 0xMMA · Apr 1, 2026 · Mar 8, 2026 · Mar 8, 2026 · Mar 9, 2026
diff --git a/.claude/commands/refine-requirements.md b/.claude/commands/refine-requirements.md
@@ -0,0 +1,165 @@
+# Refine Requirements — Interactive Requirements Engineering
+
+Take a plan file and refine its requirements through structured exploration, interactive Q&A with ASCII mockups, and play-pretend walkthroughs that surface gaps naturally.
+
+**Input:** `$ARGUMENTS` is the path to the plan file (e.g. `PYRAMIDIZE.md`, `docs/plan.md`).
+
+If `$ARGUMENTS` is empty, use `AskUserQuestion` to ask: "Which plan file should I refine? (relative path from project root)"
+
+---
+
+## Rules
+
+- **Max 3-4 questions per round.** Never wall-of-text the user.
+- **Never assume — ask when ambiguous.** A wrong assumption costs more than a question.
+- **Never copy v1 blindly.** If there's prior art, question whether old decisions still apply.
+- **Always show, don't tell.** Every question with a UI or layout implication gets an ASCII mockup. Abstract descriptions are not acceptable — make it concrete.
+- **Stay in character during Phase 3.** Narrate as if the feature exists. Break character only to surface a gap, then resume.
+- **Requirements and design only.** Never ask about implementation details (JSON parsing, HTTP clients, DI wiring, test scaffolding) — those are the developer's domain.
+- **Capture decisions immediately.** After each round, note what was decided before moving on.
+
+---
+
+## Phase 1 — Deep Exploration (silent)
+
+Do all of this silently. Do NOT output anything to the user yet.
+
+1. Read the plan file at `$ARGUMENTS`.
+2. Read `CLAUDE.md` and any architecture docs it references (`.claude/docs/architecture.md`, `.claude/docs/testing.md`, etc.).
+3. Search the codebase for existing implementations related to the plan's feature area — look at the code, not just file names. Check for archived or previous versions if referenced.
+4. Read any related Angular components, Go services, and shared utilities that the feature will touch or extend.
+5. Build a mental model of:
+   - What exists today that this feature builds on
+   - What constraints the current architecture imposes
+   - What patterns the codebase already uses (and should be followed)
+   - Where the plan has gaps, ambiguities, or implicit assumptions
+
+When done, output a single short message: "I've explored the codebase and the plan. Starting requirements review — Phase 2."
+
+---
+
+## Phase 2 — Structured Requirements Review
+
+Go through the plan section by section. For each section that has decisions to make or ambiguities to resolve:
+
+1. Present 3-4 questions (never more per round).
+2. Every question MUST include:
+   - **2-4 concrete options** (labelled A, B, C, D)
+   - **ASCII preview mockup** for any option that affects layout, UI, or user-visible behaviour
+   - **Your recommendation** with a brief rationale (1 sentence)
+3. Use `AskUserQuestion` to collect the user's choices.
+4. After each round, summarize decisions made in a compact list before moving to the next section.
+
+Example question format:
+```
+**Q2: How should the error state appear?**
+
+Option A — Inline below the action area:
+┌─────────────────────────────────────┐
+│  [Action Button]                    │
+│  ❌ Step 2/3 failed: timeout.       │
+│     [Retry] [Settings →]           │
+└─────────────────────────────────────┘
+
+Option B — Toast notification:
+┌─────────────────────────────────────┐
+│  [Action Button]                    │
+│                    ┌──────────────┐ │
+│                    │ ❌ Timeout   │ │
+│                    │ [Retry]      │ │
+│                    └──────────────┘ │
+└─────────────────────────────────────┘
+
+Recommendation: A — keeps error context near the action.
+```
+
+Continue until all sections have been reviewed. Then announce: "Requirements review complete. Moving to play-pretend walkthrough — Phase 3."
+
+---
+
+## Phase 3 — Play-Pretend Walkthrough
+
+Walk through the feature as if it's already built and shipping. You narrate in present tense. The user is the product owner / architect — you ask them requirements and design questions, never code-level implementation details.
+
+### How to narrate
+
+Speak as if you're a QA tester or product reviewer using the finished feature for the first time:
+
+> "I open the app and navigate to the Pyramidize page. The left panel shows a doc type selector set to AUTO, a style dropdown, and a relationship dropdown. Below them is a large Pyramidize button with a Ctrl+Enter hint. The canvas area is empty — I see a placeholder with ghost text showing a sample pyramidized email..."
+
+### When to pause and ask
+
+Pause the narration whenever:
+- The spec doesn't say what should happen → surface the gap
+- Two requirements seem to conflict → ask which takes priority
+- A behaviour feels wrong from a UX perspective → propose an alternative
+- An edge case isn't covered → ask for the desired behaviour
+
+When pausing, break character briefly:
+
+> "**Gap found:** The spec doesn't say what happens when the user clicks Pyramidize again while the canvas already has edits from a previous run. Should it:
+> A) Overwrite canvasText with the new result (edits lost)
+> B) Ask 'Re-pyramidize from original? Your canvas edits will be lost' with [Yes] [No]
+> C) Create a new trace entry and overwrite silently (edits recoverable via trace log)
+>
+> Recommendation: C — edits are never truly lost thanks to the trace log."
+
+Then use `AskUserQuestion` to get the decision, note it, and resume narrating.
+
+### Minimum scenarios to walk through
+
+Cover ALL of these angles (not just UI walkthroughs):
+
+1. **Happy path** — the golden scenario, start to finish
+2. **First-time user** — no config, no presets, empty state
+3. **Returning user** — presets exist, muscle memory, what's faster now
+4. **Error / timeout** — API fails mid-pipeline, what does the user see and do
+5. **Interruption / cancel** — user cancels during processing, closes mid-edit
+6. **Edge cases** — empty input, very long input, mixed languages, rapid repeated actions
+7. **State & lifecycle** — navigate away and back, minimize to tray, close window
+8. **Hotkey vs manual** — differences in flow, what's available vs hidden
+9. **Architecture choices** — new service vs extending existing, separate route vs same page, shared state
+10. **Scope boundaries** — "this could grow into X — is X in scope or deferred?"
+
+When all scenarios are covered, announce: "Play-pretend walkthrough complete. Moving to gap resolution — Phase 4."
+
+---
+
+## Phase 4 — Gap Resolution
+
+1. Compile all remaining gaps, open questions, and ambiguities discovered during Phases 2 and 3.
+2. Present them as a numbered list, grouped by theme (UI, behaviour, state, error handling, scope).
+3. Ask in batches of 3-4 using `AskUserQuestion`.
+4. For each gap, either:
+   - Resolve it with the user's decision, OR
+   - Mark it explicitly as **out of scope** with a reason
+
+Continue until all gaps are resolved or marked out-of-scope. Then announce: "All gaps resolved. Updating the plan — Phase 5."
+
+---
+
+## Phase 5 — Document Update
+
+1. Update the plan file (`$ARGUMENTS`) with all decisions made during this session:
+   - Add/modify requirement entries
+   - Update scoping decisions table
+   - Update out-of-scope table
+   - Add any new sections needed (e.g., new user stories, new NFRs)
+   - Add a "Last updated" timestamp
+2. Present a completion checklist:
+
+```
+Requirements Refinement Complete
+────────────────────────────────
+✅ Plan explored and understood
+✅ N questions resolved across M rounds
+✅ K scenarios walked through
+✅ J gaps resolved, L marked out-of-scope
+✅ Plan file updated: [filename]
+
+Want to do one more round? (e.g., "walk through the admin scenario" or "what about offline mode?")
+```
+
+3. Use `AskUserQuestion` to ask if the user wants one more round.
+   - If yes: return to Phase 3 or Phase 4 as appropriate, then repeat Phase 5.
+   - If no: end the session.
diff --git a/.gitignore b/.gitignore
@@ -5,5 +5,7 @@ frontend/dist
 frontend/node_modules
 build/linux/appimage/build
 build/windows/nsis/MicrosoftEdgeWebview2Setup.exe
+test-data/eval-runs/
+.superpowers/
 .claude/settings.local.json
-frontend/test-results
+frontend/test-results
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,8 @@
 
 **Key directories:**
 - `main.go` — Wails entry point, service registration, event loop
-- `internal/features/` — vertical slices: settings, shortcut, clipboard, tray, enhance, welcome, logger, updater
+- `internal/features/` — vertical slices: settings, shortcut, clipboard, tray, enhance, welcome, logger, updater, pyramidize
+- `internal/cli/` — headless CLI commands (`-fix`, `-pyramidize`), dispatched from `main.go` before Wails boots
 - `internal/app/wire.go` + `wire_gen.go` — Wire DI (never edit `wire_gen.go` manually)
 - `frontend/src/app/core/wails.service.ts` — sole RPC bridge; all Go calls go through here
 - `frontend/src/app/features/` — folder-per-component, subcomponents in nested folders → See `.claude/rules/architecture.md#component-structure`
@@ -15,6 +16,33 @@
 
 **Build / run / test:** `cd frontend && npm run build` (Angular), `go build -o bin/KeyLint .`, `wails3 dev` (hot-reload). Tests: `cd frontend && npm test` (Vitest), `go test ./internal/...` (Go), `cd frontend && npx playwright test` (E2E, needs `ng serve` on :4200). → See `.claude/rules/workflows.md` for post-change steps.
 
+**CLI commands (headless, no GUI):**
+```
+./bin/KeyLint -fix "text to fix"                       # silent grammar fix
+./bin/KeyLint -fix -f input.txt                        # fix from file
+cat input.txt | ./bin/KeyLint -fix                     # fix from stdin
+./bin/KeyLint -pyramidize -type email -f input.md      # pyramidize from file
+./bin/KeyLint -pyramidize --json -f input.md           # JSON output with quality score
+./bin/KeyLint -pyramidize --provider claude --model claude-sonnet-4-6 -f input.md
+./bin/KeyLint -pyramidize --variant 1 -f input.md     # use prompt variant v1
+./bin/KeyLint -pyramidize --variant 2 -f input.md     # use prompt variant v2 (0=latest)
+```
+
+**Evaluation tests (real API calls — NOT run by default):**
+```
+# Requires .env with ANTHROPIC_API_KEY (or OPENAI_API_KEY) in project root.
+# Uses //go:build eval tag — never included in normal `go test` runs.
+# Results are logged to test-data/eval-runs/<timestamp>/ with summary.json.
+go test -tags eval ./internal/features/pyramidize/ -v -timeout 300s
+EVAL_PROVIDER=claude go test -tags eval ./internal/features/pyramidize/ -v -timeout 600s
+EVAL_PROVIDER=claude EVAL_MODEL=claude-sonnet-4-6 go test -tags eval ...
+./scripts/eval.sh                                      # automated eval with summary
+./scripts/eval.sh --provider claude --model claude-sonnet-4-6
+./scripts/eval.sh --variant 1                          # compare v1 vs v2 prompts
+EVAL_VARIANT=2 go test -tags eval ...                  # variant via env var
+./scripts/eval-human.sh                                # interactive human review mode
+```
+
 ## Why (The Context)
 
 KeyLint is a desktop app that fixes/enhances clipboard text via AI (OpenAI, Anthropic, Ollama, Bedrock). A global hotkey silently grabs clipboard text, enhances it, and writes it back. The main UI provides manual fix and advanced enhancement modes.
@@ -23,6 +51,8 @@ KeyLint is a desktop app that fixes/enhances clipboard text via AI (OpenAI, Anth
 - AI API calls go through the Go backend (`internal/features/enhance/service.go:1`) — WebKit2GTK on Linux blocks external HTTPS fetch from the webview
 - API keys stored in OS keyring (`github.com/zalando/go-keyring`); env vars take priority over keyring — see `internal/features/settings/service.go`
 - PrimeNG Stepper was replaced with a custom `@switch`-based wizard (`welcome-wizard.component.ts`) because PrimeNG v21 StepPanel animations broke DOM visibility
+- CLI mode (`-fix`, `-pyramidize`) dispatches before Wails boots in `main.go`, uses the same service layer with manual wiring (no Wire/Wails). Prompts are identical between CLI and GUI — output formatting is the caller's concern.
+- Evaluation tests use `//go:build eval` build tag to isolate from normal `go test` runs. They make real API calls and write results to `test-data/eval-runs/<timestamp>/`.
 
 **Component flow:** `main.go` → Wire DI initializes services → Wails registers them → `wails3 generate bindings` generates JS → `wails.service.ts` wraps bindings → Angular components call `WailsService`
 
@@ -54,3 +84,5 @@ KeyLint is a desktop app that fixes/enhances clipboard text via AI (OpenAI, Anth
 **Rules (active steering):** `.claude/rules/architecture.md`, `.claude/rules/testing.md`, `.claude/rules/workflows.md`
 
 **Reference docs:** `.claude/docs/architecture.md` (service wiring, RPC bridge, platform differences), `.claude/docs/testing.md` (detailed patterns), `.claude/docs/versioning.md` (release pipeline, CI)
+
+**Pyramidize docs:** `docs/pyramidize/` (requirements, ADR, quality status, NLP/LangChain research, UX roadmap)
diff --git a/README.md b/README.md
@@ -101,6 +101,27 @@ go test ./internal/...
 npx playwright test
 ```
 
+### Evaluation (Pyramidize Quality)
+
+The pyramidize feature has an automated eval pipeline that measures output quality against baseline test data using deterministic checks (structure, info coverage, hallucination detection) and LLM-as-judge scoring.
+
+```bash
+# Setup: create .env in project root with your API key
+echo "ANTHROPIC_API_KEY=sk-ant-..." > .env
+
+# Run eval (uses //go:build eval tag — isolated from normal tests)
+EVAL_PROVIDER=claude go test -tags eval ./internal/features/pyramidize/ -v -timeout 600s
+
+# Or use the wrapper script (supports --provider / --model flags)
+./scripts/eval.sh --provider claude
+
+# Interactive human review mode (side-by-side comparison)
+./scripts/eval-human.sh --provider claude
+
+# Results are logged to test-data/eval-runs/<timestamp>/
+# Each run produces: summary.json, results.jsonl, samples/
+```
+
 ### Wire DI Regeneration
 
 ```bash
@@ -116,7 +137,8 @@ KeyLint/
 ├── main.go                         # Entry point — CLI flags, Wails app setup
 ├── internal/
 │   ├── app/                        # Wire DI (wire.go + wire_gen.go)
-│   └── features/                   # Vertical slices: settings, shortcut, clipboard, tray, enhance, welcome
+│   ├── cli/                        # Headless CLI commands (-fix, -pyramidize)
+│   └── features/                   # Vertical slices: settings, shortcut, clipboard, tray, enhance, welcome, pyramidize
 ├── frontend/
 │   ├── src/app/
 │   │   ├── core/                   # WailsService (bindings bridge), MessageBus, guards

diff --git a/TODO.md b/TODO.md
@@ -1,96 +1,33 @@
 # KeyLint — Feature Parity TODO
 
-Audit of gaps between v1 (Rust/Tauri) and v2 (Go/Wails).
-Focus: the two core features — **Silent Fix** and **Pyramidize**.
-
----
-
-## System Tray & Window Lifecycle
-
-- [x] **Minimize to tray on close** — `ApplicationShouldTerminateAfterLastWindowClosed: false` set in
-      `main.go`; window-close event calls `window.Hide()`.
-
-- [x] **Tray icon click / double-click brings window to front** — `tray.OnClick` and
-      `tray.OnDoubleClick` handlers added in `internal/features/tray/service.go`.
+Remaining gaps between v1 (Rust/Tauri) and v2 (Go/Wails).
+For Pyramidize-specific status, see `docs/pyramidize/`.
 
 ---
 
 ## Silent Fix
 
-- [x] **Auto-paste to source app** — `PasteToForeground` implemented on both platforms:
-      Windows via Win32 `SendInput` (`paste_windows.go`), Linux via `xdotool` (`paste_linux.go`).
-
+- [x] **Auto-paste to source app** — `PasteToForeground` on both platforms
 - [ ] **Linux hotkey** — currently a no-op stub (`service_linux.go`). Wire up a real global
       shortcut (e.g. `github.com/robotn/gohook` or `xbindkeys` integration).
-
-- [ ] **HTML clipboard support** — detect foreground app (Outlook, Word, LibreOffice, etc.),
+- [ ] **HTML clipboard support** — detect foreground app (Outlook, Word, LibreOffice),
       convert Markdown output to HTML, write both CF_HTML and CF_TEXT to clipboard.
-      v1 had `HtmlClipboardService` with app-name regex matching.
-
----
-
-## Version & Updates
-
-- [x] **Version + update indicator in main nav** *(v4.0.0-alpha finding)* — display the app version
-      in small text at the bottom-left of the shell nav alongside a single icon that lights up when
-      an update is available. Clicking the icon (or version text) should navigate to Settings → About.
-      The version string is already available via `wails.getVersion()`; update status via
-      `wails.checkForUpdate()`. Currently only visible in Settings → About.
 
 ---
 
-## Pyramidize (Advanced Mode)
-
-The current `TextEnhancementComponent` is a single-pass generic fix with no pyramidal logic.
-The entire v1 `PyramidalAgentService` pipeline needs to be rebuilt in Go + Angular.
-
-### Pipeline (Generate → Specialists → QA)
-
-- [ ] **Document type detection** — LLM classifies input as EMAIL / WIKI / POWERPOINT / MEMO
-      (or user selects manually). Returns `{type, language, confidence}`.
-
-- [ ] **Oneshot foundation generator** — document-type-specific prompt templates (German + English)
-      that convert raw text into a structured document: subject + headers + body.
-      Output: `{subject, headers[], fullDocument, documentType, language}`.
-
-- [ ] **Parallel specialist agents** — run concurrently after the foundation step:
-  - Subject Line Specialist — validates format + information density
-  - Header Structure Specialist — MECE principle + pyramidal hierarchy
-  - Information Completeness Specialist — detects info loss vs original
-  - Style & Language Specialist — tone, consistency, professional polish
-  - Each returns a confidence score (0.0–1.0).
-
-- [ ] **Integration coordinator** — selectively applies specialist improvements where
-      confidence > 0.7; preserves baseline on low-confidence suggestions.
-
-- [ ] **Quality assurance check** — final pass returns
-      `{informationLoss[], accuracyIssues[], missingElements[], overallScore, passed}`.
-
-### UI Controls (missing from v2)
-
-- [ ] Document type selector (AUTO / EMAIL / WIKI / POWERPOINT / MEMO)
-- [ ] Communication style selector (concise / detailed / persuasive / neutral /
-      diplomatic / direct / casual / professional)
-- [ ] Relationship level selector (formal / professional / casual / friendly)
-- [ ] Custom instructions textarea
-- [ ] Markdown rendering for output (replace readonly `<textarea>`)
-- [ ] Editable output (allow manual tweaks after AI generation)
-- [ ] Tab view: Draft vs Original
-
-### Clipboard integration
+## Pyramidize
 
-- [ ] **HTML clipboard paste-back** — same as Silent Fix: convert Markdown output to HTML
-      and paste to source app with proper MIME types.
+- [x] Full pipeline (detect -> foundation -> self-QA -> refine)
+- [x] CLI mode (`-fix`, `-pyramidize`)
+- [x] Evaluation framework (deterministic + LLM-as-judge)
+- [x] All UI controls (doc type, style, relationship, custom instructions, canvas, trace log)
+- [ ] **HTML clipboard paste-back** — convert Markdown to HTML for Outlook/Teams
+- [ ] **Parallel specialist agents** — v1 had 4 independent specialists; currently simplified as self-eval. See `docs/pyramidize/adr-001-pipeline-architecture.md`.
 
 ---
 
-## Priority Order
+## Platform
 
-1. ~~**Auto-paste to source app**~~ ✓ done
-2. ~~**Minimize to tray on close**~~ ✓ done
-3. ~~**Tray icon click brings window to front**~~ ✓ done
-4. ~~**Version + update indicator in nav**~~ ✓ done
-5. Pyramidize pipeline (core value proposition)
-6. Pyramidize UI controls
-7. Linux hotkey
-8. HTML clipboard support
+- [x] Minimize to tray on close
+- [x] Tray icon click brings window to front
+- [x] Version + update indicator in nav