msitarzewski · msitarzewski · Feb 18, 2026 · Feb 18, 2026
diff --git a/memory-bank/activeContext.md b/memory-bank/activeContext.md
@@ -1,104 +1,68 @@
 # Active Context
 
-**Last Updated**: 2026-02-17
-**Current Phase**: v0.5 + Export Feature
-**Next Action**: Merge v0.5.0 to main. Export to Markdown & PDF feature implemented.
-
-## Next Task: Model Selection Controls + Provider Updates
-
-### Context
-Users can't control which models participate in consensus. `select_proposer()` picks highest `output_cost_per_mtok`, `select_challengers()` picks next-costliest. Problems: no user control (`ConsensusConfig.panel` exists but unused), Google catalog outdated, Perplexity should be challengers-only (search-grounded), Anthropic missing `claude-sonnet-4-6`.
-
-### Changes (6 steps)
-
-1. **Update provider model catalogs**
-   - `src/duh/providers/google.py:34-67` — Gemini 3 GA + early-access models (web search for latest)
-   - `src/duh/providers/anthropic.py:36-61` — Add `claude-sonnet-4-6`
-   - `src/duh/providers/perplexity.py:35-60` — Verify current model IDs/pricing
-
-2. **Add `proposer_eligible` flag to ModelInfo**
-   - `src/duh/providers/base.py:28-45` — Add `proposer_eligible: bool = True`
-   - `src/duh/providers/perplexity.py` — Set `proposer_eligible=False` (challengers only, user decision)
-
-3. **Wire `ConsensusConfig.panel` + update selection functions**
-   - `src/duh/consensus/handlers.py:185-202` (`select_proposer`) — Accept optional `panel`, filter to `proposer_eligible=True`
-   - `src/duh/consensus/handlers.py:322-356` (`select_challengers`) — Accept optional `panel`
-   - `src/duh/cli/app.py:236-246`, `src/duh/api/routes/ws.py:108,128`, `src/duh/api/routes/ask.py` — Pass panel
-
-4. **Add CLI flags**: `--proposer MODEL_REF`, `--challengers MODEL_REF,MODEL_REF`, `--panel MODEL_REF,...`
-   - `src/duh/cli/app.py` (ask command)
-
-5. **Add to REST API**: Optional `panel`, `proposer`, `challengers` fields in ask request body
-   - `src/duh/api/routes/ask.py`
-
-6. **Tests**: Update `test_propose_handler.py`, `test_challenge_handler.py` for panel filtering + proposer_eligible. Test CLI flags. Fix any tests with hardcoded model catalogs.
-
-7. **Documentation + CLI help**
-   - `docs/cli/ask.md` — Document `--proposer`, `--challengers`, `--panel` flags
-   - `docs/api-reference.md` — Document panel/proposer/challengers in `/api/ask`
-   - `docs/concepts/providers-and-models.md` — Update model lists, model selection explanation
-   - `docs/getting-started/configuration.md` — Document `[consensus] panel` config
-   - `docs/reference/config-reference.md` — Add panel, proposer_strategy fields
-   - `src/duh/cli/app.py` — Update Click help strings for new flags
-   - `docs/index.md` — Update feature list if needed
-
-### Current model cost ranking (for reference)
-| Model | output_cost | Provider |
-|-------|------------|----------|
-| Opus 4.6 | $25.00 | anthropic |
-| Sonar Pro | $15.00 | perplexity |
-| Sonnet 4.5 | $15.00 | anthropic |
-| GPT-5.2 | $14.00 | openai |
-| Gemini 3 Pro | $12.00 | google |
-| Gemini 2.5 Pro | $10.00 | google |
-| Mistral Medium | $8.10 | mistral |
-| o3 | $8.00 | openai |
-| Sonar Deep Research | $8.00 | perplexity |
-| Mistral Large | $6.00 | mistral |
-| Haiku 4.5 | $5.00 | anthropic |
+**Last Updated**: 2026-02-18
+**Current Phase**: Epistemic Confidence (Phase A) — on branch `epistemic-confidence-phase-a`
+**Next Action**: Commit, push, create PR to merge to main.
+
+## What Just Shipped: Epistemic Confidence Phase A
+
+### Core Change
+Confidence scoring is now **epistemic** — it reflects inherent uncertainty of the question domain, not just challenge quality.
+
+**Before**: `confidence = _compute_confidence(challenges)` — measured rigor only (0.5–1.0 based on sycophancy ratio).
+**After**: Two separate scores:
+- **Rigor** (renamed from old confidence) — how genuine the challenges were (0.5–1.0)
+- **Confidence** — `min(domain_cap(intent), rigor)` — rigor clamped by question type ceiling
+
+### Domain Caps
+| Intent | Cap | Rationale |
+|--------|-----|-----------|
+| factual | 0.95 | Verifiable answers, near-certain |
+| technical | 0.90 | Strong consensus possible |
+| creative | 0.85 | Subjective, multiple valid answers |
+| judgment | 0.80 | Requires weighing trade-offs |
+| strategic | 0.70 | Inherent future uncertainty |
+| unknown/None | 0.85 | Default conservative cap |
+
+### Files Changed (47 files, +997, -230)
+**New files:**
+- `src/duh/calibration.py` — ECE (Expected Calibration Error) computation
+- `src/duh/memory/migrations.py` — SQLite schema migration (adds rigor column)
+- `tests/unit/test_calibration.py` — 15 calibration tests
+- `tests/unit/test_confidence_scoring.py` — 20 epistemic confidence tests
+- `tests/unit/test_cli_calibration.py` — 4 CLI calibration tests
+- `web/src/components/calibration/CalibrationDashboard.tsx` — Calibration viz
+- `web/src/pages/CalibrationPage.tsx` — Calibration page
+- `web/src/stores/calibration.ts` — Calibration Zustand store
+
+**Modified across full stack:**
+- `consensus/handlers.py` — Renamed `_compute_confidence` → `_compute_rigor`, added `_domain_cap()`, `DOMAIN_CAPS`, epistemic formula
+- `consensus/machine.py` — Added `rigor` to ConsensusContext, RoundResult
+- `consensus/scheduler.py` — Propagates rigor through subtask results
+- `consensus/synthesis.py` — Averages rigor across subtask results
+- `consensus/voting.py` — Added rigor to VoteResult, VotingAggregation
+- `memory/models.py` — Added `rigor` column to Decision ORM
+- `memory/repository.py` — Accepts `rigor` param in `save_decision()`
+- `memory/context.py` — Shows rigor in context builder output
+- `cli/app.py` — All output paths show rigor; new `duh calibration` command; PDF export enhanced
+- `cli/display.py` — `show_commit()` and `show_final_decision()` show rigor
+- `api/routes/crud.py` — `GET /api/calibration` endpoint; rigor in decision space
+- `api/routes/ask.py`, `ws.py`, `threads.py` — Propagate rigor
+- `mcp/server.py` — Propagates rigor
+- Frontend: ConfidenceMeter, ConsensusComplete, ConsensusPanel, ThreadDetail, TurnCard, ExportMenu, Sidebar, DecisionCloud, stores updated
 
 ---
 
 ## Current State
 
-- **v0.5 + Export feature on branch `v0.5.0`.** All v0.5 tasks done + export feature added.
-- **6 providers shipping**: Anthropic (3 models), OpenAI (3 models), Google (4 models), Mistral (4 models), Perplexity (3 models) — 17 total.
-- **1539 Python unit/load tests + 122 Vitest tests** (1661 total), ruff clean.
-- **~60 Python source files + 67 frontend source files** (~127 total).
-- REST API, WebSocket streaming, MCP server, Python client library, web UI all built.
-- Multi-user auth (JWT + RBAC), PostgreSQL support, Prometheus metrics, backup/restore, Playwright E2E.
-- CLI commands: `duh ask`, `duh recall`, `duh threads`, `duh show`, `duh models`, `duh cost`, `duh serve`, `duh mcp`, `duh batch`, `duh export`, `duh feedback`, `duh backup`, `duh restore`, `duh user-create`, `duh user-list`.
-- Export: `duh export <id> --format pdf/markdown --content full/decision --no-dissent -o file`
-- Docs: production-deployment.md, monitoring.md, authentication.md added.
-- MkDocs docs site: https://msitarzewski.github.io/duh/
-- GitHub repo: https://github.com/msitarzewski/duh
+- **Branch `epistemic-confidence-phase-a`** — all changes uncommitted, ready to commit.
+- **1586 Python tests + 126 Vitest tests** (1712 total), ruff clean, mypy strict clean.
+- **~62 Python source files + 70 frontend source files** (~132 total).
+- All previous features intact (v0.1–v0.5 + export).
 
-## v0.5 Delivered
-
-**Theme**: Production hardening, multi-user, enterprise readiness.
-**18 tasks across 7 phases** — all complete.
-
-### What Shipped
-- User accounts + JWT auth + RBAC (admin/contributor/viewer) — `api/auth.py`, `api/rbac.py`, `models.py:User`
-- PostgreSQL support (asyncpg) with connection pooling (`pool_pre_ping`, compound indexes)
-- Perplexity provider adapter (6th provider, search-grounded) — `providers/perplexity.py`
-- Prometheus metrics (`/api/metrics`) + extended health checks (`/api/health/detailed`)
-- Backup/restore CLI (`duh backup`, `duh restore`) with SQLite copy + JSON export/import
-- Playwright E2E browser tests (`web/e2e/`)
-- Per-user + per-provider rate limiting (middleware keys by user_id > api_key > IP)
-- Production deployment documentation (3 new guides)
-- 26 multi-user integration tests + 12 load tests (latency, concurrency, rate limiting)
-- Alembic migration `005_v05_users.py` (users table, user_id FKs on threads/decisions/api_keys)
+## Next Task: Model Selection Controls + Provider Updates
 
-### New Source Files (v0.5)
-- `src/duh/api/auth.py` — JWT authentication endpoints
-- `src/duh/api/rbac.py` — Role-based access control
-- `src/duh/api/metrics.py` — Prometheus metrics endpoint
-- `src/duh/api/health.py` — Extended health checks
-- `src/duh/memory/backup.py` — Backup/restore utilities
-- `src/duh/providers/perplexity.py` — Perplexity provider adapter
-- `alembic/versions/005_v05_users.py` — User migration
-- `docs/guides/production-deployment.md`, `authentication.md`, `monitoring.md`
+Deferred from before Phase A. See `progress.md` for details.
 
 ## Open Questions (Still Unresolved)
 

diff --git a/memory-bank/decisions.md b/memory-bank/decisions.md
@@ -1,6 +1,6 @@
 # Architectural Decisions
 
-**Last Updated**: 2026-02-17
+**Last Updated**: 2026-02-18
 
 ---
 
@@ -324,3 +324,33 @@
 - Remove `create_all` entirely — breaks in-memory test fixtures that don't run alembic
 **Consequences**: Tests continue to work (in-memory SQLite still uses `create_all`). Production databases must run `alembic upgrade head` after code updates. This was already the expected workflow but is now enforced.
 **References**: `src/duh/cli/app.py:101-104`
+
+---
+
+## 2026-02-18: Epistemic Confidence — Separate Rigor from Confidence
+
+**Status**: Approved
+**Context**: The original `_compute_confidence()` in `handlers.py` measured challenge quality (ratio of genuine vs sycophantic challenges), producing a score in [0.5, 1.0]. This was misleading: a factual question ("What is the capital of France?") and a strategic question ("Will AI replace software engineers by 2035?") could both score 1.0 confidence if all challenges were genuine. But inherently uncertain questions should never report near-certain confidence.
+**Decision**: Split into two metrics:
+- **Rigor** (renamed from old confidence): measures challenge quality, [0.5, 1.0]
+- **Confidence** (epistemic): `min(domain_cap(intent), rigor)` — rigor clamped by a per-domain ceiling based on question intent (factual=0.95, technical=0.90, creative=0.85, judgment=0.80, strategic=0.70, default=0.85).
+**Alternatives**:
+- Single blended score (simpler, but hides the two distinct signals)
+- User-configurable caps (more flexible, but adds UX complexity without clear benefit)
+- LLM-estimated confidence (model judges own uncertainty — unreliable, circular)
+**Consequences**: Confidence scores are now more honest. Strategic questions max out at 70% even with perfect rigor. Rigor is preserved as a separate signal for calibration analysis. Requires `rigor` column added to Decision model. Full-stack change: ORM, handlers, CLI, API, WebSocket, MCP, frontend all updated.
+**References**: `src/duh/consensus/handlers.py:641-670`, `src/duh/calibration.py`
+
+---
+
+## 2026-02-18: Lightweight SQLite Migrations (Not Alembic)
+
+**Status**: Approved
+**Context**: Adding the `rigor` column to the `decisions` table requires a migration for existing file-based SQLite databases. Alembic handles PostgreSQL migrations, but for SQLite (the default local dev DB), running `alembic upgrade head` is a friction point for casual users.
+**Decision**: Created `src/duh/memory/migrations.py` with `ensure_schema()` that runs on startup for file-based SQLite only. Uses `PRAGMA table_info()` to detect missing columns and `ALTER TABLE` to add them. In-memory SQLite uses `create_all` (unchanged). PostgreSQL uses Alembic (unchanged).
+**Alternatives**:
+- Alembic-only (requires users to run migration command)
+- create_all for all databases (can't alter existing tables)
+- Manual migration instructions in docs (user friction)
+**Consequences**: File-based SQLite databases auto-migrate on startup. Zero friction for local users. PostgreSQL still requires `alembic upgrade head`. Lightweight and self-contained.
+**References**: `src/duh/memory/migrations.py`, `src/duh/cli/app.py:107-110`
diff --git a/memory-bank/progress.md b/memory-bank/progress.md
@@ -1,10 +1,26 @@
 # Progress
 
-**Last Updated**: 2026-02-17
+**Last Updated**: 2026-02-18
 
 ---
 
-## Current State: v0.5 COMPLETE — Production Hardening & Multi-User
+## Current State: Epistemic Confidence Phase A COMPLETE
+
+### Epistemic Confidence Phase A
+
+- **Renamed `_compute_confidence` → `_compute_rigor`** — old "confidence" measured challenge quality, now called "rigor"
+- **Added `rigor` field** to Decision ORM model, ConsensusContext, RoundResult, SubtaskResult, VoteResult, VotingAggregation, SynthesisResult
+- **Domain caps** — confidence capped by question intent: factual (0.95), technical (0.90), creative (0.85), judgment (0.80), strategic (0.70), default (0.85)
+- **Epistemic formula**: `confidence = min(domain_cap(intent), rigor)` — rigor clamped by domain ceiling
+- **Calibration module** — `src/duh/calibration.py` computes ECE (Expected Calibration Error) from decisions with outcomes
+- **`duh calibration` CLI command** — shows calibration analysis with bucket breakdown
+- **`GET /api/calibration` endpoint** — serves calibration data with category/date filters
+- **Calibration frontend** — CalibrationDashboard, CalibrationPage, calibration Zustand store
+- **SQLite migration** — `src/duh/memory/migrations.py` adds rigor column on startup for file-based SQLite
+- **Full-stack propagation** — rigor shown in CLI, API, WebSocket, MCP, frontend across all views
+- **Enhanced PDF export** — research-paper quality: header/footer, TOC, provider callouts, confidence meter, Unicode TTF
+- 1586 Python tests + 126 Vitest tests (1712 total), ruff clean, mypy strict clean
+- New files: calibration.py, migrations.py, test_calibration.py, test_confidence_scoring.py, test_cli_calibration.py, CalibrationDashboard.tsx, CalibrationPage.tsx, calibration.ts
 
 ### v0.5 Additions
 
@@ -151,3 +167,4 @@ Phase 0 benchmark framework — fully functional, pilot-tested on 5 questions.
 | 2026-02-17 | v0.5 T14-T18 (Phase 7: Ship) — multi-user integration tests, load tests, docs, migration finalized, version bump | Done |
 | 2026-02-17 | v0.5.0 — "It Scales" | **Complete** |
 | 2026-02-17 | Export to Markdown & PDF (CLI + API + Web UI) | Done |
+| 2026-02-18 | Epistemic Confidence Phase A (rigor + domain caps + calibration) | Done |
diff --git a/memory-bank/tasks/2026-02/README.md b/memory-bank/tasks/2026-02/README.md
@@ -462,3 +462,40 @@
 - Manual override classes: `.theme-dark` / `.theme-light` on any ancestor element
 - Light mode code block overrides in `animations.css`
 - Variables: backgrounds (5), text (3), primary accent, semantic colors (3), borders (3), glass (2), layout (3), typography (1)
+
+---
+
+## Epistemic Confidence Phase A — "Honest Confidence"
+
+### 2026-02-18: Epistemic Confidence Scoring
+- Renamed `_compute_confidence()` → `_compute_rigor()` — old metric now properly named
+- Added `DOMAIN_CAPS` dict and `_domain_cap(intent)` lookup
+- New formula: `confidence = min(domain_cap(intent), rigor)`
+- Domain caps: factual (0.95), technical (0.90), creative (0.85), judgment (0.80), strategic (0.70), default (0.85)
+- `handle_commit()` now always attempts taxonomy classification to get intent for capping
+- Files: `src/duh/consensus/handlers.py`
+
+### 2026-02-18: Rigor Field Propagation (Full Stack)
+- Added `rigor: float` to Decision ORM, ConsensusContext, RoundResult, SubtaskResult, VoteResult, VotingAggregation, SynthesisResult
+- Updated save_decision(), scheduler, synthesis, voting to propagate rigor
+- Updated all CLI outputs (ask, recall, show, export JSON/markdown/PDF)
+- Updated API responses (crud, ask, ws, threads) and MCP server
+- Updated display (show_commit, show_final_decision)
+- Updated context builder to show rigor alongside confidence
+- Frontend: ConfidenceMeter, ConsensusComplete, ConsensusPanel, ThreadDetail, TurnCard, ExportMenu, DecisionCloud, stores
+- Files: 47 files changed, +997 insertions, -230 deletions
+
+### 2026-02-18: SQLite Schema Migration
+- Created `src/duh/memory/migrations.py` — `ensure_schema()` adds rigor column on startup
+- Runs for file-based SQLite only (PRAGMA table_info check → ALTER TABLE)
+- In-memory SQLite: create_all handles it. PostgreSQL: Alembic handles it.
+- Wired into `_create_db()` in `cli/app.py`
+
+### 2026-02-18: Calibration Module + CLI + API + Frontend
+- Created `src/duh/calibration.py` — `compute_calibration()` buckets decisions by confidence, computes ECE
+- `CalibrationBucket` and `CalibrationResult` dataclasses
+- `duh calibration [--category CAT]` CLI command
+- `GET /api/calibration` endpoint with category/since/until filters
+- Frontend: CalibrationDashboard (metric cards + bar chart + bucket table), CalibrationPage, calibration Zustand store
+- Tests: 15 calibration tests, 20 confidence scoring tests, 4 CLI calibration tests
+- **Total: 1586 Python + 126 Vitest = 1712 tests**
diff --git a/memory-bank/toc.md b/memory-bank/toc.md
@@ -3,8 +3,8 @@
 ## Core Files
 - [projectbrief.md](./projectbrief.md) — Vision, tenets, architecture, build sequence
 - [techContext.md](./techContext.md) — Tech stack decisions with rationale (Python, Docker, SQLAlchemy, frontend, tools, etc.)
-- [decisions.md](./decisions.md) — Architectural decisions with context, alternatives, and consequences (18 ADRs)
-- [activeContext.md](./activeContext.md) — Current state, v0.5 complete, ready to merge to main
+- [decisions.md](./decisions.md) — Architectural decisions with context, alternatives, and consequences (20 ADRs)
+- [activeContext.md](./activeContext.md) — Current state, epistemic confidence Phase A complete
 - [progress.md](./progress.md) — Milestone tracking, what's built, what's next
 - [competitive-landscape.md](./competitive-landscape.md) — Research on existing tools, frameworks, and academic work
 - [quick-start.md](./quick-start.md) — Session entry point, v0.5 complete, key file references