msitarzewski · msitarzewski · Mar 7, 2026 · Mar 7, 2026
diff --git a/.env.example b/.env.example
@@ -0,0 +1,14 @@
+# duh environment variables
+# Copy to .env and fill in values. Never commit .env to git.
+
+# JWT secret for auth tokens (required in production; auto-generated in dev if empty)
+DUH_JWT_SECRET=
+
+# Mail (SMTP) configuration for password reset emails
+DUH_MAIL_HOST=localhost
+DUH_MAIL_PORT=1025
+DUH_MAIL_USERNAME=
+DUH_MAIL_PASSWORD=
+DUH_MAIL_ENCRYPTION=
+DUH_MAIL_FROM_ADDRESS=noreply@example.com
+DUH_MAIL_FROM_NAME=duh
diff --git a/.gitignore b/.gitignore
@@ -24,6 +24,7 @@ ENV/
 # Environment
 .env
 .env.*
+!.env.example
 
 # Phase 0 results
 results/

diff --git a/AGENTS.md b/AGENTS.md
@@ -1,6 +1,6 @@
 # AGENTS.md
 
-**Version**: 2.1 (2025-10-25) | **Compatibility**: Claude, Cursor, Copilot, Cline, Aider, all AGENTS.md-compatible tools
+**Version**: 2.2 (2025-03-04) | **Compatibility**: Claude, Cursor, Copilot, Cline, Aider, all AGENTS.md-compatible tools
 **Status**: Canonical single-file guide for AI-assisted development
 
 ---
@@ -9,6 +9,7 @@
 
 1. [Compliance & Core Rules](#1-compliance--core-rules)
 2. [Session Startup](#2-session-startup)
+   - [Compaction Protocol](#compaction-protocol-mid-session-context-preservation)
 3. [Memory Bank](#3-memory-bank)
 4. [State Machine](#4-state-machine)
 5. [Task Contract & Budgets](#5-task-contract--budgets)
@@ -106,6 +107,45 @@ Append-only JSONL format:
 {"timestamp":"2025-10-25T11:00:00Z","session_id":"uuid","event":"approval_requested","state":"APPROVAL"}
 ```
 
+### Compaction Protocol (Mid-Session Context Preservation)
+
+Compaction (context compression) can happen at any time — triggered by the system automatically, by the user via `/compact`, or by platform-level context management. **The agent does not control compaction timing and may not get advance notice.** Therefore, state persistence must be continuous, not deferred to a pre-compaction moment.
+
+#### Continuous State Persistence (At Every State Transition)
+
+At each state transition (`PLAN → BUILD → DIFF → QA → APPROVAL → APPLY → DOCS`), persist the following to the Memory Bank:
+
+1. **State machine position**: Update `activeContext.md` with current state, substate, and working context
+2. **Task progress**: Append current status to `tasks/YYYY-MM/README.md` with `[IN-PROGRESS]` tag
+3. **Decisions**: Append any new architectural decisions to `decisions.md`
+4. **Log transition** to operational log:
+   ```json
+   {"timestamp":"...","session_id":"uuid","event":"state_transition","from":"PLAN","to":"BUILD"}
+   ```
+5. **Loose context**: Capture any information that exists only in conversation (user preferences, verbal requirements, pending questions) into `activeContext.md`
+
+This ensures that when compaction occurs — without warning — the Memory Bank already reflects the latest state.
+
+#### After Compaction (Recovery)
+
+When context has been compressed (detected by loss of earlier conversation detail, or after `/compact`):
+
+1. Re-enter **Session Startup** (Section 2) using **Fast Track** mode — the Memory Bank was just updated via continuous persistence, so full discovery is unnecessary
+2. Confirm state machine position from `activeContext.md`
+3. Resume from saved state — do not restart the current task from scratch
+4. Output recovery confirmation:
+   ```
+   COMPACTION RECOVERY: Resumed [STATE] for [task name]
+   Context restored from: activeContext.md, tasks/YYYY-MM/README.md
+   ```
+
+#### Rules
+
+- State persistence happens at every transition, not "before compaction" — you cannot rely on advance notice
+- After detecting compaction, always re-read Memory Bank before taking any action
+- If the current state is `APPROVAL` or `DIFF`, the diff summary should already be in `activeContext.md` from the transition save
+- Compaction does not reset budgets — carry forward cycle/token/minute counts from the operational log
+
 ---
 
 ## 3. Memory Bank
@@ -577,7 +617,7 @@ Create outline for approval. After approval, do work. Do not document until I ap
 2. **Task** (current task): Files being modified, direct dependencies, related tests
 3. **Reference** (on-demand): Arch patterns, similar implementations, historical decisions
 
-**Context Rotation**: After each state transition, drop Task Context, reload only what's needed for next state. Keep Core Context persistent.
+**Context Rotation**: After each state transition, drop Task Context, reload only what's needed for next state. Keep Core Context persistent. State is persisted to Memory Bank at every transition per **Compaction Protocol** (Section 2), so compaction recovery is automatic.
 
 **Parallel Execution**:
 ```
@@ -896,7 +936,7 @@ Stuck? → Cycles ≥3?
 | Issue | Symptoms | Resolution |
 |-------|----------|------------|
 | **Loop** | Same diff multiple times, QA fails repeatedly, no progress after 3+ cycles | Check budgets → Load more MB → Clarify requirements → Check environment → Agent swap |
-| **Context Exceeded** | Token limit approaching, slow/truncated responses, forgetting earlier info | Rotate context (drop Task, reload essentials) → Focused mode (MB summaries only) → Break into subtasks → Agent swap |
+| **Context Exceeded** | Token limit approaching, slow/truncated responses, forgetting earlier info | State already persisted via **Compaction Protocol** (Section 2) → Rotate context (drop Task, reload essentials) → Focused mode (MB summaries only) → Break into subtasks → Agent swap |
 | **CI ≠ Local** | QA passes, CI fails | Compare environments → Verify dependency versions → Check timing/concurrency → Check state cleanup → Document waiver if CI issue |
 | **Security Fail** | Security checklist incomplete, sensitive data exposed, auth/authz bypassed | Never bypass → Return to BUILD → Fix all issues → Re-test → Document pattern if new |
 

diff --git a/README.md b/README.md
@@ -21,19 +21,22 @@ duh ask "What database should I use for a new SaaS product?"
 
 ## Features
 
-- **Multi-model consensus** -- Claude, GPT, Gemini, and Mistral debate. Sycophantic challenges are detected and flagged.
+- **Multi-model consensus** -- Claude, GPT, Gemini, Mistral, and Perplexity debate. Sycophantic challenges are detected and flagged.
+- **Web UI** -- Real-time consensus streaming, thread browser, 3D decision space, calibration dashboard. `duh serve` serves both API and frontend.
+- **Epistemic confidence** -- Rigor scoring + domain-capped confidence. Calibration analysis with ECE tracking.
+- **Authentication** -- JWT auth with user accounts, RBAC (admin/contributor/viewer), password reset via email.
 - **Voting protocol** -- Fan out to all models in parallel, aggregate answers via majority or weighted synthesis.
 - **Query decomposition** -- Break complex questions into subtask DAGs, solve in parallel, synthesize results.
-- **REST API** -- Full HTTP API via `duh serve` with API key auth, rate limiting, and WebSocket streaming.
+- **REST API** -- Full HTTP API with API key auth, rate limiting, WebSocket streaming, and Prometheus metrics.
 - **MCP server** -- AI agent integration via `duh mcp` (Model Context Protocol).
 - **Python client** -- Async and sync client library for the REST API (`pip install duh-client`).
 - **Batch processing** -- Process multiple questions from a file (`duh batch`).
-- **Export** -- Export threads as JSON or Markdown (`duh export`).
-- **Mistral provider** -- Native Mistral AI support alongside Anthropic, OpenAI, and Google.
+- **Export** -- Export threads as JSON, Markdown, or PDF (`duh export`).
 - **Decision taxonomy** -- Auto-classify decisions by intent, category, and genus for structured recall.
 - **Outcome tracking** -- Record success/failure/partial feedback on past decisions.
 - **Tool-augmented reasoning** -- Models can call web search, read files, and execute code during consensus.
-- **Persistent memory** -- Every thread, contribution, decision, vote, and subtask stored in SQLite. Search with `duh recall`.
+- **Persistent memory** -- SQLite or PostgreSQL. Every thread, contribution, decision, vote, and subtask stored. Search with `duh recall`.
+- **Backup & restore** -- `duh backup` / `duh restore` with merge mode for SQLite and JSON export.
 - **Cost tracking** -- Per-model token costs in real-time. Configurable warn threshold and hard limit.
 - **Local models** -- Ollama and LM Studio via the OpenAI-compatible API. Mix cloud + local.
 - **Rich CLI** -- Styled panels, spinners, and formatted output.
@@ -59,6 +62,12 @@ duh batch questions.txt                 # Process multiple questions
 duh batch questions.jsonl --format json # Batch with JSON output
 duh export <thread-id>                  # Export thread as JSON
 duh export <thread-id> --format markdown # Export as Markdown
+duh export <thread-id> --format pdf     # Export as PDF
+duh backup ./backup.db                  # Backup database
+duh restore ./backup.db                 # Restore database
+duh calibration                         # Show confidence calibration
+duh user-create --email u@x.com --password ... # Create user
+duh user-list                           # List users
 ```
 
 ## How consensus works
@@ -109,8 +118,13 @@ Full documentation: [docs/](docs/index.md)
 - [Export](docs/export.md)
 - [Python API](docs/python-api/library-usage.md)
 - [Docker Guide](docs/guides/docker.md)
+- [Authentication](docs/guides/authentication.md)
 - [Config Reference](docs/reference/config-reference.md)
 
+## Sponsor
+
+If duh is useful to you, consider [sponsoring the project](https://github.com/sponsors/msitarzewski).
+
 ## License
 
 TBD
diff --git a/docs/concepts/epistemic-confidence.md b/docs/concepts/epistemic-confidence.md
@@ -0,0 +1,99 @@
+# Epistemic Confidence
+
+duh uses a two-dimensional confidence system that separates the quality of the deliberation process (**rigor**) from the theoretical limits of the question domain (**domain cap**). The final **confidence** score reflects both.
+
+## Key concepts
+
+### Rigor
+
+Rigor measures how well the consensus process challenged the answer. It ranges from 0.5 to 1.0 and is computed during the COMMIT phase based on:
+
+- How substantive the challenges were
+- Whether the revision addressed the challenges
+- Whether multiple rounds of deliberation improved the answer
+
+A high rigor score means the answer survived meaningful scrutiny. A low rigor score means the challenges were weak or the proposer didn't adequately respond.
+
+### Domain caps
+
+Not all questions can be answered with equal certainty. duh classifies each question's **intent** (via taxonomy) and applies a ceiling:
+
+| Domain | Cap | Rationale |
+|--------|-----|-----------|
+| Factual | 0.95 | Verifiable facts can be highly certain |
+| Technical | 0.90 | Technical analysis has some inherent uncertainty |
+| Creative | 0.85 | Creative work is subjective |
+| Judgment | 0.80 | Value judgments vary by perspective |
+| Strategic | 0.70 | Strategy involves unpredictable futures |
+| Default | 0.85 | When classification is uncertain |
+
+### Confidence
+
+The final confidence score is:
+
+```
+confidence = min(domain_cap(intent), rigor)
+```
+
+This means even a perfect deliberation process can't claim 95% confidence on a strategic question -- the domain cap limits it to 70%.
+
+### Intent classification
+
+During the COMMIT phase, duh classifies the question's intent using a taxonomy. The classification determines which domain cap applies. Examples:
+
+- "What year was Python released?" -- factual (cap: 0.95)
+- "Should I use PostgreSQL or MongoDB?" -- technical (cap: 0.90)
+- "Write a poem about the ocean" -- creative (cap: 0.85)
+- "Is remote work better than office work?" -- judgment (cap: 0.80)
+- "Should we expand into the European market?" -- strategic (cap: 0.70)
+
+## Calibration
+
+Calibration measures whether confidence scores are accurate over time. A well-calibrated system means:
+
+- Decisions with 90% confidence should be correct ~90% of the time
+- Decisions with 70% confidence should be correct ~70% of the time
+
+### Recording outcomes
+
+To build calibration data, record whether decisions were correct:
+
+**CLI**:
+```bash
+duh feedback <thread-id> success    # Decision was correct
+duh feedback <thread-id> failure    # Decision was wrong
+duh feedback <thread-id> partial    # Decision was partially correct
+```
+
+**Web UI**: Use the inline Pass/Partial/Fail buttons on the Threads page (/threads).
+
+**API**:
+```bash
+curl -X POST http://localhost:8080/api/feedback \
+  -H "Content-Type: application/json" \
+  -d '{"thread_id": "abc123", "result": "success"}'
+```
+
+### ECE (Expected Calibration Error)
+
+The calibration page shows ECE -- a single number that summarizes how well-calibrated the system is. Lower is better:
+
+- **ECE < 0.05**: Excellent calibration
+- **ECE 0.05--0.10**: Good calibration
+- **ECE > 0.10**: Needs more data or model adjustment
+
+ECE is computed by:
+1. Bucketing decisions by confidence (e.g., 0.7--0.8)
+2. Comparing mean predicted confidence to actual success rate in each bucket
+3. Averaging the absolute difference, weighted by bucket size
+
+### Viewing calibration
+
+- **CLI**: `duh calibration` shows a table of calibration buckets
+- **Web UI**: Visit `/calibration` for a visual calibration curve
+- **API**: `GET /api/calibration` returns bucket data
+
+## Related
+
+- [How Consensus Works](how-consensus-works.md) -- The deliberation process that produces rigor scores
+- [Web UI](../web-ui.md) -- Calibration page and batch feedback
diff --git a/docs/guides/authentication.md b/docs/guides/authentication.md
@@ -282,6 +282,35 @@ X-RateLimit-Remaining: 57
 X-RateLimit-Key: user:a1b2c3d4-...
 ```
 
+## Web UI Authentication
+
+The web UI integrates with the backend authentication system. It detects whether auth is required and adapts accordingly.
+
+### Dev mode (no auth required)
+
+When no API keys or users exist in the database, the API runs in open mode. The web UI detects this via `GET /api/auth/status` and automatically logs in as a guest user. No login page is shown.
+
+This is the default behavior when you first run `duh serve` -- you can start using the web UI immediately without setting up users.
+
+### Production mode (auth required)
+
+Once you create a user or API key, the web UI requires authentication:
+
+1. **Redirect to login**: All routes except `/share/:id` redirect to `/login` if the user is not authenticated
+2. **Login form**: Enter email and password to receive a JWT token
+3. **Token storage**: The JWT token is stored in `localStorage` (key: `duh_token`)
+4. **Auto-injection**: The API client automatically includes the token in all requests via the `Authorization: Bearer` header
+5. **WebSocket auth**: The token is included in the initial WebSocket handshake message
+6. **Session expiry**: On 401 responses, the stored token is cleared and the user is redirected to login
+
+### User menu
+
+When authenticated, the top bar shows the user's display name and role badge. Clicking it reveals a dropdown with the user's email and a sign-out button.
+
+### Registration
+
+The login page includes a toggle to switch between "Sign In" and "Create Account" modes. Registration can be disabled server-side by setting `registration_enabled = false` in `config.toml`.
+
 ## Security recommendations
 
 1. **Generate a strong JWT secret**: `openssl rand -hex 32`

diff --git a/docs/web-ui.md b/docs/web-ui.md
@@ -44,6 +44,7 @@ A live cost ticker shows cumulative spend during streaming. When consensus compl
 Browse and search all past consensus sessions.
 
 - **Thread list** -- Shows question, status, and creation date for each thread
+- **Batch feedback** -- Completed threads show inline Pass/Partial/Fail buttons for quick outcome recording. This feeds the calibration system without having to open each thread individually.
 - **Search** -- Filter threads by keyword
 - **Status filter** -- Filter by `active`, `complete`, or `failed`
 - **Thread detail** (/threads/:id) -- Drill into a specific thread to see the full debate history: every round's proposal, challenges, revision, and decision
@@ -75,6 +76,25 @@ Persistent UI settings stored in `localStorage`:
 | Cost threshold | USD limit before warning |
 | Sound effects | Toggle phase transition sounds |
 
+### Calibration (/calibration)
+
+Confidence calibration analysis. Shows how well duh's confidence scores predict actual outcomes.
+
+- **Calibration curve**: Buckets of predicted confidence vs. actual success rate
+- **ECE (Expected Calibration Error)**: How well-calibrated the predictions are (lower is better)
+- **Category filter**: Break down calibration by decision category (factual, technical, creative, etc.)
+
+The page requires outcome data to be useful. Use the batch feedback buttons on the Threads page or `duh feedback` CLI to record whether decisions were correct.
+
+### Login (/login)
+
+The login page appears when authentication is required (production mode). Features:
+
+- Toggle between "Sign In" and "Create Account" modes
+- Email and password form using the glassmorphism design system
+- Inline error messages for failed authentication
+- Automatic redirect to `/` on successful login
+
 ### Share (/share/:id)
 
 A standalone page (no sidebar) for viewing shared consensus results via a public share token. Accessible without authentication.
@@ -117,12 +137,14 @@ The frontend uses a typed API client at `web/src/api/client.ts` that wraps `fetc
 
 The WebSocket client (`web/src/api/websocket.ts`) handles consensus streaming. It auto-detects `ws:` vs `wss:` based on the page protocol and connects to `/ws/ask` on the current host.
 
-State management uses four Zustand stores:
+State management uses six Zustand stores:
 
 | Store | Purpose |
 |-------|---------|
+| `auth` | JWT token, user info, login/logout, dev mode detection |
 | `consensus` | WebSocket connection, phase tracking, round data, final result |
-| `threads` | Thread list, search, pagination |
+| `threads` | Thread list, search, pagination, batch feedback |
+| `calibration` | Calibration buckets, ECE, accuracy metrics |
 | `decision-space` | Decision data, filters, timeline position |
 | `preferences` | User settings (persisted to localStorage) |