Skip to content

Commit c034101

Browse files
authored
Merge pull request #11 from msitarzewski/ux-cleanup
v0.6.0: Auth, password reset, epistemic confidence, UX polish
2 parents 2a72c45 + eec2d4b commit c034101

67 files changed

Lines changed: 2423 additions & 336 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# duh environment variables
2+
# Copy to .env and fill in values. Never commit .env to git.
3+
4+
# JWT secret for auth tokens (required in production; auto-generated in dev if empty)
5+
DUH_JWT_SECRET=
6+
7+
# Mail (SMTP) configuration for password reset emails
8+
DUH_MAIL_HOST=localhost
9+
DUH_MAIL_PORT=1025
10+
DUH_MAIL_USERNAME=
11+
DUH_MAIL_PASSWORD=
12+
DUH_MAIL_ENCRYPTION=
13+
DUH_MAIL_FROM_ADDRESS=noreply@example.com
14+
DUH_MAIL_FROM_NAME=duh

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ ENV/
2424
# Environment
2525
.env
2626
.env.*
27+
!.env.example
2728

2829
# Phase 0 results
2930
results/

AGENTS.md

Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# AGENTS.md
22

3-
**Version**: 2.1 (2025-10-25) | **Compatibility**: Claude, Cursor, Copilot, Cline, Aider, all AGENTS.md-compatible tools
3+
**Version**: 2.2 (2025-03-04) | **Compatibility**: Claude, Cursor, Copilot, Cline, Aider, all AGENTS.md-compatible tools
44
**Status**: Canonical single-file guide for AI-assisted development
55

66
---
@@ -9,6 +9,7 @@
99

1010
1. [Compliance & Core Rules](#1-compliance--core-rules)
1111
2. [Session Startup](#2-session-startup)
12+
- [Compaction Protocol](#compaction-protocol-mid-session-context-preservation)
1213
3. [Memory Bank](#3-memory-bank)
1314
4. [State Machine](#4-state-machine)
1415
5. [Task Contract & Budgets](#5-task-contract--budgets)
@@ -106,6 +107,45 @@ Append-only JSONL format:
106107
{"timestamp":"2025-10-25T11:00:00Z","session_id":"uuid","event":"approval_requested","state":"APPROVAL"}
107108
```
108109

110+
### Compaction Protocol (Mid-Session Context Preservation)
111+
112+
Compaction (context compression) can happen at any time — triggered by the system automatically, by the user via `/compact`, or by platform-level context management. **The agent does not control compaction timing and may not get advance notice.** Therefore, state persistence must be continuous, not deferred to a pre-compaction moment.
113+
114+
#### Continuous State Persistence (At Every State Transition)
115+
116+
At each state transition (`PLAN → BUILD → DIFF → QA → APPROVAL → APPLY → DOCS`), persist the following to the Memory Bank:
117+
118+
1. **State machine position**: Update `activeContext.md` with current state, substate, and working context
119+
2. **Task progress**: Append current status to `tasks/YYYY-MM/README.md` with `[IN-PROGRESS]` tag
120+
3. **Decisions**: Append any new architectural decisions to `decisions.md`
121+
4. **Log transition** to operational log:
122+
```json
123+
{"timestamp":"...","session_id":"uuid","event":"state_transition","from":"PLAN","to":"BUILD"}
124+
```
125+
5. **Loose context**: Capture any information that exists only in conversation (user preferences, verbal requirements, pending questions) into `activeContext.md`
126+
127+
This ensures that when compaction occurs — without warning — the Memory Bank already reflects the latest state.
128+
129+
#### After Compaction (Recovery)
130+
131+
When context has been compressed (detected by loss of earlier conversation detail, or after `/compact`):
132+
133+
1. Re-enter **Session Startup** (Section 2) using **Fast Track** mode — the Memory Bank was just updated via continuous persistence, so full discovery is unnecessary
134+
2. Confirm state machine position from `activeContext.md`
135+
3. Resume from saved state — do not restart the current task from scratch
136+
4. Output recovery confirmation:
137+
```
138+
COMPACTION RECOVERY: Resumed [STATE] for [task name]
139+
Context restored from: activeContext.md, tasks/YYYY-MM/README.md
140+
```
141+
142+
#### Rules
143+
144+
- State persistence happens at every transition, not "before compaction" — you cannot rely on advance notice
145+
- After detecting compaction, always re-read Memory Bank before taking any action
146+
- If the current state is `APPROVAL` or `DIFF`, the diff summary should already be in `activeContext.md` from the transition save
147+
- Compaction does not reset budgets — carry forward cycle/token/minute counts from the operational log
148+
109149
---
110150

111151
## 3. Memory Bank
@@ -577,7 +617,7 @@ Create outline for approval. After approval, do work. Do not document until I ap
577617
2. **Task** (current task): Files being modified, direct dependencies, related tests
578618
3. **Reference** (on-demand): Arch patterns, similar implementations, historical decisions
579619

580-
**Context Rotation**: After each state transition, drop Task Context, reload only what's needed for next state. Keep Core Context persistent.
620+
**Context Rotation**: After each state transition, drop Task Context, reload only what's needed for next state. Keep Core Context persistent. State is persisted to Memory Bank at every transition per **Compaction Protocol** (Section 2), so compaction recovery is automatic.
581621

582622
**Parallel Execution**:
583623
```
@@ -896,7 +936,7 @@ Stuck? → Cycles ≥3?
896936
| Issue | Symptoms | Resolution |
897937
|-------|----------|------------|
898938
| **Loop** | Same diff multiple times, QA fails repeatedly, no progress after 3+ cycles | Check budgets → Load more MB → Clarify requirements → Check environment → Agent swap |
899-
| **Context Exceeded** | Token limit approaching, slow/truncated responses, forgetting earlier info | Rotate context (drop Task, reload essentials) → Focused mode (MB summaries only) → Break into subtasks → Agent swap |
939+
| **Context Exceeded** | Token limit approaching, slow/truncated responses, forgetting earlier info | State already persisted via **Compaction Protocol** (Section 2) → Rotate context (drop Task, reload essentials) → Focused mode (MB summaries only) → Break into subtasks → Agent swap |
900940
| **CI ≠ Local** | QA passes, CI fails | Compare environments → Verify dependency versions → Check timing/concurrency → Check state cleanup → Document waiver if CI issue |
901941
| **Security Fail** | Security checklist incomplete, sensitive data exposed, auth/authz bypassed | Never bypass → Return to BUILD → Fix all issues → Re-test → Document pattern if new |
902942

README.md

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,19 +21,22 @@ duh ask "What database should I use for a new SaaS product?"
2121

2222
## Features
2323

24-
- **Multi-model consensus** -- Claude, GPT, Gemini, and Mistral debate. Sycophantic challenges are detected and flagged.
24+
- **Multi-model consensus** -- Claude, GPT, Gemini, Mistral, and Perplexity debate. Sycophantic challenges are detected and flagged.
25+
- **Web UI** -- Real-time consensus streaming, thread browser, 3D decision space, calibration dashboard. `duh serve` serves both API and frontend.
26+
- **Epistemic confidence** -- Rigor scoring + domain-capped confidence. Calibration analysis with ECE tracking.
27+
- **Authentication** -- JWT auth with user accounts, RBAC (admin/contributor/viewer), password reset via email.
2528
- **Voting protocol** -- Fan out to all models in parallel, aggregate answers via majority or weighted synthesis.
2629
- **Query decomposition** -- Break complex questions into subtask DAGs, solve in parallel, synthesize results.
27-
- **REST API** -- Full HTTP API via `duh serve` with API key auth, rate limiting, and WebSocket streaming.
30+
- **REST API** -- Full HTTP API with API key auth, rate limiting, WebSocket streaming, and Prometheus metrics.
2831
- **MCP server** -- AI agent integration via `duh mcp` (Model Context Protocol).
2932
- **Python client** -- Async and sync client library for the REST API (`pip install duh-client`).
3033
- **Batch processing** -- Process multiple questions from a file (`duh batch`).
31-
- **Export** -- Export threads as JSON or Markdown (`duh export`).
32-
- **Mistral provider** -- Native Mistral AI support alongside Anthropic, OpenAI, and Google.
34+
- **Export** -- Export threads as JSON, Markdown, or PDF (`duh export`).
3335
- **Decision taxonomy** -- Auto-classify decisions by intent, category, and genus for structured recall.
3436
- **Outcome tracking** -- Record success/failure/partial feedback on past decisions.
3537
- **Tool-augmented reasoning** -- Models can call web search, read files, and execute code during consensus.
36-
- **Persistent memory** -- Every thread, contribution, decision, vote, and subtask stored in SQLite. Search with `duh recall`.
38+
- **Persistent memory** -- SQLite or PostgreSQL. Every thread, contribution, decision, vote, and subtask stored. Search with `duh recall`.
39+
- **Backup & restore** -- `duh backup` / `duh restore` with merge mode for SQLite and JSON export.
3740
- **Cost tracking** -- Per-model token costs in real-time. Configurable warn threshold and hard limit.
3841
- **Local models** -- Ollama and LM Studio via the OpenAI-compatible API. Mix cloud + local.
3942
- **Rich CLI** -- Styled panels, spinners, and formatted output.
@@ -59,6 +62,12 @@ duh batch questions.txt # Process multiple questions
5962
duh batch questions.jsonl --format json # Batch with JSON output
6063
duh export <thread-id> # Export thread as JSON
6164
duh export <thread-id> --format markdown # Export as Markdown
65+
duh export <thread-id> --format pdf # Export as PDF
66+
duh backup ./backup.db # Backup database
67+
duh restore ./backup.db # Restore database
68+
duh calibration # Show confidence calibration
69+
duh user-create --email u@x.com --password ... # Create user
70+
duh user-list # List users
6271
```
6372

6473
## How consensus works
@@ -109,8 +118,13 @@ Full documentation: [docs/](docs/index.md)
109118
- [Export](docs/export.md)
110119
- [Python API](docs/python-api/library-usage.md)
111120
- [Docker Guide](docs/guides/docker.md)
121+
- [Authentication](docs/guides/authentication.md)
112122
- [Config Reference](docs/reference/config-reference.md)
113123

124+
## Sponsor
125+
126+
If duh is useful to you, consider [sponsoring the project](https://github.com/sponsors/msitarzewski).
127+
114128
## License
115129

116130
TBD
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Epistemic Confidence
2+
3+
duh uses a two-dimensional confidence system that separates the quality of the deliberation process (**rigor**) from the theoretical limits of the question domain (**domain cap**). The final **confidence** score reflects both.
4+
5+
## Key concepts
6+
7+
### Rigor
8+
9+
Rigor measures how well the consensus process challenged the answer. It ranges from 0.5 to 1.0 and is computed during the COMMIT phase based on:
10+
11+
- How substantive the challenges were
12+
- Whether the revision addressed the challenges
13+
- Whether multiple rounds of deliberation improved the answer
14+
15+
A high rigor score means the answer survived meaningful scrutiny. A low rigor score means the challenges were weak or the proposer didn't adequately respond.
16+
17+
### Domain caps
18+
19+
Not all questions can be answered with equal certainty. duh classifies each question's **intent** (via taxonomy) and applies a ceiling:
20+
21+
| Domain | Cap | Rationale |
22+
|--------|-----|-----------|
23+
| Factual | 0.95 | Verifiable facts can be highly certain |
24+
| Technical | 0.90 | Technical analysis has some inherent uncertainty |
25+
| Creative | 0.85 | Creative work is subjective |
26+
| Judgment | 0.80 | Value judgments vary by perspective |
27+
| Strategic | 0.70 | Strategy involves unpredictable futures |
28+
| Default | 0.85 | When classification is uncertain |
29+
30+
### Confidence
31+
32+
The final confidence score is:
33+
34+
```
35+
confidence = min(domain_cap(intent), rigor)
36+
```
37+
38+
This means even a perfect deliberation process can't claim 95% confidence on a strategic question -- the domain cap limits it to 70%.
39+
40+
### Intent classification
41+
42+
During the COMMIT phase, duh classifies the question's intent using a taxonomy. The classification determines which domain cap applies. Examples:
43+
44+
- "What year was Python released?" -- factual (cap: 0.95)
45+
- "Should I use PostgreSQL or MongoDB?" -- technical (cap: 0.90)
46+
- "Write a poem about the ocean" -- creative (cap: 0.85)
47+
- "Is remote work better than office work?" -- judgment (cap: 0.80)
48+
- "Should we expand into the European market?" -- strategic (cap: 0.70)
49+
50+
## Calibration
51+
52+
Calibration measures whether confidence scores are accurate over time. A well-calibrated system means:
53+
54+
- Decisions with 90% confidence should be correct ~90% of the time
55+
- Decisions with 70% confidence should be correct ~70% of the time
56+
57+
### Recording outcomes
58+
59+
To build calibration data, record whether decisions were correct:
60+
61+
**CLI**:
62+
```bash
63+
duh feedback <thread-id> success # Decision was correct
64+
duh feedback <thread-id> failure # Decision was wrong
65+
duh feedback <thread-id> partial # Decision was partially correct
66+
```
67+
68+
**Web UI**: Use the inline Pass/Partial/Fail buttons on the Threads page (/threads).
69+
70+
**API**:
71+
```bash
72+
curl -X POST http://localhost:8080/api/feedback \
73+
-H "Content-Type: application/json" \
74+
-d '{"thread_id": "abc123", "result": "success"}'
75+
```
76+
77+
### ECE (Expected Calibration Error)
78+
79+
The calibration page shows ECE -- a single number that summarizes how well-calibrated the system is. Lower is better:
80+
81+
- **ECE < 0.05**: Excellent calibration
82+
- **ECE 0.05--0.10**: Good calibration
83+
- **ECE > 0.10**: Needs more data or model adjustment
84+
85+
ECE is computed by:
86+
1. Bucketing decisions by confidence (e.g., 0.7--0.8)
87+
2. Comparing mean predicted confidence to actual success rate in each bucket
88+
3. Averaging the absolute difference, weighted by bucket size
89+
90+
### Viewing calibration
91+
92+
- **CLI**: `duh calibration` shows a table of calibration buckets
93+
- **Web UI**: Visit `/calibration` for a visual calibration curve
94+
- **API**: `GET /api/calibration` returns bucket data
95+
96+
## Related
97+
98+
- [How Consensus Works](how-consensus-works.md) -- The deliberation process that produces rigor scores
99+
- [Web UI](../web-ui.md) -- Calibration page and batch feedback

docs/guides/authentication.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -282,6 +282,35 @@ X-RateLimit-Remaining: 57
282282
X-RateLimit-Key: user:a1b2c3d4-...
283283
```
284284

285+
## Web UI Authentication
286+
287+
The web UI integrates with the backend authentication system. It detects whether auth is required and adapts accordingly.
288+
289+
### Dev mode (no auth required)
290+
291+
When no API keys or users exist in the database, the API runs in open mode. The web UI detects this via `GET /api/auth/status` and automatically logs in as a guest user. No login page is shown.
292+
293+
This is the default behavior when you first run `duh serve` -- you can start using the web UI immediately without setting up users.
294+
295+
### Production mode (auth required)
296+
297+
Once you create a user or API key, the web UI requires authentication:
298+
299+
1. **Redirect to login**: All routes except `/share/:id` redirect to `/login` if the user is not authenticated
300+
2. **Login form**: Enter email and password to receive a JWT token
301+
3. **Token storage**: The JWT token is stored in `localStorage` (key: `duh_token`)
302+
4. **Auto-injection**: The API client automatically includes the token in all requests via the `Authorization: Bearer` header
303+
5. **WebSocket auth**: The token is included in the initial WebSocket handshake message
304+
6. **Session expiry**: On 401 responses, the stored token is cleared and the user is redirected to login
305+
306+
### User menu
307+
308+
When authenticated, the top bar shows the user's display name and role badge. Clicking it reveals a dropdown with the user's email and a sign-out button.
309+
310+
### Registration
311+
312+
The login page includes a toggle to switch between "Sign In" and "Create Account" modes. Registration can be disabled server-side by setting `registration_enabled = false` in `config.toml`.
313+
285314
## Security recommendations
286315

287316
1. **Generate a strong JWT secret**: `openssl rand -hex 32`

docs/web-ui.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ A live cost ticker shows cumulative spend during streaming. When consensus compl
4444
Browse and search all past consensus sessions.
4545

4646
- **Thread list** -- Shows question, status, and creation date for each thread
47+
- **Batch feedback** -- Completed threads show inline Pass/Partial/Fail buttons for quick outcome recording. This feeds the calibration system without having to open each thread individually.
4748
- **Search** -- Filter threads by keyword
4849
- **Status filter** -- Filter by `active`, `complete`, or `failed`
4950
- **Thread detail** (/threads/:id) -- Drill into a specific thread to see the full debate history: every round's proposal, challenges, revision, and decision
@@ -75,6 +76,25 @@ Persistent UI settings stored in `localStorage`:
7576
| Cost threshold | USD limit before warning |
7677
| Sound effects | Toggle phase transition sounds |
7778

79+
### Calibration (/calibration)
80+
81+
Confidence calibration analysis. Shows how well duh's confidence scores predict actual outcomes.
82+
83+
- **Calibration curve**: Buckets of predicted confidence vs. actual success rate
84+
- **ECE (Expected Calibration Error)**: How well-calibrated the predictions are (lower is better)
85+
- **Category filter**: Break down calibration by decision category (factual, technical, creative, etc.)
86+
87+
The page requires outcome data to be useful. Use the batch feedback buttons on the Threads page or `duh feedback` CLI to record whether decisions were correct.
88+
89+
### Login (/login)
90+
91+
The login page appears when authentication is required (production mode). Features:
92+
93+
- Toggle between "Sign In" and "Create Account" modes
94+
- Email and password form using the glassmorphism design system
95+
- Inline error messages for failed authentication
96+
- Automatic redirect to `/` on successful login
97+
7898
### Share (/share/:id)
7999

80100
A standalone page (no sidebar) for viewing shared consensus results via a public share token. Accessible without authentication.
@@ -117,12 +137,14 @@ The frontend uses a typed API client at `web/src/api/client.ts` that wraps `fetc
117137

118138
The WebSocket client (`web/src/api/websocket.ts`) handles consensus streaming. It auto-detects `ws:` vs `wss:` based on the page protocol and connects to `/ws/ask` on the current host.
119139

120-
State management uses four Zustand stores:
140+
State management uses six Zustand stores:
121141

122142
| Store | Purpose |
123143
|-------|---------|
144+
| `auth` | JWT token, user info, login/logout, dev mode detection |
124145
| `consensus` | WebSocket connection, phase tracking, round data, final result |
125-
| `threads` | Thread list, search, pagination |
146+
| `threads` | Thread list, search, pagination, batch feedback |
147+
| `calibration` | Calibration buckets, ECE, accuracy metrics |
126148
| `decision-space` | Decision data, filters, timeline position |
127149
| `preferences` | User settings (persisted to localStorage) |
128150

0 commit comments

Comments
 (0)