Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# duh environment variables
# Copy to .env and fill in values. Never commit .env to git.

# JWT secret for auth tokens (required in production; auto-generated in dev if empty)
DUH_JWT_SECRET=

# Mail (SMTP) configuration for password reset emails
DUH_MAIL_HOST=localhost
DUH_MAIL_PORT=1025
DUH_MAIL_USERNAME=
DUH_MAIL_PASSWORD=
DUH_MAIL_ENCRYPTION=
DUH_MAIL_FROM_ADDRESS=noreply@example.com
DUH_MAIL_FROM_NAME=duh
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ ENV/
# Environment
.env
.env.*
!.env.example

# Phase 0 results
results/
Expand Down
46 changes: 43 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# AGENTS.md

**Version**: 2.1 (2025-10-25) | **Compatibility**: Claude, Cursor, Copilot, Cline, Aider, all AGENTS.md-compatible tools
**Version**: 2.2 (2025-03-04) | **Compatibility**: Claude, Cursor, Copilot, Cline, Aider, all AGENTS.md-compatible tools
**Status**: Canonical single-file guide for AI-assisted development

---
Expand All @@ -9,6 +9,7 @@

1. [Compliance & Core Rules](#1-compliance--core-rules)
2. [Session Startup](#2-session-startup)
- [Compaction Protocol](#compaction-protocol-mid-session-context-preservation)
3. [Memory Bank](#3-memory-bank)
4. [State Machine](#4-state-machine)
5. [Task Contract & Budgets](#5-task-contract--budgets)
Expand Down Expand Up @@ -106,6 +107,45 @@ Append-only JSONL format:
{"timestamp":"2025-10-25T11:00:00Z","session_id":"uuid","event":"approval_requested","state":"APPROVAL"}
```

### Compaction Protocol (Mid-Session Context Preservation)

Compaction (context compression) can happen at any time — triggered by the system automatically, by the user via `/compact`, or by platform-level context management. **The agent does not control compaction timing and may not get advance notice.** Therefore, state persistence must be continuous, not deferred to a pre-compaction moment.

#### Continuous State Persistence (At Every State Transition)

At each state transition (`PLAN → BUILD → DIFF → QA → APPROVAL → APPLY → DOCS`), persist the following to the Memory Bank:

1. **State machine position**: Update `activeContext.md` with current state, substate, and working context
2. **Task progress**: Append current status to `tasks/YYYY-MM/README.md` with `[IN-PROGRESS]` tag
3. **Decisions**: Append any new architectural decisions to `decisions.md`
4. **Log transition** to operational log:
```json
{"timestamp":"...","session_id":"uuid","event":"state_transition","from":"PLAN","to":"BUILD"}
```
5. **Loose context**: Capture any information that exists only in conversation (user preferences, verbal requirements, pending questions) into `activeContext.md`

This ensures that when compaction occurs — without warning — the Memory Bank already reflects the latest state.

#### After Compaction (Recovery)

When context has been compressed (detected by loss of earlier conversation detail, or after `/compact`):

1. Re-enter **Session Startup** (Section 2) using **Fast Track** mode — the Memory Bank was just updated via continuous persistence, so full discovery is unnecessary
2. Confirm state machine position from `activeContext.md`
3. Resume from saved state — do not restart the current task from scratch
4. Output recovery confirmation:
```
COMPACTION RECOVERY: Resumed [STATE] for [task name]
Context restored from: activeContext.md, tasks/YYYY-MM/README.md
```

#### Rules

- State persistence happens at every transition, not "before compaction" — you cannot rely on advance notice
- After detecting compaction, always re-read Memory Bank before taking any action
- If the current state is `APPROVAL` or `DIFF`, the diff summary should already be in `activeContext.md` from the transition save
- Compaction does not reset budgets — carry forward cycle/token/minute counts from the operational log

---

## 3. Memory Bank
Expand Down Expand Up @@ -577,7 +617,7 @@ Create outline for approval. After approval, do work. Do not document until I ap
2. **Task** (current task): Files being modified, direct dependencies, related tests
3. **Reference** (on-demand): Arch patterns, similar implementations, historical decisions

**Context Rotation**: After each state transition, drop Task Context, reload only what's needed for next state. Keep Core Context persistent.
**Context Rotation**: After each state transition, drop Task Context, reload only what's needed for next state. Keep Core Context persistent. State is persisted to Memory Bank at every transition per **Compaction Protocol** (Section 2), so compaction recovery is automatic.

**Parallel Execution**:
```
Expand Down Expand Up @@ -896,7 +936,7 @@ Stuck? → Cycles ≥3?
| Issue | Symptoms | Resolution |
|-------|----------|------------|
| **Loop** | Same diff multiple times, QA fails repeatedly, no progress after 3+ cycles | Check budgets → Load more MB → Clarify requirements → Check environment → Agent swap |
| **Context Exceeded** | Token limit approaching, slow/truncated responses, forgetting earlier info | Rotate context (drop Task, reload essentials) → Focused mode (MB summaries only) → Break into subtasks → Agent swap |
| **Context Exceeded** | Token limit approaching, slow/truncated responses, forgetting earlier info | State already persisted via **Compaction Protocol** (Section 2) → Rotate context (drop Task, reload essentials) → Focused mode (MB summaries only) → Break into subtasks → Agent swap |
| **CI ≠ Local** | QA passes, CI fails | Compare environments → Verify dependency versions → Check timing/concurrency → Check state cleanup → Document waiver if CI issue |
| **Security Fail** | Security checklist incomplete, sensitive data exposed, auth/authz bypassed | Never bypass → Return to BUILD → Fix all issues → Re-test → Document pattern if new |

Expand Down
24 changes: 19 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,22 @@ duh ask "What database should I use for a new SaaS product?"

## Features

- **Multi-model consensus** -- Claude, GPT, Gemini, and Mistral debate. Sycophantic challenges are detected and flagged.
- **Multi-model consensus** -- Claude, GPT, Gemini, Mistral, and Perplexity debate. Sycophantic challenges are detected and flagged.
- **Web UI** -- Real-time consensus streaming, thread browser, 3D decision space, calibration dashboard. `duh serve` serves both API and frontend.
- **Epistemic confidence** -- Rigor scoring + domain-capped confidence. Calibration analysis with ECE tracking.
- **Authentication** -- JWT auth with user accounts, RBAC (admin/contributor/viewer), password reset via email.
- **Voting protocol** -- Fan out to all models in parallel, aggregate answers via majority or weighted synthesis.
- **Query decomposition** -- Break complex questions into subtask DAGs, solve in parallel, synthesize results.
- **REST API** -- Full HTTP API via `duh serve` with API key auth, rate limiting, and WebSocket streaming.
- **REST API** -- Full HTTP API with API key auth, rate limiting, WebSocket streaming, and Prometheus metrics.
- **MCP server** -- AI agent integration via `duh mcp` (Model Context Protocol).
- **Python client** -- Async and sync client library for the REST API (`pip install duh-client`).
- **Batch processing** -- Process multiple questions from a file (`duh batch`).
- **Export** -- Export threads as JSON or Markdown (`duh export`).
- **Mistral provider** -- Native Mistral AI support alongside Anthropic, OpenAI, and Google.
- **Export** -- Export threads as JSON, Markdown, or PDF (`duh export`).
- **Decision taxonomy** -- Auto-classify decisions by intent, category, and genus for structured recall.
- **Outcome tracking** -- Record success/failure/partial feedback on past decisions.
- **Tool-augmented reasoning** -- Models can call web search, read files, and execute code during consensus.
- **Persistent memory** -- Every thread, contribution, decision, vote, and subtask stored in SQLite. Search with `duh recall`.
- **Persistent memory** -- SQLite or PostgreSQL. Every thread, contribution, decision, vote, and subtask stored. Search with `duh recall`.
- **Backup & restore** -- `duh backup` / `duh restore` with merge mode for SQLite and JSON export.
- **Cost tracking** -- Per-model token costs in real-time. Configurable warn threshold and hard limit.
- **Local models** -- Ollama and LM Studio via the OpenAI-compatible API. Mix cloud + local.
- **Rich CLI** -- Styled panels, spinners, and formatted output.
Expand All @@ -59,6 +62,12 @@ duh batch questions.txt # Process multiple questions
duh batch questions.jsonl --format json # Batch with JSON output
duh export <thread-id> # Export thread as JSON
duh export <thread-id> --format markdown # Export as Markdown
duh export <thread-id> --format pdf # Export as PDF
duh backup ./backup.db # Backup database
duh restore ./backup.db # Restore database
duh calibration # Show confidence calibration
duh user-create --email u@x.com --password ... # Create user
duh user-list # List users
```

## How consensus works
Expand Down Expand Up @@ -109,8 +118,13 @@ Full documentation: [docs/](docs/index.md)
- [Export](docs/export.md)
- [Python API](docs/python-api/library-usage.md)
- [Docker Guide](docs/guides/docker.md)
- [Authentication](docs/guides/authentication.md)
- [Config Reference](docs/reference/config-reference.md)

## Sponsor

If duh is useful to you, consider [sponsoring the project](https://github.com/sponsors/msitarzewski).

## License

TBD
99 changes: 99 additions & 0 deletions docs/concepts/epistemic-confidence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Epistemic Confidence

duh uses a two-dimensional confidence system that separates the quality of the deliberation process (**rigor**) from the theoretical limits of the question domain (**domain cap**). The final **confidence** score reflects both.

## Key concepts

### Rigor

Rigor measures how well the consensus process challenged the answer. It ranges from 0.5 to 1.0 and is computed during the COMMIT phase based on:

- How substantive the challenges were
- Whether the revision addressed the challenges
- Whether multiple rounds of deliberation improved the answer

A high rigor score means the answer survived meaningful scrutiny. A low rigor score means the challenges were weak or the proposer didn't adequately respond.

### Domain caps

Not all questions can be answered with equal certainty. duh classifies each question's **intent** (via taxonomy) and applies a ceiling:

| Domain | Cap | Rationale |
|--------|-----|-----------|
| Factual | 0.95 | Verifiable facts can be highly certain |
| Technical | 0.90 | Technical analysis has some inherent uncertainty |
| Creative | 0.85 | Creative work is subjective |
| Judgment | 0.80 | Value judgments vary by perspective |
| Strategic | 0.70 | Strategy involves unpredictable futures |
| Default | 0.85 | When classification is uncertain |

### Confidence

The final confidence score is:

```
confidence = min(domain_cap(intent), rigor)
```

This means even a perfect deliberation process can't claim 95% confidence on a strategic question -- the domain cap limits it to 70%.

### Intent classification

During the COMMIT phase, duh classifies the question's intent using a taxonomy. The classification determines which domain cap applies. Examples:

- "What year was Python released?" -- factual (cap: 0.95)
- "Should I use PostgreSQL or MongoDB?" -- technical (cap: 0.90)
- "Write a poem about the ocean" -- creative (cap: 0.85)
- "Is remote work better than office work?" -- judgment (cap: 0.80)
- "Should we expand into the European market?" -- strategic (cap: 0.70)

## Calibration

Calibration measures whether confidence scores are accurate over time. A well-calibrated system means:

- Decisions with 90% confidence should be correct ~90% of the time
- Decisions with 70% confidence should be correct ~70% of the time

### Recording outcomes

To build calibration data, record whether decisions were correct:

**CLI**:
```bash
duh feedback <thread-id> success # Decision was correct
duh feedback <thread-id> failure # Decision was wrong
duh feedback <thread-id> partial # Decision was partially correct
```

**Web UI**: Use the inline Pass/Partial/Fail buttons on the Threads page (/threads).

**API**:
```bash
curl -X POST http://localhost:8080/api/feedback \
-H "Content-Type: application/json" \
-d '{"thread_id": "abc123", "result": "success"}'
```

### ECE (Expected Calibration Error)

The calibration page shows ECE -- a single number that summarizes how well-calibrated the system is. Lower is better:

- **ECE < 0.05**: Excellent calibration
- **ECE 0.05--0.10**: Good calibration
- **ECE > 0.10**: Needs more data or model adjustment

ECE is computed by:
1. Bucketing decisions by confidence (e.g., 0.7--0.8)
2. Comparing mean predicted confidence to actual success rate in each bucket
3. Averaging the absolute difference, weighted by bucket size

### Viewing calibration

- **CLI**: `duh calibration` shows a table of calibration buckets
- **Web UI**: Visit `/calibration` for a visual calibration curve
- **API**: `GET /api/calibration` returns bucket data

## Related

- [How Consensus Works](how-consensus-works.md) -- The deliberation process that produces rigor scores
- [Web UI](../web-ui.md) -- Calibration page and batch feedback
29 changes: 29 additions & 0 deletions docs/guides/authentication.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,35 @@ X-RateLimit-Remaining: 57
X-RateLimit-Key: user:a1b2c3d4-...
```

## Web UI Authentication

The web UI integrates with the backend authentication system. It detects whether auth is required and adapts accordingly.

### Dev mode (no auth required)

When no API keys or users exist in the database, the API runs in open mode. The web UI detects this via `GET /api/auth/status` and automatically logs in as a guest user. No login page is shown.

This is the default behavior when you first run `duh serve` -- you can start using the web UI immediately without setting up users.

### Production mode (auth required)

Once you create a user or API key, the web UI requires authentication:

1. **Redirect to login**: All routes except `/share/:id` redirect to `/login` if the user is not authenticated
2. **Login form**: Enter email and password to receive a JWT token
3. **Token storage**: The JWT token is stored in `localStorage` (key: `duh_token`)
4. **Auto-injection**: The API client automatically includes the token in all requests via the `Authorization: Bearer` header
5. **WebSocket auth**: The token is included in the initial WebSocket handshake message
6. **Session expiry**: On 401 responses, the stored token is cleared and the user is redirected to login

### User menu

When authenticated, the top bar shows the user's display name and role badge. Clicking it reveals a dropdown with the user's email and a sign-out button.

### Registration

The login page includes a toggle to switch between "Sign In" and "Create Account" modes. Registration can be disabled server-side by setting `registration_enabled = false` in `config.toml`.

## Security recommendations

1. **Generate a strong JWT secret**: `openssl rand -hex 32`
Expand Down
26 changes: 24 additions & 2 deletions docs/web-ui.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ A live cost ticker shows cumulative spend during streaming. When consensus compl
Browse and search all past consensus sessions.

- **Thread list** -- Shows question, status, and creation date for each thread
- **Batch feedback** -- Completed threads show inline Pass/Partial/Fail buttons for quick outcome recording. This feeds the calibration system without having to open each thread individually.
- **Search** -- Filter threads by keyword
- **Status filter** -- Filter by `active`, `complete`, or `failed`
- **Thread detail** (/threads/:id) -- Drill into a specific thread to see the full debate history: every round's proposal, challenges, revision, and decision
Expand Down Expand Up @@ -75,6 +76,25 @@ Persistent UI settings stored in `localStorage`:
| Cost threshold | USD limit before warning |
| Sound effects | Toggle phase transition sounds |

### Calibration (/calibration)

Confidence calibration analysis. Shows how well duh's confidence scores predict actual outcomes.

- **Calibration curve**: Buckets of predicted confidence vs. actual success rate
- **ECE (Expected Calibration Error)**: How well-calibrated the predictions are (lower is better)
- **Category filter**: Break down calibration by decision category (factual, technical, creative, etc.)

The page requires outcome data to be useful. Use the batch feedback buttons on the Threads page or `duh feedback` CLI to record whether decisions were correct.

### Login (/login)

The login page appears when authentication is required (production mode). Features:

- Toggle between "Sign In" and "Create Account" modes
- Email and password form using the glassmorphism design system
- Inline error messages for failed authentication
- Automatic redirect to `/` on successful login

### Share (/share/:id)

A standalone page (no sidebar) for viewing shared consensus results via a public share token. Accessible without authentication.
Expand Down Expand Up @@ -117,12 +137,14 @@ The frontend uses a typed API client at `web/src/api/client.ts` that wraps `fetc

The WebSocket client (`web/src/api/websocket.ts`) handles consensus streaming. It auto-detects `ws:` vs `wss:` based on the page protocol and connects to `/ws/ask` on the current host.

State management uses four Zustand stores:
State management uses six Zustand stores:

| Store | Purpose |
|-------|---------|
| `auth` | JWT token, user info, login/logout, dev mode detection |
| `consensus` | WebSocket connection, phase tracking, round data, final result |
| `threads` | Thread list, search, pagination |
| `threads` | Thread list, search, pagination, batch feedback |
| `calibration` | Calibration buckets, ECE, accuracy metrics |
| `decision-space` | Decision data, filters, timeline position |
| `preferences` | User settings (persisted to localStorage) |

Expand Down
Loading
Loading