Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 56 additions & 92 deletions memory-bank/activeContext.md
Original file line number Diff line number Diff line change
@@ -1,104 +1,68 @@
# Active Context

**Last Updated**: 2026-02-17
**Current Phase**: v0.5 + Export Feature
**Next Action**: Merge v0.5.0 to main. Export to Markdown & PDF feature implemented.

## Next Task: Model Selection Controls + Provider Updates

### Context
Users can't control which models participate in consensus. `select_proposer()` picks highest `output_cost_per_mtok`, `select_challengers()` picks next-costliest. Problems: no user control (`ConsensusConfig.panel` exists but unused), Google catalog outdated, Perplexity should be challengers-only (search-grounded), Anthropic missing `claude-sonnet-4-6`.

### Changes (6 steps)

1. **Update provider model catalogs**
- `src/duh/providers/google.py:34-67` — Gemini 3 GA + early-access models (web search for latest)
- `src/duh/providers/anthropic.py:36-61` — Add `claude-sonnet-4-6`
- `src/duh/providers/perplexity.py:35-60` — Verify current model IDs/pricing

2. **Add `proposer_eligible` flag to ModelInfo**
- `src/duh/providers/base.py:28-45` — Add `proposer_eligible: bool = True`
- `src/duh/providers/perplexity.py` — Set `proposer_eligible=False` (challengers only, user decision)

3. **Wire `ConsensusConfig.panel` + update selection functions**
- `src/duh/consensus/handlers.py:185-202` (`select_proposer`) — Accept optional `panel`, filter to `proposer_eligible=True`
- `src/duh/consensus/handlers.py:322-356` (`select_challengers`) — Accept optional `panel`
- `src/duh/cli/app.py:236-246`, `src/duh/api/routes/ws.py:108,128`, `src/duh/api/routes/ask.py` — Pass panel

4. **Add CLI flags**: `--proposer MODEL_REF`, `--challengers MODEL_REF,MODEL_REF`, `--panel MODEL_REF,...`
- `src/duh/cli/app.py` (ask command)

5. **Add to REST API**: Optional `panel`, `proposer`, `challengers` fields in ask request body
- `src/duh/api/routes/ask.py`

6. **Tests**: Update `test_propose_handler.py`, `test_challenge_handler.py` for panel filtering + proposer_eligible. Test CLI flags. Fix any tests with hardcoded model catalogs.

7. **Documentation + CLI help**
- `docs/cli/ask.md` — Document `--proposer`, `--challengers`, `--panel` flags
- `docs/api-reference.md` — Document panel/proposer/challengers in `/api/ask`
- `docs/concepts/providers-and-models.md` — Update model lists, model selection explanation
- `docs/getting-started/configuration.md` — Document `[consensus] panel` config
- `docs/reference/config-reference.md` — Add panel, proposer_strategy fields
- `src/duh/cli/app.py` — Update Click help strings for new flags
- `docs/index.md` — Update feature list if needed

### Current model cost ranking (for reference)
| Model | output_cost | Provider |
|-------|------------|----------|
| Opus 4.6 | $25.00 | anthropic |
| Sonar Pro | $15.00 | perplexity |
| Sonnet 4.5 | $15.00 | anthropic |
| GPT-5.2 | $14.00 | openai |
| Gemini 3 Pro | $12.00 | google |
| Gemini 2.5 Pro | $10.00 | google |
| Mistral Medium | $8.10 | mistral |
| o3 | $8.00 | openai |
| Sonar Deep Research | $8.00 | perplexity |
| Mistral Large | $6.00 | mistral |
| Haiku 4.5 | $5.00 | anthropic |
**Last Updated**: 2026-02-18
**Current Phase**: Epistemic Confidence (Phase A) — on branch `epistemic-confidence-phase-a`
**Next Action**: Commit, push, create PR to merge to main.

## What Just Shipped: Epistemic Confidence Phase A

### Core Change
Confidence scoring is now **epistemic** — it reflects inherent uncertainty of the question domain, not just challenge quality.

**Before**: `confidence = _compute_confidence(challenges)` — measured rigor only (0.5–1.0 based on sycophancy ratio).
**After**: Two separate scores:
- **Rigor** (renamed from old confidence) — how genuine the challenges were (0.5–1.0)
- **Confidence** — `min(domain_cap(intent), rigor)` — rigor clamped by question type ceiling

### Domain Caps
| Intent | Cap | Rationale |
|--------|-----|-----------|
| factual | 0.95 | Verifiable answers, near-certain |
| technical | 0.90 | Strong consensus possible |
| creative | 0.85 | Subjective, multiple valid answers |
| judgment | 0.80 | Requires weighing trade-offs |
| strategic | 0.70 | Inherent future uncertainty |
| unknown/None | 0.85 | Default conservative cap |

### Files Changed (47 files, +997, -230)
**New files:**
- `src/duh/calibration.py` — ECE (Expected Calibration Error) computation
- `src/duh/memory/migrations.py` — SQLite schema migration (adds rigor column)
- `tests/unit/test_calibration.py` — 15 calibration tests
- `tests/unit/test_confidence_scoring.py` — 20 epistemic confidence tests
- `tests/unit/test_cli_calibration.py` — 4 CLI calibration tests
- `web/src/components/calibration/CalibrationDashboard.tsx` — Calibration viz
- `web/src/pages/CalibrationPage.tsx` — Calibration page
- `web/src/stores/calibration.ts` — Calibration Zustand store

**Modified across full stack:**
- `consensus/handlers.py` — Renamed `_compute_confidence` → `_compute_rigor`, added `_domain_cap()`, `DOMAIN_CAPS`, epistemic formula
- `consensus/machine.py` — Added `rigor` to ConsensusContext, RoundResult
- `consensus/scheduler.py` — Propagates rigor through subtask results
- `consensus/synthesis.py` — Averages rigor across subtask results
- `consensus/voting.py` — Added rigor to VoteResult, VotingAggregation
- `memory/models.py` — Added `rigor` column to Decision ORM
- `memory/repository.py` — Accepts `rigor` param in `save_decision()`
- `memory/context.py` — Shows rigor in context builder output
- `cli/app.py` — All output paths show rigor; new `duh calibration` command; PDF export enhanced
- `cli/display.py` — `show_commit()` and `show_final_decision()` show rigor
- `api/routes/crud.py` — `GET /api/calibration` endpoint; rigor in decision space
- `api/routes/ask.py`, `ws.py`, `threads.py` — Propagate rigor
- `mcp/server.py` — Propagates rigor
- Frontend: ConfidenceMeter, ConsensusComplete, ConsensusPanel, ThreadDetail, TurnCard, ExportMenu, Sidebar, DecisionCloud, stores updated

---

## Current State

- **v0.5 + Export feature on branch `v0.5.0`.** All v0.5 tasks done + export feature added.
- **6 providers shipping**: Anthropic (3 models), OpenAI (3 models), Google (4 models), Mistral (4 models), Perplexity (3 models) — 17 total.
- **1539 Python unit/load tests + 122 Vitest tests** (1661 total), ruff clean.
- **~60 Python source files + 67 frontend source files** (~127 total).
- REST API, WebSocket streaming, MCP server, Python client library, web UI all built.
- Multi-user auth (JWT + RBAC), PostgreSQL support, Prometheus metrics, backup/restore, Playwright E2E.
- CLI commands: `duh ask`, `duh recall`, `duh threads`, `duh show`, `duh models`, `duh cost`, `duh serve`, `duh mcp`, `duh batch`, `duh export`, `duh feedback`, `duh backup`, `duh restore`, `duh user-create`, `duh user-list`.
- Export: `duh export <id> --format pdf/markdown --content full/decision --no-dissent -o file`
- Docs: production-deployment.md, monitoring.md, authentication.md added.
- MkDocs docs site: https://msitarzewski.github.io/duh/
- GitHub repo: https://github.com/msitarzewski/duh
- **Branch `epistemic-confidence-phase-a`** — all changes uncommitted, ready to commit.
- **1586 Python tests + 126 Vitest tests** (1712 total), ruff clean, mypy strict clean.
- **~62 Python source files + 70 frontend source files** (~132 total).
- All previous features intact (v0.1–v0.5 + export).

## v0.5 Delivered

**Theme**: Production hardening, multi-user, enterprise readiness.
**18 tasks across 7 phases** — all complete.

### What Shipped
- User accounts + JWT auth + RBAC (admin/contributor/viewer) — `api/auth.py`, `api/rbac.py`, `models.py:User`
- PostgreSQL support (asyncpg) with connection pooling (`pool_pre_ping`, compound indexes)
- Perplexity provider adapter (6th provider, search-grounded) — `providers/perplexity.py`
- Prometheus metrics (`/api/metrics`) + extended health checks (`/api/health/detailed`)
- Backup/restore CLI (`duh backup`, `duh restore`) with SQLite copy + JSON export/import
- Playwright E2E browser tests (`web/e2e/`)
- Per-user + per-provider rate limiting (middleware keys by user_id > api_key > IP)
- Production deployment documentation (3 new guides)
- 26 multi-user integration tests + 12 load tests (latency, concurrency, rate limiting)
- Alembic migration `005_v05_users.py` (users table, user_id FKs on threads/decisions/api_keys)
## Next Task: Model Selection Controls + Provider Updates

### New Source Files (v0.5)
- `src/duh/api/auth.py` — JWT authentication endpoints
- `src/duh/api/rbac.py` — Role-based access control
- `src/duh/api/metrics.py` — Prometheus metrics endpoint
- `src/duh/api/health.py` — Extended health checks
- `src/duh/memory/backup.py` — Backup/restore utilities
- `src/duh/providers/perplexity.py` — Perplexity provider adapter
- `alembic/versions/005_v05_users.py` — User migration
- `docs/guides/production-deployment.md`, `authentication.md`, `monitoring.md`
Deferred from before Phase A. See `progress.md` for details.

## Open Questions (Still Unresolved)

Expand Down
32 changes: 31 additions & 1 deletion memory-bank/decisions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Architectural Decisions

**Last Updated**: 2026-02-17
**Last Updated**: 2026-02-18

---

Expand Down Expand Up @@ -324,3 +324,33 @@
- Remove `create_all` entirely — breaks in-memory test fixtures that don't run alembic
**Consequences**: Tests continue to work (in-memory SQLite still uses `create_all`). Production databases must run `alembic upgrade head` after code updates. This was already the expected workflow but is now enforced.
**References**: `src/duh/cli/app.py:101-104`

---

## 2026-02-18: Epistemic Confidence — Separate Rigor from Confidence

**Status**: Approved
**Context**: The original `_compute_confidence()` in `handlers.py` measured challenge quality (ratio of genuine vs sycophantic challenges), producing a score in [0.5, 1.0]. This was misleading: a factual question ("What is the capital of France?") and a strategic question ("Will AI replace software engineers by 2035?") could both score 1.0 confidence if all challenges were genuine. But inherently uncertain questions should never report near-certain confidence.
**Decision**: Split into two metrics:
- **Rigor** (renamed from old confidence): measures challenge quality, [0.5, 1.0]
- **Confidence** (epistemic): `min(domain_cap(intent), rigor)` — rigor clamped by a per-domain ceiling based on question intent (factual=0.95, technical=0.90, creative=0.85, judgment=0.80, strategic=0.70, default=0.85).
**Alternatives**:
- Single blended score (simpler, but hides the two distinct signals)
- User-configurable caps (more flexible, but adds UX complexity without clear benefit)
- LLM-estimated confidence (model judges own uncertainty — unreliable, circular)
**Consequences**: Confidence scores are now more honest. Strategic questions max out at 70% even with perfect rigor. Rigor is preserved as a separate signal for calibration analysis. Requires `rigor` column added to Decision model. Full-stack change: ORM, handlers, CLI, API, WebSocket, MCP, frontend all updated.
**References**: `src/duh/consensus/handlers.py:641-670`, `src/duh/calibration.py`

---

## 2026-02-18: Lightweight SQLite Migrations (Not Alembic)

**Status**: Approved
**Context**: Adding the `rigor` column to the `decisions` table requires a migration for existing file-based SQLite databases. Alembic handles PostgreSQL migrations, but for SQLite (the default local dev DB), running `alembic upgrade head` is a friction point for casual users.
**Decision**: Created `src/duh/memory/migrations.py` with `ensure_schema()` that runs on startup for file-based SQLite only. Uses `PRAGMA table_info()` to detect missing columns and `ALTER TABLE` to add them. In-memory SQLite uses `create_all` (unchanged). PostgreSQL uses Alembic (unchanged).
**Alternatives**:
- Alembic-only (requires users to run migration command)
- create_all for all databases (can't alter existing tables)
- Manual migration instructions in docs (user friction)
**Consequences**: File-based SQLite databases auto-migrate on startup. Zero friction for local users. PostgreSQL still requires `alembic upgrade head`. Lightweight and self-contained.
**References**: `src/duh/memory/migrations.py`, `src/duh/cli/app.py:107-110`
21 changes: 19 additions & 2 deletions memory-bank/progress.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,26 @@
# Progress

**Last Updated**: 2026-02-17
**Last Updated**: 2026-02-18

---

## Current State: v0.5 COMPLETE — Production Hardening & Multi-User
## Current State: Epistemic Confidence Phase A COMPLETE

### Epistemic Confidence Phase A

- **Renamed `_compute_confidence` → `_compute_rigor`** — old "confidence" measured challenge quality, now called "rigor"
- **Added `rigor` field** to Decision ORM model, ConsensusContext, RoundResult, SubtaskResult, VoteResult, VotingAggregation, SynthesisResult
- **Domain caps** — confidence capped by question intent: factual (0.95), technical (0.90), creative (0.85), judgment (0.80), strategic (0.70), default (0.85)
- **Epistemic formula**: `confidence = min(domain_cap(intent), rigor)` — rigor clamped by domain ceiling
- **Calibration module** — `src/duh/calibration.py` computes ECE (Expected Calibration Error) from decisions with outcomes
- **`duh calibration` CLI command** — shows calibration analysis with bucket breakdown
- **`GET /api/calibration` endpoint** — serves calibration data with category/date filters
- **Calibration frontend** — CalibrationDashboard, CalibrationPage, calibration Zustand store
- **SQLite migration** — `src/duh/memory/migrations.py` adds rigor column on startup for file-based SQLite
- **Full-stack propagation** — rigor shown in CLI, API, WebSocket, MCP, frontend across all views
- **Enhanced PDF export** — research-paper quality: header/footer, TOC, provider callouts, confidence meter, Unicode TTF
- 1586 Python tests + 126 Vitest tests (1712 total), ruff clean, mypy strict clean
- New files: calibration.py, migrations.py, test_calibration.py, test_confidence_scoring.py, test_cli_calibration.py, CalibrationDashboard.tsx, CalibrationPage.tsx, calibration.ts

### v0.5 Additions

Expand Down Expand Up @@ -151,3 +167,4 @@ Phase 0 benchmark framework — fully functional, pilot-tested on 5 questions.
| 2026-02-17 | v0.5 T14-T18 (Phase 7: Ship) — multi-user integration tests, load tests, docs, migration finalized, version bump | Done |
| 2026-02-17 | v0.5.0 — "It Scales" | **Complete** |
| 2026-02-17 | Export to Markdown & PDF (CLI + API + Web UI) | Done |
| 2026-02-18 | Epistemic Confidence Phase A (rigor + domain caps + calibration) | Done |
37 changes: 37 additions & 0 deletions memory-bank/tasks/2026-02/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -462,3 +462,40 @@
- Manual override classes: `.theme-dark` / `.theme-light` on any ancestor element
- Light mode code block overrides in `animations.css`
- Variables: backgrounds (5), text (3), primary accent, semantic colors (3), borders (3), glass (2), layout (3), typography (1)

---

## Epistemic Confidence Phase A — "Honest Confidence"

### 2026-02-18: Epistemic Confidence Scoring
- Renamed `_compute_confidence()` → `_compute_rigor()` — old metric now properly named
- Added `DOMAIN_CAPS` dict and `_domain_cap(intent)` lookup
- New formula: `confidence = min(domain_cap(intent), rigor)`
- Domain caps: factual (0.95), technical (0.90), creative (0.85), judgment (0.80), strategic (0.70), default (0.85)
- `handle_commit()` now always attempts taxonomy classification to get intent for capping
- Files: `src/duh/consensus/handlers.py`

### 2026-02-18: Rigor Field Propagation (Full Stack)
- Added `rigor: float` to Decision ORM, ConsensusContext, RoundResult, SubtaskResult, VoteResult, VotingAggregation, SynthesisResult
- Updated save_decision(), scheduler, synthesis, voting to propagate rigor
- Updated all CLI outputs (ask, recall, show, export JSON/markdown/PDF)
- Updated API responses (crud, ask, ws, threads) and MCP server
- Updated display (show_commit, show_final_decision)
- Updated context builder to show rigor alongside confidence
- Frontend: ConfidenceMeter, ConsensusComplete, ConsensusPanel, ThreadDetail, TurnCard, ExportMenu, DecisionCloud, stores
- Files: 47 files changed, +997 insertions, -230 deletions

### 2026-02-18: SQLite Schema Migration
- Created `src/duh/memory/migrations.py` — `ensure_schema()` adds rigor column on startup
- Runs for file-based SQLite only (PRAGMA table_info check → ALTER TABLE)
- In-memory SQLite: create_all handles it. PostgreSQL: Alembic handles it.
- Wired into `_create_db()` in `cli/app.py`

### 2026-02-18: Calibration Module + CLI + API + Frontend
- Created `src/duh/calibration.py` — `compute_calibration()` buckets decisions by confidence, computes ECE
- `CalibrationBucket` and `CalibrationResult` dataclasses
- `duh calibration [--category CAT]` CLI command
- `GET /api/calibration` endpoint with category/since/until filters
- Frontend: CalibrationDashboard (metric cards + bar chart + bucket table), CalibrationPage, calibration Zustand store
- Tests: 15 calibration tests, 20 confidence scoring tests, 4 CLI calibration tests
- **Total: 1586 Python + 126 Vitest = 1712 tests**
4 changes: 2 additions & 2 deletions memory-bank/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
## Core Files
- [projectbrief.md](./projectbrief.md) — Vision, tenets, architecture, build sequence
- [techContext.md](./techContext.md) — Tech stack decisions with rationale (Python, Docker, SQLAlchemy, frontend, tools, etc.)
- [decisions.md](./decisions.md) — Architectural decisions with context, alternatives, and consequences (18 ADRs)
- [activeContext.md](./activeContext.md) — Current state, v0.5 complete, ready to merge to main
- [decisions.md](./decisions.md) — Architectural decisions with context, alternatives, and consequences (20 ADRs)
- [activeContext.md](./activeContext.md) — Current state, epistemic confidence Phase A complete
- [progress.md](./progress.md) — Milestone tracking, what's built, what's next
- [competitive-landscape.md](./competitive-landscape.md) — Research on existing tools, frameworks, and academic work
- [quick-start.md](./quick-start.md) — Session entry point, v0.5 complete, key file references
Expand Down
Loading
Loading