msitarzewski · msitarzewski · Mar 8, 2026 · Mar 8, 2026 · Mar 8, 2026
diff --git a/memory-bank/activeContext.md b/memory-bank/activeContext.md
@@ -1,37 +1,72 @@
 # Active Context
 
-**Last Updated**: 2026-03-07
-**Current Phase**: Post v0.6.0 — z-index fix + GPT-5.4 + .env improvements
-**Next Action**: PR ready for review
+**Last Updated**: 2026-03-08
+**Current Phase**: `question-refinement` branch — pre-consensus question refinement, native web search, citations, tools-by-default
+**Next Action**: Branch in progress, uncommitted changes staged
 
-## Latest Work (2026-03-07)
+## Latest Work (2026-03-08)
 
-### Z-index stacking context fix
-- **Problem**: Nested stacking contexts (`z-10` on main content, `z-20` on TopBar header) trapped dropdowns inside containers. Account menu's `fixed inset-0 z-40` backdrop was meaningless outside its container.
-- **Fix**: Removed unnecessary z-index values creating stacking contexts, added `isolate` to Shell root, defined z-index tokens in CSS (`--z-background`, `--z-dropdown`, `--z-overlay`, `--z-modal`), replaced backdrop hack with `useRef` + `mousedown` click-outside pattern (matching ExportMenu).
-- Files: `duh-theme.css`, `Shell.tsx`, `TopBar.tsx`, `GridOverlay.tsx`, `ParticleField.tsx`, `ExportMenu.tsx`, `ConsensusComplete.tsx`, `ThreadDetail.tsx`
+### Question Refinement
+- Pre-consensus clarification step: analyze question → ask clarifying questions → enrich with answers → proceed to consensus
+- `src/duh/consensus/refine.py` — `analyze_question()` + `enrich_question()`, uses MOST EXPENSIVE model (not cheapest)
+- API: `POST /api/refine` → `RefineResponse{needs_refinement, questions[]}`, `POST /api/enrich` → `EnrichResponse{enriched_question}`
+- CLI: `duh ask --refine "question"` — interactive `click.prompt()` loop, default `--no-refine`
+- Frontend: consensus store `'refining'` status, `submitQuestion` → refine → clarify → enrich → `startConsensus`
+- `RefinementPanel.tsx` — tabbed UI inside GlassPanel, checkmarks on answered tabs, Skip + Start Consensus buttons
+- Graceful fallback: any failure → proceed to consensus with original question
 
-### GPT-5.4 added to model catalog
-- `gpt-5.4`: 1M context, 128K output, $2.50/$15.00 per MTok, no temperature (uses reasoning.effort)
-- Added to `NO_TEMPERATURE_MODELS` set
-- File: `src/duh/providers/catalog.py`
+### Native Provider Web Search
+- Providers use server-side search instead of DDG proxy when `config.tools.web_search.native` is true
+- `web_search: bool` param added to `ModelProvider.send()` protocol
+- Anthropic: `web_search_20250305` server tool in tools[]
+- Google: `GoogleSearch()` grounding (replaces function tools — can't coexist)
+- Mistral: `{"type": "web_search"}` appended to tools
+- OpenAI: `web_search_options={}` only for `_SEARCH_MODELS` set; others fall back to DDG
+- Perplexity: no-op (always searches natively)
+- `tool_augmented_send`: filters DDG `web_search` tool when native=True, passes flag to provider
 
-### .env improvements
-- Added provider API key placeholders to `.env.example` (ANTHROPIC, OPENAI, GOOGLE, PERPLEXITY, MISTRAL)
-- Updated README quick start with all provider env vars + `.env` reference
-- Note: Google key env var is `GOOGLE_API_KEY` (not `GEMINI_API_KEY`)
+### Citations — Persisted + Domain-Grouped
+- `Citation` dataclass (url, title, snippet) on `ModelResponse.citations`
+- Extraction per provider: Anthropic (`web_search_tool_result`), Google (grounding metadata), Perplexity (`response.citations`)
+- **Persistence**: `citations_json` TEXT column on `Contribution` model, SQLite auto-migration via `ensure_schema()`
+- `proposal_citations` tracked on `ConsensusContext` → archived to `RoundResult` → persisted via `_persist_consensus`
+- Thread detail API returns `citations` on `ContributionResponse`
+- **Domain-grouped Sources nav**: ConsensusNav (live) + ThreadNav (stored) group citations by hostname
+  - Nested Disclosure: outer "Sources (17)" → inner "wikipedia.org (3)" → P/C/R role badges per citation
+  - P (green) = propose, C (amber) = challenge, R (blue) = revise
+- `CitationList` shared component for inline display below content
+
+### Anthropic Streaming + max_tokens
+- `AnthropicProvider.send()` now uses streaming internally via `_collect_stream()` — avoids 10-minute timeout
+- `max_tokens` bumped from 16384 → 32768 across all 6 handler defaults (propose, challenge, revise, commit, voting, decomposition)
+- Citations are part of the value — truncating them undermines trust
+
+### Parallel Challenge Streaming
+- `_stream_challenges()` in `ws.py` uses `asyncio.as_completed()` to send each challenge result to the frontend as it finishes
+- Previously: all challengers ran in parallel but results were batched after all completed
+- Now: first challenger to respond appears immediately in the UI
+
+### Tools Enabled by Default
+- `web_search` tool wired through CLI, REST, and WebSocket paths by default
+- Provider tool format fix: `tool_augmented_send` builds generic `{name, description, parameters}` — each provider transforms to native format in `send()`
+
+### Sidebar UX
+- New-question button (Heroicons pencil-square) + collapsible sidebar toggle
+- Shell manages `desktopSidebarOpen` (default true) + `mobileSidebarOpen` separately
+- TopBar shows sidebar toggle when desktop sidebar collapsed or always on mobile
 
 ### Test Results
-- 1603 Python tests + 185 Vitest tests (1788 total)
+- 1641 Python tests + 194 Vitest tests (1835 total)
 - Build clean, all tests pass
 
 ---
 
 ## Current State
 
-- **Branch `ux-cleanup`** — z-index fix, GPT-5.4, .env docs
-- **1603 Python tests + 185 Vitest tests** (1788 total)
+- **Branch `question-refinement`** — in progress, not yet merged
+- **1641 Python tests + 194 Vitest tests** (1835 total)
 - All previous features intact (v0.1–v0.6)
+- Prior work merged: z-index fix, GPT-5.4, .env docs, password reset
 
 ## Open Questions (Still Unresolved)
 

diff --git a/memory-bank/decisions.md b/memory-bank/decisions.md
@@ -1,6 +1,6 @@
 # Architectural Decisions
 
-**Last Updated**: 2026-02-18
+**Last Updated**: 2026-03-08
 
 ---
 
@@ -354,3 +354,87 @@
 - Manual migration instructions in docs (user friction)
 **Consequences**: File-based SQLite databases auto-migrate on startup. Zero friction for local users. PostgreSQL still requires `alembic upgrade head`. Lightweight and self-contained.
 **References**: `src/duh/memory/migrations.py`, `src/duh/cli/app.py:107-110`
+
+---
+
+## 2026-03-08: Native Provider Web Search Over DDG Proxy
+
+**Status**: Approved
+**Context**: The original web search tool used DuckDuckGo as a proxy — every provider's tool calls went through DDG, which returned index pages rather than real content. Most major providers now offer server-side web search that returns higher-quality results with citations.
+**Decision**: Add `web_search: bool` parameter to the `ModelProvider.send()` protocol. When `config.tools.web_search.native` is true, each provider uses its native search capability: Anthropic (`web_search_20250305` server tool), Google (`GoogleSearch()` grounding), Mistral (`{"type": "web_search"}`), OpenAI (`web_search_options`), Perplexity (always native). DDG proxy remains as fallback for providers/models that don't support native search.
+**Alternatives**:
+- DDG-only (simpler, but returns low-quality index pages instead of real content)
+- Single search provider for all (e.g., Bing API — adds external dependency and API key)
+- Remove web search entirely (loses grounding capability)
+**Consequences**: Higher quality search results with real content. Citations extractable from provider responses. Each provider has different native search API shape — increases per-provider complexity. Google grounding and function declarations can't coexist (grounding replaces function tools).
+**References**: `src/duh/providers/anthropic.py`, `src/duh/providers/google.py`, `src/duh/providers/mistral.py`, `src/duh/providers/openai.py`, `src/duh/tools/augmented_send.py`
+
+---
+
+## 2026-03-08: Question Refinement Uses Most Expensive Model
+
+**Status**: Approved
+**Context**: Question refinement analyzes user questions before consensus to determine if clarification is needed. The analysis quality directly impacts downstream consensus quality — a poorly refined question wastes all subsequent model calls.
+**Decision**: `analyze_question()` and `enrich_question()` in `src/duh/consensus/refine.py` use the most expensive configured model (sorted by cost), not the cheapest. The refinement step is a single model call, so the cost difference is minimal compared to the full consensus round it precedes.
+**Alternatives**:
+- Cheapest model (saves tokens, but poor analysis leads to poor consensus)
+- User-configurable refinement model (adds UX complexity)
+- Multi-model refinement (overkill — single strong model is sufficient for question analysis)
+**Consequences**: Better question analysis quality. Marginal cost increase (one extra expensive model call). Graceful fallback on failure — original question proceeds to consensus unchanged.
+**References**: `src/duh/consensus/refine.py`, `src/duh/api/routes/ask.py`, `src/duh/cli/app.py`
+
+---
+
+## 2026-03-08: Tools Enabled by Default
+
+**Status**: Approved
+**Context**: Web search was originally opt-in. Users who didn't know about the `--tools` flag got ungrounded responses. Most queries benefit from web search grounding.
+**Decision**: `web_search` tool is enabled by default across CLI, REST API, and WebSocket paths. The `config.tools.web_search` section controls behavior. Native provider search is preferred when available.
+**Alternatives**:
+- Opt-in only (simpler, but most users miss it)
+- Always-on with no config (inflexible for cost-sensitive users)
+- Per-question tool selection (too much UX friction)
+**Consequences**: Better default experience — responses are grounded in current information. Slightly higher cost per query (search tool calls). Users can disable via config if needed.
+**References**: `src/duh/config/schema.py`, `src/duh/cli/app.py`, `src/duh/api/routes/ws.py`
+
+---
+
+## 2026-03-08: Citation Persistence on Contributions
+
+**Status**: Approved
+**Context**: Citations were emitted over WebSocket during live consensus but never persisted. Viewing a thread later from the Threads section showed no sources — undermining the trust value of native web search.
+**Decision**: Add `citations_json` TEXT column to the `Contribution` model (nullable, JSON-encoded list of `{url, title}`). Track `proposal_citations` on `ConsensusContext` and archive to `RoundResult`. Serialize and persist during `_persist_consensus`. Thread detail API returns parsed citations on `ContributionResponse`. ThreadNav shows domain-grouped sources matching ConsensusNav.
+**Alternatives**:
+- Separate Citation table with FK to Contribution (more normalized, but adds query complexity for marginal benefit)
+- Store citations only on Decision (loses per-role attribution)
+- Don't persist (simpler, but citations are essential to trust)
+**Consequences**: Citations survive beyond the WebSocket session. Thread detail view shows sources grouped by domain with role attribution (P/C/R). SQLite auto-migration handles existing databases. Slightly larger DB rows due to JSON text.
+**References**: `src/duh/memory/models.py:146`, `src/duh/api/routes/threads.py`, `src/duh/api/routes/ws.py`, `web/src/components/threads/ThreadNav.tsx`
+
+---
+
+## 2026-03-08: Anthropic Streaming Internally in send()
+
+**Status**: Approved
+**Context**: Increasing `max_tokens` to 32768 triggered Anthropic SDK's 10-minute timeout error: "Streaming is required for operations that may take longer than 10 minutes." The `send()` method used non-streaming `messages.create()`.
+**Decision**: `send()` now calls `_collect_stream()` which uses `messages.stream()` as a context manager and collects the final `Message` via `get_final_message()`. The returned object is identical to `messages.create()` output, so all downstream parsing (citations, tool calls, text concatenation) works unchanged.
+**Alternatives**:
+- Keep non-streaming and lower max_tokens (loses citation content to truncation)
+- Full streaming to frontend (larger change, separate concern)
+- Increase Anthropic client timeout (fragile, doesn't scale)
+**Consequences**: No more timeout errors at any max_tokens value. Test mocks must mock `messages.stream` context manager instead of `messages.create`. Marginal latency increase from stream overhead (negligible vs network time).
+**References**: `src/duh/providers/anthropic.py:222-229`
+
+---
+
+## 2026-03-08: Parallel Challenge Streaming via as_completed
+
+**Status**: Approved
+**Context**: Challengers were already running in parallel via `asyncio.gather` in `handle_challenge`, but the WebSocket handler sent all results after ALL challengers finished. Users saw nothing until the slowest challenger responded.
+**Decision**: New `_stream_challenges()` function in `ws.py` uses `asyncio.as_completed()` to send each challenge result to the frontend immediately as each completes. Builds `ChallengeResult` objects and updates `ctx.challenges` directly, bypassing `handle_challenge`.
+**Alternatives**:
+- Keep batched approach (simpler, but poor UX — users wait for slowest model)
+- Token-level streaming per challenger (much more complex, requires protocol changes)
+- Sequential challengers (defeats the purpose of multi-model)
+**Consequences**: First challenger to respond appears immediately. More engaging real-time experience. WS test mocks now patch `_stream_challenges` instead of `handle_challenge`. Challenge order in UI reflects completion speed, not configuration order.
+**References**: `src/duh/api/routes/ws.py:253-347`, `tests/unit/test_api_ws.py`
diff --git a/memory-bank/progress.md b/memory-bank/progress.md
@@ -4,7 +4,36 @@
 
 ---
 
-## Current State: v0.6.0 — "It's Honest" COMPLETE
+## Current State: Post v0.6.0 — `question-refinement` Branch In Progress
+
+### Question Refinement + Native Web Search + Citations (2026-03-08)
+
+- **Question refinement**: pre-consensus clarification step (analyze → clarify → enrich → consensus)
+  - `src/duh/consensus/refine.py`, API routes (`/api/refine`, `/api/enrich`), CLI `--refine` flag
+  - Frontend: `RefinementPanel.tsx` tabbed UI, consensus store `'refining'` status
+  - Graceful fallback on failure → original question proceeds to consensus
+- **Native provider web search**: Anthropic/Google/Mistral/OpenAI/Perplexity use server-side search
+  - `web_search: bool` param on `ModelProvider.send()` protocol
+  - `config.tools.web_search.native` flag controls behavior
+  - DDG proxy still available as fallback for non-native providers
+- **Citations**: `Citation` dataclass on `ModelResponse`, extracted per provider, displayed in frontend
+  - `CitationList` shared component, `ConsensusNav` collapsible Sources sidebar section
+  - WS events include `citations` array for PROPOSE and CHALLENGE phases
+- **Tools enabled by default**: `web_search` (DuckDuckGo) wired through all paths (CLI, REST, WS)
+- **Provider tool format fix**: each provider transforms generic tool defs to native API format
+- **Sidebar UX**: new-question button + collapsible sidebar toggle
+- **Citation persistence**: `citations_json` on Contribution model, SQLite migration, thread detail API returns citations
+- **Domain-grouped Sources**: ConsensusNav + ThreadNav group citations by hostname with Disclosure, P/C/R role badges
+- **Anthropic streaming**: `send()` uses `_collect_stream()` internally to avoid 10-min timeout on large max_tokens
+- **Parallel challenge streaming**: `_stream_challenges()` sends each result to frontend as it completes via `asyncio.as_completed`
+- **max_tokens 32768**: bumped from 16384 across all handlers — citations are essential to trust
+- 1641 Python tests + 194 Vitest tests (1835 total), build clean
+
+### Z-index Fix + GPT-5.4 + .env Docs (2026-03-07)
+
+- Z-index stacking context fix, GPT-5.4 model catalog entry, .env.example provider keys
+- Password reset flow, SMTP mail module, JWT-scoped tokens
+- 1603 Python tests + 185 Vitest tests (1788 total)
 
 ### Consensus Navigation & Collapsible Sections
 
@@ -195,3 +224,9 @@ Phase 0 benchmark framework — fully functional, pilot-tested on 5 questions.
 | 2026-03-07 | GPT-5.4 added to model catalog (1M ctx, $2.50/$15.00, no-temperature) | Done |
 | 2026-03-07 | .env.example updated with provider API key placeholders | Done |
 | 2026-03-07 | README updated with all provider env vars | Done |
+| 2026-03-08 | Question refinement (analyze → clarify → enrich → consensus) | In Progress |
+| 2026-03-08 | Native provider web search (Anthropic/Google/Mistral/OpenAI/Perplexity) | In Progress |
+| 2026-03-08 | Citations extraction + frontend CitationList + ConsensusNav Sources | In Progress |
+| 2026-03-08 | Tools enabled by default (web_search wired through CLI/REST/WS) | In Progress |
+| 2026-03-08 | Provider tool format fix (generic → native transform per provider) | In Progress |
+| 2026-03-08 | Sidebar UX (new-question button, collapsible toggle) | In Progress |