Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 19 additions & 38 deletions memory-bank/activeContext.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,27 @@
# Active Context

**Last Updated**: 2026-02-18
**Current Phase**: Consensus UX — right-side nav, collapsible sections, decision-first layout
**Next Action**: PR open for review.
**Last Updated**: 2026-02-19
**Current Phase**: UX cleanup and consensus engine hardening
**Next Action**: PR ready for review.

## What Just Shipped: Consensus Navigation & Collapsible Sections
## What Just Shipped: UX Cleanup + Consensus Engine Improvements

### Core Changes
The consensus page and thread detail view now have proper navigation and information hierarchy for multi-round deliberations.
### Thread Detail UX
- All round sections collapsed by default when thread loads — decision stays open
- Dissent inside decision block collapsed by default
- `DissentBanner` gained `defaultOpen` prop for caller control

**Before**: Long vertical scroll of rounds with no way to navigate or collapse. Decision buried at the bottom after all rounds.
**After**:
- Sticky right-side nav panel shows progress through rounds/phases
- All sections are independently collapsible via a shared `Disclosure` primitive
- Decision surfaces to the **top** when consensus is complete (both live + stored threads)
- Individual challengers shown by model name in nav and each collapsible
- Dissent gets equal treatment: collapsible `DissentBanner` with model attribution parsed from `[model:name]:` prefix
### Consensus Engine Hardening
- **max_tokens bumped 4096 -> 16384** for propose/challenge/revise phases — prevents LLM output truncation on long responses
- **Token budget in system prompts** — LLMs now told their output budget so they can self-regulate length and end on complete thoughts
- **Truncation detection** — `finish_reason` checked after each handler call; `truncated` flag sent via WebSocket; amber warning shown in PhaseCard UI
- **Cross-provider challenger selection** — `select_challengers()` now prefers models from different providers (one per provider first, then fills). Prevents e.g. Opus proposing + two Sonnet variants challenging (same training biases)

### New Shared Component: `Disclosure`
Reusable chevron + toggle primitive (`web/src/components/shared/Disclosure.tsx`):
- Props: `header`, `defaultOpen`, `forceOpen`, `className`
- Used by: PhaseCard, TurnCard, ConsensusComplete, DissentBanner, ThreadDetail
### Visual Polish
- Export dropdown menus (both `ConsensusComplete` and `ExportMenu`) now use glass styling matching the design system (`glass-bg` + `backdrop-blur`)

### Files Changed (17 files)
**New files:**
- `web/src/components/shared/Disclosure.tsx` — Shared collapsible primitive
- `web/src/components/consensus/ConsensusNav.tsx` — Sticky nav for live consensus
- `web/src/components/threads/ThreadNav.tsx` — Sticky nav for thread detail
- `web/src/__tests__/consensus-nav.test.tsx` — 32 tests (Disclosure, PhaseCard, DissentBanner, TurnCard, ConsensusNav)
- `web/src/__tests__/thread-nav.test.tsx` — 8 tests (ThreadNav)

**Modified:**
- `PhaseCard.tsx` — Uses Disclosure for outer collapse + per-challenger Disclosure
- `TurnCard.tsx` — Uses Disclosure for outer collapse + per-contribution Disclosure
- `ConsensusComplete.tsx` — Collapsible via Disclosure, dissent moved inside panel
- `DissentBanner.tsx` — Uses Disclosure, parses `[model:name]:` prefix for ModelBadge
- `ConsensusPanel.tsx` — Decision at top when complete, scroll target IDs
- `ConsensusPage.tsx` — Flex-row layout with sticky ConsensusNav sidebar
- `ThreadDetail.tsx` — Decision surfaced to top, DissentBanner for dissent, scroll IDs
- `ThreadDetailPage.tsx` — Flex-row layout with sticky ThreadNav sidebar
- Barrel exports: `consensus/index.ts`, `threads/index.ts`, `shared/index.ts`
### PDF Export Bug Fix
- `_setup_fonts()` was missing the bold-italic (`BI`) TTF font variant — caused crash when dissent content contained bold markdown rendered in italic context

### Test Results
- 1586 Python tests + 166 Vitest tests (1752 total)
Expand All @@ -49,10 +31,9 @@ Reusable chevron + toggle primitive (`web/src/components/shared/Disclosure.tsx`)

## Current State

- **Branch `consensus-nav-collapsible`** — ready for PR.
- **Branch `ux-cleanup`** — ready for PR.
- **1586 Python tests + 166 Vitest tests** (1752 total).
- **~62 Python source files + 75 frontend source files** (~137 total).
- All previous features intact (v0.1–v0.5 + export + epistemic confidence).
- All previous features intact (v0.1–v0.5 + export + epistemic confidence + consensus nav).

## Open Questions (Still Unresolved)

Expand Down
1 change: 1 addition & 0 deletions memory-bank/progress.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,3 +181,4 @@ Phase 0 benchmark framework — fully functional, pilot-tested on 5 questions.
| 2026-02-17 | Export to Markdown & PDF (CLI + API + Web UI) | Done |
| 2026-02-18 | Epistemic Confidence Phase A (rigor + domain caps + calibration) | Done |
| 2026-02-18 | Consensus nav + collapsible sections + decision-first layout | Done |
| 2026-02-19 | UX cleanup: collapse defaults, max_tokens 16384, cross-provider challengers, truncation detection, glass exports, PDF BI font fix | Done |
44 changes: 44 additions & 0 deletions memory-bank/tasks/2026-02/190219_ux-cleanup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# 190219_ux-cleanup

## Objective
UX polish and bug fixes: thread detail collapse defaults, consensus engine improvements (token limits, cross-provider challengers, truncation detection), export menu glass styling, PDF export crash fix.

## Outcome
- All thread sections collapsed by default except decision (with dissent)
- Consensus `max_tokens` bumped 4096 -> 16384 for propose/challenge/revise
- Token budget communicated to LLMs in system prompts to prevent truncation
- Truncation detection: `finish_reason` checked after each phase, `truncated` flag sent via WebSocket, amber warning shown in PhaseCard UI
- Challenger selection prefers cross-provider diversity (one per provider first, then fill)
- Export dropdown menus use glass styling (`glass-bg` + `backdrop-blur`)
- PDF export crash fixed: missing bold-italic (`BI`) TTF font variant
- All 1586 Python + 166 Vitest tests pass

## Files Modified

### Backend
- `src/duh/consensus/handlers.py` — `max_tokens` 4096->16384; `_token_budget_note()` helper appended to all system prompts; `select_challengers()` rewritten for cross-provider diversity (prefers one model per different provider, then fills same-provider, then self-ensemble)
- `src/duh/api/routes/ws.py` — Captures `ModelResponse` from propose/challenge/revise handlers; sends `truncated` boolean in `phase_complete` and `challenge` WebSocket events
- `src/duh/cli/app.py` — Added `self.add_font("DuhSans", "BI", path)` to fix bold-italic crash in PDF export

### Frontend
- `web/src/components/threads/ThreadDetail.tsx` — All rounds `defaultOpen={false}`; dissent in decision block `defaultOpen={false}`
- `web/src/components/consensus/DissentBanner.tsx` — Added `defaultOpen` prop (defaults `true` for backward compat)
- `web/src/components/consensus/PhaseCard.tsx` — Added `truncated` prop; renders amber "Output truncated" warning when content hit token limit; `challenges` type updated to include `truncated` field
- `web/src/components/consensus/ConsensusPanel.tsx` — Passes `truncated` flag from round data to PROPOSE and REVISE PhaseCards
- `web/src/components/consensus/ConsensusComplete.tsx` — Export dropdown uses glass styling
- `web/src/components/shared/ExportMenu.tsx` — Export dropdown uses glass styling
- `web/src/stores/consensus.ts` — Added `truncated: string[]` to `RoundData`; `ChallengeEntry` gains `truncated` field; `handleEvent` tracks truncation per phase
- `web/src/api/types.ts` — Added `truncated?: boolean` to `WSPhaseComplete` and `WSChallenge`

## Patterns Applied
- `systemPatterns.md#Disclosure` — reused for DissentBanner defaultOpen prop
- Cross-provider challenger selection follows existing `select_challengers` pattern but adds provider diversity layer
- Token budget note follows existing `_grounding_prefix()` pattern for system prompt composition

## Architectural Decisions
- **Token budget in system prompt**: LLMs don't know their `max_tokens` limit. Adding budget instruction in system prompt lets models self-regulate output length. Not a guarantee (models can't count tokens precisely), but dramatically reduces truncation.
- **Cross-provider challengers**: Prefers models from different providers for genuine intellectual diversity. Same-provider models may share training data biases, reducing challenge quality.
- **16384 max_tokens**: 4x increase from 4096. Balances thorough responses against cost (output tokens dominate cost for expensive models).

## Artifacts
- Branch: `ux-cleanup`
15 changes: 15 additions & 0 deletions memory-bank/tasks/2026-02/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -499,3 +499,18 @@
- Frontend: CalibrationDashboard (metric cards + bar chart + bucket table), CalibrationPage, calibration Zustand store
- Tests: 15 calibration tests, 20 confidence scoring tests, 4 CLI calibration tests
- **Total: 1586 Python + 126 Vitest = 1712 tests**

---

## UX Cleanup + Consensus Engine Improvements

### 2026-02-19: UX Cleanup
- Thread detail: all sections collapsed by default except decision (with dissent)
- `DissentBanner` gained `defaultOpen` prop
- Export dropdown menus use glass styling (`glass-bg` + `backdrop-blur`)
- PDF export crash fix: missing bold-italic (`BI`) TTF font variant in `_setup_fonts()`
- `max_tokens` bumped 4096 -> 16384 for propose/challenge/revise
- Token budget communicated to LLMs in system prompts via `_token_budget_note()`
- Truncation detection: `finish_reason` checked, `truncated` flag sent via WebSocket, amber warning in PhaseCard
- Cross-provider challenger selection: prefers one model per different provider for diversity
- See: [190219_ux-cleanup.md](./190219_ux-cleanup.md)
14 changes: 10 additions & 4 deletions src/duh/api/routes/ws.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,12 +131,13 @@ async def _stream_consensus(
"round": ctx.current_round,
}
)
await handle_propose(ctx, pm, proposer)
propose_resp = await handle_propose(ctx, pm, proposer)
await ws.send_json(
{
"type": "phase_complete",
"phase": "PROPOSE",
"content": ctx.proposal or "",
"truncated": propose_resp.finish_reason != "stop",
}
)

Expand All @@ -153,13 +154,17 @@ async def _stream_consensus(
"round": ctx.current_round,
}
)
await handle_challenge(ctx, pm, challengers)
for ch in ctx.challenges:
challenge_resps = await handle_challenge(ctx, pm, challengers)
for i, ch in enumerate(ctx.challenges):
resp_truncated = (
i < len(challenge_resps) and challenge_resps[i].finish_reason != "stop"
)
await ws.send_json(
{
"type": "challenge",
"model": ch.model_ref,
"content": ch.content,
"truncated": resp_truncated,
}
)
await ws.send_json({"type": "phase_complete", "phase": "CHALLENGE"})
Expand All @@ -175,12 +180,13 @@ async def _stream_consensus(
"round": ctx.current_round,
}
)
await handle_revise(ctx, pm)
revise_resp = await handle_revise(ctx, pm)
await ws.send_json(
{
"type": "phase_complete",
"phase": "REVISE",
"content": ctx.revision or "",
"truncated": revise_resp.finish_reason != "stop",
}
)

Expand Down
1 change: 1 addition & 0 deletions src/duh/cli/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -1307,6 +1307,7 @@ def _setup_fonts(self) -> None:
self.add_font("DuhSans", "", path)
self.add_font("DuhSans", "B", path)
self.add_font("DuhSans", "I", path)
self.add_font("DuhSans", "BI", path)
self._use_ttf = True
self._font_family = "DuhSans"
break
Expand Down
77 changes: 63 additions & 14 deletions src/duh/consensus/handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,17 +149,30 @@ def _grounding_prefix() -> str:
return f"Today's date is {today}. {_GROUNDING}"


def _token_budget_note(max_tokens: int) -> str:
"""Instruction telling the model its output token budget."""
return (
f"\n\nYour response budget is approximately {max_tokens:,} tokens. "
"Structure your answer to fit within this budget — be thorough but "
"concise. If the topic requires extensive detail, prioritize the most "
"important points and ensure your response ends with a complete thought."
)


# ── Prompt building ───────────────────────────────────────────


def build_propose_prompt(ctx: ConsensusContext) -> list[PromptMessage]:
def build_propose_prompt(
ctx: ConsensusContext, *, max_tokens: int = 16384
) -> list[PromptMessage]:
"""Build prompt messages for the PROPOSE phase.

Round 1: system prompt + question.
Round > 1: system prompt + question + previous round context
(decision and challenges) so the proposer can improve.
"""
system = f"{_grounding_prefix()}\n\n{_PROPOSER_SYSTEM}"
budget = _token_budget_note(max_tokens)
system = f"{_grounding_prefix()}\n\n{_PROPOSER_SYSTEM}{budget}"

if ctx.current_round <= 1 or not ctx.round_history:
user_content = ctx.question
Expand Down Expand Up @@ -243,7 +256,7 @@ async def handle_propose(
model_ref: str,
*,
temperature: float = 0.7,
max_tokens: int = 4096,
max_tokens: int = 16384,
tool_registry: ToolRegistry | None = None,
) -> ModelResponse:
"""Execute the PROPOSE phase of consensus.
Expand Down Expand Up @@ -279,7 +292,7 @@ async def handle_propose(
msg = f"handle_propose requires PROPOSE state, got {ctx.state.value}"
raise ConsensusError(msg)

messages = build_propose_prompt(ctx)
messages = build_propose_prompt(ctx, max_tokens=max_tokens)
provider, model_id = provider_manager.get_provider(model_ref)

if tool_registry is not None:
Expand Down Expand Up @@ -316,6 +329,8 @@ async def handle_propose(
def build_challenge_prompt(
ctx: ConsensusContext,
framing: str = "flaw",
*,
max_tokens: int = 16384,
) -> list[PromptMessage]:
"""Build prompt messages for the CHALLENGE phase.

Expand All @@ -325,9 +340,10 @@ def build_challenge_prompt(
Args:
ctx: Consensus context with the proposal to challenge.
framing: One of the challenge framing types.
max_tokens: Token budget communicated to the model.
"""
system_text = _CHALLENGE_FRAMINGS.get(framing, _CHALLENGE_FRAMINGS["flaw"])
system = f"{_grounding_prefix()}\n\n{system_text}"
system = f"{_grounding_prefix()}\n\n{system_text}{_token_budget_note(max_tokens)}"
user_content = (
f"Question: {ctx.question}\n\n"
f"Answer from another expert (do NOT defer to this -- challenge it):\n"
Expand Down Expand Up @@ -374,13 +390,43 @@ def select_challengers(
msg = "No panel models available for challenge"
raise InsufficientModelsError(msg)

others = sorted(
(m for m in models if m.model_ref != proposer_model),
proposer_provider = proposer_model.split(":")[0]

others = [m for m in models if m.model_ref != proposer_model]

# Prefer models from different providers for true cross-provider challenge
cross_provider = sorted(
(m for m in others if m.provider_id != proposer_provider),
key=lambda m: m.output_cost_per_mtok,
reverse=True,
)
same_provider = sorted(
(m for m in others if m.provider_id == proposer_provider),
key=lambda m: m.output_cost_per_mtok,
reverse=True,
)

selected = [m.model_ref for m in others[:count]]
# Pick cross-provider first, then fill with same-provider
selected: list[str] = []
used_providers: set[str] = set()
for m in cross_provider:
if len(selected) >= count:
break
# Prefer one model per provider for maximum diversity
if m.provider_id not in used_providers:
selected.append(m.model_ref)
used_providers.add(m.provider_id)
# If still not enough, add remaining cross-provider models
for m in cross_provider:
if len(selected) >= count:
break
if m.model_ref not in selected:
selected.append(m.model_ref)
# Then same-provider models
for m in same_provider:
if len(selected) >= count:
break
selected.append(m.model_ref)
# Fill remaining slots with proposer (same-model ensemble)
while len(selected) < count:
selected.append(proposer_model)
Expand Down Expand Up @@ -415,7 +461,7 @@ async def _call_challenger(

Returns (model_ref, framing, response).
"""
messages = build_challenge_prompt(ctx, framing=framing)
messages = build_challenge_prompt(ctx, framing=framing, max_tokens=max_tokens)
provider, model_id = provider_manager.get_provider(model_ref)

if tool_registry is not None:
Expand Down Expand Up @@ -446,7 +492,7 @@ async def handle_challenge(
challenger_models: list[str],
*,
temperature: float = 0.7,
max_tokens: int = 4096,
max_tokens: int = 16384,
tool_registry: ToolRegistry | None = None,
) -> list[ModelResponse]:
"""Execute the CHALLENGE phase of consensus.
Expand Down Expand Up @@ -527,14 +573,17 @@ async def handle_challenge(
# ── REVISE prompt + handler ───────────────────────────────────


def build_revise_prompt(ctx: ConsensusContext) -> list[PromptMessage]:
def build_revise_prompt(
ctx: ConsensusContext, *, max_tokens: int = 16384
) -> list[PromptMessage]:
"""Build prompt messages for the REVISE phase.

System prompt instructs the reviser to address challenges.
User prompt includes the question, original proposal, and all
challenges so the revision addresses each one.
"""
system = f"{_grounding_prefix()}\n\n{_REVISER_SYSTEM}"
budget = _token_budget_note(max_tokens)
system = f"{_grounding_prefix()}\n\n{_REVISER_SYSTEM}{budget}"

challenges_text = "\n\n".join(
f"Challenge from {c.model_ref}:\n{c.content}" for c in ctx.challenges
Expand All @@ -557,7 +606,7 @@ async def handle_revise(
model_ref: str | None = None,
*,
temperature: float = 0.7,
max_tokens: int = 4096,
max_tokens: int = 16384,
) -> ModelResponse:
"""Execute the REVISE phase of consensus.

Expand Down Expand Up @@ -604,7 +653,7 @@ async def handle_revise(
msg = "handle_revise requires a model_ref or proposal_model"
raise ConsensusError(msg)

messages = build_revise_prompt(ctx)
messages = build_revise_prompt(ctx, max_tokens=max_tokens)
provider, model_id = provider_manager.get_provider(reviser_ref)

response = await provider.send(
Expand Down
2 changes: 2 additions & 0 deletions web/src/api/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -197,12 +197,14 @@ export interface WSPhaseComplete {
type: 'phase_complete'
phase: ConsensusPhase
content?: string
truncated?: boolean
}

export interface WSChallenge {
type: 'challenge'
model: string
content: string
truncated?: boolean
}

export interface WSCommit {
Expand Down
Loading
Loading