Skip to content

Commit bdcd007

Browse files
christsoCopilotclaude
authored
feat(eval): multi-turn conversation mode with turn-by-turn evaluation (#1054)
* feat(eval): add multi-turn conversation mode with turn-by-turn evaluation Implements issue #1052: support for evaluating multi-turn conversations where the agent generates each assistant turn with per-turn grading. - Add ConversationTurn type, mode, turns, aggregation, on_turn_failure, window_size to EvalTest - Zod schema and YAML parser updates for new fields - Turn-by-turn loop in orchestrator: accumulate messages, call provider, grade, repeat - Conversation assertions run after all turns - Aggregation: mean (default), min (weakest-link), max - String shorthand in per-turn assertions works identically to top-level - Cross-field validation (turns requires mode:conversation, etc.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(eval): add multi-turn conversation-live example for UAT Adds examples/features/multi-turn-conversation-live/ with 5 test cases exercising conversation mode features: context retention, aggregation modes, on_turn_failure, mixed assertions, and conversation-level assertions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(eval): add unit tests for multi-turn conversation mode Tests for conversation-mode orchestrator, validation rules, and score aggregation (mean/min/max). Also fixes buildTurnAssertions to emit type: 'llm-grader' with rubrics instead of type: 'rubrics' (which is not registered in the builtin registry). The evaluator-parser uses the same pattern for YAML-sourced rubrics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(eval): correct conversation mode scoring, loader, and serialization - YAML loader: include `turns` in completeness gate so conversation-only cases (no top-level criteria/assertions) are not silently skipped - Orchestrator: stop falling back to evalCase.assertions per-turn — turns without own assertions score 1.0 instead of double-counting top-level - Orchestrator: pass full transcript as candidate for conversation-level grading instead of only the last assistant reply - Orchestrator: serialize structured message content with JSON.stringify instead of producing [object Object] in transcript strings - Validator: reject whitespace-only and empty-array turn inputs - Tests: add regression coverage for double-counting, transcript candidate, and whitespace input validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent e659bd7 commit bdcd007

10 files changed

Lines changed: 7486 additions & 3236 deletions

File tree

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Issue #1052: Multi-turn Conversational Test Case — Live Turn-by-Turn Evaluation
2+
3+
## Problem
4+
5+
Today, multi-turn evals script all intermediate assistant responses in `input` — the LLM only generates the last response. This means conversation context retention, progressive reasoning, and turn-by-turn quality cannot be measured independently.
6+
7+
## Solution
8+
9+
Add `mode: conversation` with a `turns` array that drives turn-by-turn LLM evaluation with per-turn and conversation-level grading.
10+
11+
### New Schema Fields
12+
13+
| Field | Type | Default | Description |
14+
|-------|------|---------|-------------|
15+
| `mode` | `'conversation'` | - | Enables conversation evaluation mode |
16+
| `turns` | `ConversationTurn[]` | - | Ordered user messages; each generates an LLM call |
17+
| `aggregation` | `'mean' \| 'min' \| 'max'` | `'mean'` | How turn scores combine into final score |
18+
| `on_turn_failure` | `'continue' \| 'stop'` | `'continue'` | What to do when a turn's assertions fail |
19+
| `window_size` | `number` | all turns | Sliding window for context passed to graders |
20+
21+
### How It Works
22+
23+
1. `input` provides system prompt and initial context (same as today)
24+
2. For each entry in `turns`:
25+
a. Append the user message to accumulated history
26+
b. Call the provider with full history — LLM generates assistant response
27+
c. Grade the response against turn's `assertions` and `expected_output`
28+
d. Append actual LLM response (not expected_output) to history
29+
3. After all turns: run top-level `assertions` over full transcript
30+
4. Final score = aggregation of per-turn + conversation assertion scores
31+
32+
### Validation Rules
33+
34+
- `turns` requires `mode: conversation`
35+
- `mode: conversation` requires `turns`
36+
- `turns` incompatible with top-level `expected_output`
37+
- `aggregation` only valid with `mode: conversation`
38+
- Each turn must have non-empty `input`
39+
40+
### Files Modified
41+
42+
| File | Change |
43+
|------|--------|
44+
| `packages/core/src/evaluation/types.ts` | ConversationTurn, mode, turns, etc. on EvalTest |
45+
| `packages/core/src/evaluation/validation/eval-file.schema.ts` | Zod schema for new fields |
46+
| `packages/core/src/evaluation/yaml-parser.ts` | Parse conversation fields |
47+
| `packages/core/src/evaluation/orchestrator.ts` | Conversation runner in runEvalCase |
48+
| `packages/core/test/evaluation/conversation-mode.test.ts` | Unit tests |
49+
| `examples/features/multi-turn-conversation-live/` | UAT example |
50+
51+
## References
52+
53+
- Issue: #1052
54+
- Research: agentevals-research PR #57
55+
- Prior art: #505 / PR #507 (scripted multi-turn), #331 / PR #1051 (depends_on)
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Multi-Turn Conversation (Live)
2+
3+
This example demonstrates **live turn-by-turn conversation evaluation** where the LLM generates each assistant response (unlike `multi-turn-conversation/` which scripts intermediate turns).
4+
5+
## Features Shown
6+
7+
- `mode: conversation` — enables live turn-by-turn evaluation
8+
- `turns[]` — each entry is a user message that generates an LLM call
9+
- Per-turn `assertions` — string shorthand (rubric) and structured evaluators
10+
- `aggregation: mean | min | max` — how turn scores combine
11+
- `on_turn_failure: stop | continue` — behavior on assertion failure
12+
- Top-level `assertions` — conversation-level grading after all turns
13+
14+
## Running
15+
16+
```bash
17+
# With default target
18+
bun apps/cli/src/cli.ts eval examples/features/multi-turn-conversation-live/evals/dataset.eval.yaml
19+
20+
# With specific test
21+
bun apps/cli/src/cli.ts eval examples/features/multi-turn-conversation-live/evals/dataset.eval.yaml --test-id context-retention
22+
```
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Multi-turn conversation evaluation (live turn-by-turn)
2+
# Each turn generates a fresh LLM call; per-turn assertions grade each response.
3+
# This is different from multi-turn-conversation/ which scripts intermediate turns.
4+
5+
description: Live multi-turn conversation evaluation with per-turn grading
6+
7+
execution:
8+
target: llm
9+
10+
tests:
11+
# Test 1: Basic context retention across turns
12+
- id: context-retention
13+
mode: conversation
14+
criteria: Agent maintains context and provides relevant responses across turns
15+
aggregation: mean
16+
input:
17+
- role: system
18+
content: |-
19+
You are a helpful math tutor. Be concise and accurate.
20+
Always show your work step by step.
21+
turns:
22+
- input: What is 15% of 200?
23+
assertions:
24+
- Correctly calculates 15% of 200 as 30
25+
- Shows the calculation steps
26+
- input: Now double that result.
27+
assertions:
28+
- References the previous answer of 30
29+
- Correctly calculates double as 60
30+
- input: What were the original numbers I asked about?
31+
assertions:
32+
- Recalls that the user asked about 15% and 200
33+
- Demonstrates memory of the conversation context
34+
35+
# Test 2: With aggregation: min (weakest-link scoring)
36+
- id: weakest-link-scoring
37+
mode: conversation
38+
criteria: Agent provides accurate, well-structured responses
39+
aggregation: min
40+
input:
41+
- role: system
42+
content: You are a concise geography expert. Answer in 1-2 sentences.
43+
turns:
44+
- input: What is the capital of France?
45+
assertions:
46+
- Correctly identifies Paris as the capital of France
47+
- input: What country is it in?
48+
assertions:
49+
- Recognizes the question refers to Paris from the previous turn
50+
- Confirms Paris is in France
51+
52+
# Test 3: With on_turn_failure: stop
53+
- id: stop-on-failure
54+
mode: conversation
55+
on_turn_failure: stop
56+
criteria: Agent follows instructions precisely
57+
input:
58+
- role: system
59+
content: You are a helpful assistant. Be precise and accurate.
60+
turns:
61+
- input: What is 2 + 2?
62+
assertions:
63+
- Answers with 4
64+
- input: Multiply that by 3.
65+
assertions:
66+
- References the previous answer
67+
- Calculates 12 correctly
68+
69+
# Test 4: Mixed string and structured assertions
70+
- id: mixed-assertions
71+
mode: conversation
72+
criteria: Agent writes correct, well-formed Python code
73+
input:
74+
- role: system
75+
content: You are a helpful coding assistant.
76+
turns:
77+
- input: Write a Python function that adds two numbers.
78+
assertions:
79+
- Contains a Python function definition
80+
- type: contains
81+
value: def
82+
- input: Now add type hints to the function.
83+
assertions:
84+
- Includes type hints (int, float, or similar)
85+
- type: contains
86+
value: "->"
87+
88+
# Test 5: Conversation-level assertions
89+
- id: conversation-coherence
90+
mode: conversation
91+
criteria: Agent maintains a coherent, helpful conversation
92+
input:
93+
- role: system
94+
content: You are a helpful travel advisor. Be concise.
95+
turns:
96+
- input: I want to visit somewhere warm in December.
97+
assertions:
98+
- Suggests at least one warm destination
99+
- input: I prefer beaches over cities.
100+
assertions:
101+
- Adjusts recommendations toward beach destinations
102+
- Does not suggest purely urban destinations
103+
assertions:
104+
- Agent maintains consistency — later suggestions align with earlier preferences
105+
- Agent does not contradict its own prior recommendations

0 commit comments

Comments
 (0)