Commit bdcd007
feat(eval): multi-turn conversation mode with turn-by-turn evaluation (#1054)
* feat(eval): add multi-turn conversation mode with turn-by-turn evaluation
Implements issue #1052: support for evaluating multi-turn conversations
where the agent generates each assistant turn with per-turn grading.
- Add ConversationTurn type, mode, turns, aggregation, on_turn_failure, window_size to EvalTest
- Zod schema and YAML parser updates for new fields
- Turn-by-turn loop in orchestrator: accumulate messages, call provider, grade, repeat
- Conversation assertions run after all turns
- Aggregation: mean (default), min (weakest-link), max
- String shorthand in per-turn assertions works identically to top-level
- Cross-field validation (turns requires mode:conversation, etc.)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat(eval): add multi-turn conversation-live example for UAT
Adds examples/features/multi-turn-conversation-live/ with 5 test cases
exercising conversation mode features: context retention, aggregation
modes, on_turn_failure, mixed assertions, and conversation-level assertions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* test(eval): add unit tests for multi-turn conversation mode
Tests for conversation-mode orchestrator, validation rules,
and score aggregation (mean/min/max).
Also fixes buildTurnAssertions to emit type: 'llm-grader' with rubrics
instead of type: 'rubrics' (which is not registered in the builtin registry).
The evaluator-parser uses the same pattern for YAML-sourced rubrics.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix(eval): correct conversation mode scoring, loader, and serialization
- YAML loader: include `turns` in completeness gate so conversation-only
cases (no top-level criteria/assertions) are not silently skipped
- Orchestrator: stop falling back to evalCase.assertions per-turn — turns
without own assertions score 1.0 instead of double-counting top-level
- Orchestrator: pass full transcript as candidate for conversation-level
grading instead of only the last assistant reply
- Orchestrator: serialize structured message content with JSON.stringify
instead of producing [object Object] in transcript strings
- Validator: reject whitespace-only and empty-array turn inputs
- Tests: add regression coverage for double-counting, transcript candidate,
and whitespace input validation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent e659bd7 commit bdcd007
10 files changed
Lines changed: 7486 additions & 3236 deletions
File tree
- docs/plans
- examples/features/multi-turn-conversation-live
- evals
- packages/core
- src/evaluation
- validation
- test/evaluation
- plugins/agentv-dev/skills/agentv-eval-writer/references
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
Lines changed: 105 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
0 commit comments