Python: feat(evals): add ground_truth support for similarity evaluator by chetantoshniwal · Pull Request #5234 · microsoft/agent-framework

chetantoshniwal · 2026-04-13T22:07:34Z

Include expected_output as ground_truth in Foundry JSONL dataset rows
Add ground_truth to item schema and data mapping for similarity evaluator
Add expected_output parameter to evaluate_workflow
Add similarity Pattern 3 to evaluate_agent and evaluate_workflow samples
Add tests for ground_truth in dataset, schema, and evaluate_workflow

Motivation and Context

Description

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

- Include expected_output as ground_truth in Foundry JSONL dataset rows - Add ground_truth to item schema and data mapping for similarity evaluator - Add expected_output parameter to evaluate_workflow - Add similarity Pattern 3 to evaluate_agent and evaluate_workflow samples - Add tests for ground_truth in dataset, schema, and evaluate_workflow

moonbox3 · 2026-04-13T22:14:20Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/core/agent_framework
_evaluation.py	655	72	89%	164, 172, 485, 487, 615, 618, 697–699, 704, 741–744, 800–801, 804, 810–812, 816, 849–851, 907, 943, 955–957, 962, 986–991, 1084, 1162–1163, 1165–1169, 1175, 1214, 1562, 1564, 1572, 1582, 1586, 1631, 1649–1650, 1728, 1730, 1736, 1744, 1759, 1797, 1803–1807, 1839, 1870–1871, 1873, 1898–1899, 1904
packages/foundry/agent_framework_foundry
_foundry_evals.py	249	4	98%	444, 449, 629, 694
TOTAL	27220	3183	88%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
5449	20 💤	0 ❌	0 🔥	1m 23s ⏱️

Copilot

Pull request overview

Adds “ground truth” support to Foundry-backed similarity evaluation by mapping EvalItem.expected_output into the Foundry JSONL field ground_truth, and exposing expected_output on evaluate_workflow() to enable reference-based evaluators (e.g., SIMILARITY) for workflows.

Changes:

Extend Foundry JSONL item schema + data mapping to include ground_truth for similarity evaluation.
Add expected_output parameter to evaluate_workflow() and stamp it onto overall workflow EvalItems when running via queries.
Update Foundry eval samples to include a similarity/ground-truth pattern; add tests covering schema/mapping/dataset output.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
python/packages/foundry/agent_framework_foundry/_foundry_evals.py	Adds ground-truth evaluator mapping + schema support; stamps `expected_output` into JSONL `ground_truth`.
python/packages/core/agent_framework/_evaluation.py	Adds `expected_output` to `evaluate_workflow()` and stamps it onto overall items in the run+evaluate path.
python/packages/foundry/tests/test_foundry_evals.py	Adds tests for ground-truth mapping/schema and workflow `expected_output` stamping/validation.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py	Adds Pattern 3 sample demonstrating similarity evaluation with ground truth.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py	Adds Pattern 3 similarity sample with `expected_output`.

Comments suppressed due to low confidence (1)

python/packages/foundry/agent_framework_foundry/_foundry_evals.py:717

Similarity (and any evaluator in _GROUND_TRUTH_EVALUATORS) always gets a data_mapping for ground_truth, but there’s no local validation that all items provide expected_output/ground_truth. If a caller requests similarity without expected_output, this will still create an eval definition and fail provider-side with a less actionable error. Add a preflight check here to raise a clear ValueError when a ground-truth-required evaluator is selected and any item is missing expected_output (or alternatively filter out those evaluators similarly to _filter_tool_evaluators).

        # Resolve evaluators with auto-detection
        resolved = _resolve_default_evaluators(self._evaluators, items=items)
        # Filter tool evaluators if items don't have tools
        resolved = _filter_tool_evaluators(resolved, items)

        # Standard JSONL dataset path
        return await self._evaluate_via_dataset(items, resolved, eval_name)

    # -- Internal evaluation paths --

    async def _evaluate_via_dataset(
        self,
        items: Sequence[EvalItem],
        evaluators: list[str],
        eval_name: str,
    ) -> EvalResults:
        """Evaluate using JSONL dataset upload path."""
        dicts: list[dict[str, Any]] = []
        for item in items:
            # Build JSONL dict directly from split_messages + converter
            # to avoid splitting the conversation twice.
            effective_split = item.split_strategy or self._conversation_split
            query_msgs, response_msgs = item.split_messages(effective_split)

            query_text = " ".join(m.text for m in query_msgs if m.role == "user" and m.text).strip()
            response_text = " ".join(m.text for m in response_msgs if m.role == "assistant" and m.text).strip()

            d: dict[str, Any] = {
                "query": query_text,
                "response": response_text,
                "query_messages": AgentEvalConverter.convert_messages(query_msgs),
                "response_messages": AgentEvalConverter.convert_messages(response_msgs),
            }
            if item.tools:
                d["tool_definitions"] = [
                    {"name": t.name, "description": t.description, "parameters": t.parameters()} for t in item.tools
                ]
            if item.context:
                d["context"] = item.context
            if item.expected_output:
                d["ground_truth"] = item.expected_output
            dicts.append(d)

        has_context = any("context" in d for d in dicts)
        has_ground_truth = any("ground_truth" in d for d in dicts)
        has_tools = any("tool_definitions" in d for d in dicts)

        eval_obj = await self._client.evals.create(
            name=eval_name,
            data_source_config={  # type: ignore[arg-type]  # pyright: ignore[reportArgumentType]
                "type": "custom",
                "item_schema": _build_item_schema(
                    has_context=has_context, has_ground_truth=has_ground_truth, has_tools=has_tools
                ),
                "include_sample_schema": True,
            },
            testing_criteria=_build_testing_criteria(  # type: ignore[arg-type]  # pyright: ignore[reportArgumentType]
                evaluators,
                self._model,
                include_data_mapping=True,
            ),
        )

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 13, 2026 22:07

chetantoshniwal changed the title ~~feat(evals): add ground_truth support for similarity evaluator~~ .NET feat(evals): add ground_truth support for similarity evaluator Apr 13, 2026

chetantoshniwal added the .NET label Apr 13, 2026

chetantoshniwal added this to Agent Framework Apr 13, 2026

moonbox3 added the python label Apr 13, 2026

github-actions bot changed the title ~~.NET feat(evals): add ground_truth support for similarity evaluator~~ Python: .NET feat(evals): add ground_truth support for similarity evaluator Apr 13, 2026

Copilot started reviewing on behalf of chetantoshniwal April 13, 2026 22:11 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Comment thread python/packages/foundry/agent_framework_foundry/_foundry_evals.py Outdated

Comment thread python/packages/core/agent_framework/_evaluation.py

Comment thread python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py Outdated

Apply suggestions from code review

bab4c9a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

alliscode approved these changes Apr 13, 2026

View reviewed changes

fix: wrap long line to satisfy ruff E501

72c3461

chetantoshniwal changed the title ~~Python: .NET feat(evals): add ground_truth support for similarity evaluator~~ Python: feat(evals): add ground_truth support for similarity evaluator Apr 14, 2026

chetantoshniwal removed the .NET label Apr 14, 2026

giles17 approved these changes Apr 14, 2026

View reviewed changes

chetantoshniwal added this pull request to the merge queue Apr 14, 2026

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: feat(evals): add ground_truth support for similarity evaluator#5234

Python: feat(evals): add ground_truth support for similarity evaluator#5234
chetantoshniwal wants to merge 3 commits intomainfrom
feature/similarity-ground-truth

chetantoshniwal commented Apr 13, 2026

Uh oh!

moonbox3 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

chetantoshniwal commented Apr 13, 2026

Motivation and Context

Description

Uh oh!

moonbox3 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Unit Test Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

moonbox3 commented Apr 13, 2026 •

edited

Loading