Skip to content

Python: feat(evals): add ground_truth support for similarity evaluator#5234

Open
chetantoshniwal wants to merge 3 commits intomainfrom
feature/similarity-ground-truth
Open

Python: feat(evals): add ground_truth support for similarity evaluator#5234
chetantoshniwal wants to merge 3 commits intomainfrom
feature/similarity-ground-truth

Conversation

@chetantoshniwal
Copy link
Copy Markdown
Contributor

  • Include expected_output as ground_truth in Foundry JSONL dataset rows
  • Add ground_truth to item schema and data mapping for similarity evaluator
  • Add expected_output parameter to evaluate_workflow
  • Add similarity Pattern 3 to evaluate_agent and evaluate_workflow samples
  • Add tests for ground_truth in dataset, schema, and evaluate_workflow

Motivation and Context

#5135

Description

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

- Include expected_output as ground_truth in Foundry JSONL dataset rows
- Add ground_truth to item schema and data mapping for similarity evaluator
- Add expected_output parameter to evaluate_workflow
- Add similarity Pattern 3 to evaluate_agent and evaluate_workflow samples
- Add tests for ground_truth in dataset, schema, and evaluate_workflow
Copilot AI review requested due to automatic review settings April 13, 2026 22:07
@chetantoshniwal chetantoshniwal changed the title feat(evals): add ground_truth support for similarity evaluator .NET feat(evals): add ground_truth support for similarity evaluator Apr 13, 2026
@github-actions github-actions bot changed the title .NET feat(evals): add ground_truth support for similarity evaluator Python: .NET feat(evals): add ground_truth support for similarity evaluator Apr 13, 2026
@moonbox3
Copy link
Copy Markdown
Contributor

moonbox3 commented Apr 13, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework
   _evaluation.py6557289%164, 172, 485, 487, 615, 618, 697–699, 704, 741–744, 800–801, 804, 810–812, 816, 849–851, 907, 943, 955–957, 962, 986–991, 1084, 1162–1163, 1165–1169, 1175, 1214, 1562, 1564, 1572, 1582, 1586, 1631, 1649–1650, 1728, 1730, 1736, 1744, 1759, 1797, 1803–1807, 1839, 1870–1871, 1873, 1898–1899, 1904
packages/foundry/agent_framework_foundry
   _foundry_evals.py249498%444, 449, 629, 694
TOTAL27220318388% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5449 20 💤 0 ❌ 0 🔥 1m 23s ⏱️

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds “ground truth” support to Foundry-backed similarity evaluation by mapping EvalItem.expected_output into the Foundry JSONL field ground_truth, and exposing expected_output on evaluate_workflow() to enable reference-based evaluators (e.g., SIMILARITY) for workflows.

Changes:

  • Extend Foundry JSONL item schema + data mapping to include ground_truth for similarity evaluation.
  • Add expected_output parameter to evaluate_workflow() and stamp it onto overall workflow EvalItems when running via queries.
  • Update Foundry eval samples to include a similarity/ground-truth pattern; add tests covering schema/mapping/dataset output.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/packages/foundry/agent_framework_foundry/_foundry_evals.py Adds ground-truth evaluator mapping + schema support; stamps expected_output into JSONL ground_truth.
python/packages/core/agent_framework/_evaluation.py Adds expected_output to evaluate_workflow() and stamps it onto overall items in the run+evaluate path.
python/packages/foundry/tests/test_foundry_evals.py Adds tests for ground-truth mapping/schema and workflow expected_output stamping/validation.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py Adds Pattern 3 sample demonstrating similarity evaluation with ground truth.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py Adds Pattern 3 similarity sample with expected_output.
Comments suppressed due to low confidence (1)

python/packages/foundry/agent_framework_foundry/_foundry_evals.py:717

  • Similarity (and any evaluator in _GROUND_TRUTH_EVALUATORS) always gets a data_mapping for ground_truth, but there’s no local validation that all items provide expected_output/ground_truth. If a caller requests similarity without expected_output, this will still create an eval definition and fail provider-side with a less actionable error. Add a preflight check here to raise a clear ValueError when a ground-truth-required evaluator is selected and any item is missing expected_output (or alternatively filter out those evaluators similarly to _filter_tool_evaluators).
        # Resolve evaluators with auto-detection
        resolved = _resolve_default_evaluators(self._evaluators, items=items)
        # Filter tool evaluators if items don't have tools
        resolved = _filter_tool_evaluators(resolved, items)

        # Standard JSONL dataset path
        return await self._evaluate_via_dataset(items, resolved, eval_name)

    # -- Internal evaluation paths --

    async def _evaluate_via_dataset(
        self,
        items: Sequence[EvalItem],
        evaluators: list[str],
        eval_name: str,
    ) -> EvalResults:
        """Evaluate using JSONL dataset upload path."""
        dicts: list[dict[str, Any]] = []
        for item in items:
            # Build JSONL dict directly from split_messages + converter
            # to avoid splitting the conversation twice.
            effective_split = item.split_strategy or self._conversation_split
            query_msgs, response_msgs = item.split_messages(effective_split)

            query_text = " ".join(m.text for m in query_msgs if m.role == "user" and m.text).strip()
            response_text = " ".join(m.text for m in response_msgs if m.role == "assistant" and m.text).strip()

            d: dict[str, Any] = {
                "query": query_text,
                "response": response_text,
                "query_messages": AgentEvalConverter.convert_messages(query_msgs),
                "response_messages": AgentEvalConverter.convert_messages(response_msgs),
            }
            if item.tools:
                d["tool_definitions"] = [
                    {"name": t.name, "description": t.description, "parameters": t.parameters()} for t in item.tools
                ]
            if item.context:
                d["context"] = item.context
            if item.expected_output:
                d["ground_truth"] = item.expected_output
            dicts.append(d)

        has_context = any("context" in d for d in dicts)
        has_ground_truth = any("ground_truth" in d for d in dicts)
        has_tools = any("tool_definitions" in d for d in dicts)

        eval_obj = await self._client.evals.create(
            name=eval_name,
            data_source_config={  # type: ignore[arg-type]  # pyright: ignore[reportArgumentType]
                "type": "custom",
                "item_schema": _build_item_schema(
                    has_context=has_context, has_ground_truth=has_ground_truth, has_tools=has_tools
                ),
                "include_sample_schema": True,
            },
            testing_criteria=_build_testing_criteria(  # type: ignore[arg-type]  # pyright: ignore[reportArgumentType]
                evaluators,
                self._model,
                include_data_mapping=True,
            ),
        )

Comment thread python/packages/foundry/agent_framework_foundry/_foundry_evals.py Outdated
Comment thread python/packages/core/agent_framework/_evaluation.py
Comment thread python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@chetantoshniwal chetantoshniwal changed the title Python: .NET feat(evals): add ground_truth support for similarity evaluator Python: feat(evals): add ground_truth support for similarity evaluator Apr 14, 2026
@chetantoshniwal chetantoshniwal added this pull request to the merge queue Apr 14, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants