fix: display eval status per metric type#305
Open
stefanoamorelli wants to merge 1 commit intogoogle:mainfrom
Open
fix: display eval status per metric type#305stefanoamorelli wants to merge 1 commit intogoogle:mainfrom
stefanoamorelli wants to merge 1 commit intogoogle:mainfrom
Conversation
Previously, when viewing eval results, all messages in an invocation showed the same pass/fail status. If response_match_score failed, tool calls would incorrectly show ❌ even when tool_trajectory_avg_score passed. Now, tool-related events (functionCall, functionResponse) display the tool_trajectory_avg_score result, while text responses display the response_match_score result. This gives accurate per-metric feedback in the eval UI. Fixes google#187
1a0a1bb to
4a8d456
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When viewing eval results, if
response_match_scorefailed buttool_trajectory_avg_scorepassed, all messages in the invocation (including tool calls) incorrectly showed ❌. This can be confusing because the tool trajectory is actually correct.To address this issue, this PR introduces an
isToolRelatedEvent()helper to identify events involving tool calls. TheaddEvalCaseResultToEvents()method now assigns the metric based on event type:tool_trajectory_avg_scoreresponse_match_scoreThis solution hardcodes the mapping above. It works for the two current default metrics but it does not automatically support custom or future metrics.
Fixes #187 with minimal frontend-only changes, but long-term I would recommend a backend API change for a more scalable solution, such as including metadata on metrics indicating which event types they evaluate (for example something along the lines of:
appliesTo: 'tool' | 'response')