Conversation
| name: Streaming Compliance Benchmark | ||
|
|
||
| on: | ||
| push: |
There was a problem hiding this comment.
Bug: Unrestricted Push Triggers Cause Excess CI Runs
The workflow trigger on: push: without branch filters will run on every push to any branch, including feature branches and pull requests. This causes unnecessary CI runs and resource consumption. Other workflows in the repository like ci.yml and fireworks-tracing-tests.yml restrict pushes to specific branches (e.g., main) or use path filters to avoid this issue.
| ], | ||
| rollout_processor=SingleTurnRolloutProcessor(), | ||
| aggregation_method="mean", | ||
| passed_threshold=0.0, |
There was a problem hiding this comment.
Bug: Require threshold to enforce failing on zero score
The passed_threshold=0.0 allows the test to pass even when all compliance checks fail (score=0.0). For a streaming compliance benchmark that validates tool call behavior, this threshold should be higher (likely 1.0) to ensure the test only passes when the model correctly handles streaming tool calls.
| """Check whether the assistant retries tool calls when instructed to recover.""" | ||
|
|
||
| assistant_msg = row.last_assistant_message() | ||
| print(f"assistant_msg: {assistant_msg}") |
There was a problem hiding this comment.
Bug: Debug print statement left in test code
A print() debug statement is left in the test_streaming_tool_retry_behavior function. This will clutter test output logs during CI/CD runs. The line print(f"assistant_msg: {assistant_msg}") should be removed as it appears to be temporary debugging code that was accidentally committed.
name: Pull Request
about: Propose changes to the codebase
title: "Brief description of changes"
labels: ''
assignees: ''
Description
Please include a summary of the change and which issue is fixed or feature is implemented. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
Implements # (issue)
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
Test Configuration:
Checklist:
black .,isort .,flake8 .)Screenshots (if applicable)
If applicable, add screenshots to help showcase your changes.
Additional context
Add any other context about the PR here.
Note
Adds comprehensive streaming-compliance tests (structured JSON, tools, reasoning, consistency) and updates rollout/metadata to capture finish_reason, reasoning_content, and tool call counts.
eval_protocol/benchmarks/test_glm_streaming_compliance.pywith streaming and non-streaming compliance tests:eval_protocol/pytest/default_single_turn_rollout_process.py):finish_reason, serializereasoning_content, and normalizetool_calls(with fallback conversion).execution_metadata.finish_reasonandexecution_metadata.tool_call_count.eval_protocol/models.py):ExecutionMetadatawithfinish_reasonandtool_call_countfields.Message.reasoning_contentandChatCompletionContentPartTextParamsupport utilized by tests.Written by Cursor Bugbot for commit fac4f37. This will update automatically on new commits. Configure here.