added model quality gha by shreymodi1 · Pull Request #319 · eval-protocol/python-sdk

shreymodi1 · 2025-11-06T21:29:32Z

name: Pull Request
about: Propose changes to the codebase
title: "Brief description of changes"
labels: ''
assignees: ''

Description

Please include a summary of the change and which issue is fixed or feature is implemented. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)
Implements # (issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Refactoring/Code cleanup
Build/CI/CD related changes
Other (please describe):

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

Test A
Test B

Test Configuration:

Firmware version:
Hardware:
Toolchain:
SDK:

Checklist:

My code follows the style guidelines of this project (ran black ., isort ., flake8 .)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Screenshots (if applicable)

If applicable, add screenshots to help showcase your changes.

Additional context

Add any other context about the PR here.

Note

Adds comprehensive streaming-compliance tests (structured JSON, tools, reasoning, consistency) and updates rollout/metadata to capture finish_reason, reasoning_content, and tool call counts.

Benchmarks/Tests:
- Add eval_protocol/benchmarks/test_glm_streaming_compliance.py with streaming and non-streaming compliance tests:
  - Structured JSON output, single/multi tool calls, complex args, parameter types, naming/array validation, recovery behavior.
  - Reasoning effort checks (none/low), tools+reasoning combos.
  - Streaming vs non-streaming output consistency shadow test.
Rollout Processor (eval_protocol/pytest/default_single_turn_rollout_process.py):
- Capture finish_reason, serialize reasoning_content, and normalize tool_calls (with fallback conversion).
- Populate execution_metadata.finish_reason and execution_metadata.tool_call_count.
- Minor: add per-request no-cache, retain usage/duration logging.
Models (eval_protocol/models.py):
- Extend ExecutionMetadata with finish_reason and tool_call_count fields.
- Keep Message.reasoning_content and ChatCompletionContentPartTextParam support utilized by tests.

^{Written by Cursor Bugbot for commit fac4f37. This will update automatically on new commits. Configure here.}

eval_protocol/pytest/default_single_turn_rollout_process.py

cursor · 2025-11-06T23:31:30Z

.github/workflows/streaming_compliance.yml

+name: Streaming Compliance Benchmark
+
+on:
+  push:


Bug: Unrestricted Push Triggers Cause Excess CI Runs

The workflow trigger on: push: without branch filters will run on every push to any branch, including feature branches and pull requests. This causes unnecessary CI runs and resource consumption. Other workflows in the repository like ci.yml and fireworks-tracing-tests.yml restrict pushes to specific branches (e.g., main) or use path filters to avoid this issue.

cursor · 2025-11-07T00:03:37Z

eval_protocol/benchmarks/test_glm_streaming_compliance.py

+    ],
+    rollout_processor=SingleTurnRolloutProcessor(),
+    aggregation_method="mean",
+    passed_threshold=0.0,


Bug: Require threshold to enforce failing on zero score

The passed_threshold=0.0 allows the test to pass even when all compliance checks fail (score=0.0). For a streaming compliance benchmark that validates tool call behavior, this threshold should be higher (likely 1.0) to ensure the test only passes when the model correctly handles streaming tool calls.

cursor · 2025-11-19T04:33:07Z

eval_protocol/benchmarks/test_glm_streaming_compliance.py

+    """Check whether the assistant retries tool calls when instructed to recover."""
+
+    assistant_msg = row.last_assistant_message()
+    print(f"assistant_msg: {assistant_msg}")


Bug: Debug print statement left in test code

A print() debug statement is left in the test_streaming_tool_retry_behavior function. This will clutter test output logs during CI/CD runs. The line print(f"assistant_msg: {assistant_msg}") should be removed as it appears to be temporary debugging code that was accidentally committed.

added model quality gha

9f6aa7b

cursor bot reviewed Nov 6, 2025

View reviewed changes

eval_protocol/pytest/default_single_turn_rollout_process.py Show resolved Hide resolved

fixes

7d6d905

cursor bot reviewed Nov 6, 2025

View reviewed changes

shreymodi1 added 4 commits November 6, 2025 15:36

streaming

c699532

fixes

bb743b7

fix

275f992

fix

4315f1e

cursor bot reviewed Nov 7, 2025

View reviewed changes

shreymodi1 added 6 commits November 6, 2025 16:04

fix

fcc7e10

fix

a1a2046

fix

cf0ab9d

df

67c2619

streaming ouput

2900b87

changes

406ed5b

cursor bot reviewed Nov 19, 2025

View reviewed changes

dphuang2 approved these changes Nov 19, 2025

View reviewed changes

Delete .github/workflows/streaming_compliance.yml

fac4f37

shreymodi1 merged commit f10c29f into main Nov 20, 2025
3 checks passed

shreymodi1 deleted the shrey/modelquality branch November 20, 2025 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added model quality gha#319

added model quality gha#319
shreymodi1 merged 13 commits intomainfrom
shrey/modelquality

shreymodi1 commented Nov 6, 2025 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

cursor bot Nov 6, 2025

Uh oh!

cursor bot Nov 7, 2025

Uh oh!

cursor bot Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shreymodi1 commented Nov 6, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Checklist:

Screenshots (if applicable)

Additional context

Uh oh!

Uh oh!

cursor bot Nov 6, 2025

Choose a reason for hiding this comment

Bug: Unrestricted Push Triggers Cause Excess CI Runs

Uh oh!

cursor bot Nov 7, 2025

Choose a reason for hiding this comment

Bug: Require threshold to enforce failing on zero score

Uh oh!

cursor bot Nov 19, 2025

Choose a reason for hiding this comment

Bug: Debug print statement left in test code

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shreymodi1 commented Nov 6, 2025 •

edited by cursor bot

Loading