Conversation
There was a problem hiding this comment.
Bug: Inconsistent Exception Handling Across Evaluation Modes
The "all" mode (batch mode) lacks the exception handling added to pointwise and groupwise modes. While pointwise and groupwise modes now capture non-assert exceptions and convert them to evaluation results with score=0 when EP_CAPTURE_EVAL_EXCEPTIONS=1, the "all" mode has no try-except block around execute_pytest, causing non-assert exceptions to propagate and fail the entire test. This creates inconsistent exception handling behavior across the three evaluation modes.
eval_protocol/pytest/evaluation_test.py#L567-L600
python-sdk/eval_protocol/pytest/evaluation_test.py
Lines 567 to 600 in d5ea771
There was a problem hiding this comment.
Bug: Inconsistent Exception Handling Across Evaluation Modes
The "all" mode (batch mode) lacks the exception handling logic added to "pointwise" and "groupwise" modes. When execute_pytest is called in "all" mode, AssertionError and other exceptions aren't handled consistently with the other modes - there's no explicit AssertionError re-raise or conditional exception capture based on EP_CAPTURE_EVAL_EXCEPTIONS. This creates inconsistent behavior where assertion failures and exception handling differ depending on the evaluation mode.
eval_protocol/pytest/evaluation_test.py#L594-L615
python-sdk/eval_protocol/pytest/evaluation_test.py
Lines 594 to 615 in decf061
| raise ValueError("Neither processed_row nor processed_dataset was provided") | ||
| # Default: raise exceptions unless explicitly disabled | ||
| else: | ||
| raise |
There was a problem hiding this comment.
Bug: Assertion Errors Must Always Raise
AssertionError is caught and converted to error results when EP_RAISE_EVAL_EXCEPTIONS is "false", breaking pytest assertions and threshold checks. The PR title "raise on assert failure" and existing tests (e.g., test_assertion_error_no_new_rollouts.py) expect AssertionError to always be raised. The exception handler should check for AssertionError and re-raise it unconditionally before checking the environment variable.
Note
Introduce
execute_pytest_with_exception_handlingand wire it intoevaluation_testto standardize evaluator error handling (configurable viaEP_RAISE_EVAL_EXCEPTIONS), with comprehensive tests.eval_protocol/pytest/evaluation_test.pywith calls toexecute_pytest_with_exception_handlingfor pointwise, groupwise, and all-mode execution.execute_pytest_with_exception_handlingineval_protocol/pytest/execution.pyto wrapexecute_pytestand, whenEP_RAISE_EVAL_EXCEPTIONSisfalse, convert exceptions intoevaluation_result(score=0, invalid) andStatus.erroron affectedEvaluationRow(s); otherwise re-raise.EvaluateResult,Status,Any,os).tests/pytest/test_pytest_evaluator_error_handling.pycovering pointwise/groupwise paths, multiple runs, custom/empty errors,input_rows, status codes, and reason formatting (with fixture forcing exception capture).Written by Cursor Bugbot for commit 5fce8ab. This will update automatically on new commits. Configure here.