@@ -16,24 +16,30 @@ def build_python_grader_from_evaluation_test(test_fn) -> dict:
1616 Return an OpenAI Python grader spec from an Eval Protocol-style evaluation function.
1717
1818 Assumptions:
19- - `test_fn` is the *core* evaluation function (not the @evaluation_test wrapper),
20- or an @evaluation_test-decorated function that carries _origin_func.
21- It should have a signature like:
19+ - `test_fn` is either:
20+ * the core evaluation function, or
21+ * an @evaluation_test-decorated function that carries `_origin_func`.
22+ Its effective signature looks like:
2223
2324 def my_eval(row, **kwargs) -> EvaluateResult | float | EvaluationRow
2425
25- - The function only relies on attributes that we provide on `EvaluationRowLike`
26- (you can extend that class as needed).
26+ - The function treats `row` as an `EvaluationRow` and only relies on attributes
27+ we provide in the duck-typed stand-in:
28+ * row.ground_truth
29+ * row.messages
30+ * row.item (raw item dict)
31+ * row.sample (raw sample dict)
2732
28- - We map OpenAI's (sample, item) to a duck‑typed `row`:
29- - item["reference_answer"] -> row.ground_truth
30- - sample["output_text"] -> appended as an assistant message
31- - raw dicts available as row.item / row.sample
33+ - We map OpenAI's (sample, item) into that duck-typed `EvaluationRow` as follows:
34+ * item["reference_answer"] -> row.ground_truth
35+ * item["messages"] (if present) -> row.messages (normalized to Message-like objects)
36+ * sample["output_text"] -> appended as the last assistant message in row.messages
37+ * the original dicts are also available via row.item / row.sample
3238
3339 - The function returns either:
34- - a numeric score, or
35- - an object/dict with a `score` field, or
36- - an EvaluationRow/EvaluateResult-like object with `.evaluation_result.score`.
40+ * a numeric score, or
41+ * an object/dict with a `score` field, or
42+ * an EvaluationRow/EvaluateResult-like object with `.evaluation_result.score`.
3743 """
3844
3945 # If the user passed an @evaluation_test wrapper, try to recover the original function
0 commit comments