There seems to be a bug in the evaluation code. Taking the first question in Video-MME as an example, the function videomme_process_results(doc, results) is:
pred_ans = extract_characters_regex(pred)
score_dict = compute_score(data_source="videomme", solution_str=pred.strip(), ground_truth=doc["answer"], extra_info=extra_info)
The content of pred looks like this:
<tool_call>crop_video {"video_path": "/root/.cache/huggingface/videomme/data/fFjv93ACGo8.mp4", "start_time": 0.0, "end_time": 54.0}</tool_call></tool_response><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url></tool_response><answer>D</answer>
However, pred_ans parsed by extract_characters_regex becomes A, because the function does not first extract the answer from the <answer>...</answer> tag.
In addition, inside compute_score, cases with acc_score == 0.0 are re-evaluated by the judge model, while cases with acc_score == 1.0 are accepted directly. The problem is that acc_score == 1.0 does not necessarily mean the prediction was actually parsed correctly.
Therefore, this might have an impact on the reliability of the evaluation results.
There seems to be a bug in the evaluation code. Taking the first question in Video-MME as an example, the function
videomme_process_results(doc, results)is:The content of pred looks like this:
However,
pred_ansparsed byextract_characters_regexbecomesA, because the function does not first extract the answer from the<answer>...</answer>tag.In addition, inside
compute_score, cases withacc_score == 0.0are re-evaluated by thejudgemodel, while cases withacc_score == 1.0are accepted directly. The problem is thatacc_score == 1.0does not necessarily mean the prediction was actually parsed correctly.Therefore, this might have an impact on the reliability of the evaluation results.