Skip to content

Questions about the evaluation. #19

@xuzq23

Description

@xuzq23

There seems to be a bug in the evaluation code. Taking the first question in Video-MME as an example, the function videomme_process_results(doc, results) is:

pred_ans = extract_characters_regex(pred)
score_dict = compute_score(data_source="videomme", solution_str=pred.strip(), ground_truth=doc["answer"], extra_info=extra_info)

The content of pred looks like this:

<tool_call>crop_video {"video_path": "/root/.cache/huggingface/videomme/data/fFjv93ACGo8.mp4", "start_time": 0.0, "end_time": 54.0}</tool_call></tool_response><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url></tool_response><answer>D</answer>

However, pred_ans parsed by extract_characters_regex becomes A, because the function does not first extract the answer from the <answer>...</answer> tag.

In addition, inside compute_score, cases with acc_score == 0.0 are re-evaluated by the judge model, while cases with acc_score == 1.0 are accepted directly. The problem is that acc_score == 1.0 does not necessarily mean the prediction was actually parsed correctly.

Therefore, this might have an impact on the reliability of the evaluation results.

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions