Questions about the evaluation.

There seems to be a bug in the evaluation code. Taking the first question in Video-MME as an example, the function `videomme_process_results(doc, results)` is:

```Python
pred_ans = extract_characters_regex(pred)
score_dict = compute_score(data_source="videomme", solution_str=pred.strip(), ground_truth=doc["answer"], extra_info=extra_info)
```

The content of pred looks like this:
```
<tool_call>crop_video {"video_path": "/root/.cache/huggingface/videomme/data/fFjv93ACGo8.mp4", "start_time": 0.0, "end_time": 54.0}</tool_call></tool_response><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url><image_url></tool_response><answer>D</answer>
```

However, `pred_ans` parsed by `extract_characters_regex` becomes `A`, because the function does not first extract the answer from the `<answer>...</answer>` tag.

In addition, inside `compute_score`, cases with `acc_score == 0.0` are re-evaluated by the `judge` model, while cases with `acc_score == 1.0` are accepted directly. The problem is that `acc_score == 1.0` does not necessarily mean the prediction was actually parsed correctly.

Therefore, this might have an impact on the reliability of the evaluation results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the evaluation. #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about the evaluation. #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions