Skip to content

The evaluation code has serious issues #82

@nansun5410

Description

@nansun5410

I trained QwenVL using both absolute_ee and delta_ee action representations. I found that the absolute version’s results are far from fair: for example, on the select_painting task the success rate (SR) is 0, and many tasks that look correct are still marked as false. Additionally, for the intention_score in select_chemistry_tube, even successful episodes sometimes receive an intention score of 0 (for your pi ckpts too).

Do authors observe such issues when performing evaluation?

{
"episode_id": 1,
"task": "select_chemistry_tube",
"instruction": "Take out the FeCl3 solution",
"success": true,
"consumed_step": 73,
"intention_score": 0,
"progress_score": 1.0
},

{
"select_poker": {
"success_rate": 0.4,
"intention_score": 0.92,
"progress_score": 0.7333333333333333
},
"select_painting": {
"success_rate": 0.0,
"intention_score": 0.7,
"progress_score": 0.0
},
"select_book": {
"success_rate": 0.7,
"intention_score": 0.94,
"progress_score": 0.81
},
"select_chemistry_tube": {
"success_rate": 0.52,
"intention_score": 0.06,
"progress_score": 0.74
},
"select_drink": {
"success_rate": 0.32,
"intention_score": 0.96,
"progress_score": 0.48
},
"select_toy": {
"success_rate": 0.6,
"intention_score": 0.16,
"progress_score": 0.75
},
"select_mahjong": {
"success_rate": 0.4,
"intention_score": 0.9,
"progress_score": 0.43
},
"select_fruit": {
"success_rate": 0.62,
"intention_score": 0.98,
"progress_score": 0.78
},
"insert_flower": {
"success_rate": 0.22,
"intention_score": 0.98,
"progress_score": 0.59
},
"add_condiment": {
"success_rate": 0.72,
"intention_score": 1.0,
"progress_score": 0.8133333333333335
}
} Ive seen the video, the policy seems to be right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions