bad performace in robotwin evaluation

Hi, thanks for the excellent work and for releasing the codebase!

I followed the README instructions to set up the evaluation environment for **RobotWin** with **Motus**, and tested the rollout trajectories on the task **“put the object in the cabinet.”** However, I observed a noticeable discrepancy between the reported and reproduced performance.

**Test results:**
- Reproduced success rate: ~**45%**
- Reported in the paper (supplementary):
  - ~**88%** in the *clean* setting  
  - ~**71%** in the *randomized* setting

I would like to ask:
1. Are there any additional evaluation details (e.g., environment version, random seed handling, number of rollouts, or success criteria) that might affect this result?
2. Were the reported numbers averaged over multiple runs or random seeds?
3. Have you tested the reproducibility of this task multiple times on a fresh environment to confirm the reported performance?

I would really appreciate any guidance on what might cause this gap, or pointers to specific evaluation settings to double-check. I’m happy to provide more details (logs, configs, seeds) if helpful.

Thanks again for sharing such a great project!

<img width="1842" height="1554" alt="Image" src="https://github.com/user-attachments/assets/c7803a8b-bd2a-45a5-9c92-400046867d0b" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bad performace in robotwin evaluation #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bad performace in robotwin evaluation #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions