Hi, thanks for the excellent work and for releasing the codebase!
I followed the README instructions to set up the evaluation environment for RobotWin with Motus, and tested the rollout trajectories on the task “put the object in the cabinet.” However, I observed a noticeable discrepancy between the reported and reproduced performance.
Test results:
- Reproduced success rate: ~45%
- Reported in the paper (supplementary):
- ~88% in the clean setting
- ~71% in the randomized setting
I would like to ask:
- Are there any additional evaluation details (e.g., environment version, random seed handling, number of rollouts, or success criteria) that might affect this result?
- Were the reported numbers averaged over multiple runs or random seeds?
- Have you tested the reproducibility of this task multiple times on a fresh environment to confirm the reported performance?
I would really appreciate any guidance on what might cause this gap, or pointers to specific evaluation settings to double-check. I’m happy to provide more details (logs, configs, seeds) if helpful.
Thanks again for sharing such a great project!

Hi, thanks for the excellent work and for releasing the codebase!
I followed the README instructions to set up the evaluation environment for RobotWin with Motus, and tested the rollout trajectories on the task “put the object in the cabinet.” However, I observed a noticeable discrepancy between the reported and reproduced performance.
Test results:
I would like to ask:
I would really appreciate any guidance on what might cause this gap, or pointers to specific evaluation settings to double-check. I’m happy to provide more details (logs, configs, seeds) if helpful.
Thanks again for sharing such a great project!