Hi, first of all, thank you for open-sourcing this project.
I’m currently trying to train the HOPE-mid model using 8×A100 GPUs, following the default mid_fsdp configuration provided in the repo.
However, I’m encountering a significant performance issue:
-
Single step time is around ~5 minutes
-
At this speed, finishing 250,000 steps would take an impractically long time
So I wanted to ask a few questions:
- When you trained HOPE-mid, did you use the same configuration as the current YAML (e.g., mid_fsdp), or were there any important differences?
- What was your approximate step time per iteration during training?
- Are there any recommended optimizations (e.g., batch size, sequence length, FSDP settings, CPU offload, etc.) to improve training speed?
Any guidance would be greatly appreciated. Thanks again for sharing this excellent work!
Hi, first of all, thank you for open-sourcing this project.
I’m currently trying to train the HOPE-mid model using 8×A100 GPUs, following the default mid_fsdp configuration provided in the repo.
However, I’m encountering a significant performance issue:
Single step time is around ~5 minutes
At this speed, finishing 250,000 steps would take an impractically long time
So I wanted to ask a few questions:
Any guidance would be greatly appreciated. Thanks again for sharing this excellent work!