[Perf] Training speed issue for HOPE-mid (very slow step time with 8×A100)

Hi, first of all, thank you for open-sourcing this project.

I’m currently trying to train the HOPE-mid model using 8×A100 GPUs, following the default mid_fsdp configuration provided in the repo.

However, I’m encountering a significant performance issue:

- **Single step time is around ~5 minutes**

- **At this speed, finishing 250,000 steps would take an impractically long time**

So I wanted to ask a few questions:

1. When you trained HOPE-mid, did you use the same configuration as the current YAML (e.g., mid_fsdp), or were there any important differences?
2. What was your approximate step time per iteration during training?
3. Are there any recommended optimizations (e.g., batch size, sequence length, FSDP settings, CPU offload, etc.) to improve training speed?

Any guidance would be greatly appreciated. Thanks again for sharing this excellent work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Training speed issue for HOPE-mid (very slow step time with 8×A100) #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Perf] Training speed issue for HOPE-mid (very slow step time with 8×A100) #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions