Skip to content

[Perf] Training speed issue for HOPE-mid (very slow step time with 8×A100) #13

@Calvert0921

Description

@Calvert0921

Hi, first of all, thank you for open-sourcing this project.

I’m currently trying to train the HOPE-mid model using 8×A100 GPUs, following the default mid_fsdp configuration provided in the repo.

However, I’m encountering a significant performance issue:

  • Single step time is around ~5 minutes

  • At this speed, finishing 250,000 steps would take an impractically long time

So I wanted to ask a few questions:

  1. When you trained HOPE-mid, did you use the same configuration as the current YAML (e.g., mid_fsdp), or were there any important differences?
  2. What was your approximate step time per iteration during training?
  3. Are there any recommended optimizations (e.g., batch size, sequence length, FSDP settings, CPU offload, etc.) to improve training speed?

Any guidance would be greatly appreciated. Thanks again for sharing this excellent work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions