-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Labels
Description
π― Goal (What & Why)
The current memory footprint during the inference forward pass is approximately 2.5Γ higher for the same model and batch size when using Fast-LLM compared to Hugging Face Transformers.
π§ Qwen2 1.5B β Batch Size: 16Γ4096, Flash Attention, bfloat16, H100 GPU
| Test | Peak GPU Memory Usage (MB) |
|---|---|
| HF (no loss calculation) | 22,162.28 |
| HF (with loss calculation) | 40,962.78 |
| Fast-LLM (no loss calculation) | 59,013.70 |
| Fast-LLM (with loss calculation) | OOM |
What is a reasonable target for reducing Fast-LLM's memory usage?
π Execution Plan
(This section may start as an incomplete draft but must be defined before implementation begins.)
Step 1: What is the smallest working version?
(Describe the simplest way to implement this feature with minimal effort.)
Step 2: What additional optimizations are possible (but optional)?
(List potential refinements that can be added in later PRs if needed.)
π Acceptance Criteria (Must-Haves for Completion)
- The feature must be functional and tested.
- The implementation must be documented in practical terms.
- The PR must include a performance/impact summary.
- No refactors unless directly necessary for feature completion.
π οΈ Project Management
- Assign the project to the Fast-LLM project.
- Set the
Estimatefield (in days) in the GitHub project. - Use the
Sizefield to categorize the PR size (Small/Medium/Large). - Assign an owner when opening the issue.