You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*\*Note on Llama-2 7B VRAM: The 5.90 GB figure represents a benchmark run with Batch Size 1 and Sequence Length 128. For the full training run (Batch Size 8, Seq 512) used for the HuggingFace model, peak VRAM was 6.36 GB.*
21
22
*\*Note on GPT-2: For extremely small models, LEMA's fixed buffering overhead exceeds the model size. LEMA is optimized for Large-scale models.*
22
23
*\**Note on Llama-2 7B: Standard PEFT can load the model (13.99GB) but fails immediately with **Out-Of-Memory (OOM)** when attempting a training step due to gradients/activations. LEMA trains comfortably with >10GB headroom.*
Copy file name to clipboardExpand all lines: docs/LEMA Framework Proposal.md
+3-7Lines changed: 3 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,14 +8,14 @@ LEMA is a specialized framework designed to facilitate the fine-tuning of Large
8
8
9
9
## **2\. Core Concepts**
10
10
11
-
### **2.1 Binary Indexed Engagement**
11
+
### **2.1 Global Binary Index (GBI)**
12
12
13
13
Standard model loading (e.g., PyTorch .bin or .pt) involves full deserialization into System RAM. LEMA uses a **Global Binary Index (GBI)**.
14
14
15
15
***Zero-Copy Mapping:** Uses mmap to map the model file (preferably in .safetensors format) into the process's virtual address space.
16
16
***Header Indexing:** A JSON/Binary header stores the (offset, size, dtype, shape) for every tensor, allowing O(1) access to specific layer weights without scanning the file.
17
17
18
-
### **2.2 Layer-wise Execution (Patchwork)**
18
+
### **2.2 Layer-wise Execution**
19
19
20
20
Instead of a monolithic model.forward(), LEMA decomposes the computational graph into a sequence of isolated layer blocks.
21
21
@@ -55,10 +55,6 @@ To save VRAM, LEMA implements **Segmented Gradient Checkpointing**:
55
55
4. Calculate gradients for Layer ![][image3] LoRA adapters.
56
56
5. Offload Layer ![][image3] weights; move to Layer ![][image5].
57
57
58
-
### **4.3 Optimizer Offloading**
59
-
60
-
The **Adam Optimizer states** (Momentum and Variance) are stored in System RAM. During the weight update step, LEMA pulls only the specific optimizer slice for the current layer's adapters into VRAM, performs the update, and pushes it back.
61
-
62
58
## **5\. Technical Implementation Stack**
63
59
64
60
***Host Language:** Python (Orchestration) / C++ (High-speed Memory Management).
@@ -73,5 +69,5 @@ The **Adam Optimizer states** (Momentum and Variance) are stored in System RAM.
73
69
| :---- | :---- | :---- |
74
70
|**VRAM Requirement**| Full Model \+ Gradients |\~2 Layers \+ Buffers |
75
71
|**System RAM Usage**| Model Size | Model Size (via mmap/Page Cache) |
0 commit comments