Skip to content

Commit a92a385

Browse files
committed
Refactor: unify versioning, standardize GBI naming, and correct performance claims
1 parent b6fb74b commit a92a385

8 files changed

Lines changed: 37 additions & 29 deletions

File tree

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,10 @@ LEMA is a specialized framework designed to facilitate the fine-tuning of Large
1616
| :--- | :--- | :--- | :--- | :--- |
1717
| **TinyLlama 1.1B** | 2.67 GB | **2.12 GB** | **20.5%** | **Stable** |
1818
| **SmolLM2 1.7B** | 3.88 GB | **3.20 GB** | **17.6%** | **Stable** |
19-
| **Llama-2 7B** | 13.99 GB* | **5.90 GB** | **~58%** | **LEMA Recommended** |
19+
| **Llama-2 7B** | 13.99 GB* | **5.90 GB*** | **~58%** | **LEMA Recommended** |
20+
21+
*VRAM Note: 5.90 GB is for Seq 128. For Seq 512, peak VRAM is 6.36 GB.*
22+
*Note on Llama-2 7B: Standard PEFT can load the model but fails with OOM during training.*
2023

2124
![VRAM Benchmark](docs/assets/vram_benchmark.png)
2225

@@ -33,7 +36,7 @@ The primary value of LEMA is not just "fitting" the model, but providing the **c
3336

3437
## Core Features
3538

36-
- **Binary Indexed Engagement (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.
39+
- **Global Binary Index (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.
3740
- **Triple-Buffer Pipeline**: Pipelined data movement (Disk -> RAM -> VRAM) to hide PCIe latency.
3841
- **High-Level API**: Simplified `LemaModel` and `LemaTrainer` interfaces for fast integration.
3942
- **Automatic Checkpointing**: Built-in interval-based saving of LoRA adapters and optimizer states.

docs/ARCHITECTURE.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,5 +42,7 @@ LEMA uses a specialized indexer to bypass standard PyTorch/Pickle deserializatio
4242

4343
## Performance Trade-offs
4444
- **VRAM Efficiency**: ~50-70% reduction for 7B+ models.
45-
- **Compute Overhead**: 1.2x - 1.8x slowdown compared to fully resident training, depending on PCIe bandwidth and disk speed.
46-
- **System RAM**: Requires space equal to the model size (or less if using aggressive disk streaming).
45+
- **Compute Overhead**: 1.5x - 3.5x slowdown compared to fully resident training, depending on PCIe bandwidth and disk speed.
46+
- **System RAM**:
47+
- **STREAMING Mode**: ~2.5 GB (Pinned buffers).
48+
- **RESIDENT Mode**: Requires space equal to the model size.

docs/BENCHMARK_RESULTS.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# LEMA Benchmark Results (v0.7 - Release Candidate)
1+
# LEMA Benchmark Results (v1.0.0)
22

33
Benchmarks were performed on **Kaggle (Tesla P100 GPU, 16GB VRAM)**.
44
Comparisons were made between **Standard PEFT (LoRA)** and **LEMA (Streaming Strategy)**.
@@ -16,8 +16,9 @@ LEMA demonstrates significant VRAM savings, particularly for larger models where
1616
| **GPT-2 (Small)** | 124M | 0.44 GB | 1.05 GB | N/A* |
1717
| **TinyLlama** | 1.1B | 2.67 GB | **2.12 GB** | **20.5%** |
1818
| **SmolLM2** | 1.7B | 3.88 GB | **3.20 GB** | **17.6%** |
19-
| **Llama-2** | 7B | **13.99 GB** (Load Only)** | **5.90 GB** | **57.9%** |
19+
| **Llama-2** | 7B | **13.99 GB** (Load Only)** | **5.90 GB*** | **57.9%** |
2020

21+
*\*Note on Llama-2 7B VRAM: The 5.90 GB figure represents a benchmark run with Batch Size 1 and Sequence Length 128. For the full training run (Batch Size 8, Seq 512) used for the HuggingFace model, peak VRAM was 6.36 GB.*
2122
*\*Note on GPT-2: For extremely small models, LEMA's fixed buffering overhead exceeds the model size. LEMA is optimized for Large-scale models.*
2223
*\**Note on Llama-2 7B: Standard PEFT can load the model (13.99GB) but fails immediately with **Out-Of-Memory (OOM)** when attempting a training step due to gradients/activations. LEMA trains comfortably with >10GB headroom.*
2324

docs/LEMA Framework Proposal.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@ LEMA is a specialized framework designed to facilitate the fine-tuning of Large
88

99
## **2\. Core Concepts**
1010

11-
### **2.1 Binary Indexed Engagement**
11+
### **2.1 Global Binary Index (GBI)**
1212

1313
Standard model loading (e.g., PyTorch .bin or .pt) involves full deserialization into System RAM. LEMA uses a **Global Binary Index (GBI)**.
1414

1515
* **Zero-Copy Mapping:** Uses mmap to map the model file (preferably in .safetensors format) into the process's virtual address space.
1616
* **Header Indexing:** A JSON/Binary header stores the (offset, size, dtype, shape) for every tensor, allowing O(1) access to specific layer weights without scanning the file.
1717

18-
### **2.2 Layer-wise Execution (Patchwork)**
18+
### **2.2 Layer-wise Execution**
1919

2020
Instead of a monolithic model.forward(), LEMA decomposes the computational graph into a sequence of isolated layer blocks.
2121

@@ -55,10 +55,6 @@ To save VRAM, LEMA implements **Segmented Gradient Checkpointing**:
5555
4. Calculate gradients for Layer ![][image3] LoRA adapters.
5656
5. Offload Layer ![][image3] weights; move to Layer ![][image5].
5757

58-
### **4.3 Optimizer Offloading**
59-
60-
The **Adam Optimizer states** (Momentum and Variance) are stored in System RAM. During the weight update step, LEMA pulls only the specific optimizer slice for the current layer's adapters into VRAM, performs the update, and pushes it back.
61-
6258
## **5\. Technical Implementation Stack**
6359

6460
* **Host Language:** Python (Orchestration) / C++ (High-speed Memory Management).
@@ -73,5 +69,5 @@ The **Adam Optimizer states** (Momentum and Variance) are stored in System RAM.
7369
| :---- | :---- | :---- |
7470
| **VRAM Requirement** | Full Model \+ Gradients | \~2 Layers \+ Buffers |
7571
| **System RAM Usage** | Model Size | Model Size (via mmap/Page Cache) |
76-
| **Speed** | 100% (Baseline) | 60-80% (PCIe Latency) |
72+
| **Speed** | 100% (Baseline) | 30-70% (PCIe Latency) |
7773
| **Model Scalability** | Limited by GPU VRAM | Limited by Disk Space |

examples/kaggle/benchmark.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
"\n",
2525
"## Core Features\n",
2626
"\n",
27-
"- **Binary Indexed Engagement (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.\n",
27+
"- **Global Binary Index (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.\n",
2828
"- **Triple-Buffer Pipeline**: Pipelined data movement (Disk -> RAM -> VRAM) to hide PCIe latency.\n",
2929
"- **High-Level API**: Simplified `LemaModel` and `LemaTrainer` interfaces for fast integration.\n",
3030
"- **Automatic Checkpointing**: Built-in interval-based saving of LoRA adapters and optimizer states.\n",
@@ -2304,7 +2304,7 @@
23042304
"\n",
23052305
"## **2\\. Core Concepts**\n",
23062306
"\n",
2307-
"### **2.1 Binary Indexed Engagement**\n",
2307+
"### **2.1 Global Binary Index (GBI)**\n",
23082308
"\n",
23092309
"Standard model loading (e.g., PyTorch .bin or .pt) involves full deserialization into System RAM. LEMA uses a **Global Binary Index (GBI)**.\n",
23102310
"\n",
@@ -2369,7 +2369,7 @@
23692369
"| :---- | :---- | :---- |\n",
23702370
"| **VRAM Requirement** | Full Model \\+ Gradients | \\~2 Layers \\+ Buffers |\n",
23712371
"| **System RAM Usage** | Model Size | Model Size (via mmap/Page Cache) |\n",
2372-
"| **Speed** | 100% (Baseline) | 60-80% (PCIe Latency) |\n",
2372+
"| **Speed** | 100% (Baseline) | 30-70% (PCIe Latency) |\n",
23732373
"| **Model Scalability** | Limited by GPU VRAM | Limited by Disk Space |"
23742374
]
23752375
},

src/lema/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22
from .core.model import LemaModel
33
from .engine.trainer import LemaTrainer
44

5-
__version__ = "0.1.1"
5+
__version__ = "1.0.0"

src/lema/core/memory.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,13 @@ def __init__(self, gbi, adapter, device="cuda", strategy=MemoryStrategy.STREAMIN
3232
# RAM Buffers
3333
if self.strategy == MemoryStrategy.RESIDENT:
3434
print(f"LEMA: Initializing RESIDENT strategy (Caching model in RAM)...")
35-
self.ram_flat_buffers: Dict[int, torch.Tensor] = {}
35+
# Dict[layer_id, Tensor]
36+
self.ram_buffers: Dict[int, torch.Tensor] = {}
3637
self._initialize_full_ram_cache()
3738
else:
3839
print(f"LEMA: Initializing STREAMING strategy (Default)...")
39-
# In streaming mode, we only need 2 RAM slots for the pipeline
40-
self.ram_flat_buffers: List[torch.Tensor] = [
40+
# List[Tensor] (Double buffering slots)
41+
self.ram_buffers: List[torch.Tensor] = [
4142
torch.empty(self.max_params, device="cpu", dtype=torch.float32).pin_memory() if self.is_cuda else torch.empty(self.max_params, device="cpu", dtype=torch.float32)
4243
for _ in range(2)
4344
]
@@ -69,7 +70,7 @@ def _pack_layer_to_ram(self, layer_id: int, slot: int = 0, is_resident: bool = F
6970
total_el = sum(w.numel() for w in weights.values())
7071
buf = torch.empty(total_el, device="cpu", dtype=torch.float32).pin_memory()
7172
else:
72-
buf = self.ram_flat_buffers[slot]
73+
buf = self.ram_buffers[slot]
7374

7475
offset = 0
7576
for name in param_names:
@@ -79,7 +80,7 @@ def _pack_layer_to_ram(self, layer_id: int, slot: int = 0, is_resident: bool = F
7980
offset += numel
8081

8182
if is_resident:
82-
self.ram_flat_buffers[layer_id] = buf
83+
self.ram_buffers[layer_id] = buf
8384
else:
8485
self.ram_layer_ids[slot] = layer_id
8586

@@ -96,11 +97,11 @@ def prefetch_to_ram(self, layer_id: int, ram_slot: int):
9697
def async_transfer_to_vram(self, layer_id: int, vram_slot: int, ram_slot: Optional[int] = None):
9798
"""Stage 2: Async transfer to GPU."""
9899
if self.strategy == MemoryStrategy.RESIDENT:
99-
src_buf = self.ram_flat_buffers[layer_id]
100+
src_buf = self.ram_buffers[layer_id]
100101
else:
101102
if ram_slot is None:
102103
raise ValueError("ram_slot must be provided in streaming mode")
103-
src_buf = self.ram_flat_buffers[ram_slot]
104+
src_buf = self.ram_buffers[ram_slot]
104105

105106
vram_dest = self.vram_flat_buffers[vram_slot]
106107

src/lema/engine/trainer.py

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ def __init__(self,
3232
self.lora_manager = lora_manager
3333
self.optimizer = optimizer
3434
self.global_step = 0
35+
self.accumulation_step = 0
3536

3637
def save_checkpoint(self, save_directory: str):
3738
"""Saves the model state (config + LoRA) and optionally optimizer state."""
@@ -139,13 +140,15 @@ def train_step(self, inputs: Any, labels: Optional[torch.Tensor] = None):
139140
if i == last_idx:
140141
if labels is not None:
141142
# Real Causal LM Loss
142-
# Shift so that tokens < n predict n
143143
shift_logits = output[..., :-1, :].contiguous()
144144
shift_labels = labels[..., 1:].contiguous()
145145
loss = F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
146146
loss_val = loss.item()
147+
148+
# Normalize for gradient accumulation
149+
loss = loss / self.config.gradient_accumulation_steps
147150
else:
148-
loss = output.mean() # Dummy
151+
loss = output.mean() / self.config.gradient_accumulation_steps
149152

150153
loss.backward()
151154
grad_output = layer_input.grad
@@ -161,14 +164,16 @@ def train_step(self, inputs: Any, labels: Optional[torch.Tensor] = None):
161164
self.adapter.release_layer_module(layer_module)
162165
del layer_module
163166

164-
if self.optimizer:
167+
self.accumulation_step += 1
168+
169+
if self.optimizer and (self.accumulation_step % self.config.gradient_accumulation_steps == 0):
165170
self.optimizer.step()
166171
self.optimizer.zero_grad()
167172

168173
self.global_step += 1
169174

170-
# Automatic checkpointing
171-
if self.config.save_steps > 0 and self.global_step % self.config.save_steps == 0:
175+
# Automatic checkpointing (only after optimizer step)
176+
if self.config.save_steps > 0 and self.global_step % (self.config.save_steps * self.config.gradient_accumulation_steps) == 0:
172177
checkpoint_path = os.path.join(self.config.output_dir, f"checkpoint-{self.global_step}")
173178
self.save_checkpoint(checkpoint_path)
174179

0 commit comments

Comments
 (0)