Refactor: unify versioning, standardize GBI naming, and correct performance claims

Pomilon · Pomilon · commit a92a385a0db8 · 2026-02-17T12:39:36.000Z
diff --git a/README.md b/README.md
@@ -16,7 +16,10 @@ LEMA is a specialized framework designed to facilitate the fine-tuning of Large
 | :--- | :--- | :--- | :--- | :--- |
 | **TinyLlama 1.1B** | 2.67 GB | **2.12 GB** | **20.5%** | **Stable** |
 | **SmolLM2 1.7B** | 3.88 GB | **3.20 GB** | **17.6%** | **Stable** |
-| **Llama-2 7B** | 13.99 GB* | **5.90 GB** | **~58%** | **LEMA Recommended** |
+| **Llama-2 7B** | 13.99 GB* | **5.90 GB*** | **~58%** | **LEMA Recommended** |
+
+*VRAM Note: 5.90 GB is for Seq 128. For Seq 512, peak VRAM is 6.36 GB.*
+*Note on Llama-2 7B: Standard PEFT can load the model but fails with OOM during training.*
 
 ![VRAM Benchmark](docs/assets/vram_benchmark.png)
 
@@ -33,7 +36,7 @@ The primary value of LEMA is not just "fitting" the model, but providing the **c
 
 ## Core Features
 
-- **Binary Indexed Engagement (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.
+- **Global Binary Index (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.
 - **Triple-Buffer Pipeline**: Pipelined data movement (Disk -> RAM -> VRAM) to hide PCIe latency.
 - **High-Level API**: Simplified `LemaModel` and `LemaTrainer` interfaces for fast integration.
 - **Automatic Checkpointing**: Built-in interval-based saving of LoRA adapters and optimizer states.
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -42,5 +42,7 @@ LEMA uses a specialized indexer to bypass standard PyTorch/Pickle deserializatio
 
 ## Performance Trade-offs
 -   **VRAM Efficiency**: ~50-70% reduction for 7B+ models.
--   **Compute Overhead**: 1.2x - 1.8x slowdown compared to fully resident training, depending on PCIe bandwidth and disk speed.
--   **System RAM**: Requires space equal to the model size (or less if using aggressive disk streaming).
+-   **Compute Overhead**: 1.5x - 3.5x slowdown compared to fully resident training, depending on PCIe bandwidth and disk speed.
+-   **System RAM**: 
+    -   **STREAMING Mode**: ~2.5 GB (Pinned buffers).
+    -   **RESIDENT Mode**: Requires space equal to the model size.
diff --git a/docs/BENCHMARK_RESULTS.md b/docs/BENCHMARK_RESULTS.md
@@ -1,4 +1,4 @@
-# LEMA Benchmark Results (v0.7 - Release Candidate)
+# LEMA Benchmark Results (v1.0.0)
 
 Benchmarks were performed on **Kaggle (Tesla P100 GPU, 16GB VRAM)**.
 Comparisons were made between **Standard PEFT (LoRA)** and **LEMA (Streaming Strategy)**.
@@ -16,8 +16,9 @@ LEMA demonstrates significant VRAM savings, particularly for larger models where
 | **GPT-2 (Small)** | 124M | 0.44 GB | 1.05 GB | N/A* |
 | **TinyLlama** | 1.1B | 2.67 GB | **2.12 GB** | **20.5%** |
 | **SmolLM2** | 1.7B | 3.88 GB | **3.20 GB** | **17.6%** |
-| **Llama-2** | 7B | **13.99 GB** (Load Only)** | **5.90 GB** | **57.9%** |
+| **Llama-2** | 7B | **13.99 GB** (Load Only)** | **5.90 GB*** | **57.9%** |
 
+*\*Note on Llama-2 7B VRAM: The 5.90 GB figure represents a benchmark run with Batch Size 1 and Sequence Length 128. For the full training run (Batch Size 8, Seq 512) used for the HuggingFace model, peak VRAM was 6.36 GB.*
 *\*Note on GPT-2: For extremely small models, LEMA's fixed buffering overhead exceeds the model size. LEMA is optimized for Large-scale models.*
 *\**Note on Llama-2 7B: Standard PEFT can load the model (13.99GB) but fails immediately with **Out-Of-Memory (OOM)** when attempting a training step due to gradients/activations. LEMA trains comfortably with >10GB headroom.*
 
diff --git a/docs/LEMA Framework Proposal.md b/docs/LEMA Framework Proposal.md
@@ -8,14 +8,14 @@ LEMA is a specialized framework designed to facilitate the fine-tuning of Large
 
 ## **2\. Core Concepts**
 
-### **2.1 Binary Indexed Engagement**
+### **2.1 Global Binary Index (GBI)**
 
 Standard model loading (e.g., PyTorch .bin or .pt) involves full deserialization into System RAM. LEMA uses a **Global Binary Index (GBI)**.
 
 * **Zero-Copy Mapping:** Uses mmap to map the model file (preferably in .safetensors format) into the process's virtual address space.  
 * **Header Indexing:** A JSON/Binary header stores the (offset, size, dtype, shape) for every tensor, allowing O(1) access to specific layer weights without scanning the file.
 
-### **2.2 Layer-wise Execution (Patchwork)**
+### **2.2 Layer-wise Execution**
 
 Instead of a monolithic model.forward(), LEMA decomposes the computational graph into a sequence of isolated layer blocks.
 
@@ -55,10 +55,6 @@ To save VRAM, LEMA implements **Segmented Gradient Checkpointing**:
 4. Calculate gradients for Layer ![][image3] LoRA adapters.  
 5. Offload Layer ![][image3] weights; move to Layer ![][image5].
 
-### **4.3 Optimizer Offloading**
-
-The **Adam Optimizer states** (Momentum and Variance) are stored in System RAM. During the weight update step, LEMA pulls only the specific optimizer slice for the current layer's adapters into VRAM, performs the update, and pushes it back.
-
 ## **5\. Technical Implementation Stack**
 
 * **Host Language:** Python (Orchestration) / C++ (High-speed Memory Management).  
@@ -73,5 +69,5 @@ The **Adam Optimizer states** (Momentum and Variance) are stored in System RAM.
 | :---- | :---- | :---- |
 | **VRAM Requirement** | Full Model \+ Gradients | \~2 Layers \+ Buffers |
 | **System RAM Usage** | Model Size | Model Size (via mmap/Page Cache) |
-| **Speed** | 100% (Baseline) | 60-80% (PCIe Latency) |
+| **Speed** | 100% (Baseline) | 30-70% (PCIe Latency) |
 | **Model Scalability** | Limited by GPU VRAM | Limited by Disk Space |
diff --git a/examples/kaggle/benchmark.ipynb b/examples/kaggle/benchmark.ipynb
@@ -24,7 +24,7 @@
         "\n",
         "## Core Features\n",
         "\n",
-        "- **Binary Indexed Engagement (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.\n",
+        "- **Global Binary Index (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.\n",
         "- **Triple-Buffer Pipeline**: Pipelined data movement (Disk -> RAM -> VRAM) to hide PCIe latency.\n",
         "- **High-Level API**: Simplified `LemaModel` and `LemaTrainer` interfaces for fast integration.\n",
         "- **Automatic Checkpointing**: Built-in interval-based saving of LoRA adapters and optimizer states.\n",
@@ -2304,7 +2304,7 @@
         "\n",
         "## **2\\. Core Concepts**\n",
         "\n",
-        "### **2.1 Binary Indexed Engagement**\n",
+        "### **2.1 Global Binary Index (GBI)**\n",
         "\n",
         "Standard model loading (e.g., PyTorch .bin or .pt) involves full deserialization into System RAM. LEMA uses a **Global Binary Index (GBI)**.\n",
         "\n",
@@ -2369,7 +2369,7 @@
         "| :---- | :---- | :---- |\n",
         "| **VRAM Requirement** | Full Model \\+ Gradients | \\~2 Layers \\+ Buffers |\n",
         "| **System RAM Usage** | Model Size | Model Size (via mmap/Page Cache) |\n",
-        "| **Speed** | 100% (Baseline) | 60-80% (PCIe Latency) |\n",
+        "| **Speed** | 100% (Baseline) | 30-70% (PCIe Latency) |\n",
         "| **Model Scalability** | Limited by GPU VRAM | Limited by Disk Space |"
       ]
     },
diff --git a/src/lema/__init__.py b/src/lema/__init__.py
@@ -2,4 +2,4 @@
 from .core.model import LemaModel
 from .engine.trainer import LemaTrainer
 
-__version__ = "0.1.1"
+__version__ = "1.0.0"
diff --git a/src/lema/core/memory.py b/src/lema/core/memory.py
@@ -32,12 +32,13 @@ def __init__(self, gbi, adapter, device="cuda", strategy=MemoryStrategy.STREAMIN
         # RAM Buffers
         if self.strategy == MemoryStrategy.RESIDENT:
             print(f"LEMA: Initializing RESIDENT strategy (Caching model in RAM)...")
-            self.ram_flat_buffers: Dict[int, torch.Tensor] = {}
+            # Dict[layer_id, Tensor]
+            self.ram_buffers: Dict[int, torch.Tensor] = {}
             self._initialize_full_ram_cache()
         else:
             print(f"LEMA: Initializing STREAMING strategy (Default)...")
-            # In streaming mode, we only need 2 RAM slots for the pipeline
-            self.ram_flat_buffers: List[torch.Tensor] = [
+            # List[Tensor] (Double buffering slots)
+            self.ram_buffers: List[torch.Tensor] = [
                 torch.empty(self.max_params, device="cpu", dtype=torch.float32).pin_memory() if self.is_cuda else torch.empty(self.max_params, device="cpu", dtype=torch.float32)
                 for _ in range(2)
             ]
@@ -69,7 +70,7 @@ def _pack_layer_to_ram(self, layer_id: int, slot: int = 0, is_resident: bool = F
             total_el = sum(w.numel() for w in weights.values())
             buf = torch.empty(total_el, device="cpu", dtype=torch.float32).pin_memory()
         else:
-            buf = self.ram_flat_buffers[slot]
+            buf = self.ram_buffers[slot]
             
         offset = 0
         for name in param_names:
@@ -79,7 +80,7 @@ def _pack_layer_to_ram(self, layer_id: int, slot: int = 0, is_resident: bool = F
             offset += numel
             
         if is_resident:
-            self.ram_flat_buffers[layer_id] = buf
+            self.ram_buffers[layer_id] = buf
         else:
             self.ram_layer_ids[slot] = layer_id
 
@@ -96,11 +97,11 @@ def prefetch_to_ram(self, layer_id: int, ram_slot: int):
     def async_transfer_to_vram(self, layer_id: int, vram_slot: int, ram_slot: Optional[int] = None):
         """Stage 2: Async transfer to GPU."""
         if self.strategy == MemoryStrategy.RESIDENT:
-            src_buf = self.ram_flat_buffers[layer_id]
+            src_buf = self.ram_buffers[layer_id]
         else:
             if ram_slot is None:
                 raise ValueError("ram_slot must be provided in streaming mode")
-            src_buf = self.ram_flat_buffers[ram_slot]
+            src_buf = self.ram_buffers[ram_slot]
             
         vram_dest = self.vram_flat_buffers[vram_slot]
         
diff --git a/src/lema/engine/trainer.py b/src/lema/engine/trainer.py
@@ -32,6 +32,7 @@ def __init__(self,
         self.lora_manager = lora_manager
         self.optimizer = optimizer
         self.global_step = 0
+        self.accumulation_step = 0
 
     def save_checkpoint(self, save_directory: str):
         """Saves the model state (config + LoRA) and optionally optimizer state."""
@@ -139,13 +140,15 @@ def train_step(self, inputs: Any, labels: Optional[torch.Tensor] = None):
             if i == last_idx:
                 if labels is not None:
                     # Real Causal LM Loss
-                    # Shift so that tokens < n predict n
                     shift_logits = output[..., :-1, :].contiguous()
                     shift_labels = labels[..., 1:].contiguous()
                     loss = F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
                     loss_val = loss.item()
+                    
+                    # Normalize for gradient accumulation
+                    loss = loss / self.config.gradient_accumulation_steps
                 else:
-                    loss = output.mean() # Dummy
+                    loss = output.mean() / self.config.gradient_accumulation_steps
                 
                 loss.backward()
                 grad_output = layer_input.grad
@@ -161,14 +164,16 @@ def train_step(self, inputs: Any, labels: Optional[torch.Tensor] = None):
                 self.adapter.release_layer_module(layer_module)
             del layer_module
 
-        if self.optimizer:
+        self.accumulation_step += 1
+        
+        if self.optimizer and (self.accumulation_step % self.config.gradient_accumulation_steps == 0):
             self.optimizer.step()
             self.optimizer.zero_grad()
             
         self.global_step += 1
         
-        # Automatic checkpointing
-        if self.config.save_steps > 0 and self.global_step % self.config.save_steps == 0:
+        # Automatic checkpointing (only after optimizer step)
+        if self.config.save_steps > 0 and self.global_step % (self.config.save_steps * self.config.gradient_accumulation_steps) == 0:
             checkpoint_path = os.path.join(self.config.output_dir, f"checkpoint-{self.global_step}")
             self.save_checkpoint(checkpoint_path)