This document describes the internal mechanics of the Layer-wise Efficient Memory Abstraction (LEMA) framework.
Standard fine-tuning (even with PEFT/LoRA) requires the entire model weights to be resident in VRAM. For a Llama-2 7B model in FP16, this is ~14GB. Adding optimizer states and activations quickly exceeds the capacity of consumer GPUs (e.g., 16GB).
LEMA treats GPU VRAM not as a static storage for the model, but as a dynamic cache for execution.
LEMA hides data transfer latency by pipelining movements across three memory tiers:
- Storage (NVMe): Weights reside in
.safetensorsfiles. Accessed viammap(Zero-copy). - System RAM (Pinned): Acting as a "Prefetch Buffer". Pinned memory ensures high-speed Host-to-Device (H2D) transfers.
- VRAM (Execution): Divided into two "Slots" (Active and Prefetch).
While the GPU is computing Layer
- Asynchronously transferring Layer
$N+1$ from RAM to Slot B (VRAM). - Loading Layer
$N+2$ from Disk to RAM (Staging).
When Layer
- Model is executed layer-by-layer.
- Only "Boundary Activations" (the output of each layer) are stored in VRAM.
- Intermediate activations are discarded.
- LEMA traverses the layers in reverse.
- For each layer:
- The weights are swapped back into VRAM.
- The layer's forward pass is re-executed (Segmented Gradient Checkpointing) using the stored boundary activations.
- Gradients are calculated for the LoRA adapters.
- Optimizer states for those specific adapters are updated.
LEMA uses a specialized indexer to bypass standard PyTorch/Pickle deserialization. By reading the .safetensors header, LEMA knows the exact byte offsets for every parameter, allowing it to "slice" the file and load only the parameters needed for the current layer module.
- VRAM Efficiency: ~50-70% reduction for 7B+ models.
- Compute Overhead: 1.5x - 3.5x slowdown compared to fully resident training, depending on PCIe bandwidth and disk speed.
- System RAM:
- STREAMING Mode: ~2.5 GB (Pinned buffers).
- RESIDENT Mode: Requires space equal to the model size.