A CLI utility that checks available GPU VRAM before you load AI models. Prevents OOM crashes that force a full system reboot.
If you run local inference on consumer GPUs, you know the pain:
| Without gpu-memory-guard | With gpu-memory-guard |
|---|---|
| Load 70B model on 24GB card | Check VRAM before loading |
| System freezes, GPU hangs | Get a clear warning in terminal |
| Force reboot, lose unsaved work | Pick a smaller model or free memory |
| Repeat next week | Zero OOM crashes |
One command saves you from constant reboots.
git clone https://github.com/CastelDazur/gpu-memory-guard.git
cd gpu-memory-guard
pip install -e .# Check current GPU status
gpu-guard
# Check if an 18GB model fits with 2GB safety buffer
gpu-guard --model-size 18 --buffer 2Example output:
GPU 0: NVIDIA GeForce RTX 5090
Total: 32.00 GB
Used: 4.12 GB
Available: 27.88 GB
Model size: 18.00 GB (buffer: 2.00 GB)
Status: OK - model fits with 7.88 GB to spare
- MODEL_COMPATIBILITY.md - Sizing reference for GPUs, models, and quantizations (Q4_K_M, Q5_K_M, Q8_0, FP16) with KV cache tables and the mmproj trap for vision-language models.
- TROUBLESHOOTING.md - Field guide to the five CUDA OOM errors you will actually see, with a diagnostic checklist and notes on vLLM, llama.cpp, and Ollama quirks.
git clone https://github.com/CastelDazur/gpu-memory-guard.git
cd gpu-memory-guard
pip install -e .- Python 3.8+
- NVIDIA GPU with
nvidia-smiinstalled, OR pynvmlPython package (pip install pynvml)
# Basic VRAM check
gpu-guard
# Check if a model fits (size in GB)
gpu-guard --model-size 13
# Custom safety buffer (default: 1GB)
gpu-guard --model-size 18 --buffer 2
# JSON output for scripting
gpu-guard --model-size 13 --json
# Quiet mode: exit code only (0 = fits, 1 = doesn't)
gpu-guard --model-size 7 --quietfrom gpu_guard import check_vram, can_load_model, get_gpu_info
# Check current VRAM
gpu_info = get_gpu_info()
for gpu in gpu_info:
print(f"GPU {gpu.device_id}: {gpu.available_memory_gb:.2f}GB available")
# Check if a model fits
result = can_load_model(model_size_gb=13.0, buffer_gb=2.0)
if result.fits:
print("Safe to load")
else:
print(f"Need {result.shortage_gb:.2f}GB more VRAM")# Pre-check before launching inference
if gpu-guard --model-size 13 --quiet; then
python run_inference.py --model llama-13b
else
echo "Not enough VRAM, switching to 7B model"
python run_inference.py --model llama-7b
fi| Model | FP16 | Q4 (GGUF) |
|---|---|---|
| 7B params | ~14 GB | ~4 GB |
| 13B params | ~26 GB | ~7 GB |
| 33B params | ~66 GB | ~18 GB |
| 70B params | ~140 GB | ~35 GB |
- AMD ROCm support
- Memory estimation by model architecture
- Multi-GPU split recommendations
- PyPI package (
pip install gpu-memory-guard) - Integration with Ollama and vLLM
PRs welcome. If you want to add AMD ROCm support or model-specific memory estimation, open an issue first so we can discuss the approach.
MIT