Skip to content

christancho/llm-quantization-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Quantization Benchmark

A benchmarking tool for comparing LLM quantization techniques across different hardware platforms. Measures memory footprint, inference latency, and performance trade-offs with detailed comparative analysis.

Features

  • Multi-platform support: Automatically detects and uses the appropriate device (CUDA, MPS, CPU)
  • Quantization benchmarking: Compares 8-bit quantization (CUDA) and FP16 precision (MPS/CPU)
  • Memory profiling: Detailed memory footprint tracking and analysis
  • Performance metrics: Measures and compares inference latency across configurations
  • Hugging Face integration: Seamless model loading via transformers and accelerate

Requirements

  • Python 3.12 (exact version; 3.14+ has compatibility issues with PyO3)
  • PyTorch
  • Transformers
  • Accelerate
  • BitsAndBytes
  • Hugging Face CLI
  • Hugging Face account with access to Meta-Llama-3.1-8B-Instruct

Setup

  1. Install dependencies:
pip install -r requirements.txt
  1. Authenticate with Hugging Face:
huggingface-cli login

You'll need a Hugging Face token from https://huggingface.co/settings/tokens (requires accepting the model's license on the model card).

Usage

Run Benchmark

python quantization.py

Executes a comprehensive comparison benchmark of the Llama 3.1 8B model across quantization methods:

  • CUDA devices: 8-bit quantization via BitsAndBytes
  • MPS (Apple Silicon): FP16 precision
  • CPU: FP16 precision

Outputs a detailed comparison table with memory and latency metrics.

Example Results

MacBook Pro M1 16GB (Llama 3.1 8B)

Metric                    Original                  Quantized                 Difference
──────────────────────────────────────────────────────────────────────────────────────────
Model Size (RAM)          29.92 GB                  14.96 GB                  -50.0%
Inference Time            27m 50s                   1h 28m 14s                +217.0%
Throughput                0.09 tok/s                0.03 tok/s                -68.5%
Peak Memory (Inference)   29.92 GB                  14.96 GB                  -50.0%
Output Similarity         N/A                       0.672
Data Types                {torch.float32}           FP16

Key Insights:

  • 50% memory savings with FP16 quantization enables running 8B models on 16GB unified memory
  • Trade-off: Significant speed reduction on Apple Silicon (MPS not optimized for FP16)
  • Output quality: 0.672 semantic similarity indicates quantization preserves model meaning reasonably well
  • On NVIDIA GPUs, the speed trade-off is typically reversed (faster inference with 8-bit quantization)

Project Structure

.
├── quantization.py      # Quantized model loading and inference
├── utils.py             # Helper utilities (byte humanization)
├── requirements.txt     # Project dependencies
└── README.md           # This file

Hardware Compatibility

Apple Silicon (M1/M2/M3)

  • GPU acceleration via Metal Performance Shaders (MPS)
  • Constraint: 8-bit quantization unavailable (BitsAndBytes CUDA-only)
  • Uses FP16 as quantization method (~50% memory savings vs. FP32)
  • Optimal for 8B models with 16GB+ unified memory
  • Requires Python 3.12 for full compatibility

NVIDIA GPUs

  • CUDA-enabled 8-bit quantization via BitsAndBytes (75% memory savings)
  • Superior inference performance with optimized CUDA kernels
  • Requires CUDA-compatible PyTorch installation

License

MIT License - See LICENSE file for details.

About

Benchmark tool for comparing LLM quantization techniques across hardware platforms. Demonstrates 8-bit quantization (CUDA) and FP16 precision (Apple Silicon/CPU) with detailed memory profiling and inference performance analysis. Features adaptive device detection and side-by-side comparison tables.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages