LLM Quantization Benchmark

A benchmarking tool for comparing LLM quantization techniques across different hardware platforms. Measures memory footprint, inference latency, and performance trade-offs with detailed comparative analysis.

Features

Multi-platform support: Automatically detects and uses the appropriate device (CUDA, MPS, CPU)
Quantization benchmarking: Compares 8-bit quantization (CUDA) and FP16 precision (MPS/CPU)
Memory profiling: Detailed memory footprint tracking and analysis
Performance metrics: Measures and compares inference latency across configurations
Hugging Face integration: Seamless model loading via transformers and accelerate

Requirements

Python 3.12 (exact version; 3.14+ has compatibility issues with PyO3)
PyTorch
Transformers
Accelerate
BitsAndBytes
Hugging Face CLI
Hugging Face account with access to Meta-Llama-3.1-8B-Instruct

Setup

Install dependencies:

pip install -r requirements.txt

Authenticate with Hugging Face:

huggingface-cli login

You'll need a Hugging Face token from https://huggingface.co/settings/tokens (requires accepting the model's license on the model card).

Usage

Run Benchmark

python quantization.py

Executes a comprehensive comparison benchmark of the Llama 3.1 8B model across quantization methods:

CUDA devices: 8-bit quantization via BitsAndBytes
MPS (Apple Silicon): FP16 precision
CPU: FP16 precision

Outputs a detailed comparison table with memory and latency metrics.

Example Results

MacBook Pro M1 16GB (Llama 3.1 8B)

Metric                    Original                  Quantized                 Difference
──────────────────────────────────────────────────────────────────────────────────────────
Model Size (RAM)          29.92 GB                  14.96 GB                  -50.0%
Inference Time            27m 50s                   1h 28m 14s                +217.0%
Throughput                0.09 tok/s                0.03 tok/s                -68.5%
Peak Memory (Inference)   29.92 GB                  14.96 GB                  -50.0%
Output Similarity         N/A                       0.672
Data Types                {torch.float32}           FP16

Key Insights:

50% memory savings with FP16 quantization enables running 8B models on 16GB unified memory
Trade-off: Significant speed reduction on Apple Silicon (MPS not optimized for FP16)
Output quality: 0.672 semantic similarity indicates quantization preserves model meaning reasonably well
On NVIDIA GPUs, the speed trade-off is typically reversed (faster inference with 8-bit quantization)

Project Structure

.
├── quantization.py      # Quantized model loading and inference
├── utils.py             # Helper utilities (byte humanization)
├── requirements.txt     # Project dependencies
└── README.md           # This file

Hardware Compatibility

Apple Silicon (M1/M2/M3)

GPU acceleration via Metal Performance Shaders (MPS)
Constraint: 8-bit quantization unavailable (BitsAndBytes CUDA-only)
Uses FP16 as quantization method (~50% memory savings vs. FP32)
Optimal for 8B models with 16GB+ unified memory
Requires Python 3.12 for full compatibility

NVIDIA GPUs

CUDA-enabled 8-bit quantization via BitsAndBytes (75% memory savings)
Superior inference performance with optimized CUDA kernels
Requires CUDA-compatible PyTorch installation

License

MIT License - See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Quantization Benchmark

Features

Requirements

Setup

Usage

Run Benchmark

Example Results

MacBook Pro M1 16GB (Llama 3.1 8B)

Project Structure

Hardware Compatibility

Apple Silicon (M1/M2/M3)

NVIDIA GPUs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
quantization.py		quantization.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

LLM Quantization Benchmark

Features

Requirements

Setup

Usage

Run Benchmark

Example Results

MacBook Pro M1 16GB (Llama 3.1 8B)

Project Structure

Hardware Compatibility

Apple Silicon (M1/M2/M3)

NVIDIA GPUs

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages