A benchmarking tool for comparing LLM quantization techniques across different hardware platforms. Measures memory footprint, inference latency, and performance trade-offs with detailed comparative analysis.
- Multi-platform support: Automatically detects and uses the appropriate device (CUDA, MPS, CPU)
- Quantization benchmarking: Compares 8-bit quantization (CUDA) and FP16 precision (MPS/CPU)
- Memory profiling: Detailed memory footprint tracking and analysis
- Performance metrics: Measures and compares inference latency across configurations
- Hugging Face integration: Seamless model loading via transformers and accelerate
- Python 3.12 (exact version; 3.14+ has compatibility issues with PyO3)
- PyTorch
- Transformers
- Accelerate
- BitsAndBytes
- Hugging Face CLI
- Hugging Face account with access to Meta-Llama-3.1-8B-Instruct
- Install dependencies:
pip install -r requirements.txt- Authenticate with Hugging Face:
huggingface-cli loginYou'll need a Hugging Face token from https://huggingface.co/settings/tokens (requires accepting the model's license on the model card).
python quantization.pyExecutes a comprehensive comparison benchmark of the Llama 3.1 8B model across quantization methods:
- CUDA devices: 8-bit quantization via BitsAndBytes
- MPS (Apple Silicon): FP16 precision
- CPU: FP16 precision
Outputs a detailed comparison table with memory and latency metrics.
Metric Original Quantized Difference
──────────────────────────────────────────────────────────────────────────────────────────
Model Size (RAM) 29.92 GB 14.96 GB -50.0%
Inference Time 27m 50s 1h 28m 14s +217.0%
Throughput 0.09 tok/s 0.03 tok/s -68.5%
Peak Memory (Inference) 29.92 GB 14.96 GB -50.0%
Output Similarity N/A 0.672
Data Types {torch.float32} FP16
Key Insights:
- 50% memory savings with FP16 quantization enables running 8B models on 16GB unified memory
- Trade-off: Significant speed reduction on Apple Silicon (MPS not optimized for FP16)
- Output quality: 0.672 semantic similarity indicates quantization preserves model meaning reasonably well
- On NVIDIA GPUs, the speed trade-off is typically reversed (faster inference with 8-bit quantization)
.
├── quantization.py # Quantized model loading and inference
├── utils.py # Helper utilities (byte humanization)
├── requirements.txt # Project dependencies
└── README.md # This file
- GPU acceleration via Metal Performance Shaders (MPS)
- Constraint: 8-bit quantization unavailable (BitsAndBytes CUDA-only)
- Uses FP16 as quantization method (~50% memory savings vs. FP32)
- Optimal for 8B models with 16GB+ unified memory
- Requires Python 3.12 for full compatibility
- CUDA-enabled 8-bit quantization via BitsAndBytes (75% memory savings)
- Superior inference performance with optimized CUDA kernels
- Requires CUDA-compatible PyTorch installation
MIT License - See LICENSE file for details.