LLM Compression Toolkit

A collection of helper scripts for quantizing and evaluating large language models using various compression techniques.

Current Support

AWQ (Activation-aware Weight Quantization)

Framework: llm-compressor (vLLM ecosystem)
Quantization: W4A16 (4-bit weights, 16-bit activations)
Features:
- Automated oneshot calibration
- Configurable group size (default: 128)
- Parallel processing (CPU workers)
- Memory-optimized settings
- vLLM-compatible outputs

Quick Start

Prerequisites

pip install llmcompressor vllm lm-eval

Quantize with AWQ

Run quantization:

# Default: 2048 cal samples, 3072 seq len, WikiText-103
./compressh.sh

# Custom settings
MAX_CALIB_SAMPLES=1024 MAX_SEQ_LEN=2048 ./run_llmcompressor.sh

Evaluate perplexity:

# Compare baseline vs quantized
./benchmark.sh

Test inference:

python eval.py

Outputs

Quantized model: ./models/deepseek-r1-llama-8b-llmc/
Benchmarks: ./outputs/<timestamp>/
Logs: ./logs/

Configuration

Environment Variables

Variable	Default	Description
`MODEL_ID`	`deepseek-ai/DeepSeek-R1-Distill-Llama-8B`	Model to quantize
`OUT_DIR`	`./models/deepseek-r1-llama-8b-llmc`	Output directory
`DATASET_NAME`	`wikitext`	Calibration dataset (registry name)
`DATASET_CONFIG_NAME`	`wikitext-103-raw-v1`	Dataset config
`MAX_CALIB_SAMPLES`	`2048`	Calibration samples
`MAX_SEQ_LEN`	`3072`	Max sequence length
`SCHEME`	`W4A16`	Quantization scheme
`PREPROC_WORKERS`	`$(nproc)`	CPU workers for preprocessing

Advanced Options

PAD_TO_MAX_LEN=false: Avoid padding to save VRAM
SHUFFLE_CALIB=true: Shuffle calibration samples
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True: Reduce CUDA fragmentation

Future Support

Planned additions:

GPTQ (Gradient-based PTQ)
SmoothQuant (Activation quantization)
FP8 (Mixed precision)
SparseGPT (Weight sparsity)
Custom dataset support (JSON/CSV)

Troubleshooting

Common Issues

Out of Memory (OOM):
- Reduce MAX_SEQ_LEN (e.g., 2048)
- Set PAD_TO_MAX_LEN=false
- Lower MAX_CALIB_SAMPLES
Dataset Not Found:
- Use registered datasets: wikitext, c4, open-platypus
- For custom: set DATASET_NAME=custom and DATASET_PATH=/path/to/data.json
Low Perplexity Improvement:
- Increase MAX_CALIB_SAMPLES
- Use domain-matched calibration data
- Try different schemes (W4A16_ASYM)

Performance Notes

AWQ W4A16 typically achieves ~1.5x speedup vs FP16
Memory reduction: ~60–70% (5GB vs 15GB for 8B models)
Perplexity increase: Usually 5–10% on general tasks

Contributing

Add new compression methods by:

Creating a new script (e.g., run_gptq.sh)
Updating this README
Adding benchmark support

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.sh		benchmark.sh
compress.sh		compress.sh
eval.py		eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Compression Toolkit

Current Support

AWQ (Activation-aware Weight Quantization)

Quick Start

Prerequisites

Quantize with AWQ

Outputs

Configuration

Environment Variables

Advanced Options

Future Support

Troubleshooting

Common Issues

Performance Notes

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Compression Toolkit

Current Support

AWQ (Activation-aware Weight Quantization)

Quick Start

Prerequisites

Quantize with AWQ

Outputs

Configuration

Environment Variables

Advanced Options

Future Support

Troubleshooting

Common Issues

Performance Notes

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages