Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,16 @@ Before you begin, make sure your environment meets these requirements:

This Learning Path was tested on a 96 core machine with 128-bit SVE, 192 GB of RAM and 500 GB of attached storage.

{{% notice Note %}}
Ubuntu 26.04 and later ship with Python 3.14 as the system default. vLLM does not currently support Python 3.14. This Learning Path explicitly installs and uses Python 3.12, so follow the steps as written regardless of your Ubuntu version.
{{% /notice %}}

## Install build dependencies

Install the following packages required for running inference with vLLM on Arm64:
```bash
sudo apt-get update -y
sudo apt install -y python3.12-venv python3.12-dev
sudo apt install -y python3.12-venv python3.12-dev gcc g++ build-essential
```

Now install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency:
Expand All @@ -49,6 +53,10 @@ python -m pip install --upgrade pip

## Install vLLM for CPU

{{% notice Note %}}
The following command installs vLLM version 0.20.0. The same steps work with other versions — replace the version number in the URL with your chosen release. To find the latest version, see the [vLLM releases page](https://github.com/vllm-project/vllm/releases).
{{% /notice %}}

Install a CPU-specific build of vLLM:
```bash
export VLLM_VERSION=0.20.0
Expand All @@ -60,11 +68,18 @@ If you wish to build vLLM from source you can follow the instructions in the [Bu

## Set up access to LLama3.1-8B models

To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to install the CLI and setup your access token. You can then login to HF:
To access the Llama models hosted by Hugging Face, you need to install the Hugging Face CLI and authenticate with your access token. Install the CLI with:
```bash
curl -LsSf https://hf.co/cli/install.sh | bash
```

For more details and alternative installation methods, see the [Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli).

Create an account on [huggingface.co](https://huggingface.co/) if you don't already have one, then generate an access token in your account settings. Log in with:
```bash
hf auth login
```

Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit https://huggingface.co/meta-llama/Llama-3.1-8B and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes.
Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes.

Your environment is now setup to run inference with vLLM. Next, we'll review model quantisation and then you'll use vLLM to run inference on both quantised and non-quantised Llama and Whisper models.
Your environment is now setup to run inference with vLLM. Next, we'll review model quantization and then you'll use vLLM to run inference on both quantized and non-quantized Llama and Whisper models.
Original file line number Diff line number Diff line change
@@ -1,33 +1,41 @@
---
title: Quantisation Recipe
title: Quantization Recipe
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Understanding quantisation
## Understanding quantization

Quantised models have their weights converted to a lower precision data type which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8, which we will be using in this Learning Path.
Quantized models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path, you can learn how to quantize a model yourself. There are also many publicly available quantized versions of popular models, such as [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) and [RedHatAI/whisper-large-v3-quantized.w8a8](https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8), which this Learning Path uses.

The notation w8a8 means that the weights have been quantised to 8-bit integers and the activations (the input data) are dynamically quantised to the same. This allows our kernels to utilise Arm's 8-bit integer matrix multiply feature I8MM. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path.
The notation w8a8 means that the weights have been quantized to 8-bit integers and the activations (the input data) are dynamically quantized to the same. This allows Arm's 8-bit integer matrix multiply feature I8MM to be used. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path.

The w8a8 models we are using in this Learning Path only apply quantisation to the weights and activations in the linear layers of the transformer blocks. The activation quantisations are applied per-token and the weights are quantised per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations.
The w8a8 models used in this Learning Path only apply quantization to the weights and activations in the linear layers of the transformer blocks. The activation quantizations are applied per-token and the weights are quantized per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations.

## Quantising your own models
## Quantizing your own models (optional)

If you would prefer to generate your own w8a8 quantised models, the recipe below is provided as an example. This is an optional activity and not a core part of this Learning Path, as it can take several hours to run.
{{% notice Note %}}
This section is optional. The rest of this Learning Path uses pre-quantized models from Hugging Face and does not require you to run this recipe. Quantizing a model yourself can take several hours.
{{% /notice %}}

If you prefer to generate your own w8a8 quantized model rather than using the pre-quantized RedHat models, the recipe below shows how. Install the required packages before running the quantization script.

{{% notice Note %}}
The following commands use specific package versions that were tested with this recipe. To find the latest versions, see [llmcompressor](https://github.com/vllm-project/llm-compressor/releases), [compressed-tensors](https://github.com/neuralmagic/compressed-tensors/releases), and [datasets](https://github.com/huggingface/datasets/releases) on GitHub.
{{% /notice %}}

You will need to install the required packages before running the quantisation script.
```bash
pip install compressed-tensors==0.14.0.1
pip install llmcompressor==0.10.0.1
pip install datasets==4.6.0

python w8a8_quant.py
```

Where w8a8_quant.py contains:
The script uses GPTQ (Generalized Post-Training Quantization) to calibrate the quantization scales. It loads 256 samples from a calibration dataset, runs a forward pass through each linear layer, and computes per-channel weight scales and per-token activation scales. The output is saved as a quantized model in the `Meta-Llama-3.1-8B-quantized.w8a8` directory.

Create a file named `w8a8_quant.py` with the following content:

```python
from transformers import AutoTokenizer
from datasets import Dataset, load_dataset
Expand All @@ -37,7 +45,7 @@ from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization import QuantizationType, QuantizationStrategy
import random

model_id = "meta-llama/Meta-Llama-3.1-8B"
model_id = "meta-llama/Meta-Llama-3.1-8B" # Note: this uses the Meta-prefixed model ID required by llmcompressor

num_samples = 256
max_seq_len = 4096
Expand Down Expand Up @@ -97,8 +105,19 @@ oneshot(
model.save_pretrained("Meta-Llama-3.1-8B-quantized.w8a8")
```

When this has completed you will need to copy over the tokeniser specific files from the original model before you can run inference on your quantised model.
Run the script. This step can take several hours depending on your hardware:

```bash
cp ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/*/*token* Meta-Llama-3.1-8B-quantized.w8a8/
python w8a8_quant.py
```

When this has completed, copy the tokenizer files from the original model into your quantized model directory before running inference:
```bash
for f in tokenizer.json tokenizer_config.json special_tokens_map.json tokenizer.model; do
cp ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-8B/snapshots/*/"$f" Meta-Llama-3.1-8B-quantized.w8a8/ 2>/dev/null || true
done
```

Your quantized model is ready. Next, you'll use vLLM to run inference on both the quantized and non-quantized models and compare their outputs.


Loading
Loading