diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md index d5f4df5d83..f6e4e09a62 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md @@ -25,12 +25,16 @@ Before you begin, make sure your environment meets these requirements: This Learning Path was tested on a 96 core machine with 128-bit SVE, 192 GB of RAM and 500 GB of attached storage. +{{% notice Note %}} +Ubuntu 26.04 and later ship with Python 3.14 as the system default. vLLM does not currently support Python 3.14. This Learning Path explicitly installs and uses Python 3.12, so follow the steps as written regardless of your Ubuntu version. +{{% /notice %}} + ## Install build dependencies Install the following packages required for running inference with vLLM on Arm64: ```bash sudo apt-get update -y -sudo apt install -y python3.12-venv python3.12-dev +sudo apt install -y python3.12-venv python3.12-dev gcc g++ build-essential ``` Now install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency: @@ -49,6 +53,10 @@ python -m pip install --upgrade pip ## Install vLLM for CPU +{{% notice Note %}} +The following command installs vLLM version 0.20.0. The same steps work with other versions — replace the version number in the URL with your chosen release. To find the latest version, see the [vLLM releases page](https://github.com/vllm-project/vllm/releases). +{{% /notice %}} + Install a CPU-specific build of vLLM: ```bash export VLLM_VERSION=0.20.0 @@ -60,11 +68,18 @@ If you wish to build vLLM from source you can follow the instructions in the [Bu ## Set up access to LLama3.1-8B models -To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to install the CLI and setup your access token. You can then login to HF: +To access the Llama models hosted by Hugging Face, you need to install the Hugging Face CLI and authenticate with your access token. Install the CLI with: +```bash +curl -LsSf https://hf.co/cli/install.sh | bash +``` + +For more details and alternative installation methods, see the [Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli). + +Create an account on [huggingface.co](https://huggingface.co/) if you don't already have one, then generate an access token in your account settings. Log in with: ```bash hf auth login ``` -Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit https://huggingface.co/meta-llama/Llama-3.1-8B and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes. +Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes. -Your environment is now setup to run inference with vLLM. Next, we'll review model quantisation and then you'll use vLLM to run inference on both quantised and non-quantised Llama and Whisper models. +Your environment is now setup to run inference with vLLM. Next, we'll review model quantization and then you'll use vLLM to run inference on both quantized and non-quantized Llama and Whisper models. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md index 531564b880..919a0eef0c 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md @@ -1,33 +1,41 @@ --- -title: Quantisation Recipe +title: Quantization Recipe weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Understanding quantisation +## Understanding quantization -Quantised models have their weights converted to a lower precision data type which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8, which we will be using in this Learning Path. +Quantized models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path, you can learn how to quantize a model yourself. There are also many publicly available quantized versions of popular models, such as [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) and [RedHatAI/whisper-large-v3-quantized.w8a8](https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8), which this Learning Path uses. -The notation w8a8 means that the weights have been quantised to 8-bit integers and the activations (the input data) are dynamically quantised to the same. This allows our kernels to utilise Arm's 8-bit integer matrix multiply feature I8MM. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path. +The notation w8a8 means that the weights have been quantized to 8-bit integers and the activations (the input data) are dynamically quantized to the same. This allows Arm's 8-bit integer matrix multiply feature I8MM to be used. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path. -The w8a8 models we are using in this Learning Path only apply quantisation to the weights and activations in the linear layers of the transformer blocks. The activation quantisations are applied per-token and the weights are quantised per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations. +The w8a8 models used in this Learning Path only apply quantization to the weights and activations in the linear layers of the transformer blocks. The activation quantizations are applied per-token and the weights are quantized per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations. -## Quantising your own models +## Quantizing your own models (optional) -If you would prefer to generate your own w8a8 quantised models, the recipe below is provided as an example. This is an optional activity and not a core part of this Learning Path, as it can take several hours to run. +{{% notice Note %}} +This section is optional. The rest of this Learning Path uses pre-quantized models from Hugging Face and does not require you to run this recipe. Quantizing a model yourself can take several hours. +{{% /notice %}} + +If you prefer to generate your own w8a8 quantized model rather than using the pre-quantized RedHat models, the recipe below shows how. Install the required packages before running the quantization script. + +{{% notice Note %}} +The following commands use specific package versions that were tested with this recipe. To find the latest versions, see [llmcompressor](https://github.com/vllm-project/llm-compressor/releases), [compressed-tensors](https://github.com/neuralmagic/compressed-tensors/releases), and [datasets](https://github.com/huggingface/datasets/releases) on GitHub. +{{% /notice %}} -You will need to install the required packages before running the quantisation script. ```bash pip install compressed-tensors==0.14.0.1 pip install llmcompressor==0.10.0.1 pip install datasets==4.6.0 - -python w8a8_quant.py ``` -Where w8a8_quant.py contains: +The script uses GPTQ (Generalized Post-Training Quantization) to calibrate the quantization scales. It loads 256 samples from a calibration dataset, runs a forward pass through each linear layer, and computes per-channel weight scales and per-token activation scales. The output is saved as a quantized model in the `Meta-Llama-3.1-8B-quantized.w8a8` directory. + +Create a file named `w8a8_quant.py` with the following content: + ```python from transformers import AutoTokenizer from datasets import Dataset, load_dataset @@ -37,7 +45,7 @@ from llmcompressor.modifiers.quantization import GPTQModifier from compressed_tensors.quantization import QuantizationType, QuantizationStrategy import random -model_id = "meta-llama/Meta-Llama-3.1-8B" +model_id = "meta-llama/Meta-Llama-3.1-8B" # Note: this uses the Meta-prefixed model ID required by llmcompressor num_samples = 256 max_seq_len = 4096 @@ -97,8 +105,19 @@ oneshot( model.save_pretrained("Meta-Llama-3.1-8B-quantized.w8a8") ``` -When this has completed you will need to copy over the tokeniser specific files from the original model before you can run inference on your quantised model. +Run the script. This step can take several hours depending on your hardware: ```bash -cp ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/*/*token* Meta-Llama-3.1-8B-quantized.w8a8/ +python w8a8_quant.py ``` + +When this has completed, copy the tokenizer files from the original model into your quantized model directory before running inference: +```bash +for f in tokenizer.json tokenizer_config.json special_tokens_map.json tokenizer.model; do + cp ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-8B/snapshots/*/"$f" Meta-Llama-3.1-8B-quantized.w8a8/ 2>/dev/null || true +done +``` + +Your quantized model is ready. Next, you'll use vLLM to run inference on both the quantized and non-quantized models and compare their outputs. + + diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md index addfec7d19..508a1c3322 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md @@ -8,14 +8,28 @@ layout: learningpathall ## Run inference on LLama3.1-8B -We will use vLLM to serve an openAI-compatible API that we can use to run inference on Llama3.1-8B. This will demonstrate that the local environment is setup correctly. +vLLM serves an OpenAI-compatible API that you use to run inference on Llama3.1-8B. This confirms that the local environment is set up correctly. Start vLLM’s OpenAI-compatible API server using Llama3.1-8B: ```bash vllm serve meta-llama/Llama-3.1-8B ``` -Then we can create a test script that sends a request to the server using the OpenAI library. Copy the Python script below to a file named llama_test.py. +The server prints its available routes and then confirms it's ready: + +```output +(APIServer pid=27612) INFO 05-18 15:00:06 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000 +(APIServer pid=27612) INFO 05-18 15:00:06 [launcher.py:37] Available routes are: +(APIServer pid=27612) INFO 05-18 15:00:06 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET +(APIServer pid=27612) INFO 05-18 15:00:06 [launcher.py:46] Route: /v1/chat/completions, Methods: POST +(APIServer pid=27612) INFO 05-18 15:00:06 [launcher.py:46] Route: /v1/completions, Methods: POST +... +(APIServer pid=27612) INFO: Application startup complete. +``` + +Wait until you see `Application startup complete` before continuing. The server is now listening on port 8000. Open a new terminal to run the client script while the server continues running in the first terminal. + +Then create a test script that sends a request to the server using the OpenAI library. Copy the Python script below to a file named `llama_test.py`: ```python import time @@ -23,11 +37,12 @@ from openai import OpenAI from transformers import AutoTokenizer # vLLM's OpenAI-compatible server -client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") +client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") # api_key is required by the OpenAI client but not validated by vLLM model = "meta-llama/Llama-3.1-8B" # vllm server model # Define a chat template for the model +# vLLM's /v1/completions endpoint does not auto-apply the model's chat template, so it must be applied manually llama3_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.first and message['role'] != 'system' %}{{ '<|start_header_id|>system<|end_header_id|>\n\n'+ 'You are a helpful assistant.' + '<|eot_id|>' }}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}" # Define your prompt @@ -61,35 +76,97 @@ Now run the script with: python llama_test.py ``` -This will return the text generated by the model from your prompt. In the server logs you can see the throughput measured in tokens per second. +The output is similar to: + +```output +=== Output === +Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. It is a member of a family of notations invented by Paul Bachmann, Edmund Landau, and others, collectively called Bachmann–Landau notation or asymptotic notation. -You can do the same for the pre-quantised model loaded directly from Hugging Face. Start the server: +In computer science, big O notation is used to classify algorithms according to how their run time or space requirements grow as the input size grows. In analytic number theory, big O notation is often used to express a bound on the difference between an arithmetical function and a better understood approximation; a famous example of such a difference + +Batch completed in : 16.50s +``` + +The response is truncated at 128 tokens, which is why the explanation cuts off mid-sentence — this is controlled by the `max_tokens=128` parameter in the script. In the server terminal you can also see throughput metrics logged in tokens per second. + +You can do the same for the pre-quantized model loaded directly from Hugging Face. Stop the running server first with Ctrl+C, then start the quantized model server: ```bash vllm serve RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 ``` -Update your test script to use the quantised model: +Wait for the same `Application startup complete` message before continuing: + +```output +(APIServer pid=28847) INFO 05-18 15:11:31 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000 +(APIServer pid=28847) INFO 05-18 15:11:31 [launcher.py:37] Available routes are: +(APIServer pid=28847) INFO 05-18 15:11:31 [launcher.py:46] Route: /v1/chat/completions, Methods: POST +(APIServer pid=28847) INFO 05-18 15:11:31 [launcher.py:46] Route: /v1/completions, Methods: POST +... +(APIServer pid=28847) INFO: Application startup complete. +``` + +Update your test script `llama_test.py` to use the quantized model: ```python model = "RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8" ``` -Run inference on the quantised model: +Run inference on the quantized model: ```bash python llama_test.py ``` -You have now run inference using both the non-quantised and quantised Llama3.1-8B models! +The output is similar to: + +```output +=== Output === +$\begin{array}{l}{O}\left({n}^{2}\right)\\ {O}\left({n}^{3}\right)\end{array}$ + +Please help me with this problem. I don't know where to start. Thank you. + +• Questions are typically answered in as fast as 30 minutes + +### Plainmath recommends + +• $${O}\left({n}^{2}\right)$$ +• $${O}\left({n}^{3}\right)$$ +###### Not exactly what you're looking for? + +Expert community + +• Live experts 24/7 +• Questions are usually + +Batch completed in : 7.93s +``` + +The quantized model completed the request in 7.93s compared to 16.50s for the non-quantized model — roughly a 2x speedup. The response quality differs between the two models because quantization reduces model precision, which can affect output style and coherence. The quantized model identified the Big O examples but formatted them as if pulling from a Q&A website rather than producing a clean explanation. + +You have now run inference using both the non-quantized and quantized Llama3.1-8B models. ## Run inference on Whisper -We will use a similar approach to test our ability to run inference on Whisper models. Install the required vLLM audio library then start vLLM’s OpenAI-compatible API server using Whisper-large-v3: +Use a similar approach to test inference on Whisper models. Stop the running Llama server first with Ctrl+C, then install the required vLLM audio library and start the Whisper server: ```bash pip install vllm[audio] +``` + +```bash +vllm serve openai/whisper-large-v3 +``` + +Wait for the server to confirm it's ready. You'll notice the Whisper server exposes audio-specific routes: -vllm serve openai/whisper-large-v3 +```output +(APIServer pid=29957) INFO 05-18 15:21:01 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000 +(APIServer pid=29957) INFO 05-18 15:21:01 [launcher.py:37] Available routes are: +... +(APIServer pid=29957) INFO 05-18 15:21:01 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST +(APIServer pid=29957) INFO 05-18 15:21:01 [launcher.py:46] Route: /v1/audio/translations, Methods: POST +... +(APIServer pid=29957) INFO: Application startup complete. ``` -Then we can create a test script that sends a request with an audio file to the server using the OpenAI library. Copy the Python script below to a file named whisper_test.py. +Open a new terminal once you see `Application startup complete`. Then create a test script that sends a request with an audio file to the server using the OpenAI library. Copy the Python script below to a file named `whisper_test.py`. ```python import time @@ -130,19 +207,51 @@ Now run the script with: python whisper_test.py ``` -You can do the same for the pre-quantised model loaded directly from Hugging Face. Start the server: +The output is similar to: + +```output +=== Output === + And the 0-1 pitch on the way to Edgar Martinez. Swung on the line. Now the left field line for a base hit. Here comes Joy. Here is Junior to third base. They're going to wave him in. The throw to the plate will be late. The Mariners are going to play for the American League Championship. I don't believe it. It just continues. My, oh, my. + +Batch completed in : 31.85s +``` + +The script uses `AudioAsset("winning_call")`, a sample audio clip bundled with vLLM. The transcription confirms Whisper is processing audio correctly on Arm64. + +You can do the same for the pre-quantized Whisper model loaded directly from Hugging Face. Start the server: ```bash vllm serve RedHatAI/whisper-large-v3-quantized.w8a8 ``` -Update your test script to use the quantised model: +Wait for the same `Application startup complete` message before continuing: + +```output +(APIServer pid=31658) INFO 05-18 15:27:09 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000 +(APIServer pid=31658) INFO 05-18 15:27:09 [launcher.py:37] Available routes are: +... +(APIServer pid=31658) INFO 05-18 15:27:09 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST +(APIServer pid=31658) INFO 05-18 15:27:09 [launcher.py:46] Route: /v1/audio/translations, Methods: POST +... +(APIServer pid=31658) INFO: Application startup complete. +``` + +Update your test script `whisper_test.py` to use the quantized model: ```python model = "RedHatAI/whisper-large-v3-quantized.w8a8" ``` -Run inference on the quantised model: +Run inference on the quantized model: ```bash python whisper_test.py ``` -You now have the quantised and non-quantised Llama and Whisper models on your local machine! You have installed vLLM and demonstrated you can run inference on your models. Now you can move on to benchmarking the Llama models and compare their performance. +The output is similar to: + +```output +=== Output === + And the 0-1 pitch on the way to Edgar Martinez. Swung on the line. Now the left field line for a base hit. Here comes Joy. Here is Junior to third base. They're going to wave him in. The throw to the plate will be late. The Mariners are going to play for the American League Championship. I don't believe it. It just continues. My, oh, my. + +Batch completed in : 8.14s +``` + +The quantized Whisper model completed in 8.14s compared to 31.85s for the non-quantized model — roughly a 4x speedup, while producing an identical transcription. You have installed vLLM and demonstrated you can run inference on your models. Now you can move on to benchmarking the Llama models and compare their performance. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-benchmarking.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-benchmarking.md index b07fd27a75..12cb22d3bd 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-benchmarking.md @@ -8,7 +8,7 @@ layout: learningpathall ## Llama performance benchmarking -We will use the vLLM bench CLI to measure the throughput of our models. First, install the required library then start the server and keep it running: +Use the vLLM bench CLI to measure the throughput of your models. First, install the required library then start the server in the background: ```bash pip install vllm[bench] @@ -18,11 +18,13 @@ vllm serve \ --max-model-len 4096 & ``` +Wait for `Application startup complete` in the server output before continuing. The `wget` command below will take a few seconds, which usually gives the server enough time to start. + vLLM uses dynamic continuous batching to maximise hardware utilisation. Two key parameters govern this process: - * max-model-len, which is the maximum sequence length (number of tokens per request). No single prompt or generated sequence can exceed this limit. We've chosen a value large enough for the selected model and dataset. - * max-num-batched-tokens, which is the total number of tokens processed in one batch across all requests. The sum of input and output tokens from all concurrent requests must stay within this limit. We've chosen a value that, combined with our concurrency limit shown below, gives optimal throughput and latency. +- `max-model-len`: the maximum sequence length (number of tokens per request). No single prompt or generated sequence can exceed this limit. The value chosen here is large enough for the selected model and dataset. +- `max-num-batched-tokens`: the total number of tokens processed in one batch across all requests. The sum of input and output tokens from all concurrent requests must stay within this limit. The value chosen here, combined with the concurrency limit shown below, gives optimal throughput and latency. -Now the server is running, we can benchmark using the public ShareGPT dataset. +Now the server is running, you can benchmark using the public ShareGPT dataset. ```bash wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json @@ -38,17 +40,48 @@ vllm bench serve \ --metric-percentiles 50,95,99 \ --save-result --result-dir bench_out --result-filename serve.json ``` -Here we are using greedy decoding: '''--top-p 1 --temperature 0'''. This selects the next token with the highest probability at each step, instead of sampling from a selection of likely tokens. -The interesting results are request throughput, output token throughput, total token throughput, TTFT (time to first token) and TPOT (time per output token). We're aiming for a mean TPOT < 100ms, so the maximum concurrency selected should be as high as possible while meeting that TPOT requirement. +The output is similar to: + +```output +Failed requests: 0 +Maximum request concurrency: 10 +Request rate configured (RPS): 8.00 +Benchmark duration (s): 551.96 +Total input tokens: 54084 +Total generated tokens: 35468 +Request throughput (req/s): 0.46 +Output token throughput (tok/s): 64.26 +Peak output token throughput (tok/s): 120.00 +Peak concurrent requests: 13.00 +Total token throughput (tok/s): 162.24 +---------------Time to First Token---------------- +Mean TTFT (ms): 1077.07 +Median TTFT (ms): 610.55 +P50 TTFT (ms): 610.55 +P95 TTFT (ms): 2657.55 +P99 TTFT (ms): 4134.92 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 168.44 +Median TPOT (ms): 140.70 +P50 TPOT (ms): 140.70 +P95 TPOT (ms): 224.80 +P99 TPOT (ms): 966.98 +================================================== +``` + +Here greedy decoding (`--top-p 1 --temperature 0`) selects the highest-probability token at each step rather than sampling, giving deterministic and reproducible results. The key metrics to focus on are output token throughput (64.26 tok/s), total token throughput (162.24 tok/s), mean TTFT (1077ms), and mean TPOT (168ms). The mean TPOT of 168ms exceeds the 100ms target at `max-concurrency 10` for the BF16 model — the quantized model run below uses higher concurrency to demonstrate the throughput improvement. -Repeat with the quantised model. The smaller model allows us to increase the concurrency. You should see a significant improvement in the throughput results (increased tokens/s). +Repeat with the quantized model. The reduced model size allows you to increase concurrency, which results in a significant throughput improvement. Stop the running BF16 server first with Ctrl+C, then start the quantized model server: ```bash vllm serve \ --model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \ --max-num-batched-tokens 8192 \ --max-model-len 4096 & - +``` + +Wait for `Application startup complete`, then run the benchmark: +```bash vllm bench serve \ --model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \ --dataset-name sharegpt \ @@ -62,14 +95,40 @@ vllm bench serve \ --save-result --result-dir bench_out --result-filename serve.json ``` -## Llama accuracy benchmarking +The output is similar to: + +```output +Failed requests: 0 +Maximum request concurrency: 24 +Request rate configured (RPS): 8.00 +Benchmark duration (s): 210.01 +Total input tokens: 54084 +Total generated tokens: 29058 +Request throughput (req/s): 1.22 +Output token throughput (tok/s): 138.36 +Peak output token throughput (tok/s): 336.00 +Peak concurrent requests: 31.00 +Total token throughput (tok/s): 395.89 +---------------Time to First Token---------------- +Mean TTFT (ms): 1227.51 +Median TTFT (ms): 702.22 +P50 TTFT (ms): 702.22 +P95 TTFT (ms): 4682.75 +P99 TTFT (ms): 7564.33 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 189.36 +Median TPOT (ms): 152.23 +P50 TPOT (ms): 152.23 +P95 TPOT (ms): 304.81 +P99 TPOT (ms): 1221.36 +================================================== +``` + +The quantized model completes the same benchmark in 210s versus 552s for BF16 — a 2.6x reduction in benchmark duration. Output token throughput increases from 64.26 to 138.36 tok/s and total token throughput from 162.24 to 395.89 tok/s, confirming significant throughput gains from INT8 quantization at higher concurrency. -The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example [MMLU](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu), [HellaSwag](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/hellaswag), [GSM8K](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k)) and runtimes (such as [Hugging Face](https://github.com/huggingface/transformers), [vLLM](https://github.com/vllm-project/vllm), and [llama.cpp](https://github.com/ggml-org/llama.cpp)). In this section, you’ll run accuracy tests for both BF16 and INT8 deployments of your Llama models served by vLLM on Arm-based servers. +## Llama accuracy benchmarking -You will: -- Install the lm-eval harness with vLLM support -- Run benchmarks on a BF16 model and an INT8 (weight-quantized) model -- Interpret key metrics and compare quality across precisions +The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example [MMLU](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu), [HellaSwag](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/hellaswag), [GSM8K](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k)) and runtimes (such as [Hugging Face](https://github.com/huggingface/transformers), [vLLM](https://github.com/vllm-project/vllm), and [llama.cpp](https://github.com/ggml-org/llama.cpp)). In this section you'll install the lm-eval harness with vLLM support, run benchmarks on both the BF16 and INT8 deployments, and interpret the accuracy difference between precisions. First install the required libraries for benchmarking with lm_eval. ```bash @@ -81,36 +140,62 @@ You can use a limited number of prompts to validate your environment by appendin lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_len=4096 --tasks mmlu,gsm8k --batch_size auto ``` +The output is similar to: + +```output +| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr| +|------------------|------:|------|-----:|------|---|-----:|---|-----:| +|mmlu | 2|none | |acc | |0.6895|± |0.0183| +| - humanities | 2|none | 0|acc |↑ |0.7462|± |0.0378| +| - other | 2|none | 0|acc |↑ |0.6538|± |0.0395| +| - social sciences| 2|none | 0|acc |↑ |0.7917|± |0.0363| +``` + +{{% notice Note %}} +This output was generated with `--limit 10`, which runs only 10 prompts per task. Results will vary between runs at this sample size. Remove `--limit 10` for a full benchmark over the complete dataset. +{{% /notice %}} + The [MMLU task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu) is a set of multiple choice questions split into the subgroups listed above. It allows you to measure the ability of an LLM to understand questions and select the right answers. The [GSM8k task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k) is a set of math problems that test an LLM's mathematical reasoning ability. -Repeat with the quantised model. +Repeat with the quantized model. ```bash lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks mmlu,gsm8k --batch_size auto ``` -We expect INT8 inference to show a slight accuracy drop compared to BF16. For reference results and expected accuracy differences, see the Red Hat model card: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8#accuracy +The output is similar to: + +```output +| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr| +|------------------|------:|------|-----:|------|---|-----:|---|-----:| +|mmlu | 2|none | |acc | |0.6614|± |0.0189| +| - humanities | 2|none | 0|acc |↑ |0.7231|± |0.0359| +| - other | 2|none | 0|acc |↑ |0.6077|± |0.0416| +| - social sciences| 2|none | 0|acc |↑ |0.7417|± |0.0390| +| - stem | 2|none | 0|acc |↑ |0.6053|± |0.0345| +``` + +The INT8 model scores 0.6614 on MMLU compared to 0.6895 for BF16 — a drop of approximately 3%, which is consistent with the expected accuracy cost of INT8 weight quantization. For full reference results, see the [Red Hat model card](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8#accuracy). ## Summary of results -The benchmarking results you generate will depend on the hardware you are using. The values below, provided as an example only, were measured on a 96 core machine with 128-bit SVE and 192 GB of RAM. Using the INT8 quantised Llama3.1-8B model we observe throughput improvements of over 2x at a cost of up to ~8% in accuracy. +The benchmarking results you generate will depend on the hardware you are using. The values below were measured on a 96 core machine with 128-bit SVE and 192 GB of RAM. Using the INT8 quantized Llama3.1-8B model we observe throughput improvements of over 2x. The accuracy results below used `--limit 10`; a full dataset run may show up to ~8% accuracy drop. -### Throughput ratios: INT8/BF16 -| Requests/s | Output Tokens/s | Total Tokens/s | -| -------- | -------- | -------- | -| 2.7x | 2.2x | 2.5x | +### Throughput: BF16 vs INT8 (max-concurrency 10 vs 24) +| Metric | BF16 | INT8 | Ratio | +|---|---|---|---| +| Request throughput (req/s) | 0.46 | 1.22 | 2.7x | +| Output token throughput (tok/s) | 64.26 | 138.36 | 2.2x | +| Total token throughput (tok/s) | 162.24 | 395.89 | 2.4x | -### Accuracy recovery: INT8/BF16 -| MMLU | GSM8k | -| -------- | -------- | -| 97% | 92% | +### Accuracy recovery: INT8/BF16 (--limit 10) +| MMLU | GSM8k | +|---|---| +| 97% | 92% | -## Next steps +These results were generated with `--limit 10`. Run without `--limit` for a statistically representative accuracy comparison. -Now that you have your environment set up for running inference, benchmarking and quantising different models, you can experiment with: -- Benchmarking accuracy with different tasks -- Different quantisation techniques -- Different models +## Next steps -Your results will allow you to balance accuracy and performance when making decisions about model deployment. +Now that your environment is set up for running inference, benchmarking, and quantizing different models, you can experiment further. Try benchmarking accuracy with different tasks, different quantization techniques, or different models. Your results will allow you to balance accuracy and performance when making decisions about model deployment. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_index.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_index.md index 31edef1371..c6aceff54f 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_index.md @@ -1,5 +1,5 @@ --- -title: Run vLLM inference with quantised models and benchmark on Arm servers +title: Run vLLM inference with quantized models and benchmark on Arm servers draft: true cascade: @@ -7,16 +7,16 @@ cascade: minutes_to_complete: 60 -who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B and Whisper, with and without quantisation, and benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness. +who_is_this_for: This is an introductory topic for developers interested in running inference on quantized models. This Learning Path shows you how to run inference on Llama 3.1-8B and Whisper, with and without quantization, and benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness. learning_objectives: - Install a recent release of vLLM - - Run both quantised and non-quantised variants of Llama3.1-8B and Whisper using vLLM + - Run both quantized and non-quantized variants of Llama3.1-8B and Whisper using vLLM - Evaluate and compare model performance and accuracy using vLLM's bench CLI and the LM Evaluation Harness prerequisites: - An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 96 GB RAM, and 64 GB free disk space - - Python 3.12 and basic familiarity with Hugging Face Transformers and quantisation + - Python 3.12 and basic familiarity with Hugging Face Transformers and quantization schemes author: Anna Mayne, Nikhil Gupta, Marek Michałowski @@ -27,9 +27,6 @@ armips: - Neoverse tools_software_languages: - vLLM - - LM Evaluation Harness - - LLM - - Generative AI - Python - PyTorch - Hugging Face