|
1 | | -# KServe Autoscaling with KEDA and Custom Metrics |
| 1 | +# KServe Autoscaling with KEDA and Custom Prometheus Metrics |
2 | 2 |
|
3 | | -This example demonstrates how to autoscale KServe InferenceServices using [KEDA](https://keda.sh/) with custom Prometheus metrics. This is particularly useful for LLM inference workloads where request-based autoscaling (Knative default) is not optimal. |
| 3 | +This example demonstrates autoscaling a KServe InferenceService using |
| 4 | +[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM. |
| 5 | +It scales based on total token throughput rather than simple request count, |
| 6 | +which is better suited for LLM inference workloads. |
4 | 7 |
|
5 | | -## Why Custom Metrics for LLM Autoscaling? |
| 8 | +For full documentation, see the |
| 9 | +[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling). |
6 | 10 |
|
7 | | -Traditional request-based autoscaling doesn't work well for LLM inference because: |
| 11 | +## Why Token Throughput? |
8 | 12 |
|
9 | | -- **Token-level work**: LLM inference operates at token level, not request level. A single request can generate hundreds of tokens. |
10 | | -- **Variable latency**: Request latency varies significantly based on input/output token count. |
11 | | -- **Memory pressure**: LLM models require significant GPU memory (KV cache), which fills up based on concurrent requests. |
12 | | - |
13 | | -Better metrics for LLM autoscaling include: |
14 | | -- **Time To First Token (TTFT)**: Latency until first token is generated |
15 | | -- **KV Cache utilization**: GPU memory used for attention cache |
16 | | -- **Number of running/waiting requests**: Queue depth |
| 13 | +LLM requests vary wildly in duration depending on prompt and output length. |
| 14 | +Request-count metrics (concurrency, QPS) don't reflect actual GPU load. |
| 15 | +Token throughput stays elevated as long as the model is under pressure, |
| 16 | +making it a stable scaling signal. |
17 | 17 |
|
18 | 18 | ## Prerequisites |
19 | 19 |
|
20 | | -Install KEDA in the cluster: |
21 | | - |
22 | | -```bash |
23 | | -helm repo add kedacore https://kedacore.github.io/charts |
24 | | -helm install keda kedacore/keda --namespace keda --create-namespace |
25 | | -``` |
| 20 | +- KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`) |
| 21 | +- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor) |
26 | 22 |
|
27 | 23 | ## Files |
28 | 24 |
|
29 | 25 | | File | Description | |
30 | 26 | |------|-------------| |
31 | | -| `inference-service.yaml` | KServe InferenceService for OPT-125M model with vLLM backend | |
32 | | -| `scaled-object.yaml` | KEDA ScaledObject with TTFT, GPU cache, and request-based scaling | |
33 | | -| `service-monitor.yaml` | PodMonitor for vLLM metrics collection | |
34 | | - |
35 | | -## Deployment |
| 27 | +| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) | |
| 28 | +| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput | |
36 | 29 |
|
37 | | -### 1. Deploy the InferenceService |
| 30 | +## Quick Start |
38 | 31 |
|
39 | 32 | ```bash |
40 | | -kubectl apply -f inference-service.yaml -n <your-namespace> |
41 | | -``` |
| 33 | +export NAMESPACE="default" |
42 | 34 |
|
43 | | -Wait for the model to be ready: |
44 | | -```bash |
45 | | -kubectl get inferenceservice opt-125m-vllm -n <your-namespace> -w |
46 | | -``` |
47 | | - |
48 | | -### 2. Configure Prometheus Metrics Collection |
| 35 | +# 1. Deploy the InferenceService |
| 36 | +kubectl apply -n $NAMESPACE -f inference-service.yaml |
49 | 37 |
|
50 | | -Apply the PodMonitor to scrape vLLM metrics: |
51 | | -```bash |
52 | | -kubectl apply -f service-monitor.yaml -n <your-namespace> |
53 | | -``` |
| 38 | +# 2. Wait for it to become ready |
| 39 | +kubectl get isvc opt-125m -n $NAMESPACE -w |
54 | 40 |
|
55 | | -**Note:** Update the `namespaceSelector` in `service-monitor.yaml` to match your namespace. |
| 41 | +# 3. Deploy the KEDA ScaledObject |
| 42 | +kubectl apply -n $NAMESPACE -f scaled-object.yaml |
56 | 43 |
|
57 | | -### 3. Deploy KEDA ScaledObject |
58 | | - |
59 | | -First, identify the correct deployment name: |
60 | | -```bash |
61 | | -kubectl get deployments -n <your-namespace> | grep opt-125m-vllm |
62 | | -``` |
63 | | - |
64 | | -Update `scaled-object.yaml` with: |
65 | | -- The correct deployment name in `scaleTargetRef` |
66 | | -- Your Prometheus server URL (e.g., `http://prometheus.monitoring:9090` or with path prefix) |
67 | | -- Your namespace in the Prometheus queries |
68 | | -- Your InferenceService name in the pod selector (e.g., `pod=~"opt-125m-vllm-predictor-.*"`) |
69 | | - |
70 | | -Then apply: |
71 | | -```bash |
72 | | -kubectl apply -f scaled-object.yaml -n <your-namespace> |
| 44 | +# 4. Verify |
| 45 | +kubectl get scaledobject -n $NAMESPACE |
| 46 | +kubectl get hpa -n $NAMESPACE |
73 | 47 | ``` |
74 | 48 |
|
75 | | -Verify the ScaledObject: |
76 | | -```bash |
77 | | -kubectl get scaledobject -n <your-namespace> |
78 | | -kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace> |
79 | | -``` |
| 49 | +## Customization |
80 | 50 |
|
81 | | -## Autoscaling Strategies |
| 51 | +**Namespace and model name**: replace `default` and `opt-125m` in the |
| 52 | +Prometheus queries inside `scaled-object.yaml`. |
82 | 53 |
|
83 | | -This example uses three triggers. KEDA evaluates all triggers and scales based on the highest desired replica count: |
| 54 | +**Threshold**: the `threshold: "5"` value means "scale up when each replica |
| 55 | +handles more than 5 tokens/second on average" (`AverageValue` divides the |
| 56 | +query result by replica count). Tune this based on load testing for your |
| 57 | +model and hardware. |
84 | 58 |
|
85 | | -### 1. Time To First Token (TTFT) - P95 |
86 | | -Scales when the 95th percentile TTFT exceeds 200ms: |
87 | | -```yaml |
88 | | -triggers: |
89 | | - - type: prometheus |
90 | | - metadata: |
91 | | - query: | |
92 | | - histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>"}[2m])) by (le)) |
93 | | - threshold: "0.2" |
94 | | -``` |
| 59 | +**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512` |
| 60 | +from the InferenceService args, add GPU resource requests, and consider |
| 61 | +adding a second trigger for GPU KV-cache utilization: |
95 | 62 |
|
96 | | -### 2. GPU KV-Cache Utilization |
97 | | -Scales when GPU cache usage exceeds 70% (for GPU deployments): |
98 | 63 | ```yaml |
99 | | -triggers: |
100 | | - - type: prometheus |
101 | | - metadata: |
102 | | - query: | |
103 | | - avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>"}) |
104 | | - threshold: "0.7" |
| 64 | +# Add to scaled-object.yaml triggers list |
| 65 | +- type: prometheus |
| 66 | + metadata: |
| 67 | + serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus |
| 68 | + query: >- |
| 69 | + avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"}) |
| 70 | + metricType: AverageValue |
| 71 | + threshold: "0.75" |
105 | 72 | ``` |
106 | 73 |
|
107 | | -### 3. Running Requests (Fallback) |
108 | | -Scales when average running requests per pod exceeds 2: |
109 | | -```yaml |
110 | | -triggers: |
111 | | - - type: prometheus |
112 | | - metadata: |
113 | | - query: | |
114 | | - avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>"}) |
115 | | - threshold: "2" |
116 | | -``` |
117 | | -
|
118 | | -## vLLM Metrics Reference |
119 | | -
|
120 | | -vLLM exposes metrics at the `/metrics` endpoint. Note that vLLM uses colons in metric names (this is unusual but correct): |
121 | | - |
122 | | -| Metric | Description | |
123 | | -|--------|-------------| |
124 | | -| `vllm:num_requests_running` | Number of requests currently being processed | |
125 | | -| `vllm:num_requests_waiting` | Number of requests waiting in queue | |
126 | | -| `vllm:gpu_cache_usage_perc` | GPU KV cache utilization (0-1) | |
127 | | -| `vllm:time_to_first_token_seconds` | Histogram of TTFT | |
128 | | -| `vllm:time_per_output_token_seconds` | Histogram of TPOT | |
129 | | -| `vllm:generation_tokens_total` | Total number of generated tokens | |
130 | | - |
131 | | -## Testing Autoscaling |
132 | | - |
133 | | -Generate load to trigger autoscaling: |
134 | | - |
135 | | -```bash |
136 | | -# First, find the predictor service name |
137 | | -kubectl get svc -n <your-namespace> | grep opt-125m-vllm |
138 | | -
|
139 | | -# Create a load generator pod (adjust service name if needed) |
140 | | -kubectl run load-gen --image=curlimages/curl -n <your-namespace> --restart=Never -- \ |
141 | | - sh -c 'while true; do for i in $(seq 1 10); do curl -s -X POST "http://opt-125m-vllm-predictor-00001.<your-namespace>.svc.cluster.local/openai/v1/completions" -H "Content-Type: application/json" -d "{\"model\": \"opt-125m\", \"prompt\": \"Tell me a story\", \"max_tokens\": 200}" & done; sleep 2; done' |
142 | | -``` |
143 | | - |
144 | | -Monitor scaling: |
145 | | -```bash |
146 | | -# Watch HPA status |
147 | | -kubectl get hpa -n <your-namespace> -w |
148 | | -
|
149 | | -# Watch pods |
150 | | -kubectl get pods -n <your-namespace> -l serving.kserve.io/inferenceservice=opt-125m-vllm -w |
151 | | -``` |
152 | | - |
153 | | -Clean up: |
154 | | -```bash |
155 | | -kubectl delete pod load-gen -n <your-namespace> |
156 | | -``` |
157 | | - |
158 | | -## Troubleshooting |
159 | | - |
160 | | -### KEDA not scaling |
161 | | -1. Check ScaledObject status: |
162 | | - ```bash |
163 | | - kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace> |
164 | | - ``` |
165 | | - |
166 | | -2. Verify Prometheus connectivity (some deployments use a `/prometheus` path prefix): |
167 | | - ```bash |
168 | | - kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ |
169 | | - curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=up' |
170 | | - ``` |
171 | | - |
172 | | -3. Check KEDA operator logs: |
173 | | - ```bash |
174 | | - kubectl logs -l app=keda-operator -n keda |
175 | | - ``` |
176 | | - |
177 | | -### Metrics not appearing |
178 | | -1. Verify PodMonitor is picked up: |
179 | | - ```bash |
180 | | - kubectl get podmonitor -n <your-namespace> |
181 | | - ``` |
182 | | - |
183 | | -2. Check if vLLM metrics are being scraped: |
184 | | - ```bash |
185 | | - kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ |
186 | | - curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query={__name__=~"vllm:.*"}' |
187 | | - ``` |
188 | | - |
189 | 74 | ## References |
190 | 75 |
|
191 | | -- [KServe Issue #3561: Native KEDA integration](https://github.com/kserve/kserve/issues/3561) |
192 | | -- [KEDA Prometheus Scaler](https://keda.sh/docs/scalers/prometheus/) |
193 | | -- [vLLM Metrics](https://docs.vllm.ai/en/latest/serving/metrics.html) |
| 76 | +- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/) |
| 77 | +- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler) |
| 78 | +- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/) |
| 79 | +- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html) |
0 commit comments