Skip to content

Commit b69e04d

Browse files
committed
Update KEDA example with new insights
1 parent b1a1034 commit b69e04d

4 files changed

Lines changed: 91 additions & 261 deletions

File tree

Lines changed: 49 additions & 163 deletions
Original file line numberDiff line numberDiff line change
@@ -1,193 +1,79 @@
1-
# KServe Autoscaling with KEDA and Custom Metrics
1+
# KServe Autoscaling with KEDA and Custom Prometheus Metrics
22

3-
This example demonstrates how to autoscale KServe InferenceServices using [KEDA](https://keda.sh/) with custom Prometheus metrics. This is particularly useful for LLM inference workloads where request-based autoscaling (Knative default) is not optimal.
3+
This example demonstrates autoscaling a KServe InferenceService using
4+
[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM.
5+
It scales based on total token throughput rather than simple request count,
6+
which is better suited for LLM inference workloads.
47

5-
## Why Custom Metrics for LLM Autoscaling?
8+
For full documentation, see the
9+
[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling).
610

7-
Traditional request-based autoscaling doesn't work well for LLM inference because:
11+
## Why Token Throughput?
812

9-
- **Token-level work**: LLM inference operates at token level, not request level. A single request can generate hundreds of tokens.
10-
- **Variable latency**: Request latency varies significantly based on input/output token count.
11-
- **Memory pressure**: LLM models require significant GPU memory (KV cache), which fills up based on concurrent requests.
12-
13-
Better metrics for LLM autoscaling include:
14-
- **Time To First Token (TTFT)**: Latency until first token is generated
15-
- **KV Cache utilization**: GPU memory used for attention cache
16-
- **Number of running/waiting requests**: Queue depth
13+
LLM requests vary wildly in duration depending on prompt and output length.
14+
Request-count metrics (concurrency, QPS) don't reflect actual GPU load.
15+
Token throughput stays elevated as long as the model is under pressure,
16+
making it a stable scaling signal.
1717

1818
## Prerequisites
1919

20-
Install KEDA in the cluster:
21-
22-
```bash
23-
helm repo add kedacore https://kedacore.github.io/charts
24-
helm install keda kedacore/keda --namespace keda --create-namespace
25-
```
20+
- KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`)
21+
- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor)
2622

2723
## Files
2824

2925
| File | Description |
3026
|------|-------------|
31-
| `inference-service.yaml` | KServe InferenceService for OPT-125M model with vLLM backend |
32-
| `scaled-object.yaml` | KEDA ScaledObject with TTFT, GPU cache, and request-based scaling |
33-
| `service-monitor.yaml` | PodMonitor for vLLM metrics collection |
34-
35-
## Deployment
27+
| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) |
28+
| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput |
3629

37-
### 1. Deploy the InferenceService
30+
## Quick Start
3831

3932
```bash
40-
kubectl apply -f inference-service.yaml -n <your-namespace>
41-
```
33+
export NAMESPACE="default"
4234

43-
Wait for the model to be ready:
44-
```bash
45-
kubectl get inferenceservice opt-125m-vllm -n <your-namespace> -w
46-
```
47-
48-
### 2. Configure Prometheus Metrics Collection
35+
# 1. Deploy the InferenceService
36+
kubectl apply -n $NAMESPACE -f inference-service.yaml
4937

50-
Apply the PodMonitor to scrape vLLM metrics:
51-
```bash
52-
kubectl apply -f service-monitor.yaml -n <your-namespace>
53-
```
38+
# 2. Wait for it to become ready
39+
kubectl get isvc opt-125m -n $NAMESPACE -w
5440

55-
**Note:** Update the `namespaceSelector` in `service-monitor.yaml` to match your namespace.
41+
# 3. Deploy the KEDA ScaledObject
42+
kubectl apply -n $NAMESPACE -f scaled-object.yaml
5643

57-
### 3. Deploy KEDA ScaledObject
58-
59-
First, identify the correct deployment name:
60-
```bash
61-
kubectl get deployments -n <your-namespace> | grep opt-125m-vllm
62-
```
63-
64-
Update `scaled-object.yaml` with:
65-
- The correct deployment name in `scaleTargetRef`
66-
- Your Prometheus server URL (e.g., `http://prometheus.monitoring:9090` or with path prefix)
67-
- Your namespace in the Prometheus queries
68-
- Your InferenceService name in the pod selector (e.g., `pod=~"opt-125m-vllm-predictor-.*"`)
69-
70-
Then apply:
71-
```bash
72-
kubectl apply -f scaled-object.yaml -n <your-namespace>
44+
# 4. Verify
45+
kubectl get scaledobject -n $NAMESPACE
46+
kubectl get hpa -n $NAMESPACE
7347
```
7448

75-
Verify the ScaledObject:
76-
```bash
77-
kubectl get scaledobject -n <your-namespace>
78-
kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
79-
```
49+
## Customization
8050

81-
## Autoscaling Strategies
51+
**Namespace and model name**: replace `default` and `opt-125m` in the
52+
Prometheus queries inside `scaled-object.yaml`.
8253

83-
This example uses three triggers. KEDA evaluates all triggers and scales based on the highest desired replica count:
54+
**Threshold**: the `threshold: "5"` value means "scale up when each replica
55+
handles more than 5 tokens/second on average" (`AverageValue` divides the
56+
query result by replica count). Tune this based on load testing for your
57+
model and hardware.
8458

85-
### 1. Time To First Token (TTFT) - P95
86-
Scales when the 95th percentile TTFT exceeds 200ms:
87-
```yaml
88-
triggers:
89-
- type: prometheus
90-
metadata:
91-
query: |
92-
histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>"}[2m])) by (le))
93-
threshold: "0.2"
94-
```
59+
**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
60+
from the InferenceService args, add GPU resource requests, and consider
61+
adding a second trigger for GPU KV-cache utilization:
9562

96-
### 2. GPU KV-Cache Utilization
97-
Scales when GPU cache usage exceeds 70% (for GPU deployments):
9863
```yaml
99-
triggers:
100-
- type: prometheus
101-
metadata:
102-
query: |
103-
avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>"})
104-
threshold: "0.7"
64+
# Add to scaled-object.yaml triggers list
65+
- type: prometheus
66+
metadata:
67+
serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
68+
query: >-
69+
avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"})
70+
metricType: AverageValue
71+
threshold: "0.75"
10572
```
10673
107-
### 3. Running Requests (Fallback)
108-
Scales when average running requests per pod exceeds 2:
109-
```yaml
110-
triggers:
111-
- type: prometheus
112-
metadata:
113-
query: |
114-
avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>"})
115-
threshold: "2"
116-
```
117-
118-
## vLLM Metrics Reference
119-
120-
vLLM exposes metrics at the `/metrics` endpoint. Note that vLLM uses colons in metric names (this is unusual but correct):
121-
122-
| Metric | Description |
123-
|--------|-------------|
124-
| `vllm:num_requests_running` | Number of requests currently being processed |
125-
| `vllm:num_requests_waiting` | Number of requests waiting in queue |
126-
| `vllm:gpu_cache_usage_perc` | GPU KV cache utilization (0-1) |
127-
| `vllm:time_to_first_token_seconds` | Histogram of TTFT |
128-
| `vllm:time_per_output_token_seconds` | Histogram of TPOT |
129-
| `vllm:generation_tokens_total` | Total number of generated tokens |
130-
131-
## Testing Autoscaling
132-
133-
Generate load to trigger autoscaling:
134-
135-
```bash
136-
# First, find the predictor service name
137-
kubectl get svc -n <your-namespace> | grep opt-125m-vllm
138-
139-
# Create a load generator pod (adjust service name if needed)
140-
kubectl run load-gen --image=curlimages/curl -n <your-namespace> --restart=Never -- \
141-
sh -c 'while true; do for i in $(seq 1 10); do curl -s -X POST "http://opt-125m-vllm-predictor-00001.<your-namespace>.svc.cluster.local/openai/v1/completions" -H "Content-Type: application/json" -d "{\"model\": \"opt-125m\", \"prompt\": \"Tell me a story\", \"max_tokens\": 200}" & done; sleep 2; done'
142-
```
143-
144-
Monitor scaling:
145-
```bash
146-
# Watch HPA status
147-
kubectl get hpa -n <your-namespace> -w
148-
149-
# Watch pods
150-
kubectl get pods -n <your-namespace> -l serving.kserve.io/inferenceservice=opt-125m-vllm -w
151-
```
152-
153-
Clean up:
154-
```bash
155-
kubectl delete pod load-gen -n <your-namespace>
156-
```
157-
158-
## Troubleshooting
159-
160-
### KEDA not scaling
161-
1. Check ScaledObject status:
162-
```bash
163-
kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
164-
```
165-
166-
2. Verify Prometheus connectivity (some deployments use a `/prometheus` path prefix):
167-
```bash
168-
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
169-
curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=up'
170-
```
171-
172-
3. Check KEDA operator logs:
173-
```bash
174-
kubectl logs -l app=keda-operator -n keda
175-
```
176-
177-
### Metrics not appearing
178-
1. Verify PodMonitor is picked up:
179-
```bash
180-
kubectl get podmonitor -n <your-namespace>
181-
```
182-
183-
2. Check if vLLM metrics are being scraped:
184-
```bash
185-
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
186-
curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query={__name__=~"vllm:.*"}'
187-
```
188-
18974
## References
19075
191-
- [KServe Issue #3561: Native KEDA integration](https://github.com/kserve/kserve/issues/3561)
192-
- [KEDA Prometheus Scaler](https://keda.sh/docs/scalers/prometheus/)
193-
- [vLLM Metrics](https://docs.vllm.ai/en/latest/serving/metrics.html)
76+
- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/)
77+
- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler)
78+
- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/)
79+
- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html)

serving/kserve-keda-autoscaling/inference-service.yaml

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,21 @@
1+
# KServe InferenceService for OPT-125M with vLLM backend.
2+
# Uses RawDeployment mode — required when scaling with KEDA.
3+
#
4+
# This example runs on CPU. For GPU, remove --dtype=float32 and
5+
# --max-model-len, and adjust resources to request nvidia.com/gpu.
16
apiVersion: serving.kserve.io/v1beta1
27
kind: InferenceService
38
metadata:
4-
name: opt-125m-vllm
9+
name: opt-125m
510
annotations:
6-
huggingface.co/model-id: facebook/opt-125m
11+
# RawDeployment mode — creates a plain Deployment instead of a Knative Revision.
12+
serving.kserve.io/deploymentMode: "RawDeployment"
13+
# Tell KServe not to create its own HPA (KEDA will manage scaling).
14+
serving.kserve.io/autoscalerClass: "external"
715
spec:
816
predictor:
9-
# Note: When using KEDA, replica limits are managed by the ScaledObject.
10-
# These values serve as defaults if KEDA is not deployed.
1117
minReplicas: 1
12-
maxReplicas: 10
18+
maxReplicas: 3
1319
model:
1420
modelFormat:
1521
name: huggingface
@@ -18,7 +24,13 @@ spec:
1824
- --model_id=facebook/opt-125m
1925
- --backend=vllm
2026
- --dtype=float32
21-
- --device=cpu
27+
- --max-model-len=512
28+
# Explicit port declaration is required in RawDeployment mode
29+
# for the cluster-wide PodMonitor to discover the metrics endpoint.
30+
ports:
31+
- name: user-port
32+
containerPort: 8080
33+
protocol: TCP
2234
resources:
2335
requests:
2436
cpu: "2"
Lines changed: 24 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,85 +1,44 @@
1-
# KEDA ScaledObject for KServe InferenceService with vLLM backend
2-
# Scales based on custom Prometheus metrics from vLLM serving runtime
1+
# KEDA ScaledObject for KServe InferenceService with vLLM backend.
2+
# Scales based on total token throughput (prompt + generation) from Prometheus.
33
#
44
# Prerequisites:
5-
# - KEDA installed in cluster (https://keda.sh/docs/deploy/)
6-
# - Prometheus collecting vLLM metrics (see service-monitor.yaml)
5+
# - KEDA installed (https://keda.sh/docs/deploy/)
6+
# - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor)
77
#
8-
# Note: vLLM uses colons in metric names (e.g., vllm:num_requests_running),
9-
# which is unusual but correct. Use {"__name__"="..."} syntax in PromQL.
10-
#
11-
# TODO: Replace the following before deploying:
12-
# - <your-namespace>: your actual namespace
13-
# - <your-prometheus-url>: your Prometheus server URL (may or may not have a path prefix)
14-
# - opt-125m-vllm: your InferenceService name (in queries and scaleTargetRef)
8+
# Before deploying, replace:
9+
# - "default" in the Prometheus queries with your namespace
10+
# - "opt-125m" in model_name with your --model_name value
11+
# - The serverAddress if your Prometheus uses a different URL
1512
#
1613
apiVersion: keda.sh/v1alpha1
1714
kind: ScaledObject
1815
metadata:
19-
name: opt-125m-vllm-scaledobject
20-
labels:
21-
app: opt-125m-vllm
16+
name: opt-125m-scaledobject
2217
spec:
23-
# Target the KServe predictor deployment
2418
scaleTargetRef:
25-
apiVersion: apps/v1
26-
kind: Deployment
27-
name: opt-125m-vllm-predictor-00001-deployment
28-
# Polling interval for checking metrics (seconds)
29-
pollingInterval: 15
30-
# Cooldown period before scaling down (seconds)
31-
cooldownPeriod: 60
32-
# Min/max replicas
19+
# In RawDeployment mode KServe names the Deployment {isvc-name}-predictor.
20+
name: opt-125m-predictor
3321
minReplicaCount: 1
34-
maxReplicaCount: 10
35-
# Advanced scaling behavior
22+
maxReplicaCount: 3
23+
pollingInterval: 15 # how often KEDA checks the metric (seconds)
24+
cooldownPeriod: 120 # seconds after last trigger activation before scaling to minReplicaCount
3625
advanced:
3726
horizontalPodAutoscalerConfig:
3827
behavior:
39-
scaleDown:
40-
stabilizationWindowSeconds: 120
41-
policies:
42-
- type: Percent
43-
value: 25
44-
periodSeconds: 60
4528
scaleUp:
4629
stabilizationWindowSeconds: 0
30+
scaleDown:
31+
stabilizationWindowSeconds: 120
4732
policies:
48-
- type: Percent
49-
value: 100
50-
periodSeconds: 15
5133
- type: Pods
52-
value: 4
53-
periodSeconds: 15
54-
selectPolicy: Max
34+
value: 1 # remove at most 1 replica per minute
35+
periodSeconds: 60
5536
triggers:
56-
# Scale based on Time To First Token (TTFT) - P95
57-
# Scale up when P95 TTFT exceeds 200ms (0.2s)
58-
- type: prometheus
59-
metadata:
60-
# Adjust URL to your Prometheus setup (some have /prometheus path prefix)
61-
serverAddress: <your-prometheus-url>
62-
metricName: vllm_ttft_p95
63-
query: |
64-
histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"}[2m])) by (le))
65-
threshold: "0.2"
66-
activationThreshold: "0.1"
67-
# Scale based on GPU KV-cache usage (for GPU deployments)
68-
# Scale up when cache usage exceeds 70%
69-
- type: prometheus
70-
metadata:
71-
serverAddress: <your-prometheus-url>
72-
metricName: vllm_gpu_cache_usage
73-
query: |
74-
avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"})
75-
threshold: "0.7"
76-
activationThreshold: "0.5"
77-
# Fallback: Scale based on running requests (always works)
7837
- type: prometheus
7938
metadata:
80-
serverAddress: <your-prometheus-url>
81-
metricName: vllm_num_requests_running
82-
query: |
83-
avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"})
84-
threshold: "2"
85-
activationThreshold: "1"
39+
serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
40+
query: >-
41+
sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
42+
+ sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
43+
metricType: AverageValue
44+
threshold: "5"

0 commit comments

Comments
 (0)