-
Notifications
You must be signed in to change notification settings - Fork 1
Add KServe KEDA autoscaling example with custom metrics #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
bcf91f8
03fc3cd
471ba50
dcfea4f
93ae429
b1a1034
b69e04d
b3816d7
81e4b82
f2e6f96
75c9d0a
392a22a
35ef20d
bd2a6e4
ae78abb
d0635cd
5fec793
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,266 @@ | ||
| # KServe Autoscaling with KEDA and Custom Prometheus Metrics | ||
|
|
||
| This example demonstrates autoscaling a KServe InferenceService using | ||
| [KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM. | ||
| It scales based on total token throughput rather than simple request count, | ||
| which is better suited for LLM inference workloads. | ||
|
|
||
| For full documentation, see the | ||
| [prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling). | ||
|
|
||
| ## Architecture | ||
|
|
||
| ```mermaid | ||
| flowchart LR | ||
|
|
||
| %% ---------- STYLES ---------- | ||
| classDef kserve fill:#E8F0FE,stroke:#1A73E8,stroke-width:2px,color:#0B3D91 | ||
| classDef pod fill:#F1F8E9,stroke:#558B2F,stroke-width:2px | ||
| classDef infra fill:#FFF8E1,stroke:#FF8F00,stroke-width:2px | ||
| classDef traffic fill:#FCE4EC,stroke:#C2185B,stroke-width:2px | ||
|
|
||
| %% ---------- KSERVE ---------- | ||
| subgraph KServe["<b> KServe </b>"] | ||
| direction TB | ||
|
|
||
| ISVC["<b>InferenceService</b><br/>opt-125m"]:::kserve | ||
| DEP["<b>Deployment</b><br/>opt-125m-predictor"]:::kserve | ||
|
|
||
| subgraph POD["Predictor <b>Pod</b> ×1–3"] | ||
| CTR["<b>kserve-container</b><br/>HuggingFace runtime · vLLM engine<br/>facebook/opt-125m :8080"]:::pod | ||
| end | ||
|
|
||
| ISVC -->|creates| DEP | ||
| DEP -->|manages| CTR | ||
| end | ||
|
|
||
| %% ---------- OBSERVABILITY ---------- | ||
| subgraph Observability["<b> Observability </b>"] | ||
| PROM[("Prometheus")]:::infra | ||
| end | ||
|
|
||
| %% ---------- KEDA ---------- | ||
| subgraph KEDA["<b> KEDA Autoscaling </b>"] | ||
| direction TB | ||
| SO["<b>ScaledObject</b><br/>Prometheus trigger<br/>threshold: 5 tok/s"]:::infra | ||
| HPA["<b>HorizontalPodAutoscaler</b>"]:::infra | ||
|
|
||
| SO -->|creates & drives| HPA | ||
| end | ||
|
|
||
| %% ---------- LOAD ---------- | ||
| LG["⚡ load-generator.py"]:::traffic | ||
|
|
||
| %% ---------- FLOWS ---------- | ||
| CTR -->|/metrics| PROM | ||
| PROM -->|query every 15s| SO | ||
| HPA -->|scales| DEP | ||
| SO -. targets .-> DEP | ||
| LG -->|POST /completions| CTR | ||
| ``` | ||
|
|
||
| ## Why Token Throughput? | ||
|
|
||
| LLM requests vary wildly in duration depending on prompt and output length. | ||
| Request-count metrics (concurrency, QPS) don't reflect actual GPU load. | ||
| Token throughput stays elevated as long as the model is under pressure, | ||
| making it a stable scaling signal. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - KEDA installed in the cluster — not available in all prokube clusters by default; see step 3 below | ||
| - Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor) | ||
|
|
||
| ## Files | ||
|
|
||
| | File | Description | | ||
| |------|-------------| | ||
| | `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) | | ||
| | `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput | | ||
| | `load-generator.py` | Python load generator with presets for different scaling scenarios | | ||
|
|
||
| ## Quick Start | ||
|
|
||
| > [!NOTE] | ||
| > All of the examples below should be run in prokube notebook's terminal inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default. | ||
|
|
||
| ```bash | ||
| # 1. Deploy the InferenceService | ||
| kubectl apply -f inference-service.yaml | ||
|
|
||
| # 2. Wait for it to become ready | ||
| kubectl get isvc opt-125m -w | ||
|
|
||
| # 3. Deploy the KEDA ScaledObject | ||
| kubectl apply -f scaled-object.yaml | ||
|
|
||
| # 4. Verify | ||
| kubectl get scaledobject | ||
| kubectl get hpa | ||
| ``` | ||
|
|
||
| > **NOTE 1.** | ||
| > If step 3 fails with `no matches for kind "ScaledObject"`, KEDA is not installed in your cluster. | ||
| > Ask your admin to enable it. | ||
|
|
||
|
|
||
| > **NOTE 2.** | ||
| > The Prometheus query in `scaled-object.yaml` has no `namespace` filter, so it aggregates token | ||
| > throughput across **all namespaces**. This is fine for testing, but if multiple users deploy a | ||
| > model named `opt-125m` at the same time, their metrics will interfere and autoscaling will be | ||
| > incorrect for both. For any real use, add a namespace filter to both queries in `scaled-object.yaml`: | ||
| > | ||
| > ```yaml | ||
| > query: >- | ||
| > sum(rate(vllm:prompt_tokens_total{namespace="<your-namespace>",model_name="opt-125m"}[2m])) | ||
| > + sum(rate(vllm:generation_tokens_total{namespace="<your-namespace>",model_name="opt-125m"}[2m])) | ||
| > ``` | ||
|
|
||
| ## See It in Action | ||
|
|
||
| After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle. | ||
|
|
||
| ### 1. Send a test inference request | ||
|
|
||
| Get the internal cluster address and send a request: | ||
|
|
||
| ```bash | ||
| # inference service name + "-predictor" | ||
| SERVICE_URL=opt-125m-predictor | ||
|
|
||
| curl -s "$SERVICE_URL/openai/v1/completions" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{"model":"opt-125m","prompt":"What is AI?","max_tokens":64}' \ | ||
| | python -c 'import json,sys;print("\n", json.load(sys.stdin)["choices"][0]["text"].strip(), "\n")' | ||
| ``` | ||
|
|
||
| ### 2. Generate load to trigger scale-up | ||
|
|
||
| Use the included load generator to produce controlled, sustained load. | ||
| It has two presets calibrated for the opt-125m model on CPU: | ||
|
|
||
| | Mode | Workers | Sleep | Throughput | Scaling behavior | | ||
| |------|---------|-------|------------|------------------| | ||
| | `stable-2` | 1 | 8s | ~8 tok/s | Scales to 2 replicas and holds | | ||
| | `stable-3` | 2 | 2s | ~22 tok/s | Scales to 3 replicas and holds | | ||
|
|
||
| ```bash | ||
| # Scale to 2 replicas (moderate load) | ||
| python load-generator.py --mode stable-2 | ||
|
|
||
| # Scale to 3 replicas (heavy load) | ||
| python load-generator.py --mode stable-3 | ||
|
|
||
| # Custom: pick your own concurrency and pacing | ||
| python load-generator.py --mode custom --workers 3 --sleep 1.0 | ||
| ``` | ||
|
|
||
| Press `Ctrl+C` to stop the load at any time. By default the script runs for | ||
| 10 minutes; override with `--duration`. | ||
|
|
||
| ### 3. Observe autoscaling | ||
|
|
||
| You can use dashboards (recommended, see below) or get a compact summary in terminal: | ||
|
|
||
| ```bash | ||
| # polls every 10 seconds | ||
| watch -n10 ' | ||
| echo "Deployment:" | ||
| kubectl get deployment opt-125m-predictor | ||
|
|
||
| echo | ||
| echo "Autoscaler:" | ||
| kubectl get hpa keda-hpa-opt-125m-scaledobject | ||
| ' | ||
| ``` | ||
|
|
||
| **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see: | ||
|
|
||
| - General vLLM Dashboard: | ||
| https://YOUR_DOMAIN/grafana/d/vllm-general/vllm | ||
|
|
||
| - vLLM Performance Statistics: | ||
| https://YOUR_DOMAIN/grafana/d/performance-statistics/vllm-performance-statistics | ||
|
|
||
| - vLLM Query Statistics: | ||
| https://YOUR_DOMAIN/grafana/d/query-statistics4/vllm-query-statistics | ||
|
|
||
| - Replica count and CPU load (you have to select your namespace/workload manually): | ||
| https://YOUR_DOMAIN/grafana/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you have to select your namespace and workload manually, I don't think there's a way around that :) creating a separate dashboard that does the same but for KServe specifically sounds like an overkill we can add a hint to the README though?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added a note to readme |
||
|
|
||
| Replace `YOUR_DOMAIN` with your cluster domain. | ||
|
|
||
| ### Expected behavior | ||
|
|
||
| **Stable-2 mode** (~8 tok/s): | ||
| 1. **1 replica** at rest | ||
| 2. Load applied — metric rises to ~8 tok/s — `ceil(8/5) = 2` replicas needed | ||
| 3. **Scaled to 2 replicas** within ~1 minute | ||
| 4. Metric stabilizes at ~4 tok/s per replica (below threshold) — stays at 2 | ||
| 5. Load removed — metric drops to 0 — cooldown period (120s) + stabilization window (120s) | ||
| 6. **Scaled back to 1** replica | ||
|
|
||
| **Stable-3 mode** (~22 tok/s): | ||
| 1. **1 replica** at rest | ||
| 2. Load applied — metric rises quickly — `ceil(22/5) = 5`, capped at `maxReplicas=3` | ||
| 3. **Scaled to 3 replicas** within ~1-2 minutes | ||
| 4. Load removed — gradual scale-down: 3 → 2 → 1 (one pod removed per minute) | ||
|
|
||
| ## Scaling Math | ||
|
|
||
| The ScaledObject uses `metricType: AverageValue` with `threshold: 5`. For | ||
| external metrics, HPA computes: | ||
|
|
||
| ``` | ||
| desiredReplicas = ceil(totalMetricValue / threshold) | ||
| ``` | ||
|
|
||
| | Total tok/s | Desired replicas | Actual (capped 1-3) | | ||
| |-------------|------------------|---------------------| | ||
| | 0-5 | 1 | 1 | | ||
| | 5.1-10 | 2 | 2 | | ||
| | 10.1-15 | 3 | 3 | | ||
| | 15+ | 4+ | 3 (maxReplicas) | | ||
|
|
||
| ## Customization | ||
|
|
||
| **Model name**: the `model_name="opt-125m"` filter in the Prometheus queries inside | ||
| `scaled-object.yaml` must match the `--model_name` argument in `inference-service.yaml`. | ||
|
|
||
| **Threshold**: the `threshold: "5"` value means "scale up when each replica | ||
| handles more than 5 tokens/second on average" (`AverageValue` divides the | ||
| query result by replica count). Tune this based on load testing for your | ||
| model and hardware. | ||
|
|
||
| **Load generator presets**: the presets in `load-generator.py` are calibrated for | ||
| opt-125m on CPU. If you change the model, hardware, or threshold, you'll need to | ||
| recalibrate. Use `--mode custom` to experiment, and watch the Prometheus metric: | ||
|
|
||
| ```bash | ||
| # Check the actual metric value KEDA sees | ||
| kubectl run prom-check --rm -it --restart=Never --image=curlimages/curl -- \ | ||
| -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query' \ | ||
| --data-urlencode 'query=sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m])) + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m]))' | ||
| ``` | ||
|
|
||
| **GPU deployments**: remove `--dtype=float32` and `--max-model-len=512` | ||
| from the InferenceService args, add GPU resource requests, and consider | ||
| adding a second trigger for GPU KV-cache utilization: | ||
|
|
||
| ```yaml | ||
| # Add to scaled-object.yaml triggers list | ||
| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus | ||
| query: >- | ||
| avg(vllm:gpu_cache_usage_perc{model_name="my-model"}) | ||
| metricType: AverageValue | ||
| threshold: "0.75" | ||
| ``` | ||
|
|
||
tmvfb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## References | ||
|
|
||
| - [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/) | ||
| - [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler) | ||
| - [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/) | ||
| - [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # KServe InferenceService for OPT-125M with vLLM backend. | ||
| # Uses RawDeployment mode — required when scaling with KEDA. | ||
| # | ||
| # This example runs on CPU. For GPU, remove --dtype=float32 and | ||
| # --max-model-len, and adjust resources to request nvidia.com/gpu. | ||
| apiVersion: serving.kserve.io/v1beta1 | ||
| kind: InferenceService | ||
| metadata: | ||
| name: opt-125m | ||
| annotations: | ||
| # RawDeployment mode — creates a plain Deployment instead of a Knative Revision. | ||
| serving.kserve.io/deploymentMode: "RawDeployment" | ||
| # Tell KServe not to create its own HPA (KEDA will manage scaling). | ||
| serving.kserve.io/autoscalerClass: "external" | ||
| spec: | ||
| predictor: | ||
| minReplicas: 1 | ||
| maxReplicas: 3 | ||
| model: | ||
| modelFormat: | ||
| name: huggingface | ||
| args: | ||
| - --model_name=opt-125m | ||
| - --model_id=facebook/opt-125m | ||
| - --backend=vllm | ||
| - --dtype=float32 | ||
| - --max-model-len=512 | ||
| # Explicit port declaration is required in RawDeployment mode | ||
| # for the cluster-wide PodMonitor to discover the metrics endpoint. | ||
| ports: | ||
| - name: user-port | ||
| containerPort: 8080 | ||
| protocol: TCP | ||
| resources: | ||
| requests: | ||
| cpu: "2" | ||
| memory: 4Gi | ||
| limits: | ||
| cpu: "4" | ||
| memory: 8Gi | ||
tmvfb marked this conversation as resolved.
Show resolved
Hide resolved
|
||

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great in general, but some plots saying 'no data"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question, this panel is always empty and I never questioned it. This is specific not to this PR, but to the dashboard. I will investigate (maybe KServe runtime version does not emit these metrics?), but I don't think it's a blocker for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed, this metric is just not exposed by our runtime version. However, users might use runtimes where this is available, so maybe it's better to leave it there