Add KServe KEDA autoscaling example with custom metrics#41
Add KServe KEDA autoscaling example with custom metrics#41
Conversation
- InferenceService for vLLM-based model serving - KEDA ScaledObject with multiple scaling strategies (token throughput, GPU, power) - ServiceMonitor and PrometheusRules for metrics collection - README with setup instructions and troubleshooting
- Switch from DistilBERT to OPT-125M model with vLLM backend - Fix Prometheus serverAddress to include /prometheus routePrefix - Fix metric queries to handle vLLM's colon-namespaced metrics - Simplify ScaledObject to focus on running/waiting requests - Update PodMonitor and PrometheusRules for vLLM metrics Tested on cluster: autoscaling triggers correctly when load increases
- Add Time To First Token (TTFT) P95 as primary scaling metric - Add GPU KV-cache utilization scaling (for GPU deployments) - Keep running requests as fallback metric - Update README to match other examples in repo - Replace hardcoded namespace with <your-namespace> placeholder - Fix Prometheus URL to include /prometheus prefix for prokube - Document vLLM's colon-namespaced metrics (vllm:*)
There was a problem hiding this comment.
Pull request overview
This pull request adds a comprehensive example for autoscaling KServe InferenceServices using KEDA with custom Prometheus metrics from vLLM. The example addresses the limitation of traditional request-based autoscaling for LLM workloads by implementing scaling based on Time To First Token (TTFT), GPU KV-cache utilization, and running request count.
Changes:
- Adds InferenceService configuration for OPT-125M model with vLLM backend on CPU
- Implements KEDA ScaledObject with three Prometheus-based autoscaling triggers
- Configures Prometheus monitoring with PodMonitor and recording rules for vLLM metrics
- Provides comprehensive documentation covering deployment, testing, and troubleshooting
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 15 comments.
| File | Description |
|---|---|
| serving/kserve-keda-autoscaling/inference-service.yaml | Defines KServe InferenceService for OPT-125M model with vLLM backend and CPU deployment |
| serving/kserve-keda-autoscaling/scaled-object.yaml | Configures KEDA ScaledObject with three Prometheus-based triggers for autoscaling |
| serving/kserve-keda-autoscaling/service-monitor.yaml | Sets up PodMonitor for metrics collection and PrometheusRule for recording rules |
| serving/kserve-keda-autoscaling/README.md | Provides comprehensive documentation with deployment instructions, examples, and troubleshooting |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove unused PrometheusRules (vLLM metrics use colons natively) - Fix trailing whitespace in scaled-object.yaml - Clarify that vLLM uses colons in metric names (unusual but correct) - Add note about minReplicas/maxReplicas when using KEDA - Add step to find predictor service name before load testing - Remove prokube-specific reference in troubleshooting
Response to Copilot ReviewThanks for the review! I've addressed most of the feedback, but want to clarify one point where Copilot's suggestion was incorrect: vLLM metric naming (colons vs underscores)Copilot suggested that vLLM uses underscores in metric names (e.g., vLLM actually uses colons in its raw metric names. I verified this directly from the running pod: This is unusual (Prometheus convention is underscores), but it's how vLLM implements it. That's why the queries need the As a result, I've:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix KEDA trigger description (evaluates all, uses highest replica count) - Make Prometheus URL configurable (<your-prometheus-url> placeholder) - Add pod selector to queries to avoid cross-InferenceService metric aggregation - Update README with additional configuration steps
84b79f0 to
b69e04d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
AverageValue divides total token throughput by replica count, which means the per-replica value halves after a scale-up event. With stabilizationWindowSeconds: 0 this could cause flapping near the threshold. Setting it to 30s requires the metric to stay above threshold for two consecutive polling intervals before a scale-up is committed, while the existing 120s scaleDown window prevents premature scale-down.
hsteude
left a comment
There was a problem hiding this comment.
Thanks Igor, left comments here and there :)
| ## Quick Start | ||
|
|
||
| ```bash | ||
| export NAMESPACE="default" |
There was a problem hiding this comment.
From within a notebook, this section would also work without specifying the namesapce, which would make it slightly easier to run. However, I'm not sure if the premetheus query can be adjusted accordingly... (see below)
There was a problem hiding this comment.
adjusted and fixed for simplicity
| query: >- | ||
| sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m])) | ||
| + sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m])) | ||
| metricType: AverageValue |
There was a problem hiding this comment.
Can we find a way to do this without specifying the namespace here? Ideally a new user doesn't have to edit the files at all.
There was a problem hiding this comment.
adjusted and fixed for simplicity. The caveat is that if multiple users are running that example at the same time, we are gonna get problems, so maybe we should recommend adjusting the queries, or think about some other approach
UPD: Added a comment to the readme.
| metricType: AverageValue | ||
| threshold: "0.75" | ||
| ``` | ||
|
|
There was a problem hiding this comment.
So what do I do next in order to see this in action? As a user I'd like to know a) how I send requests, b) how I send so many that it actually scales up and c) how i can sea that it actually did scale up :)
2a5f07c to
81e4b82
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6feebd2 to
35ef20d
Compare
Summary
Adds an example for autoscaling KServe InferenceServices using KEDA with custom Prometheus metrics from vLLM. This addresses the need for better LLM autoscaling beyond simple request-based scaling.
Features
Files Added
serving/kserve-keda-autoscaling/inference-service.yaml- Example InferenceService with OPT-125M modelserving/kserve-keda-autoscaling/scaled-object.yaml- KEDA ScaledObject with multiple triggersserving/kserve-keda-autoscaling/service-monitor.yaml- PodMonitor and PrometheusRules for metricsserving/kserve-keda-autoscaling/README.md- DocumentationTesting
Tested on prokube cluster:
References