From bcf91f8760973ccafa86dfb8d3924285d26b3516 Mon Sep 17 00:00:00 2001 From: hsteude Date: Mon, 16 Feb 2026 18:14:06 +0100 Subject: [PATCH 01/17] Add KServe KEDA autoscaling example with custom Prometheus metrics - InferenceService for vLLM-based model serving - KEDA ScaledObject with multiple scaling strategies (token throughput, GPU, power) - ServiceMonitor and PrometheusRules for metrics collection - README with setup instructions and troubleshooting --- serving/kserve-keda-autoscaling/README.md | 206 ++++++++++++++++++ .../inference-service.yaml | 27 +++ .../scaled-object.yaml | 123 +++++++++++ .../service-monitor.yaml | 64 ++++++ 4 files changed, 420 insertions(+) create mode 100644 serving/kserve-keda-autoscaling/README.md create mode 100644 serving/kserve-keda-autoscaling/inference-service.yaml create mode 100644 serving/kserve-keda-autoscaling/scaled-object.yaml create mode 100644 serving/kserve-keda-autoscaling/service-monitor.yaml diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md new file mode 100644 index 0000000..c5efc24 --- /dev/null +++ b/serving/kserve-keda-autoscaling/README.md @@ -0,0 +1,206 @@ +# KServe Autoscaling with KEDA and Custom Metrics + +This example demonstrates how to autoscale KServe InferenceServices using [KEDA](https://keda.sh/) with custom Prometheus metrics. This is particularly useful for LLM inference workloads where request-based autoscaling (Knative default) is not optimal. + +## Why Custom Metrics for LLM Autoscaling? + +Traditional request-based autoscaling doesn't work well for LLM inference because: + +- **Token-level work**: LLM inference operates at token level, not request level. A single request can generate hundreds of tokens. +- **Variable latency**: Request latency varies significantly based on input/output token count. +- **Memory pressure**: LLM models require significant GPU memory (KV cache), which fills up based on concurrent requests. + +Better metrics for LLM autoscaling include: +- **Token throughput**: Tokens generated per second +- **Time To First Token (TTFT)**: Latency until first token is generated +- **Time Per Output Token (TPOT)**: Average time per generated token +- **KV Cache utilization**: GPU memory used for attention cache +- **Number of running/waiting requests**: Queue depth + +## Prerequisites + +1. **KEDA** installed in the cluster: + ```bash + helm repo add kedacore https://kedacore.github.io/charts + helm install keda kedacore/keda --namespace keda --create-namespace + ``` + +2. **Prometheus** (kube-prometheus-stack recommended): + ```bash + helm repo add prometheus-community https://prometheus-community.github.io/helm-charts + helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace + ``` + +3. **KServe** with HuggingFace/vLLM runtime configured + +4. **HuggingFace Token** (optional, for gated models): + ```bash + kubectl create secret generic hf-secret --from-literal=HF_TOKEN= -n developer1 + ``` + +## Files + +| File | Description | +|------|-------------| +| `inference-service.yaml` | KServe InferenceService for Qwen2.5-0.5B model | +| `scaled-object.yaml` | KEDA ScaledObject with multiple autoscaling strategies | +| `service-monitor.yaml` | ServiceMonitor, PodMonitor, and PrometheusRules for metrics collection | + +## Deployment + +### 1. Deploy the InferenceService + +```bash +kubectl apply -f inference-service.yaml -n developer1 +``` + +Wait for the model to be ready: +```bash +kubectl get inferenceservice qwen25-05b -n developer1 -w +``` + +### 2. Configure Prometheus Metrics Collection + +Apply the ServiceMonitor to scrape vLLM metrics: +```bash +kubectl apply -f service-monitor.yaml -n developer1 +``` + +Verify metrics are being scraped: +```bash +# Port-forward Prometheus +kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring + +# Query for vLLM metrics +curl -s 'http://localhost:9090/api/v1/query?query=vllm_num_requests_running' | jq . +``` + +### 3. Deploy KEDA ScaledObject + +First, identify the correct deployment name: +```bash +kubectl get deployments -n developer1 | grep qwen25-05b +``` + +Update `scaled-object.yaml` with the correct deployment name, then apply: +```bash +kubectl apply -f scaled-object.yaml -n developer1 +``` + +Verify the ScaledObject: +```bash +kubectl get scaledobject -n developer1 +kubectl describe scaledobject qwen25-05b-scaledobject -n developer1 +``` + +## Autoscaling Strategies + +This example includes three ScaledObject variants: + +### 1. Token Throughput Based (Default) +Scales based on average token generation throughput and number of running requests: +```yaml +triggers: + - type: prometheus + metadata: + query: avg(rate(vllm:generation_tokens_total[1m])) + threshold: "100" +``` + +### 2. GPU Utilization Based +Scales based on GPU memory utilization (requires DCGM exporter): +```yaml +triggers: + - type: prometheus + metadata: + query: avg(DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"qwen25-05b-predictor.*"}) + threshold: "80" +``` + +### 3. Power Consumption Based +Scales based on power consumption metrics from [Kepler](https://github.com/sustainable-computing-io/kepler): +```yaml +triggers: + - type: prometheus + metadata: + query: sum(rate(kepler_container_joules_total{container_name=~"qwen25-05b.*"}[5m])) + threshold: "100" +``` + +## vLLM Metrics Reference + +vLLM exposes metrics at `/metrics` endpoint: + +| Metric | Description | +|--------|-------------| +| `vllm_num_requests_running` | Number of requests currently being processed | +| `vllm_num_requests_waiting` | Number of requests waiting in queue | +| `vllm_gpu_cache_usage_perc` | GPU KV cache utilization percentage | +| `vllm_generation_tokens_total` | Total number of generated tokens | +| `vllm_time_to_first_token_seconds` | Histogram of TTFT | +| `vllm_time_per_output_token_seconds` | Histogram of TPOT | + +## Testing Autoscaling + +Generate load to trigger autoscaling: + +```bash +# Get the inference URL +ISVC_URL=$(kubectl get inferenceservice qwen25-05b -n developer1 -o jsonpath='{.status.url}') + +# Send requests in a loop +for i in {1..100}; do + curl -X POST "${ISVC_URL}/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qwen25-05b", + "prompt": "Write a long story about", + "max_tokens": 500 + }' & +done +``` + +Monitor scaling: +```bash +# Watch replica count +kubectl get deployment -n developer1 -w + +# Check KEDA metrics +kubectl get hpa -n developer1 +``` + +## Troubleshooting + +### KEDA not scaling +1. Check ScaledObject status: + ```bash + kubectl describe scaledobject qwen25-05b-scaledobject -n developer1 + ``` + +2. Verify Prometheus connectivity: + ```bash + kubectl run curl-test --image=curlimages/curl --rm -it -- \ + curl -s 'http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=up' + ``` + +3. Check KEDA operator logs: + ```bash + kubectl logs -l app=keda-operator -n keda + ``` + +### Metrics not appearing +1. Verify ServiceMonitor is picked up: + ```bash + kubectl get servicemonitor -n developer1 + ``` + +2. Check Prometheus targets: + - Open Prometheus UI -> Status -> Targets + - Look for `serviceMonitor/developer1/qwen25-05b-metrics` + +## References + +- [KServe Issue #3561: Native KEDA integration](https://github.com/kserve/kserve/issues/3561) +- [KEDA Prometheus Scaler](https://keda.sh/docs/scalers/prometheus/) +- [vLLM Metrics](https://docs.vllm.ai/en/latest/serving/metrics.html) +- [Kepler: Kubernetes Energy Metering](https://github.com/sustainable-computing-io/kepler) diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml new file mode 100644 index 0000000..f3b3805 --- /dev/null +++ b/serving/kserve-keda-autoscaling/inference-service.yaml @@ -0,0 +1,27 @@ +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + name: distilbert-cpu + annotations: + # Model info for documentation + huggingface.co/model-id: distilbert-base-uncased-finetuned-sst-2-english +spec: + predictor: + # KEDA will handle scaling, but we still set bounds + minReplicas: 1 + maxReplicas: 10 + scaleTarget: 1 + scaleMetric: concurrency + model: + modelFormat: + name: huggingface + args: + - --model_name=distilbert + - --model_id=distilbert-base-uncased-finetuned-sst-2-english + resources: + requests: + cpu: "2" + memory: 4Gi + limits: + cpu: "4" + memory: 8Gi diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml new file mode 100644 index 0000000..f9eae2d --- /dev/null +++ b/serving/kserve-keda-autoscaling/scaled-object.yaml @@ -0,0 +1,123 @@ +# KEDA ScaledObject for KServe InferenceService +# Scales based on custom Prometheus metrics from vLLM/HuggingFace serving runtime +# +# Prerequisites: +# - KEDA installed in cluster (https://keda.sh/docs/deploy/) +# - Prometheus collecting vLLM metrics +# - ServiceMonitor configured (see service-monitor.yaml) +# +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: qwen25-05b-scaledobject + labels: + app: qwen25-05b +spec: + # Target the KServe predictor deployment + # KServe creates a deployment with naming pattern: -predictor- + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: qwen25-05b-predictor-00001-deployment + # Polling interval for checking metrics (seconds) + pollingInterval: 15 + # Cooldown period before scaling down (seconds) + cooldownPeriod: 60 + # Min/max replicas + minReplicaCount: 1 + maxReplicaCount: 10 + # Advanced scaling behavior + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: 120 + policies: + - type: Percent + value: 25 + periodSeconds: 60 + scaleUp: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 100 + periodSeconds: 15 + - type: Pods + value: 4 + periodSeconds: 15 + selectPolicy: Max + triggers: + # Scale based on average token throughput per second + - type: prometheus + metadata: + serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 + metricName: vllm_avg_generation_throughput + # Average tokens generated per second across all pods + query: | + avg(rate(vllm:generation_tokens_total[1m])) + threshold: "100" + activationThreshold: "10" + # Alternative: Scale based on number of running requests + - type: prometheus + metadata: + serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 + metricName: vllm_num_requests_running + # Average number of running requests per pod + query: | + avg(vllm:num_requests_running{model_name="qwen25-05b"}) + threshold: "5" + activationThreshold: "1" +--- +# Alternative ScaledObject using GPU utilization (if using GPUs) +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: qwen25-05b-gpu-scaledobject + labels: + app: qwen25-05b +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: qwen25-05b-predictor-00001-deployment + pollingInterval: 15 + cooldownPeriod: 120 + minReplicaCount: 1 + maxReplicaCount: 10 + triggers: + # Scale based on GPU memory utilization (requires DCGM exporter) + - type: prometheus + metadata: + serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 + metricName: dcgm_gpu_memory_used_percent + query: | + avg(DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"qwen25-05b-predictor.*"}) + threshold: "80" + activationThreshold: "20" +--- +# Alternative ScaledObject using Kepler power consumption metrics +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: qwen25-05b-power-scaledobject + labels: + app: qwen25-05b +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: qwen25-05b-predictor-00001-deployment + pollingInterval: 30 + cooldownPeriod: 180 + minReplicaCount: 1 + maxReplicaCount: 10 + triggers: + # Scale based on power consumption (requires Kepler) + - type: prometheus + metadata: + serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 + metricName: kepler_container_joules + query: | + sum(rate(kepler_container_joules_total{container_name=~"qwen25-05b.*"}[5m])) + threshold: "100" + activationThreshold: "10" diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml new file mode 100644 index 0000000..fcd8db1 --- /dev/null +++ b/serving/kserve-keda-autoscaling/service-monitor.yaml @@ -0,0 +1,64 @@ +# ServiceMonitor to scrape HuggingFace runtime metrics from KServe InferenceService +# This enables Prometheus to collect the metrics used by KEDA for autoscaling +# +# Prerequisites: +# - Prometheus Operator installed (kube-prometheus-stack) +# - HuggingFace runtime exposes metrics on port 8080 at /metrics endpoint +# +apiVersion: monitoring.coreos.com/v1 +kind: PodMonitor +metadata: + name: distilbert-cpu-metrics + labels: + app: distilbert-cpu + # Label to match Prometheus Operator's podMonitorSelector + release: kube-prometheus-stack +spec: + selector: + matchLabels: + serving.kserve.io/inferenceservice: distilbert-cpu + namespaceSelector: + matchNames: + - developer1 + podMetricsEndpoints: + - port: user-port # HuggingFace runtime metrics port + path: /metrics + interval: 15s + scrapeTimeout: 10s +--- +# PrometheusRule for creating recording rules +# These pre-aggregate metrics for more efficient KEDA queries +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: distilbert-cpu-recording-rules + labels: + app: distilbert-cpu + release: kube-prometheus-stack +spec: + groups: + - name: kserve-hf-metrics + interval: 15s + rules: + # Request rate (requests per second) + - record: kserve:request_rate + expr: | + sum by (namespace, pod) ( + rate(request_predict_seconds_count[1m]) + ) + # Average prediction latency + - record: kserve:predict_latency_avg + expr: | + avg by (namespace) ( + rate(request_predict_seconds_sum[5m]) + / + rate(request_predict_seconds_count[5m]) + ) + # P99 prediction latency + - record: kserve:predict_latency_p99 + expr: | + histogram_quantile(0.99, + sum by (namespace, le) ( + rate(request_predict_seconds_bucket[5m]) + ) + ) From 03fc3cd4fa1d04618845f99f08f1f325edd6d480 Mon Sep 17 00:00:00 2001 From: hsteude Date: Mon, 16 Feb 2026 18:22:07 +0100 Subject: [PATCH 02/17] Update KEDA autoscaling example to use vLLM with OPT-125M - Switch from DistilBERT to OPT-125M model with vLLM backend - Fix Prometheus serverAddress to include /prometheus routePrefix - Fix metric queries to handle vLLM's colon-namespaced metrics - Simplify ScaledObject to focus on running/waiting requests - Update PodMonitor and PrometheusRules for vLLM metrics Tested on cluster: autoscaling triggers correctly when load increases --- .../inference-service.yaml | 15 ++- .../scaled-object.yaml | 96 +++++-------------- .../service-monitor.yaml | 64 ++++++++----- 3 files changed, 69 insertions(+), 106 deletions(-) diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml index f3b3805..0b67dfa 100644 --- a/serving/kserve-keda-autoscaling/inference-service.yaml +++ b/serving/kserve-keda-autoscaling/inference-service.yaml @@ -1,23 +1,22 @@ apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: - name: distilbert-cpu + name: opt-125m-vllm annotations: - # Model info for documentation - huggingface.co/model-id: distilbert-base-uncased-finetuned-sst-2-english + huggingface.co/model-id: facebook/opt-125m spec: predictor: - # KEDA will handle scaling, but we still set bounds minReplicas: 1 maxReplicas: 10 - scaleTarget: 1 - scaleMetric: concurrency model: modelFormat: name: huggingface args: - - --model_name=distilbert - - --model_id=distilbert-base-uncased-finetuned-sst-2-english + - --model_name=opt-125m + - --model_id=facebook/opt-125m + - --backend=vllm + - --dtype=float32 + - --device=cpu resources: requests: cpu: "2" diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml index f9eae2d..2a09a44 100644 --- a/serving/kserve-keda-autoscaling/scaled-object.yaml +++ b/serving/kserve-keda-autoscaling/scaled-object.yaml @@ -1,24 +1,25 @@ -# KEDA ScaledObject for KServe InferenceService -# Scales based on custom Prometheus metrics from vLLM/HuggingFace serving runtime +# KEDA ScaledObject for KServe InferenceService with vLLM backend +# Scales based on custom Prometheus metrics from vLLM serving runtime # # Prerequisites: # - KEDA installed in cluster (https://keda.sh/docs/deploy/) -# - Prometheus collecting vLLM metrics -# - ServiceMonitor configured (see service-monitor.yaml) +# - Prometheus collecting vLLM metrics (see service-monitor.yaml) +# +# Note: vLLM metrics use colons in names (e.g., vllm:num_requests_running) +# which need to be quoted in PromQL queries # apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: - name: qwen25-05b-scaledobject + name: opt-125m-vllm-scaledobject labels: - app: qwen25-05b + app: opt-125m-vllm spec: # Target the KServe predictor deployment - # KServe creates a deployment with naming pattern: -predictor- scaleTargetRef: apiVersion: apps/v1 kind: Deployment - name: qwen25-05b-predictor-00001-deployment + name: opt-125m-vllm-predictor-00001-deployment # Polling interval for checking metrics (seconds) pollingInterval: 15 # Cooldown period before scaling down (seconds) @@ -47,77 +48,24 @@ spec: periodSeconds: 15 selectPolicy: Max triggers: - # Scale based on average token throughput per second - - type: prometheus - metadata: - serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 - metricName: vllm_avg_generation_throughput - # Average tokens generated per second across all pods - query: | - avg(rate(vllm:generation_tokens_total[1m])) - threshold: "100" - activationThreshold: "10" - # Alternative: Scale based on number of running requests + # Scale based on number of running requests per pod + # vLLM uses colons in metric names, so we use the actual metric name - type: prometheus metadata: - serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 + serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus metricName: vllm_num_requests_running - # Average number of running requests per pod + # Scale up when average running requests per pod > 2 query: | - avg(vllm:num_requests_running{model_name="qwen25-05b"}) - threshold: "5" + avg({"__name__"="vllm:num_requests_running", namespace="developer1"}) + threshold: "2" activationThreshold: "1" ---- -# Alternative ScaledObject using GPU utilization (if using GPUs) -apiVersion: keda.sh/v1alpha1 -kind: ScaledObject -metadata: - name: qwen25-05b-gpu-scaledobject - labels: - app: qwen25-05b -spec: - scaleTargetRef: - apiVersion: apps/v1 - kind: Deployment - name: qwen25-05b-predictor-00001-deployment - pollingInterval: 15 - cooldownPeriod: 120 - minReplicaCount: 1 - maxReplicaCount: 10 - triggers: - # Scale based on GPU memory utilization (requires DCGM exporter) - - type: prometheus - metadata: - serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 - metricName: dcgm_gpu_memory_used_percent - query: | - avg(DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"qwen25-05b-predictor.*"}) - threshold: "80" - activationThreshold: "20" ---- -# Alternative ScaledObject using Kepler power consumption metrics -apiVersion: keda.sh/v1alpha1 -kind: ScaledObject -metadata: - name: qwen25-05b-power-scaledobject - labels: - app: qwen25-05b -spec: - scaleTargetRef: - apiVersion: apps/v1 - kind: Deployment - name: qwen25-05b-predictor-00001-deployment - pollingInterval: 30 - cooldownPeriod: 180 - minReplicaCount: 1 - maxReplicaCount: 10 - triggers: - # Scale based on power consumption (requires Kepler) + # Scale based on number of waiting requests (queue depth) - type: prometheus metadata: - serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 - metricName: kepler_container_joules + serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus + metricName: vllm_num_requests_waiting + # Scale up when there are waiting requests query: | - sum(rate(kepler_container_joules_total{container_name=~"qwen25-05b.*"}[5m])) - threshold: "100" - activationThreshold: "10" + sum({"__name__"="vllm:num_requests_waiting", namespace="developer1"}) + threshold: "1" + activationThreshold: "0" diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml index fcd8db1..a99ae3b 100644 --- a/serving/kserve-keda-autoscaling/service-monitor.yaml +++ b/serving/kserve-keda-autoscaling/service-monitor.yaml @@ -1,27 +1,27 @@ -# ServiceMonitor to scrape HuggingFace runtime metrics from KServe InferenceService +# PodMonitor to scrape vLLM metrics from KServe InferenceService # This enables Prometheus to collect the metrics used by KEDA for autoscaling # # Prerequisites: # - Prometheus Operator installed (kube-prometheus-stack) -# - HuggingFace runtime exposes metrics on port 8080 at /metrics endpoint +# - vLLM runtime exposes metrics on port 8080 at /metrics endpoint # apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: - name: distilbert-cpu-metrics + name: opt-125m-vllm-metrics labels: - app: distilbert-cpu + app: opt-125m-vllm # Label to match Prometheus Operator's podMonitorSelector release: kube-prometheus-stack spec: selector: matchLabels: - serving.kserve.io/inferenceservice: distilbert-cpu + serving.kserve.io/inferenceservice: opt-125m-vllm namespaceSelector: matchNames: - developer1 podMetricsEndpoints: - - port: user-port # HuggingFace runtime metrics port + - port: user-port # vLLM runtime metrics port path: /metrics interval: 15s scrapeTimeout: 10s @@ -31,34 +31,50 @@ spec: apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: - name: distilbert-cpu-recording-rules + name: opt-125m-vllm-recording-rules labels: - app: distilbert-cpu + app: opt-125m-vllm release: kube-prometheus-stack spec: groups: - - name: kserve-hf-metrics + - name: vllm-metrics interval: 15s rules: - # Request rate (requests per second) - - record: kserve:request_rate + # Number of running requests + - record: vllm:num_requests_running expr: | - sum by (namespace, pod) ( - rate(request_predict_seconds_count[1m]) + sum by (model_name, namespace) ( + vllm_num_requests_running ) - # Average prediction latency - - record: kserve:predict_latency_avg + # Number of waiting requests + - record: vllm:num_requests_waiting expr: | - avg by (namespace) ( - rate(request_predict_seconds_sum[5m]) + sum by (model_name, namespace) ( + vllm_num_requests_waiting + ) + # Token generation throughput + - record: vllm:generation_tokens_rate + expr: | + sum by (model_name, namespace) ( + rate(vllm_generation_tokens_total[1m]) + ) + # Prompt tokens throughput + - record: vllm:prompt_tokens_rate + expr: | + sum by (model_name, namespace) ( + rate(vllm_prompt_tokens_total[1m]) + ) + # Average time to first token (TTFT) + - record: vllm:time_to_first_token_avg + expr: | + avg by (model_name, namespace) ( + rate(vllm_time_to_first_token_seconds_sum[5m]) / - rate(request_predict_seconds_count[5m]) + rate(vllm_time_to_first_token_seconds_count[5m]) ) - # P99 prediction latency - - record: kserve:predict_latency_p99 + # GPU KV cache utilization + - record: vllm:gpu_cache_usage_percent expr: | - histogram_quantile(0.99, - sum by (namespace, le) ( - rate(request_predict_seconds_bucket[5m]) - ) + avg by (model_name, namespace) ( + vllm_gpu_cache_usage_perc ) From 471ba505547d66b88d44180c63602a26c34d9051 Mon Sep 17 00:00:00 2001 From: hsteude Date: Mon, 16 Feb 2026 20:20:08 +0100 Subject: [PATCH 03/17] Update KEDA autoscaling example with TTFT scaling and documentation - Add Time To First Token (TTFT) P95 as primary scaling metric - Add GPU KV-cache utilization scaling (for GPU deployments) - Keep running requests as fallback metric - Update README to match other examples in repo - Replace hardcoded namespace with placeholder - Fix Prometheus URL to include /prometheus prefix for prokube - Document vLLM's colon-namespaced metrics (vllm:*) --- serving/kserve-keda-autoscaling/README.md | 154 ++++++++---------- .../scaled-object.yaml | 36 ++-- .../service-monitor.yaml | 2 +- 3 files changed, 92 insertions(+), 100 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index c5efc24..32325d1 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -11,162 +11,143 @@ Traditional request-based autoscaling doesn't work well for LLM inference becaus - **Memory pressure**: LLM models require significant GPU memory (KV cache), which fills up based on concurrent requests. Better metrics for LLM autoscaling include: -- **Token throughput**: Tokens generated per second - **Time To First Token (TTFT)**: Latency until first token is generated -- **Time Per Output Token (TPOT)**: Average time per generated token - **KV Cache utilization**: GPU memory used for attention cache - **Number of running/waiting requests**: Queue depth ## Prerequisites -1. **KEDA** installed in the cluster: - ```bash - helm repo add kedacore https://kedacore.github.io/charts - helm install keda kedacore/keda --namespace keda --create-namespace - ``` - -2. **Prometheus** (kube-prometheus-stack recommended): - ```bash - helm repo add prometheus-community https://prometheus-community.github.io/helm-charts - helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace - ``` +On prokube, Prometheus is already installed. You only need to install KEDA: -3. **KServe** with HuggingFace/vLLM runtime configured - -4. **HuggingFace Token** (optional, for gated models): - ```bash - kubectl create secret generic hf-secret --from-literal=HF_TOKEN= -n developer1 - ``` +```bash +helm repo add kedacore https://kedacore.github.io/charts +helm install keda kedacore/keda --namespace keda --create-namespace +``` ## Files | File | Description | |------|-------------| -| `inference-service.yaml` | KServe InferenceService for Qwen2.5-0.5B model | -| `scaled-object.yaml` | KEDA ScaledObject with multiple autoscaling strategies | -| `service-monitor.yaml` | ServiceMonitor, PodMonitor, and PrometheusRules for metrics collection | +| `inference-service.yaml` | KServe InferenceService for OPT-125M model with vLLM backend | +| `scaled-object.yaml` | KEDA ScaledObject with TTFT, GPU cache, and request-based scaling | +| `service-monitor.yaml` | PodMonitor and PrometheusRules for vLLM metrics collection | ## Deployment ### 1. Deploy the InferenceService ```bash -kubectl apply -f inference-service.yaml -n developer1 +kubectl apply -f inference-service.yaml -n ``` Wait for the model to be ready: ```bash -kubectl get inferenceservice qwen25-05b -n developer1 -w +kubectl get inferenceservice opt-125m-vllm -n -w ``` ### 2. Configure Prometheus Metrics Collection -Apply the ServiceMonitor to scrape vLLM metrics: +Apply the PodMonitor to scrape vLLM metrics: ```bash -kubectl apply -f service-monitor.yaml -n developer1 +kubectl apply -f service-monitor.yaml -n ``` -Verify metrics are being scraped: -```bash -# Port-forward Prometheus -kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring - -# Query for vLLM metrics -curl -s 'http://localhost:9090/api/v1/query?query=vllm_num_requests_running' | jq . -``` +**Note:** You may need to update the `namespaceSelector` in `service-monitor.yaml` to match your namespace. ### 3. Deploy KEDA ScaledObject First, identify the correct deployment name: ```bash -kubectl get deployments -n developer1 | grep qwen25-05b +kubectl get deployments -n | grep opt-125m-vllm ``` -Update `scaled-object.yaml` with the correct deployment name, then apply: +Update `scaled-object.yaml` with: +- The correct deployment name +- Your namespace in the Prometheus queries + +Then apply: ```bash -kubectl apply -f scaled-object.yaml -n developer1 +kubectl apply -f scaled-object.yaml -n ``` Verify the ScaledObject: ```bash -kubectl get scaledobject -n developer1 -kubectl describe scaledobject qwen25-05b-scaledobject -n developer1 +kubectl get scaledobject -n +kubectl describe scaledobject opt-125m-vllm-scaledobject -n ``` ## Autoscaling Strategies -This example includes three ScaledObject variants: +This example uses three triggers (first one to exceed threshold wins): -### 1. Token Throughput Based (Default) -Scales based on average token generation throughput and number of running requests: +### 1. Time To First Token (TTFT) - P95 +Scales when the 95th percentile TTFT exceeds 200ms: ```yaml triggers: - type: prometheus metadata: - query: avg(rate(vllm:generation_tokens_total[1m])) - threshold: "100" + query: | + histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace=""}[2m])) by (le)) + threshold: "0.2" ``` -### 2. GPU Utilization Based -Scales based on GPU memory utilization (requires DCGM exporter): +### 2. GPU KV-Cache Utilization +Scales when GPU cache usage exceeds 70% (for GPU deployments): ```yaml triggers: - type: prometheus metadata: - query: avg(DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"qwen25-05b-predictor.*"}) - threshold: "80" + query: | + avg({"__name__"="vllm:gpu_cache_usage_perc", namespace=""}) + threshold: "0.7" ``` -### 3. Power Consumption Based -Scales based on power consumption metrics from [Kepler](https://github.com/sustainable-computing-io/kepler): +### 3. Running Requests (Fallback) +Scales when average running requests per pod exceeds 2: ```yaml triggers: - type: prometheus metadata: - query: sum(rate(kepler_container_joules_total{container_name=~"qwen25-05b.*"}[5m])) - threshold: "100" + query: | + avg({"__name__"="vllm:num_requests_running", namespace=""}) + threshold: "2" ``` ## vLLM Metrics Reference -vLLM exposes metrics at `/metrics` endpoint: +vLLM exposes metrics at `/metrics` endpoint. Note that vLLM uses colons in metric names: | Metric | Description | |--------|-------------| -| `vllm_num_requests_running` | Number of requests currently being processed | -| `vllm_num_requests_waiting` | Number of requests waiting in queue | -| `vllm_gpu_cache_usage_perc` | GPU KV cache utilization percentage | -| `vllm_generation_tokens_total` | Total number of generated tokens | -| `vllm_time_to_first_token_seconds` | Histogram of TTFT | -| `vllm_time_per_output_token_seconds` | Histogram of TPOT | +| `vllm:num_requests_running` | Number of requests currently being processed | +| `vllm:num_requests_waiting` | Number of requests waiting in queue | +| `vllm:gpu_cache_usage_perc` | GPU KV cache utilization (0-1) | +| `vllm:time_to_first_token_seconds` | Histogram of TTFT | +| `vllm:time_per_output_token_seconds` | Histogram of TPOT | +| `vllm:generation_tokens_total` | Total number of generated tokens | ## Testing Autoscaling Generate load to trigger autoscaling: ```bash -# Get the inference URL -ISVC_URL=$(kubectl get inferenceservice qwen25-05b -n developer1 -o jsonpath='{.status.url}') - -# Send requests in a loop -for i in {1..100}; do - curl -X POST "${ISVC_URL}/v1/completions" \ - -H "Content-Type: application/json" \ - -d '{ - "model": "qwen25-05b", - "prompt": "Write a long story about", - "max_tokens": 500 - }' & -done +# Create a load generator pod +kubectl run load-gen --image=curlimages/curl -n --restart=Never -- \ + sh -c 'while true; do for i in $(seq 1 10); do curl -s -X POST "http://opt-125m-vllm-predictor-00001..svc.cluster.local/openai/v1/completions" -H "Content-Type: application/json" -d "{\"model\": \"opt-125m\", \"prompt\": \"Tell me a story\", \"max_tokens\": 200}" & done; sleep 2; done' ``` Monitor scaling: ```bash -# Watch replica count -kubectl get deployment -n developer1 -w +# Watch HPA status +kubectl get hpa -n -w -# Check KEDA metrics -kubectl get hpa -n developer1 +# Watch pods +kubectl get pods -n -l serving.kserve.io/inferenceservice=opt-125m-vllm -w +``` + +Clean up: +```bash +kubectl delete pod load-gen -n ``` ## Troubleshooting @@ -174,13 +155,13 @@ kubectl get hpa -n developer1 ### KEDA not scaling 1. Check ScaledObject status: ```bash - kubectl describe scaledobject qwen25-05b-scaledobject -n developer1 + kubectl describe scaledobject opt-125m-vllm-scaledobject -n ``` -2. Verify Prometheus connectivity: +2. Verify Prometheus connectivity (note the `/prometheus` path prefix on prokube): ```bash - kubectl run curl-test --image=curlimages/curl --rm -it -- \ - curl -s 'http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=up' + kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ + curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=up' ``` 3. Check KEDA operator logs: @@ -189,18 +170,19 @@ kubectl get hpa -n developer1 ``` ### Metrics not appearing -1. Verify ServiceMonitor is picked up: +1. Verify PodMonitor is picked up: ```bash - kubectl get servicemonitor -n developer1 + kubectl get podmonitor -n ``` -2. Check Prometheus targets: - - Open Prometheus UI -> Status -> Targets - - Look for `serviceMonitor/developer1/qwen25-05b-metrics` +2. Check if vLLM metrics are being scraped: + ```bash + kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ + curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query={__name__=~"vllm:.*"}' + ``` ## References - [KServe Issue #3561: Native KEDA integration](https://github.com/kserve/kserve/issues/3561) - [KEDA Prometheus Scaler](https://keda.sh/docs/scalers/prometheus/) - [vLLM Metrics](https://docs.vllm.ai/en/latest/serving/metrics.html) -- [Kepler: Kubernetes Energy Metering](https://github.com/sustainable-computing-io/kepler) diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml index 2a09a44..b61d3fa 100644 --- a/serving/kserve-keda-autoscaling/scaled-object.yaml +++ b/serving/kserve-keda-autoscaling/scaled-object.yaml @@ -8,6 +8,8 @@ # Note: vLLM metrics use colons in names (e.g., vllm:num_requests_running) # which need to be quoted in PromQL queries # +# TODO: Replace with your actual namespace in the queries below +# apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: @@ -48,24 +50,32 @@ spec: periodSeconds: 15 selectPolicy: Max triggers: - # Scale based on number of running requests per pod - # vLLM uses colons in metric names, so we use the actual metric name + # Scale based on Time To First Token (TTFT) - P95 + # Scale up when P95 TTFT exceeds 200ms (0.2s) - type: prometheus metadata: serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus - metricName: vllm_num_requests_running - # Scale up when average running requests per pod > 2 + metricName: vllm_ttft_p95 query: | - avg({"__name__"="vllm:num_requests_running", namespace="developer1"}) - threshold: "2" - activationThreshold: "1" - # Scale based on number of waiting requests (queue depth) + histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace=""}[2m])) by (le)) + threshold: "0.2" + activationThreshold: "0.1" + # Scale based on GPU KV-cache usage (for GPU deployments) + # Scale up when cache usage exceeds 70% - type: prometheus metadata: serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus - metricName: vllm_num_requests_waiting - # Scale up when there are waiting requests + metricName: vllm_gpu_cache_usage query: | - sum({"__name__"="vllm:num_requests_waiting", namespace="developer1"}) - threshold: "1" - activationThreshold: "0" + avg({"__name__"="vllm:gpu_cache_usage_perc", namespace=""}) + threshold: "0.7" + activationThreshold: "0.5" + # Fallback: Scale based on running requests (always works) + - type: prometheus + metadata: + serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus + metricName: vllm_num_requests_running + query: | + avg({"__name__"="vllm:num_requests_running", namespace=""}) + threshold: "2" + activationThreshold: "1" diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml index a99ae3b..2517000 100644 --- a/serving/kserve-keda-autoscaling/service-monitor.yaml +++ b/serving/kserve-keda-autoscaling/service-monitor.yaml @@ -19,7 +19,7 @@ spec: serving.kserve.io/inferenceservice: opt-125m-vllm namespaceSelector: matchNames: - - developer1 + - # TODO: Replace with your namespace podMetricsEndpoints: - port: user-port # vLLM runtime metrics port path: /metrics From dcfea4f7f330d2d2d2ea61a4e4d4745dd4ac0c3c Mon Sep 17 00:00:00 2001 From: hsteude Date: Mon, 16 Feb 2026 20:21:19 +0100 Subject: [PATCH 04/17] Remove prokube-specific Prometheus note from prerequisites --- serving/kserve-keda-autoscaling/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index 32325d1..b29acb3 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -17,7 +17,7 @@ Better metrics for LLM autoscaling include: ## Prerequisites -On prokube, Prometheus is already installed. You only need to install KEDA: +Install KEDA in the cluster: ```bash helm repo add kedacore https://kedacore.github.io/charts From 93ae4297f0b55761d1c3e8d6b9274341f7b99af7 Mon Sep 17 00:00:00 2001 From: hsteude Date: Mon, 16 Feb 2026 20:41:22 +0100 Subject: [PATCH 05/17] Address Copilot review feedback - Remove unused PrometheusRules (vLLM metrics use colons natively) - Fix trailing whitespace in scaled-object.yaml - Clarify that vLLM uses colons in metric names (unusual but correct) - Add note about minReplicas/maxReplicas when using KEDA - Add step to find predictor service name before load testing - Remove prokube-specific reference in troubleshooting --- serving/kserve-keda-autoscaling/README.md | 13 +++-- .../inference-service.yaml | 2 + .../scaled-object.yaml | 6 +- .../service-monitor.yaml | 57 +------------------ 4 files changed, 15 insertions(+), 63 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index b29acb3..69c4d94 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -30,7 +30,7 @@ helm install keda kedacore/keda --namespace keda --create-namespace |------|-------------| | `inference-service.yaml` | KServe InferenceService for OPT-125M model with vLLM backend | | `scaled-object.yaml` | KEDA ScaledObject with TTFT, GPU cache, and request-based scaling | -| `service-monitor.yaml` | PodMonitor and PrometheusRules for vLLM metrics collection | +| `service-monitor.yaml` | PodMonitor for vLLM metrics collection | ## Deployment @@ -52,7 +52,7 @@ Apply the PodMonitor to scrape vLLM metrics: kubectl apply -f service-monitor.yaml -n ``` -**Note:** You may need to update the `namespaceSelector` in `service-monitor.yaml` to match your namespace. +**Note:** Update the `namespaceSelector` in `service-monitor.yaml` to match your namespace. ### 3. Deploy KEDA ScaledObject @@ -115,7 +115,7 @@ triggers: ## vLLM Metrics Reference -vLLM exposes metrics at `/metrics` endpoint. Note that vLLM uses colons in metric names: +vLLM exposes metrics at the `/metrics` endpoint. Note that vLLM uses colons in metric names (this is unusual but correct): | Metric | Description | |--------|-------------| @@ -131,7 +131,10 @@ vLLM exposes metrics at `/metrics` endpoint. Note that vLLM uses colons in metri Generate load to trigger autoscaling: ```bash -# Create a load generator pod +# First, find the predictor service name +kubectl get svc -n | grep opt-125m-vllm + +# Create a load generator pod (adjust service name if needed) kubectl run load-gen --image=curlimages/curl -n --restart=Never -- \ sh -c 'while true; do for i in $(seq 1 10); do curl -s -X POST "http://opt-125m-vllm-predictor-00001..svc.cluster.local/openai/v1/completions" -H "Content-Type: application/json" -d "{\"model\": \"opt-125m\", \"prompt\": \"Tell me a story\", \"max_tokens\": 200}" & done; sleep 2; done' ``` @@ -158,7 +161,7 @@ kubectl delete pod load-gen -n kubectl describe scaledobject opt-125m-vllm-scaledobject -n ``` -2. Verify Prometheus connectivity (note the `/prometheus` path prefix on prokube): +2. Verify Prometheus connectivity (some deployments use a `/prometheus` path prefix): ```bash kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=up' diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml index 0b67dfa..5135391 100644 --- a/serving/kserve-keda-autoscaling/inference-service.yaml +++ b/serving/kserve-keda-autoscaling/inference-service.yaml @@ -6,6 +6,8 @@ metadata: huggingface.co/model-id: facebook/opt-125m spec: predictor: + # Note: When using KEDA, replica limits are managed by the ScaledObject. + # These values serve as defaults if KEDA is not deployed. minReplicas: 1 maxReplicas: 10 model: diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml index b61d3fa..1530843 100644 --- a/serving/kserve-keda-autoscaling/scaled-object.yaml +++ b/serving/kserve-keda-autoscaling/scaled-object.yaml @@ -5,8 +5,8 @@ # - KEDA installed in cluster (https://keda.sh/docs/deploy/) # - Prometheus collecting vLLM metrics (see service-monitor.yaml) # -# Note: vLLM metrics use colons in names (e.g., vllm:num_requests_running) -# which need to be quoted in PromQL queries +# Note: vLLM uses colons in metric names (e.g., vllm:num_requests_running), +# which is unusual but correct. Use {"__name__"="..."} syntax in PromQL. # # TODO: Replace with your actual namespace in the queries below # @@ -24,7 +24,7 @@ spec: name: opt-125m-vllm-predictor-00001-deployment # Polling interval for checking metrics (seconds) pollingInterval: 15 - # Cooldown period before scaling down (seconds) + # Cooldown period before scaling down (seconds) cooldownPeriod: 60 # Min/max replicas minReplicaCount: 1 diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml index 2517000..2bccc93 100644 --- a/serving/kserve-keda-autoscaling/service-monitor.yaml +++ b/serving/kserve-keda-autoscaling/service-monitor.yaml @@ -3,7 +3,7 @@ # # Prerequisites: # - Prometheus Operator installed (kube-prometheus-stack) -# - vLLM runtime exposes metrics on port 8080 at /metrics endpoint +# - vLLM runtime exposes metrics at /metrics endpoint # apiVersion: monitoring.coreos.com/v1 kind: PodMonitor @@ -21,60 +21,7 @@ spec: matchNames: - # TODO: Replace with your namespace podMetricsEndpoints: - - port: user-port # vLLM runtime metrics port + - port: user-port path: /metrics interval: 15s scrapeTimeout: 10s ---- -# PrometheusRule for creating recording rules -# These pre-aggregate metrics for more efficient KEDA queries -apiVersion: monitoring.coreos.com/v1 -kind: PrometheusRule -metadata: - name: opt-125m-vllm-recording-rules - labels: - app: opt-125m-vllm - release: kube-prometheus-stack -spec: - groups: - - name: vllm-metrics - interval: 15s - rules: - # Number of running requests - - record: vllm:num_requests_running - expr: | - sum by (model_name, namespace) ( - vllm_num_requests_running - ) - # Number of waiting requests - - record: vllm:num_requests_waiting - expr: | - sum by (model_name, namespace) ( - vllm_num_requests_waiting - ) - # Token generation throughput - - record: vllm:generation_tokens_rate - expr: | - sum by (model_name, namespace) ( - rate(vllm_generation_tokens_total[1m]) - ) - # Prompt tokens throughput - - record: vllm:prompt_tokens_rate - expr: | - sum by (model_name, namespace) ( - rate(vllm_prompt_tokens_total[1m]) - ) - # Average time to first token (TTFT) - - record: vllm:time_to_first_token_avg - expr: | - avg by (model_name, namespace) ( - rate(vllm_time_to_first_token_seconds_sum[5m]) - / - rate(vllm_time_to_first_token_seconds_count[5m]) - ) - # GPU KV cache utilization - - record: vllm:gpu_cache_usage_percent - expr: | - avg by (model_name, namespace) ( - vllm_gpu_cache_usage_perc - ) From b1a1034ef5eb27d4ef936772c8589e576f403bd4 Mon Sep 17 00:00:00 2001 From: hsteude Date: Mon, 16 Feb 2026 20:58:21 +0100 Subject: [PATCH 06/17] Address additional Copilot review feedback - Fix KEDA trigger description (evaluates all, uses highest replica count) - Make Prometheus URL configurable ( placeholder) - Add pod selector to queries to avoid cross-InferenceService metric aggregation - Update README with additional configuration steps --- serving/kserve-keda-autoscaling/README.md | 6 ++++-- .../kserve-keda-autoscaling/scaled-object.yaml | 18 +++++++++++------- 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index 69c4d94..b07387c 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -62,8 +62,10 @@ kubectl get deployments -n | grep opt-125m-vllm ``` Update `scaled-object.yaml` with: -- The correct deployment name +- The correct deployment name in `scaleTargetRef` +- Your Prometheus server URL (e.g., `http://prometheus.monitoring:9090` or with path prefix) - Your namespace in the Prometheus queries +- Your InferenceService name in the pod selector (e.g., `pod=~"opt-125m-vllm-predictor-.*"`) Then apply: ```bash @@ -78,7 +80,7 @@ kubectl describe scaledobject opt-125m-vllm-scaledobject -n ## Autoscaling Strategies -This example uses three triggers (first one to exceed threshold wins): +This example uses three triggers. KEDA evaluates all triggers and scales based on the highest desired replica count: ### 1. Time To First Token (TTFT) - P95 Scales when the 95th percentile TTFT exceeds 200ms: diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml index 1530843..ca30d3e 100644 --- a/serving/kserve-keda-autoscaling/scaled-object.yaml +++ b/serving/kserve-keda-autoscaling/scaled-object.yaml @@ -8,7 +8,10 @@ # Note: vLLM uses colons in metric names (e.g., vllm:num_requests_running), # which is unusual but correct. Use {"__name__"="..."} syntax in PromQL. # -# TODO: Replace with your actual namespace in the queries below +# TODO: Replace the following before deploying: +# - : your actual namespace +# - : your Prometheus server URL (may or may not have a path prefix) +# - opt-125m-vllm: your InferenceService name (in queries and scaleTargetRef) # apiVersion: keda.sh/v1alpha1 kind: ScaledObject @@ -54,28 +57,29 @@ spec: # Scale up when P95 TTFT exceeds 200ms (0.2s) - type: prometheus metadata: - serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus + # Adjust URL to your Prometheus setup (some have /prometheus path prefix) + serverAddress: metricName: vllm_ttft_p95 query: | - histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace=""}[2m])) by (le)) + histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="", pod=~"opt-125m-vllm-predictor-.*"}[2m])) by (le)) threshold: "0.2" activationThreshold: "0.1" # Scale based on GPU KV-cache usage (for GPU deployments) # Scale up when cache usage exceeds 70% - type: prometheus metadata: - serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus + serverAddress: metricName: vllm_gpu_cache_usage query: | - avg({"__name__"="vllm:gpu_cache_usage_perc", namespace=""}) + avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="", pod=~"opt-125m-vllm-predictor-.*"}) threshold: "0.7" activationThreshold: "0.5" # Fallback: Scale based on running requests (always works) - type: prometheus metadata: - serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus + serverAddress: metricName: vllm_num_requests_running query: | - avg({"__name__"="vllm:num_requests_running", namespace=""}) + avg({"__name__"="vllm:num_requests_running", namespace="", pod=~"opt-125m-vllm-predictor-.*"}) threshold: "2" activationThreshold: "1" From b69e04df6933da3ffd497ddf960dd670f8330c72 Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Thu, 12 Mar 2026 14:17:33 +0100 Subject: [PATCH 07/17] Update KEDA example with new insights --- serving/kserve-keda-autoscaling/README.md | 212 ++++-------------- .../inference-service.yaml | 24 +- .../scaled-object.yaml | 89 ++------ .../service-monitor.yaml | 27 --- 4 files changed, 91 insertions(+), 261 deletions(-) delete mode 100644 serving/kserve-keda-autoscaling/service-monitor.yaml diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index b07387c..bbe6f32 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -1,193 +1,79 @@ -# KServe Autoscaling with KEDA and Custom Metrics +# KServe Autoscaling with KEDA and Custom Prometheus Metrics -This example demonstrates how to autoscale KServe InferenceServices using [KEDA](https://keda.sh/) with custom Prometheus metrics. This is particularly useful for LLM inference workloads where request-based autoscaling (Knative default) is not optimal. +This example demonstrates autoscaling a KServe InferenceService using +[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM. +It scales based on total token throughput rather than simple request count, +which is better suited for LLM inference workloads. -## Why Custom Metrics for LLM Autoscaling? +For full documentation, see the +[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling). -Traditional request-based autoscaling doesn't work well for LLM inference because: +## Why Token Throughput? -- **Token-level work**: LLM inference operates at token level, not request level. A single request can generate hundreds of tokens. -- **Variable latency**: Request latency varies significantly based on input/output token count. -- **Memory pressure**: LLM models require significant GPU memory (KV cache), which fills up based on concurrent requests. - -Better metrics for LLM autoscaling include: -- **Time To First Token (TTFT)**: Latency until first token is generated -- **KV Cache utilization**: GPU memory used for attention cache -- **Number of running/waiting requests**: Queue depth +LLM requests vary wildly in duration depending on prompt and output length. +Request-count metrics (concurrency, QPS) don't reflect actual GPU load. +Token throughput stays elevated as long as the model is under pressure, +making it a stable scaling signal. ## Prerequisites -Install KEDA in the cluster: - -```bash -helm repo add kedacore https://kedacore.github.io/charts -helm install keda kedacore/keda --namespace keda --create-namespace -``` +- KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`) +- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor) ## Files | File | Description | |------|-------------| -| `inference-service.yaml` | KServe InferenceService for OPT-125M model with vLLM backend | -| `scaled-object.yaml` | KEDA ScaledObject with TTFT, GPU cache, and request-based scaling | -| `service-monitor.yaml` | PodMonitor for vLLM metrics collection | - -## Deployment +| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) | +| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput | -### 1. Deploy the InferenceService +## Quick Start ```bash -kubectl apply -f inference-service.yaml -n -``` +export NAMESPACE="default" -Wait for the model to be ready: -```bash -kubectl get inferenceservice opt-125m-vllm -n -w -``` - -### 2. Configure Prometheus Metrics Collection +# 1. Deploy the InferenceService +kubectl apply -n $NAMESPACE -f inference-service.yaml -Apply the PodMonitor to scrape vLLM metrics: -```bash -kubectl apply -f service-monitor.yaml -n -``` +# 2. Wait for it to become ready +kubectl get isvc opt-125m -n $NAMESPACE -w -**Note:** Update the `namespaceSelector` in `service-monitor.yaml` to match your namespace. +# 3. Deploy the KEDA ScaledObject +kubectl apply -n $NAMESPACE -f scaled-object.yaml -### 3. Deploy KEDA ScaledObject - -First, identify the correct deployment name: -```bash -kubectl get deployments -n | grep opt-125m-vllm -``` - -Update `scaled-object.yaml` with: -- The correct deployment name in `scaleTargetRef` -- Your Prometheus server URL (e.g., `http://prometheus.monitoring:9090` or with path prefix) -- Your namespace in the Prometheus queries -- Your InferenceService name in the pod selector (e.g., `pod=~"opt-125m-vllm-predictor-.*"`) - -Then apply: -```bash -kubectl apply -f scaled-object.yaml -n +# 4. Verify +kubectl get scaledobject -n $NAMESPACE +kubectl get hpa -n $NAMESPACE ``` -Verify the ScaledObject: -```bash -kubectl get scaledobject -n -kubectl describe scaledobject opt-125m-vllm-scaledobject -n -``` +## Customization -## Autoscaling Strategies +**Namespace and model name**: replace `default` and `opt-125m` in the +Prometheus queries inside `scaled-object.yaml`. -This example uses three triggers. KEDA evaluates all triggers and scales based on the highest desired replica count: +**Threshold**: the `threshold: "5"` value means "scale up when each replica +handles more than 5 tokens/second on average" (`AverageValue` divides the +query result by replica count). Tune this based on load testing for your +model and hardware. -### 1. Time To First Token (TTFT) - P95 -Scales when the 95th percentile TTFT exceeds 200ms: -```yaml -triggers: - - type: prometheus - metadata: - query: | - histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace=""}[2m])) by (le)) - threshold: "0.2" -``` +**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512` +from the InferenceService args, add GPU resource requests, and consider +adding a second trigger for GPU KV-cache utilization: -### 2. GPU KV-Cache Utilization -Scales when GPU cache usage exceeds 70% (for GPU deployments): ```yaml -triggers: - - type: prometheus - metadata: - query: | - avg({"__name__"="vllm:gpu_cache_usage_perc", namespace=""}) - threshold: "0.7" +# Add to scaled-object.yaml triggers list +- type: prometheus + metadata: + serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus + query: >- + avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"}) + metricType: AverageValue + threshold: "0.75" ``` -### 3. Running Requests (Fallback) -Scales when average running requests per pod exceeds 2: -```yaml -triggers: - - type: prometheus - metadata: - query: | - avg({"__name__"="vllm:num_requests_running", namespace=""}) - threshold: "2" -``` - -## vLLM Metrics Reference - -vLLM exposes metrics at the `/metrics` endpoint. Note that vLLM uses colons in metric names (this is unusual but correct): - -| Metric | Description | -|--------|-------------| -| `vllm:num_requests_running` | Number of requests currently being processed | -| `vllm:num_requests_waiting` | Number of requests waiting in queue | -| `vllm:gpu_cache_usage_perc` | GPU KV cache utilization (0-1) | -| `vllm:time_to_first_token_seconds` | Histogram of TTFT | -| `vllm:time_per_output_token_seconds` | Histogram of TPOT | -| `vllm:generation_tokens_total` | Total number of generated tokens | - -## Testing Autoscaling - -Generate load to trigger autoscaling: - -```bash -# First, find the predictor service name -kubectl get svc -n | grep opt-125m-vllm - -# Create a load generator pod (adjust service name if needed) -kubectl run load-gen --image=curlimages/curl -n --restart=Never -- \ - sh -c 'while true; do for i in $(seq 1 10); do curl -s -X POST "http://opt-125m-vllm-predictor-00001..svc.cluster.local/openai/v1/completions" -H "Content-Type: application/json" -d "{\"model\": \"opt-125m\", \"prompt\": \"Tell me a story\", \"max_tokens\": 200}" & done; sleep 2; done' -``` - -Monitor scaling: -```bash -# Watch HPA status -kubectl get hpa -n -w - -# Watch pods -kubectl get pods -n -l serving.kserve.io/inferenceservice=opt-125m-vllm -w -``` - -Clean up: -```bash -kubectl delete pod load-gen -n -``` - -## Troubleshooting - -### KEDA not scaling -1. Check ScaledObject status: - ```bash - kubectl describe scaledobject opt-125m-vllm-scaledobject -n - ``` - -2. Verify Prometheus connectivity (some deployments use a `/prometheus` path prefix): - ```bash - kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ - curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=up' - ``` - -3. Check KEDA operator logs: - ```bash - kubectl logs -l app=keda-operator -n keda - ``` - -### Metrics not appearing -1. Verify PodMonitor is picked up: - ```bash - kubectl get podmonitor -n - ``` - -2. Check if vLLM metrics are being scraped: - ```bash - kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ - curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query={__name__=~"vllm:.*"}' - ``` - ## References -- [KServe Issue #3561: Native KEDA integration](https://github.com/kserve/kserve/issues/3561) -- [KEDA Prometheus Scaler](https://keda.sh/docs/scalers/prometheus/) -- [vLLM Metrics](https://docs.vllm.ai/en/latest/serving/metrics.html) +- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/) +- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler) +- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/) +- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html) diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml index 5135391..f7a755b 100644 --- a/serving/kserve-keda-autoscaling/inference-service.yaml +++ b/serving/kserve-keda-autoscaling/inference-service.yaml @@ -1,15 +1,21 @@ +# KServe InferenceService for OPT-125M with vLLM backend. +# Uses RawDeployment mode — required when scaling with KEDA. +# +# This example runs on CPU. For GPU, remove --dtype=float32 and +# --max-model-len, and adjust resources to request nvidia.com/gpu. apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: - name: opt-125m-vllm + name: opt-125m annotations: - huggingface.co/model-id: facebook/opt-125m + # RawDeployment mode — creates a plain Deployment instead of a Knative Revision. + serving.kserve.io/deploymentMode: "RawDeployment" + # Tell KServe not to create its own HPA (KEDA will manage scaling). + serving.kserve.io/autoscalerClass: "external" spec: predictor: - # Note: When using KEDA, replica limits are managed by the ScaledObject. - # These values serve as defaults if KEDA is not deployed. minReplicas: 1 - maxReplicas: 10 + maxReplicas: 3 model: modelFormat: name: huggingface @@ -18,7 +24,13 @@ spec: - --model_id=facebook/opt-125m - --backend=vllm - --dtype=float32 - - --device=cpu + - --max-model-len=512 + # Explicit port declaration is required in RawDeployment mode + # for the cluster-wide PodMonitor to discover the metrics endpoint. + ports: + - name: user-port + containerPort: 8080 + protocol: TCP resources: requests: cpu: "2" diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml index ca30d3e..004f038 100644 --- a/serving/kserve-keda-autoscaling/scaled-object.yaml +++ b/serving/kserve-keda-autoscaling/scaled-object.yaml @@ -1,85 +1,44 @@ -# KEDA ScaledObject for KServe InferenceService with vLLM backend -# Scales based on custom Prometheus metrics from vLLM serving runtime +# KEDA ScaledObject for KServe InferenceService with vLLM backend. +# Scales based on total token throughput (prompt + generation) from Prometheus. # # Prerequisites: -# - KEDA installed in cluster (https://keda.sh/docs/deploy/) -# - Prometheus collecting vLLM metrics (see service-monitor.yaml) +# - KEDA installed (https://keda.sh/docs/deploy/) +# - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor) # -# Note: vLLM uses colons in metric names (e.g., vllm:num_requests_running), -# which is unusual but correct. Use {"__name__"="..."} syntax in PromQL. -# -# TODO: Replace the following before deploying: -# - : your actual namespace -# - : your Prometheus server URL (may or may not have a path prefix) -# - opt-125m-vllm: your InferenceService name (in queries and scaleTargetRef) +# Before deploying, replace: +# - "default" in the Prometheus queries with your namespace +# - "opt-125m" in model_name with your --model_name value +# - The serverAddress if your Prometheus uses a different URL # apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: - name: opt-125m-vllm-scaledobject - labels: - app: opt-125m-vllm + name: opt-125m-scaledobject spec: - # Target the KServe predictor deployment scaleTargetRef: - apiVersion: apps/v1 - kind: Deployment - name: opt-125m-vllm-predictor-00001-deployment - # Polling interval for checking metrics (seconds) - pollingInterval: 15 - # Cooldown period before scaling down (seconds) - cooldownPeriod: 60 - # Min/max replicas + # In RawDeployment mode KServe names the Deployment {isvc-name}-predictor. + name: opt-125m-predictor minReplicaCount: 1 - maxReplicaCount: 10 - # Advanced scaling behavior + maxReplicaCount: 3 + pollingInterval: 15 # how often KEDA checks the metric (seconds) + cooldownPeriod: 120 # seconds after last trigger activation before scaling to minReplicaCount advanced: horizontalPodAutoscalerConfig: behavior: - scaleDown: - stabilizationWindowSeconds: 120 - policies: - - type: Percent - value: 25 - periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 0 + scaleDown: + stabilizationWindowSeconds: 120 policies: - - type: Percent - value: 100 - periodSeconds: 15 - type: Pods - value: 4 - periodSeconds: 15 - selectPolicy: Max + value: 1 # remove at most 1 replica per minute + periodSeconds: 60 triggers: - # Scale based on Time To First Token (TTFT) - P95 - # Scale up when P95 TTFT exceeds 200ms (0.2s) - - type: prometheus - metadata: - # Adjust URL to your Prometheus setup (some have /prometheus path prefix) - serverAddress: - metricName: vllm_ttft_p95 - query: | - histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="", pod=~"opt-125m-vllm-predictor-.*"}[2m])) by (le)) - threshold: "0.2" - activationThreshold: "0.1" - # Scale based on GPU KV-cache usage (for GPU deployments) - # Scale up when cache usage exceeds 70% - - type: prometheus - metadata: - serverAddress: - metricName: vllm_gpu_cache_usage - query: | - avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="", pod=~"opt-125m-vllm-predictor-.*"}) - threshold: "0.7" - activationThreshold: "0.5" - # Fallback: Scale based on running requests (always works) - type: prometheus metadata: - serverAddress: - metricName: vllm_num_requests_running - query: | - avg({"__name__"="vllm:num_requests_running", namespace="", pod=~"opt-125m-vllm-predictor-.*"}) - threshold: "2" - activationThreshold: "1" + serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus + query: >- + sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m])) + + sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m])) + metricType: AverageValue + threshold: "5" diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml deleted file mode 100644 index 2bccc93..0000000 --- a/serving/kserve-keda-autoscaling/service-monitor.yaml +++ /dev/null @@ -1,27 +0,0 @@ -# PodMonitor to scrape vLLM metrics from KServe InferenceService -# This enables Prometheus to collect the metrics used by KEDA for autoscaling -# -# Prerequisites: -# - Prometheus Operator installed (kube-prometheus-stack) -# - vLLM runtime exposes metrics at /metrics endpoint -# -apiVersion: monitoring.coreos.com/v1 -kind: PodMonitor -metadata: - name: opt-125m-vllm-metrics - labels: - app: opt-125m-vllm - # Label to match Prometheus Operator's podMonitorSelector - release: kube-prometheus-stack -spec: - selector: - matchLabels: - serving.kserve.io/inferenceservice: opt-125m-vllm - namespaceSelector: - matchNames: - - # TODO: Replace with your namespace - podMetricsEndpoints: - - port: user-port - path: /metrics - interval: 15s - scrapeTimeout: 10s From b3816d7e4e5ad2601ff9705b67370a86bb2bdc70 Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Thu, 12 Mar 2026 14:31:06 +0100 Subject: [PATCH 08/17] Add scaleUp stabilization window to mitigate metric oscillation AverageValue divides total token throughput by replica count, which means the per-replica value halves after a scale-up event. With stabilizationWindowSeconds: 0 this could cause flapping near the threshold. Setting it to 30s requires the metric to stay above threshold for two consecutive polling intervals before a scale-up is committed, while the existing 120s scaleDown window prevents premature scale-down. --- serving/kserve-keda-autoscaling/scaled-object.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml index 004f038..32bc365 100644 --- a/serving/kserve-keda-autoscaling/scaled-object.yaml +++ b/serving/kserve-keda-autoscaling/scaled-object.yaml @@ -26,7 +26,7 @@ spec: horizontalPodAutoscalerConfig: behavior: scaleUp: - stabilizationWindowSeconds: 0 + stabilizationWindowSeconds: 30 # short window to absorb metric noise before committing to scale-up scaleDown: stabilizationWindowSeconds: 120 policies: From 81e4b823b510b6186b82be70bba7f6960860ee73 Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Thu, 12 Mar 2026 18:05:04 +0100 Subject: [PATCH 09/17] Address reviewer's feedback --- serving/kserve-keda-autoscaling/README.md | 87 ++++++++++++++++--- .../scaled-object.yaml | 11 ++- 2 files changed, 82 insertions(+), 16 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index bbe6f32..66d71d9 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -29,33 +29,100 @@ making it a stable scaling signal. ## Quick Start -```bash -export NAMESPACE="default" +> [!NOTE] +> All of the examples below should be run in prokube notebook inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default. +```bash # 1. Deploy the InferenceService -kubectl apply -n $NAMESPACE -f inference-service.yaml +kubectl apply -f inference-service.yaml # 2. Wait for it to become ready -kubectl get isvc opt-125m -n $NAMESPACE -w +kubectl get isvc opt-125m -w # 3. Deploy the KEDA ScaledObject -kubectl apply -n $NAMESPACE -f scaled-object.yaml +kubectl apply -f scaled-object.yaml # 4. Verify -kubectl get scaledobject -n $NAMESPACE -kubectl get hpa -n $NAMESPACE +kubectl get scaledobject +kubectl get hpa +``` + +## See It in Action + +After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle. + +### 1. Send inference requests + +Get the service URL and send a request: + +```bash +SERVICE_URL=$(kubectl get isvc opt-125m -o jsonpath='{.status.url}') + +curl -s "$SERVICE_URL/openai/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{"model": "opt-125m", "prompt": "Hello world", "max_tokens": 64}' +``` + +### 2. Generate enough load to trigger scale-up + +Run several concurrent workers to push token throughput above the threshold +(5 tokens/second per replica by default): + +```bash +# 5 parallel workers, each sending requests in a loop +for i in $(seq 1 5); do + (while true; do + curl -s "$SERVICE_URL/openai/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{"model": "opt-125m", "prompt": "Write a long story about a dragon", "max_tokens": 200}' > /dev/null + done) & +done + +# Stop the load later with: +# kill $(jobs -p) ``` +### 3. Observe autoscaling + +Watch replicas scale up in response to load: + +```bash +# Watch pods scale up (and later scale down) +kubectl get pods -l serving.kserve.io/inferenceservice=opt-125m -w + +# Check KEDA's HPA +kubectl get hpa -w +``` + +**Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see: +- vLLM Performance Statistics: https:///grafana/d/performance-statistics/vllm-performance-statistics +- vLLM Query Statistics: https:///grafana/d/query-statistics4/vllm-query-statistics +- Replica count: https:///grafana/d/demqj48/kubernetes-compute-resources-workload-copy + +In our testing, the full cycle looked like: +1. **1 replica** at rest +2. Load applied (5 workers, ~55 tok/s total) — KEDA detects threshold breach +3. **Scaled to 3 replicas** within ~30 seconds +4. Load removed — metric drops to 0 — stabilization window (120s) +5. **Scaled back down** 3 → 2 → 1 gracefully (1 pod removed per minute) + ## Customization -**Namespace and model name**: replace `default` and `opt-125m` in the -Prometheus queries inside `scaled-object.yaml`. +**Model name**: the `model_name="opt-125m"` filter in the Prometheus queries inside +`scaled-object.yaml` must match the `--model_name` argument in `inference-service.yaml`. **Threshold**: the `threshold: "5"` value means "scale up when each replica handles more than 5 tokens/second on average" (`AverageValue` divides the query result by replica count). Tune this based on load testing for your model and hardware. +**Multi-tenant clusters**: if multiple users may deploy models with the same +name, add a `namespace` filter to the Prometheus queries: + +```promql +sum(rate(vllm:prompt_tokens_total{namespace="my-namespace",model_name="opt-125m"}[2m])) +``` + **GPU deployments**: remove `--dtype=float32` and `--max-model-len=512` from the InferenceService args, add GPU resource requests, and consider adding a second trigger for GPU KV-cache utilization: @@ -66,7 +133,7 @@ adding a second trigger for GPU KV-cache utilization: metadata: serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus query: >- - avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"}) + avg(vllm:gpu_cache_usage_perc{model_name="my-model"}) metricType: AverageValue threshold: "0.75" ``` diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml index 32bc365..40036b8 100644 --- a/serving/kserve-keda-autoscaling/scaled-object.yaml +++ b/serving/kserve-keda-autoscaling/scaled-object.yaml @@ -5,10 +5,9 @@ # - KEDA installed (https://keda.sh/docs/deploy/) # - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor) # -# Before deploying, replace: -# - "default" in the Prometheus queries with your namespace -# - "opt-125m" in model_name with your --model_name value -# - The serverAddress if your Prometheus uses a different URL +# Customization: +# - "opt-125m" in model_name must match the --model_name arg in inference-service.yaml +# - The serverAddress must match your Prometheus URL # apiVersion: keda.sh/v1alpha1 kind: ScaledObject @@ -38,7 +37,7 @@ spec: metadata: serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus query: >- - sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m])) - + sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m])) + sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m])) + + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m])) metricType: AverageValue threshold: "5" From f2e6f96830a5fe5a968c8b0fa6aeb8541bb8676e Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Thu, 12 Mar 2026 18:26:35 +0100 Subject: [PATCH 10/17] Better readme and scaling watching instructions --- serving/kserve-keda-autoscaling/README.md | 24 +++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index 66d71d9..582d6cd 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -30,7 +30,7 @@ making it a stable scaling signal. ## Quick Start > [!NOTE] -> All of the examples below should be run in prokube notebook inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default. +> All of the examples below should be run in prokube notebook's terminal inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default. ```bash # 1. Deploy the InferenceService @@ -84,14 +84,26 @@ done ### 3. Observe autoscaling -Watch replicas scale up in response to load: +You can use dashboards (see below) or check out any of these in terminal while the load is running: ```bash -# Watch pods scale up (and later scale down) -kubectl get pods -l serving.kserve.io/inferenceservice=opt-125m -w +# Deployment replica count (most direct signal) +kubectl get deployment opt-125m-predictor -w -# Check KEDA's HPA -kubectl get hpa -w +# HPA — shows current metric value vs threshold and desired replica count +kubectl get hpa keda-hpa-opt-125m-scaledobject -w + +# ScaledObject — shows Ready/Active/Paused conditions +kubectl get scaledobject opt-125m-scaledobject -w + +# Pods coming up and terminating +kubectl get pods -l app=isvc.opt-125m-predictor -w +``` + +Or poll a compact summary every 10 seconds: + +```bash +watch -n 10 kubectl get deployment/opt-125m-predictor hpa/keda-hpa-opt-125m-scaledobject ``` **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see: From 75c9d0aa1b6fb4401f4497bc1b014c7725fd049a Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Thu, 12 Mar 2026 18:44:59 +0100 Subject: [PATCH 11/17] Fix service URL to use internal cluster address and simplify observe section --- serving/kserve-keda-autoscaling/README.md | 49 ++++++++++------------- 1 file changed, 22 insertions(+), 27 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index 582d6cd..ffa0878 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -53,57 +53,52 @@ After deploying, you can trigger autoscaling and observe the full scale-up / sca ### 1. Send inference requests -Get the service URL and send a request: +Get the internal cluster address and send a request: ```bash -SERVICE_URL=$(kubectl get isvc opt-125m -o jsonpath='{.status.url}') +SERVICE_URL="$(kubectl get isvc opt-125m -o jsonpath='{.status.address.url}')" + +echo "\nService URL: $SERVICE_URL" curl -s "$SERVICE_URL/openai/v1/completions" \ -H "Content-Type: application/json" \ - -d '{"model": "opt-125m", "prompt": "Hello world", "max_tokens": 64}' + -d '{"model":"opt-125m","prompt":"Hello world","max_tokens":64}' \ + | python -m json.tool ``` ### 2. Generate enough load to trigger scale-up -Run several concurrent workers to push token throughput above the threshold +Run several concurrent workers (in the background!) to push token throughput above the threshold (5 tokens/second per replica by default): ```bash # 5 parallel workers, each sending requests in a loop for i in $(seq 1 5); do - (while true; do - curl -s "$SERVICE_URL/openai/v1/completions" \ - -H "Content-Type: application/json" \ - -d '{"model": "opt-125m", "prompt": "Write a long story about a dragon", "max_tokens": 200}' > /dev/null - done) & + ( + while true; do + curl -s "$SERVICE_URL/openai/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{"model":"opt-125m","prompt":"Write a long story about a dragon","max_tokens":200}' \ + > /dev/null + sleep 1 + done + ) & done +echo "Load generation started." +echo "Stop it with: kill $(jobs -p)" + # Stop the load later with: # kill $(jobs -p) ``` ### 3. Observe autoscaling -You can use dashboards (see below) or check out any of these in terminal while the load is running: - -```bash -# Deployment replica count (most direct signal) -kubectl get deployment opt-125m-predictor -w - -# HPA — shows current metric value vs threshold and desired replica count -kubectl get hpa keda-hpa-opt-125m-scaledobject -w - -# ScaledObject — shows Ready/Active/Paused conditions -kubectl get scaledobject opt-125m-scaledobject -w - -# Pods coming up and terminating -kubectl get pods -l app=isvc.opt-125m-predictor -w -``` - -Or poll a compact summary every 10 seconds: +You can use dashboards (recommended, see below) or get a compact summary in terminal: ```bash -watch -n 10 kubectl get deployment/opt-125m-predictor hpa/keda-hpa-opt-125m-scaledobject +# polls every 10 seconds +watch -n 10 kubectl get deployment/opt-125m-predictor hpa/keda-hpa-opt-125m-scaledobject scaledobject/opt-125m-scaledobject ``` **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see: From 392a22a823f0e01835216a4ba03eb91718ef2d88 Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Thu, 12 Mar 2026 19:06:00 +0100 Subject: [PATCH 12/17] Improve KEDA autoscaling documentation --- serving/kserve-keda-autoscaling/README.md | 45 ++++++++++++----------- 1 file changed, 24 insertions(+), 21 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index ffa0878..3f49bed 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -39,7 +39,7 @@ kubectl apply -f inference-service.yaml # 2. Wait for it to become ready kubectl get isvc opt-125m -w -# 3. Deploy the KEDA ScaledObject +# 3. Deploy the KEDA ScaledObject (requires corresponding permissions) kubectl apply -f scaled-object.yaml # 4. Verify @@ -56,14 +56,13 @@ After deploying, you can trigger autoscaling and observe the full scale-up / sca Get the internal cluster address and send a request: ```bash -SERVICE_URL="$(kubectl get isvc opt-125m -o jsonpath='{.status.address.url}')" - -echo "\nService URL: $SERVICE_URL" +# inference service name + "-predictor" +SERVICE_URL=opt-125m-predictor curl -s "$SERVICE_URL/openai/v1/completions" \ -H "Content-Type: application/json" \ - -d '{"model":"opt-125m","prompt":"Hello world","max_tokens":64}' \ - | python -m json.tool + -d '{"model":"opt-125m","prompt":"What is AI?","max_tokens":64}' \ + | python -c 'import json,sys;print("\n", json.load(sys.stdin)["choices"][0]["text"].strip(), "\n")' ``` ### 2. Generate enough load to trigger scale-up @@ -73,23 +72,20 @@ Run several concurrent workers (in the background!) to push token throughput abo ```bash # 5 parallel workers, each sending requests in a loop +PIDS="" for i in $(seq 1 5); do - ( - while true; do - curl -s "$SERVICE_URL/openai/v1/completions" \ - -H "Content-Type: application/json" \ - -d '{"model":"opt-125m","prompt":"Write a long story about a dragon","max_tokens":200}' \ - > /dev/null - sleep 1 - done - ) & + (while true; do + curl -s "$SERVICE_URL/openai/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{"model":"opt-125m","prompt":"Write a long story about a dragon","max_tokens":200}' > /dev/null + done) & + PIDS="$PIDS $!" done -echo "Load generation started." -echo "Stop it with: kill $(jobs -p)" - -# Stop the load later with: -# kill $(jobs -p) +echo +echo "Load running (PIDs:$PIDS)" +echo "Stop with: kill$PIDS" +echo ``` ### 3. Observe autoscaling @@ -98,7 +94,14 @@ You can use dashboards (recommended, see below) or get a compact summary in term ```bash # polls every 10 seconds -watch -n 10 kubectl get deployment/opt-125m-predictor hpa/keda-hpa-opt-125m-scaledobject scaledobject/opt-125m-scaledobject +watch -n10 ' +echo "Deployment:" +kubectl get deployment opt-125m-predictor + +echo +echo "Autoscaler:" +kubectl get hpa keda-hpa-opt-125m-scaledobject +' ``` **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see: From 35ef20dc25b5578706c73e4fe80ad67e275627d3 Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Fri, 13 Mar 2026 18:32:58 +0100 Subject: [PATCH 13/17] Warn about KEDA availability and namespace metric collision --- serving/kserve-keda-autoscaling/README.md | 27 +++++++++++++++-------- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index 3f49bed..e8986dc 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -17,7 +17,7 @@ making it a stable scaling signal. ## Prerequisites -- KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`) +- KEDA installed in the cluster — not available in all prokube clusters by default; see step 3 below - Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor) ## Files @@ -39,7 +39,7 @@ kubectl apply -f inference-service.yaml # 2. Wait for it to become ready kubectl get isvc opt-125m -w -# 3. Deploy the KEDA ScaledObject (requires corresponding permissions) +# 3. Deploy the KEDA ScaledObject kubectl apply -f scaled-object.yaml # 4. Verify @@ -47,6 +47,22 @@ kubectl get scaledobject kubectl get hpa ``` +> [!WARNING] +> If step 3 fails with `no matches for kind "ScaledObject"`, KEDA is not installed in your cluster. +> Ask your admin to enable it. + +> [!WARNING] +> The Prometheus query in `scaled-object.yaml` has no `namespace` filter, so it aggregates token +> throughput across **all namespaces**. This is fine for testing, but if multiple users deploy a +> model named `opt-125m` at the same time, their metrics will interfere and autoscaling will be +> incorrect for both. For any real use, add a namespace filter to both queries in `scaled-object.yaml`: +> +> ```yaml +> query: >- +> sum(rate(vllm:prompt_tokens_total{namespace="",model_name="opt-125m"}[2m])) +> + sum(rate(vllm:generation_tokens_total{namespace="",model_name="opt-125m"}[2m])) +> ``` + ## See It in Action After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle. @@ -126,13 +142,6 @@ handles more than 5 tokens/second on average" (`AverageValue` divides the query result by replica count). Tune this based on load testing for your model and hardware. -**Multi-tenant clusters**: if multiple users may deploy models with the same -name, add a `namespace` filter to the Prometheus queries: - -```promql -sum(rate(vllm:prompt_tokens_total{namespace="my-namespace",model_name="opt-125m"}[2m])) -``` - **GPU deployments**: remove `--dtype=float32` and `--max-model-len=512` from the InferenceService args, add GPU resource requests, and consider adding a second trigger for GPU KV-cache utilization: From bd2a6e406446294a77afd71222f22f47a899e7be Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Thu, 19 Mar 2026 15:06:36 +0100 Subject: [PATCH 14/17] Improve load generation --- serving/kserve-keda-autoscaling/README.md | 84 ++++-- .../kserve-keda-autoscaling/load-generator.py | 261 ++++++++++++++++++ 2 files changed, 322 insertions(+), 23 deletions(-) create mode 100644 serving/kserve-keda-autoscaling/load-generator.py diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index e8986dc..77cbb6e 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -26,6 +26,7 @@ making it a stable scaling signal. |------|-------------| | `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) | | `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput | +| `load-generator.py` | Python load generator with presets for different scaling scenarios | ## Quick Start @@ -67,7 +68,7 @@ kubectl get hpa After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle. -### 1. Send inference requests +### 1. Send a test inference request Get the internal cluster address and send a request: @@ -81,29 +82,30 @@ curl -s "$SERVICE_URL/openai/v1/completions" \ | python -c 'import json,sys;print("\n", json.load(sys.stdin)["choices"][0]["text"].strip(), "\n")' ``` -### 2. Generate enough load to trigger scale-up +### 2. Generate load to trigger scale-up -Run several concurrent workers (in the background!) to push token throughput above the threshold -(5 tokens/second per replica by default): +Use the included load generator to produce controlled, sustained load. +It has two presets calibrated for the opt-125m model on CPU: + +| Mode | Workers | Sleep | Throughput | Scaling behavior | +|------|---------|-------|------------|------------------| +| `stable-2` | 1 | 8s | ~8 tok/s | Scales to 2 replicas and holds | +| `stable-3` | 2 | 2s | ~22 tok/s | Scales to 3 replicas and holds | ```bash -# 5 parallel workers, each sending requests in a loop -PIDS="" -for i in $(seq 1 5); do - (while true; do - curl -s "$SERVICE_URL/openai/v1/completions" \ - -H "Content-Type: application/json" \ - -d '{"model":"opt-125m","prompt":"Write a long story about a dragon","max_tokens":200}' > /dev/null - done) & - PIDS="$PIDS $!" -done +# Scale to 2 replicas (moderate load) +python load-generator.py --mode stable-2 -echo -echo "Load running (PIDs:$PIDS)" -echo "Stop with: kill$PIDS" -echo +# Scale to 3 replicas (heavy load) +python load-generator.py --mode stable-3 + +# Custom: pick your own concurrency and pacing +python load-generator.py --mode custom --workers 3 --sleep 1.0 ``` +Press `Ctrl+C` to stop the load at any time. By default the script runs for +10 minutes; override with `--duration`. + ### 3. Observe autoscaling You can use dashboards (recommended, see below) or get a compact summary in terminal: @@ -125,12 +127,37 @@ kubectl get hpa keda-hpa-opt-125m-scaledobject - vLLM Query Statistics: https:///grafana/d/query-statistics4/vllm-query-statistics - Replica count: https:///grafana/d/demqj48/kubernetes-compute-resources-workload-copy -In our testing, the full cycle looked like: +### Expected behavior + +**Stable-2 mode** (~8 tok/s): +1. **1 replica** at rest +2. Load applied — metric rises to ~8 tok/s — `ceil(8/5) = 2` replicas needed +3. **Scaled to 2 replicas** within ~1 minute +4. Metric stabilizes at ~4 tok/s per replica (below threshold) — stays at 2 +5. Load removed — metric drops to 0 — cooldown period (120s) + stabilization window (120s) +6. **Scaled back to 1** replica + +**Stable-3 mode** (~22 tok/s): 1. **1 replica** at rest -2. Load applied (5 workers, ~55 tok/s total) — KEDA detects threshold breach -3. **Scaled to 3 replicas** within ~30 seconds -4. Load removed — metric drops to 0 — stabilization window (120s) -5. **Scaled back down** 3 → 2 → 1 gracefully (1 pod removed per minute) +2. Load applied — metric rises quickly — `ceil(22/5) = 5`, capped at `maxReplicas=3` +3. **Scaled to 3 replicas** within ~1-2 minutes +4. Load removed — gradual scale-down: 3 → 2 → 1 (one pod removed per minute) + +## Scaling Math + +The ScaledObject uses `metricType: AverageValue` with `threshold: 5`. For +external metrics, HPA computes: + +``` +desiredReplicas = ceil(totalMetricValue / threshold) +``` + +| Total tok/s | Desired replicas | Actual (capped 1-3) | +|-------------|------------------|---------------------| +| 0-5 | 1 | 1 | +| 5.1-10 | 2 | 2 | +| 10.1-15 | 3 | 3 | +| 15+ | 4+ | 3 (maxReplicas) | ## Customization @@ -142,6 +169,17 @@ handles more than 5 tokens/second on average" (`AverageValue` divides the query result by replica count). Tune this based on load testing for your model and hardware. +**Load generator presets**: the presets in `load-generator.py` are calibrated for +opt-125m on CPU. If you change the model, hardware, or threshold, you'll need to +recalibrate. Use `--mode custom` to experiment, and watch the Prometheus metric: + +```bash +# Check the actual metric value KEDA sees +kubectl run prom-check --rm -it --restart=Never --image=curlimages/curl -- \ + -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query' \ + --data-urlencode 'query=sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m])) + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m]))' +``` + **GPU deployments**: remove `--dtype=float32` and `--max-model-len=512` from the InferenceService args, add GPU resource requests, and consider adding a second trigger for GPU KV-cache utilization: diff --git a/serving/kserve-keda-autoscaling/load-generator.py b/serving/kserve-keda-autoscaling/load-generator.py new file mode 100644 index 0000000..f33381a --- /dev/null +++ b/serving/kserve-keda-autoscaling/load-generator.py @@ -0,0 +1,261 @@ +#!/usr/bin/env python3 +""" +Load generator for vLLM-based KServe InferenceService autoscaling demos. + +Provides two preset scenarios to demonstrate KEDA autoscaling behaviors: + + stable-2 - Sustained moderate load that triggers a stable scale-up to 2 replicas + stable-3 - Sustained heavy load that triggers a stable scale-up to 3 replicas + +You can also run in 'custom' mode to specify your own concurrency and sleep values. + +Usage (run from a terminal with network access to the service, e.g. a Kubeflow notebook): + python load-generator.py --mode stable-2 + python load-generator.py --mode stable-3 --duration 300 + python load-generator.py --mode custom --workers 3 --sleep 2.0 + +Press Ctrl+C to stop the load at any time. +""" + +import argparse +import json +import signal +import threading +import time +import urllib.request +import urllib.error + +# --------------------------------------------------------------------------- +# Preset configurations +# +# Each preset defines (workers, sleep_between_requests). +# "workers" is the number of concurrent request loops. +# "sleep_between_requests" is how long each worker pauses (seconds) after +# receiving a response before sending the next request. +# +# The ScaledObject uses: +# metricType: AverageValue (total metric value / current replicas) +# threshold: 5 (tokens/sec per replica) +# +# HPA computes: desiredReplicas = ceil(total_tok_per_sec / threshold) +# +# stable-2: target ~8 tok/s total -> ceil(8/5) = 2 +# stable-3: target ~22 tok/s total -> ceil(22/5) = 5, capped at maxReplicas=3 +# +# Calibrated for opt-125m on CPU (float32, --max-model-len=512). +# Each request averages ~116 tokens and ~6.7s of processing time. +# Effective rate per worker ≈ 116 / (6.7 + sleep) tok/s. +# --------------------------------------------------------------------------- + +PRESETS = { + "stable-2": {"workers": 1, "sleep": 8.0}, + "stable-3": {"workers": 2, "sleep": 2.0}, +} + +# Default model endpoint (Kubernetes service DNS name in RawDeployment mode) +DEFAULT_URL = "http://opt-125m-predictor/openai/v1/completions" +DEFAULT_DURATION = 600 # 10 minutes +DEFAULT_MAX_TOKENS = 200 +DEFAULT_PROMPT = ( + "Write a long detailed story about a dragon who discovers a hidden kingdom" +) + +# --------------------------------------------------------------------------- +# Globals for stats +# --------------------------------------------------------------------------- +stats_lock = threading.Lock() +total_requests = 0 +total_tokens = 0 +total_errors = 0 +start_time: float = 0.0 +stop_event = threading.Event() + + +def send_request(url: str, prompt: str, max_tokens: int) -> dict | None: + """Send a single completion request to the vLLM endpoint.""" + payload = json.dumps( + { + "model": "opt-125m", + "prompt": prompt, + "max_tokens": max_tokens, + } + ).encode("utf-8") + + req = urllib.request.Request( + url, + data=payload, + headers={"Content-Type": "application/json"}, + ) + try: + with urllib.request.urlopen(req, timeout=120) as resp: + return json.loads(resp.read().decode("utf-8")) + except (urllib.error.URLError, OSError, json.JSONDecodeError): + return None + + +def worker_loop( + worker_id: int, url: str, prompt: str, max_tokens: int, sleep_sec: float +): + """Continuously send requests with a sleep between each, until stop_event is set.""" + global total_requests, total_tokens, total_errors + + while not stop_event.is_set(): + result = send_request(url, prompt, max_tokens) + + with stats_lock: + if result and "usage" in result: + total_requests += 1 + total_tokens += result["usage"].get("total_tokens", 0) + else: + total_errors += 1 + + # Sleep between requests (interruptible via stop_event) + if sleep_sec > 0 and not stop_event.is_set(): + stop_event.wait(timeout=sleep_sec) + + +def print_stats(): + """Periodically print throughput stats.""" + while not stop_event.is_set(): + stop_event.wait(timeout=10) + if stop_event.is_set(): + break + elapsed = time.time() - start_time + with stats_lock: + tok_rate = total_tokens / elapsed if elapsed > 0 else 0 + req_rate = total_requests / elapsed if elapsed > 0 else 0 + print( + f" [{elapsed:6.0f}s] requests={total_requests} " + f"tokens={total_tokens} errors={total_errors} " + f"avg_tok/s={tok_rate:.1f} avg_req/s={req_rate:.2f}" + ) + + +def main(): + global start_time + + parser = argparse.ArgumentParser( + description="Load generator for KServe + KEDA autoscaling demo", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=__doc__, + ) + parser.add_argument( + "--mode", + choices=["stable-2", "stable-3", "custom"], + default="stable-2", + help="Load preset (default: stable-2)", + ) + parser.add_argument( + "--workers", + type=int, + default=None, + help="Number of concurrent workers (custom mode)", + ) + parser.add_argument( + "--sleep", + type=float, + default=None, + help="Sleep seconds between requests per worker (custom mode)", + ) + parser.add_argument( + "--url", + default=DEFAULT_URL, + help=f"Service URL (default: {DEFAULT_URL})", + ) + parser.add_argument( + "--duration", + type=int, + default=DEFAULT_DURATION, + help=f"Duration in seconds (default: {DEFAULT_DURATION})", + ) + parser.add_argument( + "--max-tokens", + type=int, + default=DEFAULT_MAX_TOKENS, + help=f"Max tokens per request (default: {DEFAULT_MAX_TOKENS})", + ) + parser.add_argument( + "--prompt", + default=DEFAULT_PROMPT, + help="Prompt text", + ) + + args = parser.parse_args() + + # Resolve configuration + if args.mode == "custom": + if args.workers is None or args.sleep is None: + parser.error("--workers and --sleep are required in custom mode") + workers = args.workers + sleep_sec = args.sleep + else: + preset = PRESETS[args.mode] + workers = args.workers if args.workers is not None else preset["workers"] + sleep_sec = args.sleep if args.sleep is not None else preset["sleep"] + + print("=== Load Generator ===") + print(f" Mode: {args.mode}") + print(f" Workers: {workers}") + print(f" Sleep: {sleep_sec}s between requests") + print(f" URL: {args.url}") + print(f" Duration: {args.duration}s") + print(f" Max tokens: {args.max_tokens}") + print() + print("Starting load... (Ctrl+C to stop)") + print() + + # Handle Ctrl+C gracefully + def signal_handler(sig, frame): + print("\n\nStopping load...") + stop_event.set() + + signal.signal(signal.SIGINT, signal_handler) + signal.signal(signal.SIGTERM, signal_handler) + + start_time = time.time() + + # Start stats printer + stats_thread = threading.Thread(target=print_stats, daemon=True) + stats_thread.start() + + # Start worker threads + threads = [] + for i in range(workers): + t = threading.Thread( + target=worker_loop, + args=(i, args.url, args.prompt, args.max_tokens, sleep_sec), + daemon=True, + ) + t.start() + threads.append(t) + + # Wait for duration or Ctrl+C + try: + stop_event.wait(timeout=args.duration) + except KeyboardInterrupt: + pass + + stop_event.set() + + # Wait for threads to finish + for t in threads: + t.join(timeout=5) + + # Final stats + elapsed = time.time() - start_time + with stats_lock: + tok_rate = total_tokens / elapsed if elapsed > 0 else 0 + req_rate = total_requests / elapsed if elapsed > 0 else 0 + + print() + print("=== Final Stats ===") + print(f" Duration: {elapsed:.1f}s") + print(f" Requests: {total_requests}") + print(f" Tokens: {total_tokens}") + print(f" Errors: {total_errors}") + print(f" Avg tok/s: {tok_rate:.1f}") + print(f" Avg req/s: {req_rate:.2f}") + + +if __name__ == "__main__": + main() From ae78abb77f4e89e9e3aad1bbd4a51625d39aa5bd Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Thu, 19 Mar 2026 15:26:05 +0100 Subject: [PATCH 15/17] Update dashboards in readme --- serving/kserve-keda-autoscaling/README.md | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index 77cbb6e..0c27586 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -48,11 +48,12 @@ kubectl get scaledobject kubectl get hpa ``` -> [!WARNING] +> **NOTE 1.** > If step 3 fails with `no matches for kind "ScaledObject"`, KEDA is not installed in your cluster. > Ask your admin to enable it. -> [!WARNING] + +> **NOTE 2.** > The Prometheus query in `scaled-object.yaml` has no `namespace` filter, so it aggregates token > throughput across **all namespaces**. This is fine for testing, but if multiple users deploy a > model named `opt-125m` at the same time, their metrics will interfere and autoscaling will be @@ -123,9 +124,20 @@ kubectl get hpa keda-hpa-opt-125m-scaledobject ``` **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see: -- vLLM Performance Statistics: https:///grafana/d/performance-statistics/vllm-performance-statistics -- vLLM Query Statistics: https:///grafana/d/query-statistics4/vllm-query-statistics -- Replica count: https:///grafana/d/demqj48/kubernetes-compute-resources-workload-copy + +- General vLLM Dashboard: + https://YOUR_DOMAIN/grafana/d/b281712d-8bff-41ef-9f3f-71ad43c05e9b/vllm + +- vLLM Performance Statistics: + https://YOUR_DOMAIN/grafana/d/performance-statistics/vllm-performance-statistics + +- vLLM Query Statistics: + https://YOUR_DOMAIN/grafana/d/query-statistics4/vllm-query-statistics + +- Replica count: + https://YOUR_DOMAIN/grafana/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload + +Replace `YOUR_DOMAIN` with your cluster domain. ### Expected behavior From d0635cde868e27e7ccc33b6e9d42010cf2487261 Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Sun, 22 Mar 2026 12:38:32 +0100 Subject: [PATCH 16/17] Add a mermaid diagram to illustrate KEDA --- serving/kserve-keda-autoscaling/README.md | 51 +++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index 0c27586..11642af 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -8,6 +8,57 @@ which is better suited for LLM inference workloads. For full documentation, see the [prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling). +## Architecture + +```mermaid +flowchart LR + +%% ---------- STYLES ---------- +classDef kserve fill:#E8F0FE,stroke:#1A73E8,stroke-width:2px,color:#0B3D91 +classDef pod fill:#F1F8E9,stroke:#558B2F,stroke-width:2px +classDef infra fill:#FFF8E1,stroke:#FF8F00,stroke-width:2px +classDef traffic fill:#FCE4EC,stroke:#C2185B,stroke-width:2px + +%% ---------- KSERVE ---------- +subgraph KServe[" KServe "] + direction TB + + ISVC["InferenceService
opt-125m"]:::kserve + DEP["Deployment
opt-125m-predictor"]:::kserve + + subgraph POD["Predictor Pod ×1–3"] + CTR["kserve-container
HuggingFace runtime · vLLM engine
facebook/opt-125m :8080"]:::pod + end + + ISVC -->|creates| DEP + DEP -->|manages| CTR +end + +%% ---------- OBSERVABILITY ---------- +subgraph Observability[" Observability "] + PROM[("Prometheus")]:::infra +end + +%% ---------- KEDA ---------- +subgraph KEDA[" KEDA Autoscaling "] + direction TB + SO["ScaledObject
Prometheus trigger
threshold: 5 tok/s"]:::infra + HPA["HorizontalPodAutoscaler"]:::infra + + SO -->|creates & drives| HPA +end + +%% ---------- LOAD ---------- +LG["⚡ load-generator.py"]:::traffic + +%% ---------- FLOWS ---------- +CTR -->|/metrics| PROM +PROM -->|query every 15s| SO +HPA -->|scales| DEP +SO -. targets .-> DEP +LG -->|POST /completions| CTR +``` + ## Why Token Throughput? LLM requests vary wildly in duration depending on prompt and output length. From 5fec79301bbc5105e2cf76bdde9cdbe906d0f8cd Mon Sep 17 00:00:00 2001 From: Igor Kvachenok Date: Mon, 23 Mar 2026 11:18:26 +0100 Subject: [PATCH 17/17] Prettify dashboard name and readme nitpick --- serving/kserve-keda-autoscaling/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md index 11642af..e34d5be 100644 --- a/serving/kserve-keda-autoscaling/README.md +++ b/serving/kserve-keda-autoscaling/README.md @@ -177,7 +177,7 @@ kubectl get hpa keda-hpa-opt-125m-scaledobject **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see: - General vLLM Dashboard: - https://YOUR_DOMAIN/grafana/d/b281712d-8bff-41ef-9f3f-71ad43c05e9b/vllm + https://YOUR_DOMAIN/grafana/d/vllm-general/vllm - vLLM Performance Statistics: https://YOUR_DOMAIN/grafana/d/performance-statistics/vllm-performance-statistics @@ -185,7 +185,7 @@ kubectl get hpa keda-hpa-opt-125m-scaledobject - vLLM Query Statistics: https://YOUR_DOMAIN/grafana/d/query-statistics4/vllm-query-statistics -- Replica count: +- Replica count and CPU load (you have to select your namespace/workload manually): https://YOUR_DOMAIN/grafana/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload Replace `YOUR_DOMAIN` with your cluster domain.