Update KEDA example with new insights

tmvfb · tmvfb · commit b69e04df6933 · 2026-03-12T14:21:58.000+01:00
diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
@@ -1,193 +1,79 @@
-# KServe Autoscaling with KEDA and Custom Metrics
+# KServe Autoscaling with KEDA and Custom Prometheus Metrics
 
-This example demonstrates how to autoscale KServe InferenceServices using [KEDA](https://keda.sh/) with custom Prometheus metrics. This is particularly useful for LLM inference workloads where request-based autoscaling (Knative default) is not optimal.
+This example demonstrates autoscaling a KServe InferenceService using
+[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM.
+It scales based on total token throughput rather than simple request count,
+which is better suited for LLM inference workloads.
 
-## Why Custom Metrics for LLM Autoscaling?
+For full documentation, see the
+[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling).
 
-Traditional request-based autoscaling doesn't work well for LLM inference because:
+## Why Token Throughput?
 
-- **Token-level work**: LLM inference operates at token level, not request level. A single request can generate hundreds of tokens.
-- **Variable latency**: Request latency varies significantly based on input/output token count.
-- **Memory pressure**: LLM models require significant GPU memory (KV cache), which fills up based on concurrent requests.
-
-Better metrics for LLM autoscaling include:
-- **Time To First Token (TTFT)**: Latency until first token is generated
-- **KV Cache utilization**: GPU memory used for attention cache
-- **Number of running/waiting requests**: Queue depth
+LLM requests vary wildly in duration depending on prompt and output length.
+Request-count metrics (concurrency, QPS) don't reflect actual GPU load.
+Token throughput stays elevated as long as the model is under pressure,
+making it a stable scaling signal.
 
 ## Prerequisites
 
-Install KEDA in the cluster:
-
-```bash
-helm repo add kedacore https://kedacore.github.io/charts
-helm install keda kedacore/keda --namespace keda --create-namespace
-```
+- KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`)
+- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor)
 
 ## Files
 
 | File | Description |
 |------|-------------|
-| `inference-service.yaml` | KServe InferenceService for OPT-125M model with vLLM backend |
-| `scaled-object.yaml` | KEDA ScaledObject with TTFT, GPU cache, and request-based scaling |
-| `service-monitor.yaml` | PodMonitor for vLLM metrics collection |
-
-## Deployment
+| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) |
+| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput |
 
-### 1. Deploy the InferenceService
+## Quick Start
 
 ```bash
-kubectl apply -f inference-service.yaml -n <your-namespace>
-```
+export NAMESPACE="default"
 
-Wait for the model to be ready:
-```bash
-kubectl get inferenceservice opt-125m-vllm -n <your-namespace> -w
-```
-
-### 2. Configure Prometheus Metrics Collection
+# 1. Deploy the InferenceService
+kubectl apply -n $NAMESPACE -f inference-service.yaml
 
-Apply the PodMonitor to scrape vLLM metrics:
-```bash
-kubectl apply -f service-monitor.yaml -n <your-namespace>
-```
+# 2. Wait for it to become ready
+kubectl get isvc opt-125m -n $NAMESPACE -w
 
-**Note:** Update the `namespaceSelector` in `service-monitor.yaml` to match your namespace.
+# 3. Deploy the KEDA ScaledObject
+kubectl apply -n $NAMESPACE -f scaled-object.yaml
 
-### 3. Deploy KEDA ScaledObject
-
-First, identify the correct deployment name:
-```bash
-kubectl get deployments -n <your-namespace> | grep opt-125m-vllm
-```
-
-Update `scaled-object.yaml` with:
-- The correct deployment name in `scaleTargetRef`
-- Your Prometheus server URL (e.g., `http://prometheus.monitoring:9090` or with path prefix)
-- Your namespace in the Prometheus queries
-- Your InferenceService name in the pod selector (e.g., `pod=~"opt-125m-vllm-predictor-.*"`)
-
-Then apply:
-```bash
-kubectl apply -f scaled-object.yaml -n <your-namespace>
+# 4. Verify
+kubectl get scaledobject -n $NAMESPACE
+kubectl get hpa -n $NAMESPACE
 ```
 
-Verify the ScaledObject:
-```bash
-kubectl get scaledobject -n <your-namespace>
-kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
-```
+## Customization
 
-## Autoscaling Strategies
+**Namespace and model name**: replace `default` and `opt-125m` in the
+Prometheus queries inside `scaled-object.yaml`.
 
-This example uses three triggers. KEDA evaluates all triggers and scales based on the highest desired replica count:
+**Threshold**: the `threshold: "5"` value means "scale up when each replica
+handles more than 5 tokens/second on average" (`AverageValue` divides the
+query result by replica count). Tune this based on load testing for your
+model and hardware.
 
-### 1. Time To First Token (TTFT) - P95
-Scales when the 95th percentile TTFT exceeds 200ms:
-```yaml
-triggers:
-  - type: prometheus
-    metadata:
-      query: |
-        histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>"}[2m])) by (le))
-      threshold: "0.2"
-```
+**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
+from the InferenceService args, add GPU resource requests, and consider
+adding a second trigger for GPU KV-cache utilization:
 
-### 2. GPU KV-Cache Utilization
-Scales when GPU cache usage exceeds 70% (for GPU deployments):
 ```yaml
-triggers:
-  - type: prometheus
-    metadata:
-      query: |
-        avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>"})
-      threshold: "0.7"
+# Add to scaled-object.yaml triggers list
+- type: prometheus
+  metadata:
+    serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+    query: >-
+      avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"})
+    metricType: AverageValue
+    threshold: "0.75"
 ```
 
-### 3. Running Requests (Fallback)
-Scales when average running requests per pod exceeds 2:
-```yaml
-triggers:
-  - type: prometheus
-    metadata:
-      query: |
-        avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>"})
-      threshold: "2"
-```
-
-## vLLM Metrics Reference
-
-vLLM exposes metrics at the `/metrics` endpoint. Note that vLLM uses colons in metric names (this is unusual but correct):
-
-| Metric | Description |
-|--------|-------------|
-| `vllm:num_requests_running` | Number of requests currently being processed |
-| `vllm:num_requests_waiting` | Number of requests waiting in queue |
-| `vllm:gpu_cache_usage_perc` | GPU KV cache utilization (0-1) |
-| `vllm:time_to_first_token_seconds` | Histogram of TTFT |
-| `vllm:time_per_output_token_seconds` | Histogram of TPOT |
-| `vllm:generation_tokens_total` | Total number of generated tokens |
-
-## Testing Autoscaling
-
-Generate load to trigger autoscaling:
-
-```bash
-# First, find the predictor service name
-kubectl get svc -n <your-namespace> | grep opt-125m-vllm
-
-# Create a load generator pod (adjust service name if needed)
-kubectl run load-gen --image=curlimages/curl -n <your-namespace> --restart=Never -- \
-  sh -c 'while true; do for i in $(seq 1 10); do curl -s -X POST "http://opt-125m-vllm-predictor-00001.<your-namespace>.svc.cluster.local/openai/v1/completions" -H "Content-Type: application/json" -d "{\"model\": \"opt-125m\", \"prompt\": \"Tell me a story\", \"max_tokens\": 200}" & done; sleep 2; done'
-```
-
-Monitor scaling:
-```bash
-# Watch HPA status
-kubectl get hpa -n <your-namespace> -w
-
-# Watch pods
-kubectl get pods -n <your-namespace> -l serving.kserve.io/inferenceservice=opt-125m-vllm -w
-```
-
-Clean up:
-```bash
-kubectl delete pod load-gen -n <your-namespace>
-```
-
-## Troubleshooting
-
-### KEDA not scaling
-1. Check ScaledObject status:
-   ```bash
-   kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
-   ```
-
-2. Verify Prometheus connectivity (some deployments use a `/prometheus` path prefix):
-   ```bash
-   kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
-     curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=up'
-   ```
-
-3. Check KEDA operator logs:
-   ```bash
-   kubectl logs -l app=keda-operator -n keda
-   ```
-
-### Metrics not appearing
-1. Verify PodMonitor is picked up:
-   ```bash
-   kubectl get podmonitor -n <your-namespace>
-   ```
-
-2. Check if vLLM metrics are being scraped:
-   ```bash
-   kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
-     curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query={__name__=~"vllm:.*"}'
-   ```
-
 ## References
 
-- [KServe Issue #3561: Native KEDA integration](https://github.com/kserve/kserve/issues/3561)
-- [KEDA Prometheus Scaler](https://keda.sh/docs/scalers/prometheus/)
-- [vLLM Metrics](https://docs.vllm.ai/en/latest/serving/metrics.html)
+- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/)
+- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler)
+- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/)
+- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html)
diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml
@@ -1,15 +1,21 @@
+# KServe InferenceService for OPT-125M with vLLM backend.
+# Uses RawDeployment mode — required when scaling with KEDA.
+#
+# This example runs on CPU. For GPU, remove --dtype=float32 and
+# --max-model-len, and adjust resources to request nvidia.com/gpu.
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
 metadata:
-  name: opt-125m-vllm
+  name: opt-125m
   annotations:
-    huggingface.co/model-id: facebook/opt-125m
+    # RawDeployment mode — creates a plain Deployment instead of a Knative Revision.
+    serving.kserve.io/deploymentMode: "RawDeployment"
+    # Tell KServe not to create its own HPA (KEDA will manage scaling).
+    serving.kserve.io/autoscalerClass: "external"
 spec:
   predictor:
-    # Note: When using KEDA, replica limits are managed by the ScaledObject.
-    # These values serve as defaults if KEDA is not deployed.
     minReplicas: 1
-    maxReplicas: 10
+    maxReplicas: 3
     model:
       modelFormat:
         name: huggingface
@@ -18,7 +24,13 @@ spec:
         - --model_id=facebook/opt-125m
         - --backend=vllm
         - --dtype=float32
-        - --device=cpu
+        - --max-model-len=512
+      # Explicit port declaration is required in RawDeployment mode
+      # for the cluster-wide PodMonitor to discover the metrics endpoint.
+      ports:
+        - name: user-port
+          containerPort: 8080
+          protocol: TCP
       resources:
         requests:
           cpu: "2"
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -1,85 +1,44 @@
-# KEDA ScaledObject for KServe InferenceService with vLLM backend
-# Scales based on custom Prometheus metrics from vLLM serving runtime
+# KEDA ScaledObject for KServe InferenceService with vLLM backend.
+# Scales based on total token throughput (prompt + generation) from Prometheus.
 #
 # Prerequisites:
-# - KEDA installed in cluster (https://keda.sh/docs/deploy/)
-# - Prometheus collecting vLLM metrics (see service-monitor.yaml)
+# - KEDA installed (https://keda.sh/docs/deploy/)
+# - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor)
 #
-# Note: vLLM uses colons in metric names (e.g., vllm:num_requests_running),
-# which is unusual but correct. Use {"__name__"="..."} syntax in PromQL.
-#
-# TODO: Replace the following before deploying:
-# - <your-namespace>: your actual namespace
-# - <your-prometheus-url>: your Prometheus server URL (may or may not have a path prefix)
-# - opt-125m-vllm: your InferenceService name (in queries and scaleTargetRef)
+# Before deploying, replace:
+# - "default" in the Prometheus queries with your namespace
+# - "opt-125m" in model_name with your --model_name value
+# - The serverAddress if your Prometheus uses a different URL
 #
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
-  name: opt-125m-vllm-scaledobject
-  labels:
-    app: opt-125m-vllm
+  name: opt-125m-scaledobject
 spec:
-  # Target the KServe predictor deployment
   scaleTargetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: opt-125m-vllm-predictor-00001-deployment
-  # Polling interval for checking metrics (seconds)
-  pollingInterval: 15
-  # Cooldown period before scaling down (seconds)
-  cooldownPeriod: 60
-  # Min/max replicas
+    # In RawDeployment mode KServe names the Deployment {isvc-name}-predictor.
+    name: opt-125m-predictor
   minReplicaCount: 1
-  maxReplicaCount: 10
-  # Advanced scaling behavior
+  maxReplicaCount: 3
+  pollingInterval: 15          # how often KEDA checks the metric (seconds)
+  cooldownPeriod: 120          # seconds after last trigger activation before scaling to minReplicaCount
   advanced:
     horizontalPodAutoscalerConfig:
       behavior:
-        scaleDown:
-          stabilizationWindowSeconds: 120
-          policies:
-            - type: Percent
-              value: 25
-              periodSeconds: 60
         scaleUp:
           stabilizationWindowSeconds: 0
+        scaleDown:
+          stabilizationWindowSeconds: 120
           policies:
-            - type: Percent
-              value: 100
-              periodSeconds: 15
             - type: Pods
-              value: 4
-              periodSeconds: 15
-          selectPolicy: Max
+              value: 1          # remove at most 1 replica per minute
+              periodSeconds: 60
   triggers:
-    # Scale based on Time To First Token (TTFT) - P95
-    # Scale up when P95 TTFT exceeds 200ms (0.2s)
-    - type: prometheus
-      metadata:
-        # Adjust URL to your Prometheus setup (some have /prometheus path prefix)
-        serverAddress: <your-prometheus-url>
-        metricName: vllm_ttft_p95
-        query: |
-          histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"}[2m])) by (le))
-        threshold: "0.2"
-        activationThreshold: "0.1"
-    # Scale based on GPU KV-cache usage (for GPU deployments)
-    # Scale up when cache usage exceeds 70%
-    - type: prometheus
-      metadata:
-        serverAddress: <your-prometheus-url>
-        metricName: vllm_gpu_cache_usage
-        query: |
-          avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"})
-        threshold: "0.7"
-        activationThreshold: "0.5"
-    # Fallback: Scale based on running requests (always works)
     - type: prometheus
       metadata:
-        serverAddress: <your-prometheus-url>
-        metricName: vllm_num_requests_running
-        query: |
-          avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"})
-        threshold: "2"
-        activationThreshold: "1"
+        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+        query: >-
+          sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+          + sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+        metricType: AverageValue
+        threshold: "5"
diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml