From bcf91f8760973ccafa86dfb8d3924285d26b3516 Mon Sep 17 00:00:00 2001
From: hsteude <henrik.steude@prokube.ai>
Date: Mon, 16 Feb 2026 18:14:06 +0100
Subject: [PATCH 01/17] Add KServe KEDA autoscaling example with custom
 Prometheus metrics

- InferenceService for vLLM-based model serving
- KEDA ScaledObject with multiple scaling strategies (token throughput, GPU, power)
- ServiceMonitor and PrometheusRules for metrics collection
- README with setup instructions and troubleshooting
---
 serving/kserve-keda-autoscaling/README.md     | 206 ++++++++++++++++++
 .../inference-service.yaml                    |  27 +++
 .../scaled-object.yaml                        | 123 +++++++++++
 .../service-monitor.yaml                      |  64 ++++++
 4 files changed, 420 insertions(+)
 create mode 100644 serving/kserve-keda-autoscaling/README.md
 create mode 100644 serving/kserve-keda-autoscaling/inference-service.yaml
 create mode 100644 serving/kserve-keda-autoscaling/scaled-object.yaml
 create mode 100644 serving/kserve-keda-autoscaling/service-monitor.yaml
diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
new file mode 100644
index 0000000..c5efc24
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -0,0 +1,206 @@
+# KServe Autoscaling with KEDA and Custom Metrics
+
+This example demonstrates how to autoscale KServe InferenceServices using [KEDA](https://keda.sh/) with custom Prometheus metrics. This is particularly useful for LLM inference workloads where request-based autoscaling (Knative default) is not optimal.
+
+## Why Custom Metrics for LLM Autoscaling?
+
+Traditional request-based autoscaling doesn't work well for LLM inference because:
+
+- **Token-level work**: LLM inference operates at token level, not request level. A single request can generate hundreds of tokens.
+- **Variable latency**: Request latency varies significantly based on input/output token count.
+- **Memory pressure**: LLM models require significant GPU memory (KV cache), which fills up based on concurrent requests.
+
+Better metrics for LLM autoscaling include:
+- **Token throughput**: Tokens generated per second
+- **Time To First Token (TTFT)**: Latency until first token is generated
+- **Time Per Output Token (TPOT)**: Average time per generated token
+- **KV Cache utilization**: GPU memory used for attention cache
+- **Number of running/waiting requests**: Queue depth
+
+## Prerequisites
+
+1. **KEDA** installed in the cluster:
+   ```bash
+   helm repo add kedacore https://kedacore.github.io/charts
+   helm install keda kedacore/keda --namespace keda --create-namespace
+   ```
+
+2. **Prometheus** (kube-prometheus-stack recommended):
+   ```bash
+   helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+   helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
+   ```
+
+3. **KServe** with HuggingFace/vLLM runtime configured
+
+4. **HuggingFace Token** (optional, for gated models):
+   ```bash
+   kubectl create secret generic hf-secret --from-literal=HF_TOKEN=<your-token> -n developer1
+   ```
+
+## Files
+
+| File | Description |
+|------|-------------|
+| `inference-service.yaml` | KServe InferenceService for Qwen2.5-0.5B model |
+| `scaled-object.yaml` | KEDA ScaledObject with multiple autoscaling strategies |
+| `service-monitor.yaml` | ServiceMonitor, PodMonitor, and PrometheusRules for metrics collection |
+
+## Deployment
+
+### 1. Deploy the InferenceService
+
+```bash
+kubectl apply -f inference-service.yaml -n developer1
+```
+
+Wait for the model to be ready:
+```bash
+kubectl get inferenceservice qwen25-05b -n developer1 -w
+```
+
+### 2. Configure Prometheus Metrics Collection
+
+Apply the ServiceMonitor to scrape vLLM metrics:
+```bash
+kubectl apply -f service-monitor.yaml -n developer1
+```
+
+Verify metrics are being scraped:
+```bash
+# Port-forward Prometheus
+kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
+
+# Query for vLLM metrics
+curl -s 'http://localhost:9090/api/v1/query?query=vllm_num_requests_running' | jq .
+```
+
+### 3. Deploy KEDA ScaledObject
+
+First, identify the correct deployment name:
+```bash
+kubectl get deployments -n developer1 | grep qwen25-05b
+```
+
+Update `scaled-object.yaml` with the correct deployment name, then apply:
+```bash
+kubectl apply -f scaled-object.yaml -n developer1
+```
+
+Verify the ScaledObject:
+```bash
+kubectl get scaledobject -n developer1
+kubectl describe scaledobject qwen25-05b-scaledobject -n developer1
+```
+
+## Autoscaling Strategies
+
+This example includes three ScaledObject variants:
+
+### 1. Token Throughput Based (Default)
+Scales based on average token generation throughput and number of running requests:
+```yaml
+triggers:
+  - type: prometheus
+    metadata:
+      query: avg(rate(vllm:generation_tokens_total[1m]))
+      threshold: "100"
+```
+
+### 2. GPU Utilization Based
+Scales based on GPU memory utilization (requires DCGM exporter):
+```yaml
+triggers:
+  - type: prometheus
+    metadata:
+      query: avg(DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"qwen25-05b-predictor.*"})
+      threshold: "80"
+```
+
+### 3. Power Consumption Based
+Scales based on power consumption metrics from [Kepler](https://github.com/sustainable-computing-io/kepler):
+```yaml
+triggers:
+  - type: prometheus
+    metadata:
+      query: sum(rate(kepler_container_joules_total{container_name=~"qwen25-05b.*"}[5m]))
+      threshold: "100"
+```
+
+## vLLM Metrics Reference
+
+vLLM exposes metrics at `/metrics` endpoint:
+
+| Metric | Description |
+|--------|-------------|
+| `vllm_num_requests_running` | Number of requests currently being processed |
+| `vllm_num_requests_waiting` | Number of requests waiting in queue |
+| `vllm_gpu_cache_usage_perc` | GPU KV cache utilization percentage |
+| `vllm_generation_tokens_total` | Total number of generated tokens |
+| `vllm_time_to_first_token_seconds` | Histogram of TTFT |
+| `vllm_time_per_output_token_seconds` | Histogram of TPOT |
+
+## Testing Autoscaling
+
+Generate load to trigger autoscaling:
+
+```bash
+# Get the inference URL
+ISVC_URL=$(kubectl get inferenceservice qwen25-05b -n developer1 -o jsonpath='{.status.url}')
+
+# Send requests in a loop
+for i in {1..100}; do
+  curl -X POST "${ISVC_URL}/v1/completions" \
+    -H "Content-Type: application/json" \
+    -d '{
+      "model": "qwen25-05b",
+      "prompt": "Write a long story about",
+      "max_tokens": 500
+    }' &
+done
+```
+
+Monitor scaling:
+```bash
+# Watch replica count
+kubectl get deployment -n developer1 -w
+
+# Check KEDA metrics
+kubectl get hpa -n developer1
+```
+
+## Troubleshooting
+
+### KEDA not scaling
+1. Check ScaledObject status:
+   ```bash
+   kubectl describe scaledobject qwen25-05b-scaledobject -n developer1
+   ```
+
+2. Verify Prometheus connectivity:
+   ```bash
+   kubectl run curl-test --image=curlimages/curl --rm -it -- \
+     curl -s 'http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=up'
+   ```
+
+3. Check KEDA operator logs:
+   ```bash
+   kubectl logs -l app=keda-operator -n keda
+   ```
+
+### Metrics not appearing
+1. Verify ServiceMonitor is picked up:
+   ```bash
+   kubectl get servicemonitor -n developer1
+   ```
+
+2. Check Prometheus targets:
+   - Open Prometheus UI -> Status -> Targets
+   - Look for `serviceMonitor/developer1/qwen25-05b-metrics`
+
+## References
+
+- [KServe Issue #3561: Native KEDA integration](https://github.com/kserve/kserve/issues/3561)
+- [KEDA Prometheus Scaler](https://keda.sh/docs/scalers/prometheus/)
+- [vLLM Metrics](https://docs.vllm.ai/en/latest/serving/metrics.html)
+- [Kepler: Kubernetes Energy Metering](https://github.com/sustainable-computing-io/kepler)
diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml
new file mode 100644
index 0000000..f3b3805
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/inference-service.yaml
@@ -0,0 +1,27 @@
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: distilbert-cpu
+  annotations:
+    # Model info for documentation
+    huggingface.co/model-id: distilbert-base-uncased-finetuned-sst-2-english
+spec:
+  predictor:
+    # KEDA will handle scaling, but we still set bounds
+    minReplicas: 1
+    maxReplicas: 10
+    scaleTarget: 1
+    scaleMetric: concurrency
+    model:
+      modelFormat:
+        name: huggingface
+      args:
+        - --model_name=distilbert
+        - --model_id=distilbert-base-uncased-finetuned-sst-2-english
+      resources:
+        requests:
+          cpu: "2"
+          memory: 4Gi
+        limits:
+          cpu: "4"
+          memory: 8Gi
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
new file mode 100644
index 0000000..f9eae2d
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -0,0 +1,123 @@
+# KEDA ScaledObject for KServe InferenceService
+# Scales based on custom Prometheus metrics from vLLM/HuggingFace serving runtime
+#
+# Prerequisites:
+# - KEDA installed in cluster (https://keda.sh/docs/deploy/)
+# - Prometheus collecting vLLM metrics
+# - ServiceMonitor configured (see service-monitor.yaml)
+#
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: qwen25-05b-scaledobject
+  labels:
+    app: qwen25-05b
+spec:
+  # Target the KServe predictor deployment
+  # KServe creates a deployment with naming pattern: <isvc-name>-predictor-<revision>
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: qwen25-05b-predictor-00001-deployment
+  # Polling interval for checking metrics (seconds)
+  pollingInterval: 15
+  # Cooldown period before scaling down (seconds)  
+  cooldownPeriod: 60
+  # Min/max replicas
+  minReplicaCount: 1
+  maxReplicaCount: 10
+  # Advanced scaling behavior
+  advanced:
+    horizontalPodAutoscalerConfig:
+      behavior:
+        scaleDown:
+          stabilizationWindowSeconds: 120
+          policies:
+            - type: Percent
+              value: 25
+              periodSeconds: 60
+        scaleUp:
+          stabilizationWindowSeconds: 0
+          policies:
+            - type: Percent
+              value: 100
+              periodSeconds: 15
+            - type: Pods
+              value: 4
+              periodSeconds: 15
+          selectPolicy: Max
+  triggers:
+    # Scale based on average token throughput per second
+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
+        metricName: vllm_avg_generation_throughput
+        # Average tokens generated per second across all pods
+        query: |
+          avg(rate(vllm:generation_tokens_total[1m]))
+        threshold: "100"
+        activationThreshold: "10"
+    # Alternative: Scale based on number of running requests
+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
+        metricName: vllm_num_requests_running
+        # Average number of running requests per pod
+        query: |
+          avg(vllm:num_requests_running{model_name="qwen25-05b"})
+        threshold: "5"
+        activationThreshold: "1"
+---
+# Alternative ScaledObject using GPU utilization (if using GPUs)
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: qwen25-05b-gpu-scaledobject
+  labels:
+    app: qwen25-05b
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: qwen25-05b-predictor-00001-deployment
+  pollingInterval: 15
+  cooldownPeriod: 120
+  minReplicaCount: 1
+  maxReplicaCount: 10
+  triggers:
+    # Scale based on GPU memory utilization (requires DCGM exporter)
+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
+        metricName: dcgm_gpu_memory_used_percent
+        query: |
+          avg(DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"qwen25-05b-predictor.*"})
+        threshold: "80"
+        activationThreshold: "20"
+---
+# Alternative ScaledObject using Kepler power consumption metrics
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: qwen25-05b-power-scaledobject
+  labels:
+    app: qwen25-05b
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: qwen25-05b-predictor-00001-deployment
+  pollingInterval: 30
+  cooldownPeriod: 180
+  minReplicaCount: 1
+  maxReplicaCount: 10
+  triggers:
+    # Scale based on power consumption (requires Kepler)
+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
+        metricName: kepler_container_joules
+        query: |
+          sum(rate(kepler_container_joules_total{container_name=~"qwen25-05b.*"}[5m]))
+        threshold: "100"
+        activationThreshold: "10"
diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml
new file mode 100644
index 0000000..fcd8db1
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/service-monitor.yaml
@@ -0,0 +1,64 @@
+# ServiceMonitor to scrape HuggingFace runtime metrics from KServe InferenceService
+# This enables Prometheus to collect the metrics used by KEDA for autoscaling
+#
+# Prerequisites:
+# - Prometheus Operator installed (kube-prometheus-stack)
+# - HuggingFace runtime exposes metrics on port 8080 at /metrics endpoint
+#
+apiVersion: monitoring.coreos.com/v1
+kind: PodMonitor
+metadata:
+  name: distilbert-cpu-metrics
+  labels:
+    app: distilbert-cpu
+    # Label to match Prometheus Operator's podMonitorSelector
+    release: kube-prometheus-stack
+spec:
+  selector:
+    matchLabels:
+      serving.kserve.io/inferenceservice: distilbert-cpu
+  namespaceSelector:
+    matchNames:
+      - developer1
+  podMetricsEndpoints:
+    - port: user-port  # HuggingFace runtime metrics port
+      path: /metrics
+      interval: 15s
+      scrapeTimeout: 10s
+---
+# PrometheusRule for creating recording rules
+# These pre-aggregate metrics for more efficient KEDA queries
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: distilbert-cpu-recording-rules
+  labels:
+    app: distilbert-cpu
+    release: kube-prometheus-stack
+spec:
+  groups:
+    - name: kserve-hf-metrics
+      interval: 15s
+      rules:
+        # Request rate (requests per second)
+        - record: kserve:request_rate
+          expr: |
+            sum by (namespace, pod) (
+              rate(request_predict_seconds_count[1m])
+            )
+        # Average prediction latency
+        - record: kserve:predict_latency_avg
+          expr: |
+            avg by (namespace) (
+              rate(request_predict_seconds_sum[5m])
+              /
+              rate(request_predict_seconds_count[5m])
+            )
+        # P99 prediction latency
+        - record: kserve:predict_latency_p99
+          expr: |
+            histogram_quantile(0.99, 
+              sum by (namespace, le) (
+                rate(request_predict_seconds_bucket[5m])
+              )
+            )

From 03fc3cd4fa1d04618845f99f08f1f325edd6d480 Mon Sep 17 00:00:00 2001
From: hsteude <henrik.steude@prokube.ai>
Date: Mon, 16 Feb 2026 18:22:07 +0100
Subject: [PATCH 02/17] Update KEDA autoscaling example to use vLLM with
 OPT-125M

- Switch from DistilBERT to OPT-125M model with vLLM backend
- Fix Prometheus serverAddress to include /prometheus routePrefix
- Fix metric queries to handle vLLM's colon-namespaced metrics
- Simplify ScaledObject to focus on running/waiting requests
- Update PodMonitor and PrometheusRules for vLLM metrics

Tested on cluster: autoscaling triggers correctly when load increases
---
 .../inference-service.yaml                    | 15 ++-
 .../scaled-object.yaml                        | 96 +++++--------------
 .../service-monitor.yaml                      | 64 ++++++++-----
 3 files changed, 69 insertions(+), 106 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml
index f3b3805..0b67dfa 100644
--- a/serving/kserve-keda-autoscaling/inference-service.yaml
+++ b/serving/kserve-keda-autoscaling/inference-service.yaml
@@ -1,23 +1,22 @@
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
 metadata:
-  name: distilbert-cpu
+  name: opt-125m-vllm
   annotations:
-    # Model info for documentation
-    huggingface.co/model-id: distilbert-base-uncased-finetuned-sst-2-english
+    huggingface.co/model-id: facebook/opt-125m
 spec:
   predictor:
-    # KEDA will handle scaling, but we still set bounds
     minReplicas: 1
     maxReplicas: 10
-    scaleTarget: 1
-    scaleMetric: concurrency
     model:
       modelFormat:
         name: huggingface
       args:
-        - --model_name=distilbert
-        - --model_id=distilbert-base-uncased-finetuned-sst-2-english
+        - --model_name=opt-125m
+        - --model_id=facebook/opt-125m
+        - --backend=vllm
+        - --dtype=float32
+        - --device=cpu
       resources:
         requests:
           cpu: "2"
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
index f9eae2d..2a09a44 100644
--- a/serving/kserve-keda-autoscaling/scaled-object.yaml
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -1,24 +1,25 @@
-# KEDA ScaledObject for KServe InferenceService
-# Scales based on custom Prometheus metrics from vLLM/HuggingFace serving runtime
+# KEDA ScaledObject for KServe InferenceService with vLLM backend
+# Scales based on custom Prometheus metrics from vLLM serving runtime
 #
 # Prerequisites:
 # - KEDA installed in cluster (https://keda.sh/docs/deploy/)
-# - Prometheus collecting vLLM metrics
-# - ServiceMonitor configured (see service-monitor.yaml)
+# - Prometheus collecting vLLM metrics (see service-monitor.yaml)
+#
+# Note: vLLM metrics use colons in names (e.g., vllm:num_requests_running)
+# which need to be quoted in PromQL queries
 #
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
-  name: qwen25-05b-scaledobject
+  name: opt-125m-vllm-scaledobject
   labels:
-    app: qwen25-05b
+    app: opt-125m-vllm
 spec:
   # Target the KServe predictor deployment
-  # KServe creates a deployment with naming pattern: <isvc-name>-predictor-<revision>
   scaleTargetRef:
     apiVersion: apps/v1
     kind: Deployment
-    name: qwen25-05b-predictor-00001-deployment
+    name: opt-125m-vllm-predictor-00001-deployment
   # Polling interval for checking metrics (seconds)
   pollingInterval: 15
   # Cooldown period before scaling down (seconds)  
@@ -47,77 +48,24 @@ spec:
               periodSeconds: 15
           selectPolicy: Max
   triggers:
-    # Scale based on average token throughput per second
-    - type: prometheus
-      metadata:
-        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
-        metricName: vllm_avg_generation_throughput
-        # Average tokens generated per second across all pods
-        query: |
-          avg(rate(vllm:generation_tokens_total[1m]))
-        threshold: "100"
-        activationThreshold: "10"
-    # Alternative: Scale based on number of running requests
+    # Scale based on number of running requests per pod
+    # vLLM uses colons in metric names, so we use the actual metric name
     - type: prometheus
       metadata:
-        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
+        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
         metricName: vllm_num_requests_running
-        # Average number of running requests per pod
+        # Scale up when average running requests per pod > 2
         query: |
-          avg(vllm:num_requests_running{model_name="qwen25-05b"})
-        threshold: "5"
+          avg({"__name__"="vllm:num_requests_running", namespace="developer1"})
+        threshold: "2"
         activationThreshold: "1"
----
-# Alternative ScaledObject using GPU utilization (if using GPUs)
-apiVersion: keda.sh/v1alpha1
-kind: ScaledObject
-metadata:
-  name: qwen25-05b-gpu-scaledobject
-  labels:
-    app: qwen25-05b
-spec:
-  scaleTargetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: qwen25-05b-predictor-00001-deployment
-  pollingInterval: 15
-  cooldownPeriod: 120
-  minReplicaCount: 1
-  maxReplicaCount: 10
-  triggers:
-    # Scale based on GPU memory utilization (requires DCGM exporter)
-    - type: prometheus
-      metadata:
-        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
-        metricName: dcgm_gpu_memory_used_percent
-        query: |
-          avg(DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"qwen25-05b-predictor.*"})
-        threshold: "80"
-        activationThreshold: "20"
----
-# Alternative ScaledObject using Kepler power consumption metrics
-apiVersion: keda.sh/v1alpha1
-kind: ScaledObject
-metadata:
-  name: qwen25-05b-power-scaledobject
-  labels:
-    app: qwen25-05b
-spec:
-  scaleTargetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: qwen25-05b-predictor-00001-deployment
-  pollingInterval: 30
-  cooldownPeriod: 180
-  minReplicaCount: 1
-  maxReplicaCount: 10
-  triggers:
-    # Scale based on power consumption (requires Kepler)
+    # Scale based on number of waiting requests (queue depth)
     - type: prometheus
       metadata:
-        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
-        metricName: kepler_container_joules
+        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+        metricName: vllm_num_requests_waiting
+        # Scale up when there are waiting requests
         query: |
-          sum(rate(kepler_container_joules_total{container_name=~"qwen25-05b.*"}[5m]))
-        threshold: "100"
-        activationThreshold: "10"
+          sum({"__name__"="vllm:num_requests_waiting", namespace="developer1"})
+        threshold: "1"
+        activationThreshold: "0"
diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml
index fcd8db1..a99ae3b 100644
--- a/serving/kserve-keda-autoscaling/service-monitor.yaml
+++ b/serving/kserve-keda-autoscaling/service-monitor.yaml
@@ -1,27 +1,27 @@
-# ServiceMonitor to scrape HuggingFace runtime metrics from KServe InferenceService
+# PodMonitor to scrape vLLM metrics from KServe InferenceService
 # This enables Prometheus to collect the metrics used by KEDA for autoscaling
 #
 # Prerequisites:
 # - Prometheus Operator installed (kube-prometheus-stack)
-# - HuggingFace runtime exposes metrics on port 8080 at /metrics endpoint
+# - vLLM runtime exposes metrics on port 8080 at /metrics endpoint
 #
 apiVersion: monitoring.coreos.com/v1
 kind: PodMonitor
 metadata:
-  name: distilbert-cpu-metrics
+  name: opt-125m-vllm-metrics
   labels:
-    app: distilbert-cpu
+    app: opt-125m-vllm
     # Label to match Prometheus Operator's podMonitorSelector
     release: kube-prometheus-stack
 spec:
   selector:
     matchLabels:
-      serving.kserve.io/inferenceservice: distilbert-cpu
+      serving.kserve.io/inferenceservice: opt-125m-vllm
   namespaceSelector:
     matchNames:
       - developer1
   podMetricsEndpoints:
-    - port: user-port  # HuggingFace runtime metrics port
+    - port: user-port  # vLLM runtime metrics port
       path: /metrics
       interval: 15s
       scrapeTimeout: 10s
@@ -31,34 +31,50 @@ spec:
 apiVersion: monitoring.coreos.com/v1
 kind: PrometheusRule
 metadata:
-  name: distilbert-cpu-recording-rules
+  name: opt-125m-vllm-recording-rules
   labels:
-    app: distilbert-cpu
+    app: opt-125m-vllm
     release: kube-prometheus-stack
 spec:
   groups:
-    - name: kserve-hf-metrics
+    - name: vllm-metrics
       interval: 15s
       rules:
-        # Request rate (requests per second)
-        - record: kserve:request_rate
+        # Number of running requests
+        - record: vllm:num_requests_running
           expr: |
-            sum by (namespace, pod) (
-              rate(request_predict_seconds_count[1m])
+            sum by (model_name, namespace) (
+              vllm_num_requests_running
             )
-        # Average prediction latency
-        - record: kserve:predict_latency_avg
+        # Number of waiting requests  
+        - record: vllm:num_requests_waiting
           expr: |
-            avg by (namespace) (
-              rate(request_predict_seconds_sum[5m])
+            sum by (model_name, namespace) (
+              vllm_num_requests_waiting
+            )
+        # Token generation throughput
+        - record: vllm:generation_tokens_rate
+          expr: |
+            sum by (model_name, namespace) (
+              rate(vllm_generation_tokens_total[1m])
+            )
+        # Prompt tokens throughput
+        - record: vllm:prompt_tokens_rate
+          expr: |
+            sum by (model_name, namespace) (
+              rate(vllm_prompt_tokens_total[1m])
+            )
+        # Average time to first token (TTFT)
+        - record: vllm:time_to_first_token_avg
+          expr: |
+            avg by (model_name, namespace) (
+              rate(vllm_time_to_first_token_seconds_sum[5m])
               /
-              rate(request_predict_seconds_count[5m])
+              rate(vllm_time_to_first_token_seconds_count[5m])
             )
-        # P99 prediction latency
-        - record: kserve:predict_latency_p99
+        # GPU KV cache utilization
+        - record: vllm:gpu_cache_usage_percent
           expr: |
-            histogram_quantile(0.99, 
-              sum by (namespace, le) (
-                rate(request_predict_seconds_bucket[5m])
-              )
+            avg by (model_name, namespace) (
+              vllm_gpu_cache_usage_perc
             )

From 471ba505547d66b88d44180c63602a26c34d9051 Mon Sep 17 00:00:00 2001
From: hsteude <henrik.steude@prokube.ai>
Date: Mon, 16 Feb 2026 20:20:08 +0100
Subject: [PATCH 03/17] Update KEDA autoscaling example with TTFT scaling and
 documentation

- Add Time To First Token (TTFT) P95 as primary scaling metric
- Add GPU KV-cache utilization scaling (for GPU deployments)
- Keep running requests as fallback metric
- Update README to match other examples in repo
- Replace hardcoded namespace with <your-namespace> placeholder
- Fix Prometheus URL to include /prometheus prefix for prokube
- Document vLLM's colon-namespaced metrics (vllm:*)
---
 serving/kserve-keda-autoscaling/README.md     | 154 ++++++++----------
 .../scaled-object.yaml                        |  36 ++--
 .../service-monitor.yaml                      |   2 +-
 3 files changed, 92 insertions(+), 100 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index c5efc24..32325d1 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -11,162 +11,143 @@ Traditional request-based autoscaling doesn't work well for LLM inference becaus
 - **Memory pressure**: LLM models require significant GPU memory (KV cache), which fills up based on concurrent requests.
 
 Better metrics for LLM autoscaling include:
-- **Token throughput**: Tokens generated per second
 - **Time To First Token (TTFT)**: Latency until first token is generated
-- **Time Per Output Token (TPOT)**: Average time per generated token
 - **KV Cache utilization**: GPU memory used for attention cache
 - **Number of running/waiting requests**: Queue depth
 
 ## Prerequisites
 
-1. **KEDA** installed in the cluster:
-   ```bash
-   helm repo add kedacore https://kedacore.github.io/charts
-   helm install keda kedacore/keda --namespace keda --create-namespace
-   ```
-
-2. **Prometheus** (kube-prometheus-stack recommended):
-   ```bash
-   helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
-   helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
-   ```
+On prokube, Prometheus is already installed. You only need to install KEDA:
 
-3. **KServe** with HuggingFace/vLLM runtime configured
-
-4. **HuggingFace Token** (optional, for gated models):
-   ```bash
-   kubectl create secret generic hf-secret --from-literal=HF_TOKEN=<your-token> -n developer1
-   ```
+```bash
+helm repo add kedacore https://kedacore.github.io/charts
+helm install keda kedacore/keda --namespace keda --create-namespace
+```
 
 ## Files
 
 | File | Description |
 |------|-------------|
-| `inference-service.yaml` | KServe InferenceService for Qwen2.5-0.5B model |
-| `scaled-object.yaml` | KEDA ScaledObject with multiple autoscaling strategies |
-| `service-monitor.yaml` | ServiceMonitor, PodMonitor, and PrometheusRules for metrics collection |
+| `inference-service.yaml` | KServe InferenceService for OPT-125M model with vLLM backend |
+| `scaled-object.yaml` | KEDA ScaledObject with TTFT, GPU cache, and request-based scaling |
+| `service-monitor.yaml` | PodMonitor and PrometheusRules for vLLM metrics collection |
 
 ## Deployment
 
 ### 1. Deploy the InferenceService
 
 ```bash
-kubectl apply -f inference-service.yaml -n developer1
+kubectl apply -f inference-service.yaml -n <your-namespace>
 ```
 
 Wait for the model to be ready:
 ```bash
-kubectl get inferenceservice qwen25-05b -n developer1 -w
+kubectl get inferenceservice opt-125m-vllm -n <your-namespace> -w
 ```
 
 ### 2. Configure Prometheus Metrics Collection
 
-Apply the ServiceMonitor to scrape vLLM metrics:
+Apply the PodMonitor to scrape vLLM metrics:
 ```bash
-kubectl apply -f service-monitor.yaml -n developer1
+kubectl apply -f service-monitor.yaml -n <your-namespace>
 ```
 
-Verify metrics are being scraped:
-```bash
-# Port-forward Prometheus
-kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
-
-# Query for vLLM metrics
-curl -s 'http://localhost:9090/api/v1/query?query=vllm_num_requests_running' | jq .
-```
+**Note:** You may need to update the `namespaceSelector` in `service-monitor.yaml` to match your namespace.
 
 ### 3. Deploy KEDA ScaledObject
 
 First, identify the correct deployment name:
 ```bash
-kubectl get deployments -n developer1 | grep qwen25-05b
+kubectl get deployments -n <your-namespace> | grep opt-125m-vllm
 ```
 
-Update `scaled-object.yaml` with the correct deployment name, then apply:
+Update `scaled-object.yaml` with:
+- The correct deployment name
+- Your namespace in the Prometheus queries
+
+Then apply:
 ```bash
-kubectl apply -f scaled-object.yaml -n developer1
+kubectl apply -f scaled-object.yaml -n <your-namespace>
 ```
 
 Verify the ScaledObject:
 ```bash
-kubectl get scaledobject -n developer1
-kubectl describe scaledobject qwen25-05b-scaledobject -n developer1
+kubectl get scaledobject -n <your-namespace>
+kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
 ```
 
 ## Autoscaling Strategies
 
-This example includes three ScaledObject variants:
+This example uses three triggers (first one to exceed threshold wins):
 
-### 1. Token Throughput Based (Default)
-Scales based on average token generation throughput and number of running requests:
+### 1. Time To First Token (TTFT) - P95
+Scales when the 95th percentile TTFT exceeds 200ms:
 ```yaml
 triggers:
   - type: prometheus
     metadata:
-      query: avg(rate(vllm:generation_tokens_total[1m]))
-      threshold: "100"
+      query: |
+        histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>"}[2m])) by (le))
+      threshold: "0.2"
 ```
 
-### 2. GPU Utilization Based
-Scales based on GPU memory utilization (requires DCGM exporter):
+### 2. GPU KV-Cache Utilization
+Scales when GPU cache usage exceeds 70% (for GPU deployments):
 ```yaml
 triggers:
   - type: prometheus
     metadata:
-      query: avg(DCGM_FI_DEV_MEM_COPY_UTIL{pod=~"qwen25-05b-predictor.*"})
-      threshold: "80"
+      query: |
+        avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>"})
+      threshold: "0.7"
 ```
 
-### 3. Power Consumption Based
-Scales based on power consumption metrics from [Kepler](https://github.com/sustainable-computing-io/kepler):
+### 3. Running Requests (Fallback)
+Scales when average running requests per pod exceeds 2:
 ```yaml
 triggers:
   - type: prometheus
     metadata:
-      query: sum(rate(kepler_container_joules_total{container_name=~"qwen25-05b.*"}[5m]))
-      threshold: "100"
+      query: |
+        avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>"})
+      threshold: "2"
 ```
 
 ## vLLM Metrics Reference
 
-vLLM exposes metrics at `/metrics` endpoint:
+vLLM exposes metrics at `/metrics` endpoint. Note that vLLM uses colons in metric names:
 
 | Metric | Description |
 |--------|-------------|
-| `vllm_num_requests_running` | Number of requests currently being processed |
-| `vllm_num_requests_waiting` | Number of requests waiting in queue |
-| `vllm_gpu_cache_usage_perc` | GPU KV cache utilization percentage |
-| `vllm_generation_tokens_total` | Total number of generated tokens |
-| `vllm_time_to_first_token_seconds` | Histogram of TTFT |
-| `vllm_time_per_output_token_seconds` | Histogram of TPOT |
+| `vllm:num_requests_running` | Number of requests currently being processed |
+| `vllm:num_requests_waiting` | Number of requests waiting in queue |
+| `vllm:gpu_cache_usage_perc` | GPU KV cache utilization (0-1) |
+| `vllm:time_to_first_token_seconds` | Histogram of TTFT |
+| `vllm:time_per_output_token_seconds` | Histogram of TPOT |
+| `vllm:generation_tokens_total` | Total number of generated tokens |
 
 ## Testing Autoscaling
 
 Generate load to trigger autoscaling:
 
 ```bash
-# Get the inference URL
-ISVC_URL=$(kubectl get inferenceservice qwen25-05b -n developer1 -o jsonpath='{.status.url}')
-
-# Send requests in a loop
-for i in {1..100}; do
-  curl -X POST "${ISVC_URL}/v1/completions" \
-    -H "Content-Type: application/json" \
-    -d '{
-      "model": "qwen25-05b",
-      "prompt": "Write a long story about",
-      "max_tokens": 500
-    }' &
-done
+# Create a load generator pod
+kubectl run load-gen --image=curlimages/curl -n <your-namespace> --restart=Never -- \
+  sh -c 'while true; do for i in $(seq 1 10); do curl -s -X POST "http://opt-125m-vllm-predictor-00001.<your-namespace>.svc.cluster.local/openai/v1/completions" -H "Content-Type: application/json" -d "{\"model\": \"opt-125m\", \"prompt\": \"Tell me a story\", \"max_tokens\": 200}" & done; sleep 2; done'
 ```
 
 Monitor scaling:
 ```bash
-# Watch replica count
-kubectl get deployment -n developer1 -w
+# Watch HPA status
+kubectl get hpa -n <your-namespace> -w
 
-# Check KEDA metrics
-kubectl get hpa -n developer1
+# Watch pods
+kubectl get pods -n <your-namespace> -l serving.kserve.io/inferenceservice=opt-125m-vllm -w
+```
+
+Clean up:
+```bash
+kubectl delete pod load-gen -n <your-namespace>
 ```
 
 ## Troubleshooting
@@ -174,13 +155,13 @@ kubectl get hpa -n developer1
 ### KEDA not scaling
 1. Check ScaledObject status:
    ```bash
-   kubectl describe scaledobject qwen25-05b-scaledobject -n developer1
+   kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
    ```
 
-2. Verify Prometheus connectivity:
+2. Verify Prometheus connectivity (note the `/prometheus` path prefix on prokube):
    ```bash
-   kubectl run curl-test --image=curlimages/curl --rm -it -- \
-     curl -s 'http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=up'
+   kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
+     curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=up'
    ```
 
 3. Check KEDA operator logs:
@@ -189,18 +170,19 @@ kubectl get hpa -n developer1
    ```
 
 ### Metrics not appearing
-1. Verify ServiceMonitor is picked up:
+1. Verify PodMonitor is picked up:
    ```bash
-   kubectl get servicemonitor -n developer1
+   kubectl get podmonitor -n <your-namespace>
    ```
 
-2. Check Prometheus targets:
-   - Open Prometheus UI -> Status -> Targets
-   - Look for `serviceMonitor/developer1/qwen25-05b-metrics`
+2. Check if vLLM metrics are being scraped:
+   ```bash
+   kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
+     curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query={__name__=~"vllm:.*"}'
+   ```
 
 ## References
 
 - [KServe Issue #3561: Native KEDA integration](https://github.com/kserve/kserve/issues/3561)
 - [KEDA Prometheus Scaler](https://keda.sh/docs/scalers/prometheus/)
 - [vLLM Metrics](https://docs.vllm.ai/en/latest/serving/metrics.html)
-- [Kepler: Kubernetes Energy Metering](https://github.com/sustainable-computing-io/kepler)
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
index 2a09a44..b61d3fa 100644
--- a/serving/kserve-keda-autoscaling/scaled-object.yaml
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -8,6 +8,8 @@
 # Note: vLLM metrics use colons in names (e.g., vllm:num_requests_running)
 # which need to be quoted in PromQL queries
 #
+# TODO: Replace <your-namespace> with your actual namespace in the queries below
+#
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
@@ -48,24 +50,32 @@ spec:
               periodSeconds: 15
           selectPolicy: Max
   triggers:
-    # Scale based on number of running requests per pod
-    # vLLM uses colons in metric names, so we use the actual metric name
+    # Scale based on Time To First Token (TTFT) - P95
+    # Scale up when P95 TTFT exceeds 200ms (0.2s)
     - type: prometheus
       metadata:
         serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
-        metricName: vllm_num_requests_running
-        # Scale up when average running requests per pod > 2
+        metricName: vllm_ttft_p95
         query: |
-          avg({"__name__"="vllm:num_requests_running", namespace="developer1"})
-        threshold: "2"
-        activationThreshold: "1"
-    # Scale based on number of waiting requests (queue depth)
+          histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>"}[2m])) by (le))
+        threshold: "0.2"
+        activationThreshold: "0.1"
+    # Scale based on GPU KV-cache usage (for GPU deployments)
+    # Scale up when cache usage exceeds 70%
     - type: prometheus
       metadata:
         serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
-        metricName: vllm_num_requests_waiting
-        # Scale up when there are waiting requests
+        metricName: vllm_gpu_cache_usage
         query: |
-          sum({"__name__"="vllm:num_requests_waiting", namespace="developer1"})
-        threshold: "1"
-        activationThreshold: "0"
+          avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>"})
+        threshold: "0.7"
+        activationThreshold: "0.5"
+    # Fallback: Scale based on running requests (always works)
+    - type: prometheus
+      metadata:
+        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+        metricName: vllm_num_requests_running
+        query: |
+          avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>"})
+        threshold: "2"
+        activationThreshold: "1"
diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml
index a99ae3b..2517000 100644
--- a/serving/kserve-keda-autoscaling/service-monitor.yaml
+++ b/serving/kserve-keda-autoscaling/service-monitor.yaml
@@ -19,7 +19,7 @@ spec:
       serving.kserve.io/inferenceservice: opt-125m-vllm
   namespaceSelector:
     matchNames:
-      - developer1
+      - <your-namespace>  # TODO: Replace with your namespace
   podMetricsEndpoints:
     - port: user-port  # vLLM runtime metrics port
       path: /metrics

From dcfea4f7f330d2d2d2ea61a4e4d4745dd4ac0c3c Mon Sep 17 00:00:00 2001
From: hsteude <henrik.steude@prokube.ai>
Date: Mon, 16 Feb 2026 20:21:19 +0100
Subject: [PATCH 04/17] Remove prokube-specific Prometheus note from
 prerequisites

---
 serving/kserve-keda-autoscaling/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index 32325d1..b29acb3 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -17,7 +17,7 @@ Better metrics for LLM autoscaling include:
 
 ## Prerequisites
 
-On prokube, Prometheus is already installed. You only need to install KEDA:
+Install KEDA in the cluster:
 
 ```bash
 helm repo add kedacore https://kedacore.github.io/charts

From 93ae4297f0b55761d1c3e8d6b9274341f7b99af7 Mon Sep 17 00:00:00 2001
From: hsteude <henrik.steude@prokube.ai>
Date: Mon, 16 Feb 2026 20:41:22 +0100
Subject: [PATCH 05/17] Address Copilot review feedback

- Remove unused PrometheusRules (vLLM metrics use colons natively)
- Fix trailing whitespace in scaled-object.yaml
- Clarify that vLLM uses colons in metric names (unusual but correct)
- Add note about minReplicas/maxReplicas when using KEDA
- Add step to find predictor service name before load testing
- Remove prokube-specific reference in troubleshooting
---
 serving/kserve-keda-autoscaling/README.md     | 13 +++--
 .../inference-service.yaml                    |  2 +
 .../scaled-object.yaml                        |  6 +-
 .../service-monitor.yaml                      | 57 +------------------
 4 files changed, 15 insertions(+), 63 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index b29acb3..69c4d94 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -30,7 +30,7 @@ helm install keda kedacore/keda --namespace keda --create-namespace
 |------|-------------|
 | `inference-service.yaml` | KServe InferenceService for OPT-125M model with vLLM backend |
 | `scaled-object.yaml` | KEDA ScaledObject with TTFT, GPU cache, and request-based scaling |
-| `service-monitor.yaml` | PodMonitor and PrometheusRules for vLLM metrics collection |
+| `service-monitor.yaml` | PodMonitor for vLLM metrics collection |
 
 ## Deployment
 
@@ -52,7 +52,7 @@ Apply the PodMonitor to scrape vLLM metrics:
 kubectl apply -f service-monitor.yaml -n <your-namespace>
 ```
 
-**Note:** You may need to update the `namespaceSelector` in `service-monitor.yaml` to match your namespace.
+**Note:** Update the `namespaceSelector` in `service-monitor.yaml` to match your namespace.
 
 ### 3. Deploy KEDA ScaledObject
 
@@ -115,7 +115,7 @@ triggers:
 
 ## vLLM Metrics Reference
 
-vLLM exposes metrics at `/metrics` endpoint. Note that vLLM uses colons in metric names:
+vLLM exposes metrics at the `/metrics` endpoint. Note that vLLM uses colons in metric names (this is unusual but correct):
 
 | Metric | Description |
 |--------|-------------|
@@ -131,7 +131,10 @@ vLLM exposes metrics at `/metrics` endpoint. Note that vLLM uses colons in metri
 Generate load to trigger autoscaling:
 
 ```bash
-# Create a load generator pod
+# First, find the predictor service name
+kubectl get svc -n <your-namespace> | grep opt-125m-vllm
+
+# Create a load generator pod (adjust service name if needed)
 kubectl run load-gen --image=curlimages/curl -n <your-namespace> --restart=Never -- \
   sh -c 'while true; do for i in $(seq 1 10); do curl -s -X POST "http://opt-125m-vllm-predictor-00001.<your-namespace>.svc.cluster.local/openai/v1/completions" -H "Content-Type: application/json" -d "{\"model\": \"opt-125m\", \"prompt\": \"Tell me a story\", \"max_tokens\": 200}" & done; sleep 2; done'
 ```
@@ -158,7 +161,7 @@ kubectl delete pod load-gen -n <your-namespace>
    kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
    ```
 
-2. Verify Prometheus connectivity (note the `/prometheus` path prefix on prokube):
+2. Verify Prometheus connectivity (some deployments use a `/prometheus` path prefix):
    ```bash
    kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
      curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=up'
diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml
index 0b67dfa..5135391 100644
--- a/serving/kserve-keda-autoscaling/inference-service.yaml
+++ b/serving/kserve-keda-autoscaling/inference-service.yaml
@@ -6,6 +6,8 @@ metadata:
     huggingface.co/model-id: facebook/opt-125m
 spec:
   predictor:
+    # Note: When using KEDA, replica limits are managed by the ScaledObject.
+    # These values serve as defaults if KEDA is not deployed.
     minReplicas: 1
     maxReplicas: 10
     model:
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
index b61d3fa..1530843 100644
--- a/serving/kserve-keda-autoscaling/scaled-object.yaml
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -5,8 +5,8 @@
 # - KEDA installed in cluster (https://keda.sh/docs/deploy/)
 # - Prometheus collecting vLLM metrics (see service-monitor.yaml)
 #
-# Note: vLLM metrics use colons in names (e.g., vllm:num_requests_running)
-# which need to be quoted in PromQL queries
+# Note: vLLM uses colons in metric names (e.g., vllm:num_requests_running),
+# which is unusual but correct. Use {"__name__"="..."} syntax in PromQL.
 #
 # TODO: Replace <your-namespace> with your actual namespace in the queries below
 #
@@ -24,7 +24,7 @@ spec:
     name: opt-125m-vllm-predictor-00001-deployment
   # Polling interval for checking metrics (seconds)
   pollingInterval: 15
-  # Cooldown period before scaling down (seconds)  
+  # Cooldown period before scaling down (seconds)
   cooldownPeriod: 60
   # Min/max replicas
   minReplicaCount: 1
diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml
index 2517000..2bccc93 100644
--- a/serving/kserve-keda-autoscaling/service-monitor.yaml
+++ b/serving/kserve-keda-autoscaling/service-monitor.yaml
@@ -3,7 +3,7 @@
 #
 # Prerequisites:
 # - Prometheus Operator installed (kube-prometheus-stack)
-# - vLLM runtime exposes metrics on port 8080 at /metrics endpoint
+# - vLLM runtime exposes metrics at /metrics endpoint
 #
 apiVersion: monitoring.coreos.com/v1
 kind: PodMonitor
@@ -21,60 +21,7 @@ spec:
     matchNames:
       - <your-namespace>  # TODO: Replace with your namespace
   podMetricsEndpoints:
-    - port: user-port  # vLLM runtime metrics port
+    - port: user-port
       path: /metrics
       interval: 15s
       scrapeTimeout: 10s
----
-# PrometheusRule for creating recording rules
-# These pre-aggregate metrics for more efficient KEDA queries
-apiVersion: monitoring.coreos.com/v1
-kind: PrometheusRule
-metadata:
-  name: opt-125m-vllm-recording-rules
-  labels:
-    app: opt-125m-vllm
-    release: kube-prometheus-stack
-spec:
-  groups:
-    - name: vllm-metrics
-      interval: 15s
-      rules:
-        # Number of running requests
-        - record: vllm:num_requests_running
-          expr: |
-            sum by (model_name, namespace) (
-              vllm_num_requests_running
-            )
-        # Number of waiting requests  
-        - record: vllm:num_requests_waiting
-          expr: |
-            sum by (model_name, namespace) (
-              vllm_num_requests_waiting
-            )
-        # Token generation throughput
-        - record: vllm:generation_tokens_rate
-          expr: |
-            sum by (model_name, namespace) (
-              rate(vllm_generation_tokens_total[1m])
-            )
-        # Prompt tokens throughput
-        - record: vllm:prompt_tokens_rate
-          expr: |
-            sum by (model_name, namespace) (
-              rate(vllm_prompt_tokens_total[1m])
-            )
-        # Average time to first token (TTFT)
-        - record: vllm:time_to_first_token_avg
-          expr: |
-            avg by (model_name, namespace) (
-              rate(vllm_time_to_first_token_seconds_sum[5m])
-              /
-              rate(vllm_time_to_first_token_seconds_count[5m])
-            )
-        # GPU KV cache utilization
-        - record: vllm:gpu_cache_usage_percent
-          expr: |
-            avg by (model_name, namespace) (
-              vllm_gpu_cache_usage_perc
-            )

From b1a1034ef5eb27d4ef936772c8589e576f403bd4 Mon Sep 17 00:00:00 2001
From: hsteude <henrik.steude@prokube.ai>
Date: Mon, 16 Feb 2026 20:58:21 +0100
Subject: [PATCH 06/17] Address additional Copilot review feedback

- Fix KEDA trigger description (evaluates all, uses highest replica count)
- Make Prometheus URL configurable (<your-prometheus-url> placeholder)
- Add pod selector to queries to avoid cross-InferenceService metric aggregation
- Update README with additional configuration steps
---
 serving/kserve-keda-autoscaling/README.md      |  6 ++++--
 .../kserve-keda-autoscaling/scaled-object.yaml | 18 +++++++++++-------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index 69c4d94..b07387c 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -62,8 +62,10 @@ kubectl get deployments -n <your-namespace> | grep opt-125m-vllm
 ```
 
 Update `scaled-object.yaml` with:
-- The correct deployment name
+- The correct deployment name in `scaleTargetRef`
+- Your Prometheus server URL (e.g., `http://prometheus.monitoring:9090` or with path prefix)
 - Your namespace in the Prometheus queries
+- Your InferenceService name in the pod selector (e.g., `pod=~"opt-125m-vllm-predictor-.*"`)
 
 Then apply:
 ```bash
@@ -78,7 +80,7 @@ kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
 
 ## Autoscaling Strategies
 
-This example uses three triggers (first one to exceed threshold wins):
+This example uses three triggers. KEDA evaluates all triggers and scales based on the highest desired replica count:
 
 ### 1. Time To First Token (TTFT) - P95
 Scales when the 95th percentile TTFT exceeds 200ms:
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
index 1530843..ca30d3e 100644
--- a/serving/kserve-keda-autoscaling/scaled-object.yaml
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -8,7 +8,10 @@
 # Note: vLLM uses colons in metric names (e.g., vllm:num_requests_running),
 # which is unusual but correct. Use {"__name__"="..."} syntax in PromQL.
 #
-# TODO: Replace <your-namespace> with your actual namespace in the queries below
+# TODO: Replace the following before deploying:
+# - <your-namespace>: your actual namespace
+# - <your-prometheus-url>: your Prometheus server URL (may or may not have a path prefix)
+# - opt-125m-vllm: your InferenceService name (in queries and scaleTargetRef)
 #
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
@@ -54,28 +57,29 @@ spec:
     # Scale up when P95 TTFT exceeds 200ms (0.2s)
     - type: prometheus
       metadata:
-        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+        # Adjust URL to your Prometheus setup (some have /prometheus path prefix)
+        serverAddress: <your-prometheus-url>
         metricName: vllm_ttft_p95
         query: |
-          histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>"}[2m])) by (le))
+          histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"}[2m])) by (le))
         threshold: "0.2"
         activationThreshold: "0.1"
     # Scale based on GPU KV-cache usage (for GPU deployments)
     # Scale up when cache usage exceeds 70%
     - type: prometheus
       metadata:
-        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+        serverAddress: <your-prometheus-url>
         metricName: vllm_gpu_cache_usage
         query: |
-          avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>"})
+          avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"})
         threshold: "0.7"
         activationThreshold: "0.5"
     # Fallback: Scale based on running requests (always works)
     - type: prometheus
       metadata:
-        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+        serverAddress: <your-prometheus-url>
         metricName: vllm_num_requests_running
         query: |
-          avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>"})
+          avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"})
         threshold: "2"
         activationThreshold: "1"

From b69e04df6933da3ffd497ddf960dd670f8330c72 Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Thu, 12 Mar 2026 14:17:33 +0100
Subject: [PATCH 07/17] Update KEDA example with new insights

---
 serving/kserve-keda-autoscaling/README.md     | 212 ++++--------------
 .../inference-service.yaml                    |  24 +-
 .../scaled-object.yaml                        |  89 ++------
 .../service-monitor.yaml                      |  27 ---
 4 files changed, 91 insertions(+), 261 deletions(-)
 delete mode 100644 serving/kserve-keda-autoscaling/service-monitor.yaml

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index b07387c..bbe6f32 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -1,193 +1,79 @@
-# KServe Autoscaling with KEDA and Custom Metrics
+# KServe Autoscaling with KEDA and Custom Prometheus Metrics
 
-This example demonstrates how to autoscale KServe InferenceServices using [KEDA](https://keda.sh/) with custom Prometheus metrics. This is particularly useful for LLM inference workloads where request-based autoscaling (Knative default) is not optimal.
+This example demonstrates autoscaling a KServe InferenceService using
+[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM.
+It scales based on total token throughput rather than simple request count,
+which is better suited for LLM inference workloads.
 
-## Why Custom Metrics for LLM Autoscaling?
+For full documentation, see the
+[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling).
 
-Traditional request-based autoscaling doesn't work well for LLM inference because:
+## Why Token Throughput?
 
-- **Token-level work**: LLM inference operates at token level, not request level. A single request can generate hundreds of tokens.
-- **Variable latency**: Request latency varies significantly based on input/output token count.
-- **Memory pressure**: LLM models require significant GPU memory (KV cache), which fills up based on concurrent requests.
-
-Better metrics for LLM autoscaling include:
-- **Time To First Token (TTFT)**: Latency until first token is generated
-- **KV Cache utilization**: GPU memory used for attention cache
-- **Number of running/waiting requests**: Queue depth
+LLM requests vary wildly in duration depending on prompt and output length.
+Request-count metrics (concurrency, QPS) don't reflect actual GPU load.
+Token throughput stays elevated as long as the model is under pressure,
+making it a stable scaling signal.
 
 ## Prerequisites
 
-Install KEDA in the cluster:
-
-```bash
-helm repo add kedacore https://kedacore.github.io/charts
-helm install keda kedacore/keda --namespace keda --create-namespace
-```
+- KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`)
+- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor)
 
 ## Files
 
 | File | Description |
 |------|-------------|
-| `inference-service.yaml` | KServe InferenceService for OPT-125M model with vLLM backend |
-| `scaled-object.yaml` | KEDA ScaledObject with TTFT, GPU cache, and request-based scaling |
-| `service-monitor.yaml` | PodMonitor for vLLM metrics collection |
-
-## Deployment
+| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) |
+| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput |
 
-### 1. Deploy the InferenceService
+## Quick Start
 
 ```bash
-kubectl apply -f inference-service.yaml -n <your-namespace>
-```
+export NAMESPACE="default"
 
-Wait for the model to be ready:
-```bash
-kubectl get inferenceservice opt-125m-vllm -n <your-namespace> -w
-```
-
-### 2. Configure Prometheus Metrics Collection
+# 1. Deploy the InferenceService
+kubectl apply -n $NAMESPACE -f inference-service.yaml
 
-Apply the PodMonitor to scrape vLLM metrics:
-```bash
-kubectl apply -f service-monitor.yaml -n <your-namespace>
-```
+# 2. Wait for it to become ready
+kubectl get isvc opt-125m -n $NAMESPACE -w
 
-**Note:** Update the `namespaceSelector` in `service-monitor.yaml` to match your namespace.
+# 3. Deploy the KEDA ScaledObject
+kubectl apply -n $NAMESPACE -f scaled-object.yaml
 
-### 3. Deploy KEDA ScaledObject
-
-First, identify the correct deployment name:
-```bash
-kubectl get deployments -n <your-namespace> | grep opt-125m-vllm
-```
-
-Update `scaled-object.yaml` with:
-- The correct deployment name in `scaleTargetRef`
-- Your Prometheus server URL (e.g., `http://prometheus.monitoring:9090` or with path prefix)
-- Your namespace in the Prometheus queries
-- Your InferenceService name in the pod selector (e.g., `pod=~"opt-125m-vllm-predictor-.*"`)
-
-Then apply:
-```bash
-kubectl apply -f scaled-object.yaml -n <your-namespace>
+# 4. Verify
+kubectl get scaledobject -n $NAMESPACE
+kubectl get hpa -n $NAMESPACE
 ```
 
-Verify the ScaledObject:
-```bash
-kubectl get scaledobject -n <your-namespace>
-kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
-```
+## Customization
 
-## Autoscaling Strategies
+**Namespace and model name**: replace `default` and `opt-125m` in the
+Prometheus queries inside `scaled-object.yaml`.
 
-This example uses three triggers. KEDA evaluates all triggers and scales based on the highest desired replica count:
+**Threshold**: the `threshold: "5"` value means "scale up when each replica
+handles more than 5 tokens/second on average" (`AverageValue` divides the
+query result by replica count). Tune this based on load testing for your
+model and hardware.
 
-### 1. Time To First Token (TTFT) - P95
-Scales when the 95th percentile TTFT exceeds 200ms:
-```yaml
-triggers:
-  - type: prometheus
-    metadata:
-      query: |
-        histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>"}[2m])) by (le))
-      threshold: "0.2"
-```
+**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
+from the InferenceService args, add GPU resource requests, and consider
+adding a second trigger for GPU KV-cache utilization:
 
-### 2. GPU KV-Cache Utilization
-Scales when GPU cache usage exceeds 70% (for GPU deployments):
 ```yaml
-triggers:
-  - type: prometheus
-    metadata:
-      query: |
-        avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>"})
-      threshold: "0.7"
+# Add to scaled-object.yaml triggers list
+- type: prometheus
+  metadata:
+    serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+    query: >-
+      avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"})
+    metricType: AverageValue
+    threshold: "0.75"
 ```
 
-### 3. Running Requests (Fallback)
-Scales when average running requests per pod exceeds 2:
-```yaml
-triggers:
-  - type: prometheus
-    metadata:
-      query: |
-        avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>"})
-      threshold: "2"
-```
-
-## vLLM Metrics Reference
-
-vLLM exposes metrics at the `/metrics` endpoint. Note that vLLM uses colons in metric names (this is unusual but correct):
-
-| Metric | Description |
-|--------|-------------|
-| `vllm:num_requests_running` | Number of requests currently being processed |
-| `vllm:num_requests_waiting` | Number of requests waiting in queue |
-| `vllm:gpu_cache_usage_perc` | GPU KV cache utilization (0-1) |
-| `vllm:time_to_first_token_seconds` | Histogram of TTFT |
-| `vllm:time_per_output_token_seconds` | Histogram of TPOT |
-| `vllm:generation_tokens_total` | Total number of generated tokens |
-
-## Testing Autoscaling
-
-Generate load to trigger autoscaling:
-
-```bash
-# First, find the predictor service name
-kubectl get svc -n <your-namespace> | grep opt-125m-vllm
-
-# Create a load generator pod (adjust service name if needed)
-kubectl run load-gen --image=curlimages/curl -n <your-namespace> --restart=Never -- \
-  sh -c 'while true; do for i in $(seq 1 10); do curl -s -X POST "http://opt-125m-vllm-predictor-00001.<your-namespace>.svc.cluster.local/openai/v1/completions" -H "Content-Type: application/json" -d "{\"model\": \"opt-125m\", \"prompt\": \"Tell me a story\", \"max_tokens\": 200}" & done; sleep 2; done'
-```
-
-Monitor scaling:
-```bash
-# Watch HPA status
-kubectl get hpa -n <your-namespace> -w
-
-# Watch pods
-kubectl get pods -n <your-namespace> -l serving.kserve.io/inferenceservice=opt-125m-vllm -w
-```
-
-Clean up:
-```bash
-kubectl delete pod load-gen -n <your-namespace>
-```
-
-## Troubleshooting
-
-### KEDA not scaling
-1. Check ScaledObject status:
-   ```bash
-   kubectl describe scaledobject opt-125m-vllm-scaledobject -n <your-namespace>
-   ```
-
-2. Verify Prometheus connectivity (some deployments use a `/prometheus` path prefix):
-   ```bash
-   kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
-     curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=up'
-   ```
-
-3. Check KEDA operator logs:
-   ```bash
-   kubectl logs -l app=keda-operator -n keda
-   ```
-
-### Metrics not appearing
-1. Verify PodMonitor is picked up:
-   ```bash
-   kubectl get podmonitor -n <your-namespace>
-   ```
-
-2. Check if vLLM metrics are being scraped:
-   ```bash
-   kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
-     curl -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query={__name__=~"vllm:.*"}'
-   ```
-
 ## References
 
-- [KServe Issue #3561: Native KEDA integration](https://github.com/kserve/kserve/issues/3561)
-- [KEDA Prometheus Scaler](https://keda.sh/docs/scalers/prometheus/)
-- [vLLM Metrics](https://docs.vllm.ai/en/latest/serving/metrics.html)
+- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/)
+- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler)
+- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/)
+- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html)
diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml
index 5135391..f7a755b 100644
--- a/serving/kserve-keda-autoscaling/inference-service.yaml
+++ b/serving/kserve-keda-autoscaling/inference-service.yaml
@@ -1,15 +1,21 @@
+# KServe InferenceService for OPT-125M with vLLM backend.
+# Uses RawDeployment mode — required when scaling with KEDA.
+#
+# This example runs on CPU. For GPU, remove --dtype=float32 and
+# --max-model-len, and adjust resources to request nvidia.com/gpu.
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
 metadata:
-  name: opt-125m-vllm
+  name: opt-125m
   annotations:
-    huggingface.co/model-id: facebook/opt-125m
+    # RawDeployment mode — creates a plain Deployment instead of a Knative Revision.
+    serving.kserve.io/deploymentMode: "RawDeployment"
+    # Tell KServe not to create its own HPA (KEDA will manage scaling).
+    serving.kserve.io/autoscalerClass: "external"
 spec:
   predictor:
-    # Note: When using KEDA, replica limits are managed by the ScaledObject.
-    # These values serve as defaults if KEDA is not deployed.
     minReplicas: 1
-    maxReplicas: 10
+    maxReplicas: 3
     model:
       modelFormat:
         name: huggingface
@@ -18,7 +24,13 @@ spec:
         - --model_id=facebook/opt-125m
         - --backend=vllm
         - --dtype=float32
-        - --device=cpu
+        - --max-model-len=512
+      # Explicit port declaration is required in RawDeployment mode
+      # for the cluster-wide PodMonitor to discover the metrics endpoint.
+      ports:
+        - name: user-port
+          containerPort: 8080
+          protocol: TCP
       resources:
         requests:
           cpu: "2"
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
index ca30d3e..004f038 100644
--- a/serving/kserve-keda-autoscaling/scaled-object.yaml
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -1,85 +1,44 @@
-# KEDA ScaledObject for KServe InferenceService with vLLM backend
-# Scales based on custom Prometheus metrics from vLLM serving runtime
+# KEDA ScaledObject for KServe InferenceService with vLLM backend.
+# Scales based on total token throughput (prompt + generation) from Prometheus.
 #
 # Prerequisites:
-# - KEDA installed in cluster (https://keda.sh/docs/deploy/)
-# - Prometheus collecting vLLM metrics (see service-monitor.yaml)
+# - KEDA installed (https://keda.sh/docs/deploy/)
+# - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor)
 #
-# Note: vLLM uses colons in metric names (e.g., vllm:num_requests_running),
-# which is unusual but correct. Use {"__name__"="..."} syntax in PromQL.
-#
-# TODO: Replace the following before deploying:
-# - <your-namespace>: your actual namespace
-# - <your-prometheus-url>: your Prometheus server URL (may or may not have a path prefix)
-# - opt-125m-vllm: your InferenceService name (in queries and scaleTargetRef)
+# Before deploying, replace:
+# - "default" in the Prometheus queries with your namespace
+# - "opt-125m" in model_name with your --model_name value
+# - The serverAddress if your Prometheus uses a different URL
 #
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
-  name: opt-125m-vllm-scaledobject
-  labels:
-    app: opt-125m-vllm
+  name: opt-125m-scaledobject
 spec:
-  # Target the KServe predictor deployment
   scaleTargetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: opt-125m-vllm-predictor-00001-deployment
-  # Polling interval for checking metrics (seconds)
-  pollingInterval: 15
-  # Cooldown period before scaling down (seconds)
-  cooldownPeriod: 60
-  # Min/max replicas
+    # In RawDeployment mode KServe names the Deployment {isvc-name}-predictor.
+    name: opt-125m-predictor
   minReplicaCount: 1
-  maxReplicaCount: 10
-  # Advanced scaling behavior
+  maxReplicaCount: 3
+  pollingInterval: 15          # how often KEDA checks the metric (seconds)
+  cooldownPeriod: 120          # seconds after last trigger activation before scaling to minReplicaCount
   advanced:
     horizontalPodAutoscalerConfig:
       behavior:
-        scaleDown:
-          stabilizationWindowSeconds: 120
-          policies:
-            - type: Percent
-              value: 25
-              periodSeconds: 60
         scaleUp:
           stabilizationWindowSeconds: 0
+        scaleDown:
+          stabilizationWindowSeconds: 120
           policies:
-            - type: Percent
-              value: 100
-              periodSeconds: 15
             - type: Pods
-              value: 4
-              periodSeconds: 15
-          selectPolicy: Max
+              value: 1          # remove at most 1 replica per minute
+              periodSeconds: 60
   triggers:
-    # Scale based on Time To First Token (TTFT) - P95
-    # Scale up when P95 TTFT exceeds 200ms (0.2s)
-    - type: prometheus
-      metadata:
-        # Adjust URL to your Prometheus setup (some have /prometheus path prefix)
-        serverAddress: <your-prometheus-url>
-        metricName: vllm_ttft_p95
-        query: |
-          histogram_quantile(0.95, sum(rate({"__name__"="vllm:time_to_first_token_seconds_bucket", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"}[2m])) by (le))
-        threshold: "0.2"
-        activationThreshold: "0.1"
-    # Scale based on GPU KV-cache usage (for GPU deployments)
-    # Scale up when cache usage exceeds 70%
-    - type: prometheus
-      metadata:
-        serverAddress: <your-prometheus-url>
-        metricName: vllm_gpu_cache_usage
-        query: |
-          avg({"__name__"="vllm:gpu_cache_usage_perc", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"})
-        threshold: "0.7"
-        activationThreshold: "0.5"
-    # Fallback: Scale based on running requests (always works)
     - type: prometheus
       metadata:
-        serverAddress: <your-prometheus-url>
-        metricName: vllm_num_requests_running
-        query: |
-          avg({"__name__"="vllm:num_requests_running", namespace="<your-namespace>", pod=~"opt-125m-vllm-predictor-.*"})
-        threshold: "2"
-        activationThreshold: "1"
+        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+        query: >-
+          sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+          + sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+        metricType: AverageValue
+        threshold: "5"
diff --git a/serving/kserve-keda-autoscaling/service-monitor.yaml b/serving/kserve-keda-autoscaling/service-monitor.yaml
deleted file mode 100644
index 2bccc93..0000000
--- a/serving/kserve-keda-autoscaling/service-monitor.yaml
+++ /dev/null
@@ -1,27 +0,0 @@
-# PodMonitor to scrape vLLM metrics from KServe InferenceService
-# This enables Prometheus to collect the metrics used by KEDA for autoscaling
-#
-# Prerequisites:
-# - Prometheus Operator installed (kube-prometheus-stack)
-# - vLLM runtime exposes metrics at /metrics endpoint
-#
-apiVersion: monitoring.coreos.com/v1
-kind: PodMonitor
-metadata:
-  name: opt-125m-vllm-metrics
-  labels:
-    app: opt-125m-vllm
-    # Label to match Prometheus Operator's podMonitorSelector
-    release: kube-prometheus-stack
-spec:
-  selector:
-    matchLabels:
-      serving.kserve.io/inferenceservice: opt-125m-vllm
-  namespaceSelector:
-    matchNames:
-      - <your-namespace>  # TODO: Replace with your namespace
-  podMetricsEndpoints:
-    - port: user-port
-      path: /metrics
-      interval: 15s
-      scrapeTimeout: 10s

From b3816d7e4e5ad2601ff9705b67370a86bb2bdc70 Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Thu, 12 Mar 2026 14:31:06 +0100
Subject: [PATCH 08/17] Add scaleUp stabilization window to mitigate metric
 oscillation

AverageValue divides total token throughput by replica count, which means
the per-replica value halves after a scale-up event. With stabilizationWindowSeconds: 0
this could cause flapping near the threshold. Setting it to 30s requires the
metric to stay above threshold for two consecutive polling intervals before
a scale-up is committed, while the existing 120s scaleDown window prevents
premature scale-down.
---
 serving/kserve-keda-autoscaling/scaled-object.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
index 004f038..32bc365 100644
--- a/serving/kserve-keda-autoscaling/scaled-object.yaml
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -26,7 +26,7 @@ spec:
     horizontalPodAutoscalerConfig:
       behavior:
         scaleUp:
-          stabilizationWindowSeconds: 0
+          stabilizationWindowSeconds: 30  # short window to absorb metric noise before committing to scale-up
         scaleDown:
           stabilizationWindowSeconds: 120
           policies:

From 81e4b823b510b6186b82be70bba7f6960860ee73 Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Thu, 12 Mar 2026 18:05:04 +0100
Subject: [PATCH 09/17] Address reviewer's feedback

---
 serving/kserve-keda-autoscaling/README.md     | 87 ++++++++++++++++---
 .../scaled-object.yaml                        | 11 ++-
 2 files changed, 82 insertions(+), 16 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index bbe6f32..66d71d9 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -29,33 +29,100 @@ making it a stable scaling signal.
 
 ## Quick Start
 
-```bash
-export NAMESPACE="default"
+> [!NOTE]
+> All of the examples below should be run in prokube notebook inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default.
 
+```bash
 # 1. Deploy the InferenceService
-kubectl apply -n $NAMESPACE -f inference-service.yaml
+kubectl apply -f inference-service.yaml
 
 # 2. Wait for it to become ready
-kubectl get isvc opt-125m -n $NAMESPACE -w
+kubectl get isvc opt-125m -w
 
 # 3. Deploy the KEDA ScaledObject
-kubectl apply -n $NAMESPACE -f scaled-object.yaml
+kubectl apply -f scaled-object.yaml
 
 # 4. Verify
-kubectl get scaledobject -n $NAMESPACE
-kubectl get hpa -n $NAMESPACE
+kubectl get scaledobject
+kubectl get hpa
+```
+
+## See It in Action
+
+After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle.
+
+### 1. Send inference requests
+
+Get the service URL and send a request:
+
+```bash
+SERVICE_URL=$(kubectl get isvc opt-125m -o jsonpath='{.status.url}')
+
+curl -s "$SERVICE_URL/openai/v1/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"model": "opt-125m", "prompt": "Hello world", "max_tokens": 64}'
+```
+
+### 2. Generate enough load to trigger scale-up
+
+Run several concurrent workers to push token throughput above the threshold
+(5 tokens/second per replica by default):
+
+```bash
+# 5 parallel workers, each sending requests in a loop
+for i in $(seq 1 5); do
+  (while true; do
+    curl -s "$SERVICE_URL/openai/v1/completions" \
+      -H "Content-Type: application/json" \
+      -d '{"model": "opt-125m", "prompt": "Write a long story about a dragon", "max_tokens": 200}' > /dev/null
+  done) &
+done
+
+# Stop the load later with:
+# kill $(jobs -p)
 ```
 
+### 3. Observe autoscaling
+
+Watch replicas scale up in response to load:
+
+```bash
+# Watch pods scale up (and later scale down)
+kubectl get pods -l serving.kserve.io/inferenceservice=opt-125m -w
+
+# Check KEDA's HPA
+kubectl get hpa -w
+```
+
+**Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see:
+- vLLM Performance Statistics: https://<YOUR_DOMAIN>/grafana/d/performance-statistics/vllm-performance-statistics
+- vLLM Query Statistics: https://<YOUR_DOMAIN>/grafana/d/query-statistics4/vllm-query-statistics
+- Replica count: https://<YOUR_DOMAIN>/grafana/d/demqj48/kubernetes-compute-resources-workload-copy
+
+In our testing, the full cycle looked like:
+1. **1 replica** at rest
+2. Load applied (5 workers, ~55 tok/s total) — KEDA detects threshold breach
+3. **Scaled to 3 replicas** within ~30 seconds
+4. Load removed — metric drops to 0 — stabilization window (120s)
+5. **Scaled back down** 3 → 2 → 1 gracefully (1 pod removed per minute)
+
 ## Customization
 
-**Namespace and model name**: replace `default` and `opt-125m` in the
-Prometheus queries inside `scaled-object.yaml`.
+**Model name**: the `model_name="opt-125m"` filter in the Prometheus queries inside
+`scaled-object.yaml` must match the `--model_name` argument in `inference-service.yaml`.
 
 **Threshold**: the `threshold: "5"` value means "scale up when each replica
 handles more than 5 tokens/second on average" (`AverageValue` divides the
 query result by replica count). Tune this based on load testing for your
 model and hardware.
 
+**Multi-tenant clusters**: if multiple users may deploy models with the same
+name, add a `namespace` filter to the Prometheus queries:
+
+```promql
+sum(rate(vllm:prompt_tokens_total{namespace="my-namespace",model_name="opt-125m"}[2m]))
+```
+
 **GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
 from the InferenceService args, add GPU resource requests, and consider
 adding a second trigger for GPU KV-cache utilization:
@@ -66,7 +133,7 @@ adding a second trigger for GPU KV-cache utilization:
   metadata:
     serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
     query: >-
-      avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"})
+      avg(vllm:gpu_cache_usage_perc{model_name="my-model"})
     metricType: AverageValue
     threshold: "0.75"
 ```
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
index 32bc365..40036b8 100644
--- a/serving/kserve-keda-autoscaling/scaled-object.yaml
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -5,10 +5,9 @@
 # - KEDA installed (https://keda.sh/docs/deploy/)
 # - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor)
 #
-# Before deploying, replace:
-# - "default" in the Prometheus queries with your namespace
-# - "opt-125m" in model_name with your --model_name value
-# - The serverAddress if your Prometheus uses a different URL
+# Customization:
+# - "opt-125m" in model_name must match the --model_name arg in inference-service.yaml
+# - The serverAddress must match your Prometheus URL
 #
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
@@ -38,7 +37,7 @@ spec:
       metadata:
         serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
         query: >-
-          sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
-          + sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+          sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m]))
+          + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m]))
         metricType: AverageValue
         threshold: "5"

From f2e6f96830a5fe5a968c8b0fa6aeb8541bb8676e Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Thu, 12 Mar 2026 18:26:35 +0100
Subject: [PATCH 10/17] Better readme and scaling watching instructions

---
 serving/kserve-keda-autoscaling/README.md | 24 +++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index 66d71d9..582d6cd 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -30,7 +30,7 @@ making it a stable scaling signal.
 ## Quick Start
 
 > [!NOTE]
-> All of the examples below should be run in prokube notebook inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default.
+> All of the examples below should be run in prokube notebook's terminal inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default.
 
 ```bash
 # 1. Deploy the InferenceService
@@ -84,14 +84,26 @@ done
 
 ### 3. Observe autoscaling
 
-Watch replicas scale up in response to load:
+You can use dashboards (see below) or check out any of these in terminal while the load is running:
 
 ```bash
-# Watch pods scale up (and later scale down)
-kubectl get pods -l serving.kserve.io/inferenceservice=opt-125m -w
+# Deployment replica count (most direct signal)
+kubectl get deployment opt-125m-predictor -w
 
-# Check KEDA's HPA
-kubectl get hpa -w
+# HPA — shows current metric value vs threshold and desired replica count
+kubectl get hpa keda-hpa-opt-125m-scaledobject -w
+
+# ScaledObject — shows Ready/Active/Paused conditions
+kubectl get scaledobject opt-125m-scaledobject -w
+
+# Pods coming up and terminating
+kubectl get pods -l app=isvc.opt-125m-predictor -w
+```
+
+Or poll a compact summary every 10 seconds:
+
+```bash
+watch -n 10 kubectl get deployment/opt-125m-predictor hpa/keda-hpa-opt-125m-scaledobject
 ```
 
 **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see:

From 75c9d0aa1b6fb4401f4497bc1b014c7725fd049a Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Thu, 12 Mar 2026 18:44:59 +0100
Subject: [PATCH 11/17] Fix service URL to use internal cluster address and
 simplify observe section

---
 serving/kserve-keda-autoscaling/README.md | 49 ++++++++++-------------
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index 582d6cd..ffa0878 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -53,57 +53,52 @@ After deploying, you can trigger autoscaling and observe the full scale-up / sca
 
 ### 1. Send inference requests
 
-Get the service URL and send a request:
+Get the internal cluster address and send a request:
 
 ```bash
-SERVICE_URL=$(kubectl get isvc opt-125m -o jsonpath='{.status.url}')
+SERVICE_URL="$(kubectl get isvc opt-125m -o jsonpath='{.status.address.url}')"
+
+echo "\nService URL: $SERVICE_URL"
 
 curl -s "$SERVICE_URL/openai/v1/completions" \
   -H "Content-Type: application/json" \
-  -d '{"model": "opt-125m", "prompt": "Hello world", "max_tokens": 64}'
+  -d '{"model":"opt-125m","prompt":"Hello world","max_tokens":64}' \
+  | python -m json.tool
 ```
 
 ### 2. Generate enough load to trigger scale-up
 
-Run several concurrent workers to push token throughput above the threshold
+Run several concurrent workers (in the background!) to push token throughput above the threshold
 (5 tokens/second per replica by default):
 
 ```bash
 # 5 parallel workers, each sending requests in a loop
 for i in $(seq 1 5); do
-  (while true; do
-    curl -s "$SERVICE_URL/openai/v1/completions" \
-      -H "Content-Type: application/json" \
-      -d '{"model": "opt-125m", "prompt": "Write a long story about a dragon", "max_tokens": 200}' > /dev/null
-  done) &
+  (
+    while true; do
+      curl -s "$SERVICE_URL/openai/v1/completions" \
+        -H "Content-Type: application/json" \
+        -d '{"model":"opt-125m","prompt":"Write a long story about a dragon","max_tokens":200}' \
+        > /dev/null
+      sleep 1
+    done
+  ) &
 done
 
+echo "Load generation started."
+echo "Stop it with: kill $(jobs -p)"
+
 # Stop the load later with:
 # kill $(jobs -p)
 ```
 
 ### 3. Observe autoscaling
 
-You can use dashboards (see below) or check out any of these in terminal while the load is running:
-
-```bash
-# Deployment replica count (most direct signal)
-kubectl get deployment opt-125m-predictor -w
-
-# HPA — shows current metric value vs threshold and desired replica count
-kubectl get hpa keda-hpa-opt-125m-scaledobject -w
-
-# ScaledObject — shows Ready/Active/Paused conditions
-kubectl get scaledobject opt-125m-scaledobject -w
-
-# Pods coming up and terminating
-kubectl get pods -l app=isvc.opt-125m-predictor -w
-```
-
-Or poll a compact summary every 10 seconds:
+You can use dashboards (recommended, see below) or get a compact summary in terminal:
 
 ```bash
-watch -n 10 kubectl get deployment/opt-125m-predictor hpa/keda-hpa-opt-125m-scaledobject
+# polls every 10 seconds
+watch -n 10 kubectl get deployment/opt-125m-predictor hpa/keda-hpa-opt-125m-scaledobject scaledobject/opt-125m-scaledobject
 ```
 
 **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see:

From 392a22a823f0e01835216a4ba03eb91718ef2d88 Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Thu, 12 Mar 2026 19:06:00 +0100
Subject: [PATCH 12/17] Improve KEDA autoscaling documentation

---
 serving/kserve-keda-autoscaling/README.md | 45 ++++++++++++-----------
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index ffa0878..3f49bed 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -39,7 +39,7 @@ kubectl apply -f inference-service.yaml
 # 2. Wait for it to become ready
 kubectl get isvc opt-125m -w
 
-# 3. Deploy the KEDA ScaledObject
+# 3. Deploy the KEDA ScaledObject (requires corresponding permissions)
 kubectl apply -f scaled-object.yaml
 
 # 4. Verify
@@ -56,14 +56,13 @@ After deploying, you can trigger autoscaling and observe the full scale-up / sca
 Get the internal cluster address and send a request:
 
 ```bash
-SERVICE_URL="$(kubectl get isvc opt-125m -o jsonpath='{.status.address.url}')"
-
-echo "\nService URL: $SERVICE_URL"
+# inference service name + "-predictor"
+SERVICE_URL=opt-125m-predictor
 
 curl -s "$SERVICE_URL/openai/v1/completions" \
   -H "Content-Type: application/json" \
-  -d '{"model":"opt-125m","prompt":"Hello world","max_tokens":64}' \
-  | python -m json.tool
+  -d '{"model":"opt-125m","prompt":"What is AI?","max_tokens":64}' \
+  | python -c 'import json,sys;print("\n", json.load(sys.stdin)["choices"][0]["text"].strip(), "\n")'
 ```
 
 ### 2. Generate enough load to trigger scale-up
@@ -73,23 +72,20 @@ Run several concurrent workers (in the background!) to push token throughput abo
 
 ```bash
 # 5 parallel workers, each sending requests in a loop
+PIDS=""
 for i in $(seq 1 5); do
-  (
-    while true; do
-      curl -s "$SERVICE_URL/openai/v1/completions" \
-        -H "Content-Type: application/json" \
-        -d '{"model":"opt-125m","prompt":"Write a long story about a dragon","max_tokens":200}' \
-        > /dev/null
-      sleep 1
-    done
-  ) &
+  (while true; do
+    curl -s "$SERVICE_URL/openai/v1/completions" \
+      -H "Content-Type: application/json" \
+      -d '{"model":"opt-125m","prompt":"Write a long story about a dragon","max_tokens":200}' > /dev/null
+  done) &
+  PIDS="$PIDS $!"
 done
 
-echo "Load generation started."
-echo "Stop it with: kill $(jobs -p)"
-
-# Stop the load later with:
-# kill $(jobs -p)
+echo
+echo "Load running (PIDs:$PIDS)"
+echo "Stop with: kill$PIDS"
+echo
 ```
 
 ### 3. Observe autoscaling
@@ -98,7 +94,14 @@ You can use dashboards (recommended, see below) or get a compact summary in term
 
 ```bash
 # polls every 10 seconds
-watch -n 10 kubectl get deployment/opt-125m-predictor hpa/keda-hpa-opt-125m-scaledobject scaledobject/opt-125m-scaledobject
+watch -n10 '
+echo "Deployment:"
+kubectl get deployment opt-125m-predictor
+
+echo
+echo "Autoscaler:"
+kubectl get hpa keda-hpa-opt-125m-scaledobject
+'
 ```
 
 **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see:

From 35ef20dc25b5578706c73e4fe80ad67e275627d3 Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Fri, 13 Mar 2026 18:32:58 +0100
Subject: [PATCH 13/17] Warn about KEDA availability and namespace metric
 collision

---
 serving/kserve-keda-autoscaling/README.md | 27 +++++++++++++++--------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index 3f49bed..e8986dc 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -17,7 +17,7 @@ making it a stable scaling signal.
 
 ## Prerequisites
 
-- KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`)
+- KEDA installed in the cluster — not available in all prokube clusters by default; see step 3 below
 - Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor)
 
 ## Files
@@ -39,7 +39,7 @@ kubectl apply -f inference-service.yaml
 # 2. Wait for it to become ready
 kubectl get isvc opt-125m -w
 
-# 3. Deploy the KEDA ScaledObject (requires corresponding permissions)
+# 3. Deploy the KEDA ScaledObject
 kubectl apply -f scaled-object.yaml
 
 # 4. Verify
@@ -47,6 +47,22 @@ kubectl get scaledobject
 kubectl get hpa
 ```
 
+> [!WARNING]
+> If step 3 fails with `no matches for kind "ScaledObject"`, KEDA is not installed in your cluster.
+> Ask your admin to enable it.
+
+> [!WARNING]
+> The Prometheus query in `scaled-object.yaml` has no `namespace` filter, so it aggregates token
+> throughput across **all namespaces**. This is fine for testing, but if multiple users deploy a
+> model named `opt-125m` at the same time, their metrics will interfere and autoscaling will be
+> incorrect for both. For any real use, add a namespace filter to both queries in `scaled-object.yaml`:
+>
+> ```yaml
+> query: >-
+>   sum(rate(vllm:prompt_tokens_total{namespace="<your-namespace>",model_name="opt-125m"}[2m]))
+>   + sum(rate(vllm:generation_tokens_total{namespace="<your-namespace>",model_name="opt-125m"}[2m]))
+> ```
+
 ## See It in Action
 
 After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle.
@@ -126,13 +142,6 @@ handles more than 5 tokens/second on average" (`AverageValue` divides the
 query result by replica count). Tune this based on load testing for your
 model and hardware.
 
-**Multi-tenant clusters**: if multiple users may deploy models with the same
-name, add a `namespace` filter to the Prometheus queries:
-
-```promql
-sum(rate(vllm:prompt_tokens_total{namespace="my-namespace",model_name="opt-125m"}[2m]))
-```
-
 **GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
 from the InferenceService args, add GPU resource requests, and consider
 adding a second trigger for GPU KV-cache utilization:

From bd2a6e406446294a77afd71222f22f47a899e7be Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Thu, 19 Mar 2026 15:06:36 +0100
Subject: [PATCH 14/17] Improve load generation

---
 serving/kserve-keda-autoscaling/README.md     |  84 ++++--
 .../kserve-keda-autoscaling/load-generator.py | 261 ++++++++++++++++++
 2 files changed, 322 insertions(+), 23 deletions(-)
 create mode 100644 serving/kserve-keda-autoscaling/load-generator.py

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index e8986dc..77cbb6e 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -26,6 +26,7 @@ making it a stable scaling signal.
 |------|-------------|
 | `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) |
 | `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput |
+| `load-generator.py` | Python load generator with presets for different scaling scenarios |
 
 ## Quick Start
 
@@ -67,7 +68,7 @@ kubectl get hpa
 
 After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle.
 
-### 1. Send inference requests
+### 1. Send a test inference request
 
 Get the internal cluster address and send a request:
 
@@ -81,29 +82,30 @@ curl -s "$SERVICE_URL/openai/v1/completions" \
   | python -c 'import json,sys;print("\n", json.load(sys.stdin)["choices"][0]["text"].strip(), "\n")'
 ```
 
-### 2. Generate enough load to trigger scale-up
+### 2. Generate load to trigger scale-up
 
-Run several concurrent workers (in the background!) to push token throughput above the threshold
-(5 tokens/second per replica by default):
+Use the included load generator to produce controlled, sustained load.
+It has two presets calibrated for the opt-125m model on CPU:
+
+| Mode | Workers | Sleep | Throughput | Scaling behavior |
+|------|---------|-------|------------|------------------|
+| `stable-2` | 1 | 8s | ~8 tok/s | Scales to 2 replicas and holds |
+| `stable-3` | 2 | 2s | ~22 tok/s | Scales to 3 replicas and holds |
 
 ```bash
-# 5 parallel workers, each sending requests in a loop
-PIDS=""
-for i in $(seq 1 5); do
-  (while true; do
-    curl -s "$SERVICE_URL/openai/v1/completions" \
-      -H "Content-Type: application/json" \
-      -d '{"model":"opt-125m","prompt":"Write a long story about a dragon","max_tokens":200}' > /dev/null
-  done) &
-  PIDS="$PIDS $!"
-done
+# Scale to 2 replicas (moderate load)
+python load-generator.py --mode stable-2
 
-echo
-echo "Load running (PIDs:$PIDS)"
-echo "Stop with: kill$PIDS"
-echo
+# Scale to 3 replicas (heavy load)
+python load-generator.py --mode stable-3
+
+# Custom: pick your own concurrency and pacing
+python load-generator.py --mode custom --workers 3 --sleep 1.0
 ```
 
+Press `Ctrl+C` to stop the load at any time. By default the script runs for
+10 minutes; override with `--duration`.
+
 ### 3. Observe autoscaling
 
 You can use dashboards (recommended, see below) or get a compact summary in terminal:
@@ -125,12 +127,37 @@ kubectl get hpa keda-hpa-opt-125m-scaledobject
 - vLLM Query Statistics: https://<YOUR_DOMAIN>/grafana/d/query-statistics4/vllm-query-statistics
 - Replica count: https://<YOUR_DOMAIN>/grafana/d/demqj48/kubernetes-compute-resources-workload-copy
 
-In our testing, the full cycle looked like:
+### Expected behavior
+
+**Stable-2 mode** (~8 tok/s):
+1. **1 replica** at rest
+2. Load applied — metric rises to ~8 tok/s — `ceil(8/5) = 2` replicas needed
+3. **Scaled to 2 replicas** within ~1 minute
+4. Metric stabilizes at ~4 tok/s per replica (below threshold) — stays at 2
+5. Load removed — metric drops to 0 — cooldown period (120s) + stabilization window (120s)
+6. **Scaled back to 1** replica
+
+**Stable-3 mode** (~22 tok/s):
 1. **1 replica** at rest
-2. Load applied (5 workers, ~55 tok/s total) — KEDA detects threshold breach
-3. **Scaled to 3 replicas** within ~30 seconds
-4. Load removed — metric drops to 0 — stabilization window (120s)
-5. **Scaled back down** 3 → 2 → 1 gracefully (1 pod removed per minute)
+2. Load applied — metric rises quickly — `ceil(22/5) = 5`, capped at `maxReplicas=3`
+3. **Scaled to 3 replicas** within ~1-2 minutes
+4. Load removed — gradual scale-down: 3 → 2 → 1 (one pod removed per minute)
+
+## Scaling Math
+
+The ScaledObject uses `metricType: AverageValue` with `threshold: 5`. For
+external metrics, HPA computes:
+
+```
+desiredReplicas = ceil(totalMetricValue / threshold)
+```
+
+| Total tok/s | Desired replicas | Actual (capped 1-3) |
+|-------------|------------------|---------------------|
+| 0-5         | 1                | 1                   |
+| 5.1-10      | 2                | 2                   |
+| 10.1-15     | 3                | 3                   |
+| 15+         | 4+               | 3 (maxReplicas)     |
 
 ## Customization
 
@@ -142,6 +169,17 @@ handles more than 5 tokens/second on average" (`AverageValue` divides the
 query result by replica count). Tune this based on load testing for your
 model and hardware.
 
+**Load generator presets**: the presets in `load-generator.py` are calibrated for
+opt-125m on CPU. If you change the model, hardware, or threshold, you'll need to
+recalibrate. Use `--mode custom` to experiment, and watch the Prometheus metric:
+
+```bash
+# Check the actual metric value KEDA sees
+kubectl run prom-check --rm -it --restart=Never --image=curlimages/curl -- \
+  -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query' \
+  --data-urlencode 'query=sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m])) + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m]))'
+```
+
 **GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
 from the InferenceService args, add GPU resource requests, and consider
 adding a second trigger for GPU KV-cache utilization:
diff --git a/serving/kserve-keda-autoscaling/load-generator.py b/serving/kserve-keda-autoscaling/load-generator.py
new file mode 100644
index 0000000..f33381a
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/load-generator.py
@@ -0,0 +1,261 @@
+#!/usr/bin/env python3
+"""
+Load generator for vLLM-based KServe InferenceService autoscaling demos.
+
+Provides two preset scenarios to demonstrate KEDA autoscaling behaviors:
+
+  stable-2   - Sustained moderate load that triggers a stable scale-up to 2 replicas
+  stable-3   - Sustained heavy load that triggers a stable scale-up to 3 replicas
+
+You can also run in 'custom' mode to specify your own concurrency and sleep values.
+
+Usage (run from a terminal with network access to the service, e.g. a Kubeflow notebook):
+    python load-generator.py --mode stable-2
+    python load-generator.py --mode stable-3 --duration 300
+    python load-generator.py --mode custom --workers 3 --sleep 2.0
+
+Press Ctrl+C to stop the load at any time.
+"""
+
+import argparse
+import json
+import signal
+import threading
+import time
+import urllib.request
+import urllib.error
+
+# ---------------------------------------------------------------------------
+# Preset configurations
+#
+# Each preset defines (workers, sleep_between_requests).
+# "workers" is the number of concurrent request loops.
+# "sleep_between_requests" is how long each worker pauses (seconds) after
+# receiving a response before sending the next request.
+#
+# The ScaledObject uses:
+#   metricType: AverageValue   (total metric value / current replicas)
+#   threshold: 5               (tokens/sec per replica)
+#
+# HPA computes:  desiredReplicas = ceil(total_tok_per_sec / threshold)
+#
+#   stable-2:  target ~8 tok/s total   -> ceil(8/5) = 2
+#   stable-3:  target ~22 tok/s total  -> ceil(22/5) = 5, capped at maxReplicas=3
+#
+# Calibrated for opt-125m on CPU (float32, --max-model-len=512).
+# Each request averages ~116 tokens and ~6.7s of processing time.
+# Effective rate per worker ≈ 116 / (6.7 + sleep) tok/s.
+# ---------------------------------------------------------------------------
+
+PRESETS = {
+    "stable-2": {"workers": 1, "sleep": 8.0},
+    "stable-3": {"workers": 2, "sleep": 2.0},
+}
+
+# Default model endpoint (Kubernetes service DNS name in RawDeployment mode)
+DEFAULT_URL = "http://opt-125m-predictor/openai/v1/completions"
+DEFAULT_DURATION = 600  # 10 minutes
+DEFAULT_MAX_TOKENS = 200
+DEFAULT_PROMPT = (
+    "Write a long detailed story about a dragon who discovers a hidden kingdom"
+)
+
+# ---------------------------------------------------------------------------
+# Globals for stats
+# ---------------------------------------------------------------------------
+stats_lock = threading.Lock()
+total_requests = 0
+total_tokens = 0
+total_errors = 0
+start_time: float = 0.0
+stop_event = threading.Event()
+
+
+def send_request(url: str, prompt: str, max_tokens: int) -> dict | None:
+    """Send a single completion request to the vLLM endpoint."""
+    payload = json.dumps(
+        {
+            "model": "opt-125m",
+            "prompt": prompt,
+            "max_tokens": max_tokens,
+        }
+    ).encode("utf-8")
+
+    req = urllib.request.Request(
+        url,
+        data=payload,
+        headers={"Content-Type": "application/json"},
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=120) as resp:
+            return json.loads(resp.read().decode("utf-8"))
+    except (urllib.error.URLError, OSError, json.JSONDecodeError):
+        return None
+
+
+def worker_loop(
+    worker_id: int, url: str, prompt: str, max_tokens: int, sleep_sec: float
+):
+    """Continuously send requests with a sleep between each, until stop_event is set."""
+    global total_requests, total_tokens, total_errors
+
+    while not stop_event.is_set():
+        result = send_request(url, prompt, max_tokens)
+
+        with stats_lock:
+            if result and "usage" in result:
+                total_requests += 1
+                total_tokens += result["usage"].get("total_tokens", 0)
+            else:
+                total_errors += 1
+
+        # Sleep between requests (interruptible via stop_event)
+        if sleep_sec > 0 and not stop_event.is_set():
+            stop_event.wait(timeout=sleep_sec)
+
+
+def print_stats():
+    """Periodically print throughput stats."""
+    while not stop_event.is_set():
+        stop_event.wait(timeout=10)
+        if stop_event.is_set():
+            break
+        elapsed = time.time() - start_time
+        with stats_lock:
+            tok_rate = total_tokens / elapsed if elapsed > 0 else 0
+            req_rate = total_requests / elapsed if elapsed > 0 else 0
+            print(
+                f"  [{elapsed:6.0f}s] requests={total_requests}  "
+                f"tokens={total_tokens}  errors={total_errors}  "
+                f"avg_tok/s={tok_rate:.1f}  avg_req/s={req_rate:.2f}"
+            )
+
+
+def main():
+    global start_time
+
+    parser = argparse.ArgumentParser(
+        description="Load generator for KServe + KEDA autoscaling demo",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__,
+    )
+    parser.add_argument(
+        "--mode",
+        choices=["stable-2", "stable-3", "custom"],
+        default="stable-2",
+        help="Load preset (default: stable-2)",
+    )
+    parser.add_argument(
+        "--workers",
+        type=int,
+        default=None,
+        help="Number of concurrent workers (custom mode)",
+    )
+    parser.add_argument(
+        "--sleep",
+        type=float,
+        default=None,
+        help="Sleep seconds between requests per worker (custom mode)",
+    )
+    parser.add_argument(
+        "--url",
+        default=DEFAULT_URL,
+        help=f"Service URL (default: {DEFAULT_URL})",
+    )
+    parser.add_argument(
+        "--duration",
+        type=int,
+        default=DEFAULT_DURATION,
+        help=f"Duration in seconds (default: {DEFAULT_DURATION})",
+    )
+    parser.add_argument(
+        "--max-tokens",
+        type=int,
+        default=DEFAULT_MAX_TOKENS,
+        help=f"Max tokens per request (default: {DEFAULT_MAX_TOKENS})",
+    )
+    parser.add_argument(
+        "--prompt",
+        default=DEFAULT_PROMPT,
+        help="Prompt text",
+    )
+
+    args = parser.parse_args()
+
+    # Resolve configuration
+    if args.mode == "custom":
+        if args.workers is None or args.sleep is None:
+            parser.error("--workers and --sleep are required in custom mode")
+        workers = args.workers
+        sleep_sec = args.sleep
+    else:
+        preset = PRESETS[args.mode]
+        workers = args.workers if args.workers is not None else preset["workers"]
+        sleep_sec = args.sleep if args.sleep is not None else preset["sleep"]
+
+    print("=== Load Generator ===")
+    print(f"  Mode:      {args.mode}")
+    print(f"  Workers:   {workers}")
+    print(f"  Sleep:     {sleep_sec}s between requests")
+    print(f"  URL:       {args.url}")
+    print(f"  Duration:  {args.duration}s")
+    print(f"  Max tokens: {args.max_tokens}")
+    print()
+    print("Starting load... (Ctrl+C to stop)")
+    print()
+
+    # Handle Ctrl+C gracefully
+    def signal_handler(sig, frame):
+        print("\n\nStopping load...")
+        stop_event.set()
+
+    signal.signal(signal.SIGINT, signal_handler)
+    signal.signal(signal.SIGTERM, signal_handler)
+
+    start_time = time.time()
+
+    # Start stats printer
+    stats_thread = threading.Thread(target=print_stats, daemon=True)
+    stats_thread.start()
+
+    # Start worker threads
+    threads = []
+    for i in range(workers):
+        t = threading.Thread(
+            target=worker_loop,
+            args=(i, args.url, args.prompt, args.max_tokens, sleep_sec),
+            daemon=True,
+        )
+        t.start()
+        threads.append(t)
+
+    # Wait for duration or Ctrl+C
+    try:
+        stop_event.wait(timeout=args.duration)
+    except KeyboardInterrupt:
+        pass
+
+    stop_event.set()
+
+    # Wait for threads to finish
+    for t in threads:
+        t.join(timeout=5)
+
+    # Final stats
+    elapsed = time.time() - start_time
+    with stats_lock:
+        tok_rate = total_tokens / elapsed if elapsed > 0 else 0
+        req_rate = total_requests / elapsed if elapsed > 0 else 0
+
+    print()
+    print("=== Final Stats ===")
+    print(f"  Duration:   {elapsed:.1f}s")
+    print(f"  Requests:   {total_requests}")
+    print(f"  Tokens:     {total_tokens}")
+    print(f"  Errors:     {total_errors}")
+    print(f"  Avg tok/s:  {tok_rate:.1f}")
+    print(f"  Avg req/s:  {req_rate:.2f}")
+
+
+if __name__ == "__main__":
+    main()

From ae78abb77f4e89e9e3aad1bbd4a51625d39aa5bd Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Thu, 19 Mar 2026 15:26:05 +0100
Subject: [PATCH 15/17] Update dashboards in readme

---
 serving/kserve-keda-autoscaling/README.md | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index 77cbb6e..0c27586 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -48,11 +48,12 @@ kubectl get scaledobject
 kubectl get hpa
 ```
 
-> [!WARNING]
+> **NOTE 1.**  
 > If step 3 fails with `no matches for kind "ScaledObject"`, KEDA is not installed in your cluster.
 > Ask your admin to enable it.
 
-> [!WARNING]
+  
+> **NOTE 2.**  
 > The Prometheus query in `scaled-object.yaml` has no `namespace` filter, so it aggregates token
 > throughput across **all namespaces**. This is fine for testing, but if multiple users deploy a
 > model named `opt-125m` at the same time, their metrics will interfere and autoscaling will be
@@ -123,9 +124,20 @@ kubectl get hpa keda-hpa-opt-125m-scaledobject
 ```
 
 **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see:
-- vLLM Performance Statistics: https://<YOUR_DOMAIN>/grafana/d/performance-statistics/vllm-performance-statistics
-- vLLM Query Statistics: https://<YOUR_DOMAIN>/grafana/d/query-statistics4/vllm-query-statistics
-- Replica count: https://<YOUR_DOMAIN>/grafana/d/demqj48/kubernetes-compute-resources-workload-copy
+
+- General vLLM Dashboard:  
+  https://YOUR_DOMAIN/grafana/d/b281712d-8bff-41ef-9f3f-71ad43c05e9b/vllm
+
+- vLLM Performance Statistics:  
+  https://YOUR_DOMAIN/grafana/d/performance-statistics/vllm-performance-statistics
+
+- vLLM Query Statistics:  
+  https://YOUR_DOMAIN/grafana/d/query-statistics4/vllm-query-statistics
+
+- Replica count:  
+  https://YOUR_DOMAIN/grafana/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload
+
+Replace `YOUR_DOMAIN` with your cluster domain.
 
 ### Expected behavior
 

From d0635cde868e27e7ccc33b6e9d42010cf2487261 Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Sun, 22 Mar 2026 12:38:32 +0100
Subject: [PATCH 16/17] Add a mermaid diagram to illustrate KEDA

---
 serving/kserve-keda-autoscaling/README.md | 51 +++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index 0c27586..11642af 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -8,6 +8,57 @@ which is better suited for LLM inference workloads.
 For full documentation, see the
 [prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling).
 
+## Architecture
+
+```mermaid
+flowchart LR
+
+%% ---------- STYLES ----------
+classDef kserve fill:#E8F0FE,stroke:#1A73E8,stroke-width:2px,color:#0B3D91
+classDef pod fill:#F1F8E9,stroke:#558B2F,stroke-width:2px
+classDef infra fill:#FFF8E1,stroke:#FF8F00,stroke-width:2px
+classDef traffic fill:#FCE4EC,stroke:#C2185B,stroke-width:2px
+
+%% ---------- KSERVE ----------
+subgraph KServe["<b> KServe </b>"]
+    direction TB
+
+    ISVC["<b>InferenceService</b><br/>opt-125m"]:::kserve
+    DEP["<b>Deployment</b><br/>opt-125m-predictor"]:::kserve
+
+    subgraph POD["Predictor <b>Pod</b> ×1–3"]
+        CTR["<b>kserve-container</b><br/>HuggingFace runtime · vLLM engine<br/>facebook/opt-125m :8080"]:::pod
+    end
+
+    ISVC -->|creates| DEP
+    DEP -->|manages| CTR
+end
+
+%% ---------- OBSERVABILITY ----------
+subgraph Observability["<b> Observability </b>"]
+    PROM[("Prometheus")]:::infra
+end
+
+%% ---------- KEDA ----------
+subgraph KEDA["<b> KEDA Autoscaling </b>"]
+    direction TB
+    SO["<b>ScaledObject</b><br/>Prometheus trigger<br/>threshold: 5 tok/s"]:::infra
+    HPA["<b>HorizontalPodAutoscaler</b>"]:::infra
+
+    SO -->|creates & drives| HPA
+end
+
+%% ---------- LOAD ----------
+LG["⚡ load-generator.py"]:::traffic
+
+%% ---------- FLOWS ----------
+CTR -->|/metrics| PROM
+PROM -->|query every 15s| SO
+HPA -->|scales| DEP
+SO -. targets .-> DEP
+LG -->|POST /completions| CTR
+```
+
 ## Why Token Throughput?
 
 LLM requests vary wildly in duration depending on prompt and output length.

From 5fec79301bbc5105e2cf76bdde9cdbe906d0f8cd Mon Sep 17 00:00:00 2001
From: Igor Kvachenok <igor.kvachenok@prokube.ai>
Date: Mon, 23 Mar 2026 11:18:26 +0100
Subject: [PATCH 17/17] Prettify dashboard name and readme nitpick

---
 serving/kserve-keda-autoscaling/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
index 11642af..e34d5be 100644
--- a/serving/kserve-keda-autoscaling/README.md
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -177,7 +177,7 @@ kubectl get hpa keda-hpa-opt-125m-scaledobject
 **Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see:
 
 - General vLLM Dashboard:  
-  https://YOUR_DOMAIN/grafana/d/b281712d-8bff-41ef-9f3f-71ad43c05e9b/vllm
+  https://YOUR_DOMAIN/grafana/d/vllm-general/vllm
 
 - vLLM Performance Statistics:  
   https://YOUR_DOMAIN/grafana/d/performance-statistics/vllm-performance-statistics
@@ -185,7 +185,7 @@ kubectl get hpa keda-hpa-opt-125m-scaledobject
 - vLLM Query Statistics:  
   https://YOUR_DOMAIN/grafana/d/query-statistics4/vllm-query-statistics
 
-- Replica count:  
+- Replica count and CPU load (you have to select your namespace/workload manually):  
   https://YOUR_DOMAIN/grafana/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload
 
 Replace `YOUR_DOMAIN` with your cluster domain.