diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
new file mode 100644
index 0000000..e34d5be
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -0,0 +1,266 @@
+# KServe Autoscaling with KEDA and Custom Prometheus Metrics
+
+This example demonstrates autoscaling a KServe InferenceService using
+[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM.
+It scales based on total token throughput rather than simple request count,
+which is better suited for LLM inference workloads.
+
+For full documentation, see the
+[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling).
+
+## Architecture
+
+```mermaid
+flowchart LR
+
+%% ---------- STYLES ----------
+classDef kserve fill:#E8F0FE,stroke:#1A73E8,stroke-width:2px,color:#0B3D91
+classDef pod fill:#F1F8E9,stroke:#558B2F,stroke-width:2px
+classDef infra fill:#FFF8E1,stroke:#FF8F00,stroke-width:2px
+classDef traffic fill:#FCE4EC,stroke:#C2185B,stroke-width:2px
+
+%% ---------- KSERVE ----------
+subgraph KServe[" KServe "]
+ direction TB
+
+ ISVC["InferenceService
opt-125m"]:::kserve
+ DEP["Deployment
opt-125m-predictor"]:::kserve
+
+ subgraph POD["Predictor Pod ×1–3"]
+ CTR["kserve-container
HuggingFace runtime · vLLM engine
facebook/opt-125m :8080"]:::pod
+ end
+
+ ISVC -->|creates| DEP
+ DEP -->|manages| CTR
+end
+
+%% ---------- OBSERVABILITY ----------
+subgraph Observability[" Observability "]
+ PROM[("Prometheus")]:::infra
+end
+
+%% ---------- KEDA ----------
+subgraph KEDA[" KEDA Autoscaling "]
+ direction TB
+ SO["ScaledObject
Prometheus trigger
threshold: 5 tok/s"]:::infra
+ HPA["HorizontalPodAutoscaler"]:::infra
+
+ SO -->|creates & drives| HPA
+end
+
+%% ---------- LOAD ----------
+LG["⚡ load-generator.py"]:::traffic
+
+%% ---------- FLOWS ----------
+CTR -->|/metrics| PROM
+PROM -->|query every 15s| SO
+HPA -->|scales| DEP
+SO -. targets .-> DEP
+LG -->|POST /completions| CTR
+```
+
+## Why Token Throughput?
+
+LLM requests vary wildly in duration depending on prompt and output length.
+Request-count metrics (concurrency, QPS) don't reflect actual GPU load.
+Token throughput stays elevated as long as the model is under pressure,
+making it a stable scaling signal.
+
+## Prerequisites
+
+- KEDA installed in the cluster — not available in all prokube clusters by default; see step 3 below
+- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor)
+
+## Files
+
+| File | Description |
+|------|-------------|
+| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) |
+| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput |
+| `load-generator.py` | Python load generator with presets for different scaling scenarios |
+
+## Quick Start
+
+> [!NOTE]
+> All of the examples below should be run in prokube notebook's terminal inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default.
+
+```bash
+# 1. Deploy the InferenceService
+kubectl apply -f inference-service.yaml
+
+# 2. Wait for it to become ready
+kubectl get isvc opt-125m -w
+
+# 3. Deploy the KEDA ScaledObject
+kubectl apply -f scaled-object.yaml
+
+# 4. Verify
+kubectl get scaledobject
+kubectl get hpa
+```
+
+> **NOTE 1.**
+> If step 3 fails with `no matches for kind "ScaledObject"`, KEDA is not installed in your cluster.
+> Ask your admin to enable it.
+
+
+> **NOTE 2.**
+> The Prometheus query in `scaled-object.yaml` has no `namespace` filter, so it aggregates token
+> throughput across **all namespaces**. This is fine for testing, but if multiple users deploy a
+> model named `opt-125m` at the same time, their metrics will interfere and autoscaling will be
+> incorrect for both. For any real use, add a namespace filter to both queries in `scaled-object.yaml`:
+>
+> ```yaml
+> query: >-
+> sum(rate(vllm:prompt_tokens_total{namespace="",model_name="opt-125m"}[2m]))
+> + sum(rate(vllm:generation_tokens_total{namespace="",model_name="opt-125m"}[2m]))
+> ```
+
+## See It in Action
+
+After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle.
+
+### 1. Send a test inference request
+
+Get the internal cluster address and send a request:
+
+```bash
+# inference service name + "-predictor"
+SERVICE_URL=opt-125m-predictor
+
+curl -s "$SERVICE_URL/openai/v1/completions" \
+ -H "Content-Type: application/json" \
+ -d '{"model":"opt-125m","prompt":"What is AI?","max_tokens":64}' \
+ | python -c 'import json,sys;print("\n", json.load(sys.stdin)["choices"][0]["text"].strip(), "\n")'
+```
+
+### 2. Generate load to trigger scale-up
+
+Use the included load generator to produce controlled, sustained load.
+It has two presets calibrated for the opt-125m model on CPU:
+
+| Mode | Workers | Sleep | Throughput | Scaling behavior |
+|------|---------|-------|------------|------------------|
+| `stable-2` | 1 | 8s | ~8 tok/s | Scales to 2 replicas and holds |
+| `stable-3` | 2 | 2s | ~22 tok/s | Scales to 3 replicas and holds |
+
+```bash
+# Scale to 2 replicas (moderate load)
+python load-generator.py --mode stable-2
+
+# Scale to 3 replicas (heavy load)
+python load-generator.py --mode stable-3
+
+# Custom: pick your own concurrency and pacing
+python load-generator.py --mode custom --workers 3 --sleep 1.0
+```
+
+Press `Ctrl+C` to stop the load at any time. By default the script runs for
+10 minutes; override with `--duration`.
+
+### 3. Observe autoscaling
+
+You can use dashboards (recommended, see below) or get a compact summary in terminal:
+
+```bash
+# polls every 10 seconds
+watch -n10 '
+echo "Deployment:"
+kubectl get deployment opt-125m-predictor
+
+echo
+echo "Autoscaler:"
+kubectl get hpa keda-hpa-opt-125m-scaledobject
+'
+```
+
+**Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see:
+
+- General vLLM Dashboard:
+ https://YOUR_DOMAIN/grafana/d/vllm-general/vllm
+
+- vLLM Performance Statistics:
+ https://YOUR_DOMAIN/grafana/d/performance-statistics/vllm-performance-statistics
+
+- vLLM Query Statistics:
+ https://YOUR_DOMAIN/grafana/d/query-statistics4/vllm-query-statistics
+
+- Replica count and CPU load (you have to select your namespace/workload manually):
+ https://YOUR_DOMAIN/grafana/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload
+
+Replace `YOUR_DOMAIN` with your cluster domain.
+
+### Expected behavior
+
+**Stable-2 mode** (~8 tok/s):
+1. **1 replica** at rest
+2. Load applied — metric rises to ~8 tok/s — `ceil(8/5) = 2` replicas needed
+3. **Scaled to 2 replicas** within ~1 minute
+4. Metric stabilizes at ~4 tok/s per replica (below threshold) — stays at 2
+5. Load removed — metric drops to 0 — cooldown period (120s) + stabilization window (120s)
+6. **Scaled back to 1** replica
+
+**Stable-3 mode** (~22 tok/s):
+1. **1 replica** at rest
+2. Load applied — metric rises quickly — `ceil(22/5) = 5`, capped at `maxReplicas=3`
+3. **Scaled to 3 replicas** within ~1-2 minutes
+4. Load removed — gradual scale-down: 3 → 2 → 1 (one pod removed per minute)
+
+## Scaling Math
+
+The ScaledObject uses `metricType: AverageValue` with `threshold: 5`. For
+external metrics, HPA computes:
+
+```
+desiredReplicas = ceil(totalMetricValue / threshold)
+```
+
+| Total tok/s | Desired replicas | Actual (capped 1-3) |
+|-------------|------------------|---------------------|
+| 0-5 | 1 | 1 |
+| 5.1-10 | 2 | 2 |
+| 10.1-15 | 3 | 3 |
+| 15+ | 4+ | 3 (maxReplicas) |
+
+## Customization
+
+**Model name**: the `model_name="opt-125m"` filter in the Prometheus queries inside
+`scaled-object.yaml` must match the `--model_name` argument in `inference-service.yaml`.
+
+**Threshold**: the `threshold: "5"` value means "scale up when each replica
+handles more than 5 tokens/second on average" (`AverageValue` divides the
+query result by replica count). Tune this based on load testing for your
+model and hardware.
+
+**Load generator presets**: the presets in `load-generator.py` are calibrated for
+opt-125m on CPU. If you change the model, hardware, or threshold, you'll need to
+recalibrate. Use `--mode custom` to experiment, and watch the Prometheus metric:
+
+```bash
+# Check the actual metric value KEDA sees
+kubectl run prom-check --rm -it --restart=Never --image=curlimages/curl -- \
+ -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query' \
+ --data-urlencode 'query=sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m])) + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m]))'
+```
+
+**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
+from the InferenceService args, add GPU resource requests, and consider
+adding a second trigger for GPU KV-cache utilization:
+
+```yaml
+# Add to scaled-object.yaml triggers list
+- type: prometheus
+ metadata:
+ serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+ query: >-
+ avg(vllm:gpu_cache_usage_perc{model_name="my-model"})
+ metricType: AverageValue
+ threshold: "0.75"
+```
+
+## References
+
+- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/)
+- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler)
+- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/)
+- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html)
diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml
new file mode 100644
index 0000000..f7a755b
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/inference-service.yaml
@@ -0,0 +1,40 @@
+# KServe InferenceService for OPT-125M with vLLM backend.
+# Uses RawDeployment mode — required when scaling with KEDA.
+#
+# This example runs on CPU. For GPU, remove --dtype=float32 and
+# --max-model-len, and adjust resources to request nvidia.com/gpu.
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+ name: opt-125m
+ annotations:
+ # RawDeployment mode — creates a plain Deployment instead of a Knative Revision.
+ serving.kserve.io/deploymentMode: "RawDeployment"
+ # Tell KServe not to create its own HPA (KEDA will manage scaling).
+ serving.kserve.io/autoscalerClass: "external"
+spec:
+ predictor:
+ minReplicas: 1
+ maxReplicas: 3
+ model:
+ modelFormat:
+ name: huggingface
+ args:
+ - --model_name=opt-125m
+ - --model_id=facebook/opt-125m
+ - --backend=vllm
+ - --dtype=float32
+ - --max-model-len=512
+ # Explicit port declaration is required in RawDeployment mode
+ # for the cluster-wide PodMonitor to discover the metrics endpoint.
+ ports:
+ - name: user-port
+ containerPort: 8080
+ protocol: TCP
+ resources:
+ requests:
+ cpu: "2"
+ memory: 4Gi
+ limits:
+ cpu: "4"
+ memory: 8Gi
diff --git a/serving/kserve-keda-autoscaling/load-generator.py b/serving/kserve-keda-autoscaling/load-generator.py
new file mode 100644
index 0000000..f33381a
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/load-generator.py
@@ -0,0 +1,261 @@
+#!/usr/bin/env python3
+"""
+Load generator for vLLM-based KServe InferenceService autoscaling demos.
+
+Provides two preset scenarios to demonstrate KEDA autoscaling behaviors:
+
+ stable-2 - Sustained moderate load that triggers a stable scale-up to 2 replicas
+ stable-3 - Sustained heavy load that triggers a stable scale-up to 3 replicas
+
+You can also run in 'custom' mode to specify your own concurrency and sleep values.
+
+Usage (run from a terminal with network access to the service, e.g. a Kubeflow notebook):
+ python load-generator.py --mode stable-2
+ python load-generator.py --mode stable-3 --duration 300
+ python load-generator.py --mode custom --workers 3 --sleep 2.0
+
+Press Ctrl+C to stop the load at any time.
+"""
+
+import argparse
+import json
+import signal
+import threading
+import time
+import urllib.request
+import urllib.error
+
+# ---------------------------------------------------------------------------
+# Preset configurations
+#
+# Each preset defines (workers, sleep_between_requests).
+# "workers" is the number of concurrent request loops.
+# "sleep_between_requests" is how long each worker pauses (seconds) after
+# receiving a response before sending the next request.
+#
+# The ScaledObject uses:
+# metricType: AverageValue (total metric value / current replicas)
+# threshold: 5 (tokens/sec per replica)
+#
+# HPA computes: desiredReplicas = ceil(total_tok_per_sec / threshold)
+#
+# stable-2: target ~8 tok/s total -> ceil(8/5) = 2
+# stable-3: target ~22 tok/s total -> ceil(22/5) = 5, capped at maxReplicas=3
+#
+# Calibrated for opt-125m on CPU (float32, --max-model-len=512).
+# Each request averages ~116 tokens and ~6.7s of processing time.
+# Effective rate per worker ≈ 116 / (6.7 + sleep) tok/s.
+# ---------------------------------------------------------------------------
+
+PRESETS = {
+ "stable-2": {"workers": 1, "sleep": 8.0},
+ "stable-3": {"workers": 2, "sleep": 2.0},
+}
+
+# Default model endpoint (Kubernetes service DNS name in RawDeployment mode)
+DEFAULT_URL = "http://opt-125m-predictor/openai/v1/completions"
+DEFAULT_DURATION = 600 # 10 minutes
+DEFAULT_MAX_TOKENS = 200
+DEFAULT_PROMPT = (
+ "Write a long detailed story about a dragon who discovers a hidden kingdom"
+)
+
+# ---------------------------------------------------------------------------
+# Globals for stats
+# ---------------------------------------------------------------------------
+stats_lock = threading.Lock()
+total_requests = 0
+total_tokens = 0
+total_errors = 0
+start_time: float = 0.0
+stop_event = threading.Event()
+
+
+def send_request(url: str, prompt: str, max_tokens: int) -> dict | None:
+ """Send a single completion request to the vLLM endpoint."""
+ payload = json.dumps(
+ {
+ "model": "opt-125m",
+ "prompt": prompt,
+ "max_tokens": max_tokens,
+ }
+ ).encode("utf-8")
+
+ req = urllib.request.Request(
+ url,
+ data=payload,
+ headers={"Content-Type": "application/json"},
+ )
+ try:
+ with urllib.request.urlopen(req, timeout=120) as resp:
+ return json.loads(resp.read().decode("utf-8"))
+ except (urllib.error.URLError, OSError, json.JSONDecodeError):
+ return None
+
+
+def worker_loop(
+ worker_id: int, url: str, prompt: str, max_tokens: int, sleep_sec: float
+):
+ """Continuously send requests with a sleep between each, until stop_event is set."""
+ global total_requests, total_tokens, total_errors
+
+ while not stop_event.is_set():
+ result = send_request(url, prompt, max_tokens)
+
+ with stats_lock:
+ if result and "usage" in result:
+ total_requests += 1
+ total_tokens += result["usage"].get("total_tokens", 0)
+ else:
+ total_errors += 1
+
+ # Sleep between requests (interruptible via stop_event)
+ if sleep_sec > 0 and not stop_event.is_set():
+ stop_event.wait(timeout=sleep_sec)
+
+
+def print_stats():
+ """Periodically print throughput stats."""
+ while not stop_event.is_set():
+ stop_event.wait(timeout=10)
+ if stop_event.is_set():
+ break
+ elapsed = time.time() - start_time
+ with stats_lock:
+ tok_rate = total_tokens / elapsed if elapsed > 0 else 0
+ req_rate = total_requests / elapsed if elapsed > 0 else 0
+ print(
+ f" [{elapsed:6.0f}s] requests={total_requests} "
+ f"tokens={total_tokens} errors={total_errors} "
+ f"avg_tok/s={tok_rate:.1f} avg_req/s={req_rate:.2f}"
+ )
+
+
+def main():
+ global start_time
+
+ parser = argparse.ArgumentParser(
+ description="Load generator for KServe + KEDA autoscaling demo",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog=__doc__,
+ )
+ parser.add_argument(
+ "--mode",
+ choices=["stable-2", "stable-3", "custom"],
+ default="stable-2",
+ help="Load preset (default: stable-2)",
+ )
+ parser.add_argument(
+ "--workers",
+ type=int,
+ default=None,
+ help="Number of concurrent workers (custom mode)",
+ )
+ parser.add_argument(
+ "--sleep",
+ type=float,
+ default=None,
+ help="Sleep seconds between requests per worker (custom mode)",
+ )
+ parser.add_argument(
+ "--url",
+ default=DEFAULT_URL,
+ help=f"Service URL (default: {DEFAULT_URL})",
+ )
+ parser.add_argument(
+ "--duration",
+ type=int,
+ default=DEFAULT_DURATION,
+ help=f"Duration in seconds (default: {DEFAULT_DURATION})",
+ )
+ parser.add_argument(
+ "--max-tokens",
+ type=int,
+ default=DEFAULT_MAX_TOKENS,
+ help=f"Max tokens per request (default: {DEFAULT_MAX_TOKENS})",
+ )
+ parser.add_argument(
+ "--prompt",
+ default=DEFAULT_PROMPT,
+ help="Prompt text",
+ )
+
+ args = parser.parse_args()
+
+ # Resolve configuration
+ if args.mode == "custom":
+ if args.workers is None or args.sleep is None:
+ parser.error("--workers and --sleep are required in custom mode")
+ workers = args.workers
+ sleep_sec = args.sleep
+ else:
+ preset = PRESETS[args.mode]
+ workers = args.workers if args.workers is not None else preset["workers"]
+ sleep_sec = args.sleep if args.sleep is not None else preset["sleep"]
+
+ print("=== Load Generator ===")
+ print(f" Mode: {args.mode}")
+ print(f" Workers: {workers}")
+ print(f" Sleep: {sleep_sec}s between requests")
+ print(f" URL: {args.url}")
+ print(f" Duration: {args.duration}s")
+ print(f" Max tokens: {args.max_tokens}")
+ print()
+ print("Starting load... (Ctrl+C to stop)")
+ print()
+
+ # Handle Ctrl+C gracefully
+ def signal_handler(sig, frame):
+ print("\n\nStopping load...")
+ stop_event.set()
+
+ signal.signal(signal.SIGINT, signal_handler)
+ signal.signal(signal.SIGTERM, signal_handler)
+
+ start_time = time.time()
+
+ # Start stats printer
+ stats_thread = threading.Thread(target=print_stats, daemon=True)
+ stats_thread.start()
+
+ # Start worker threads
+ threads = []
+ for i in range(workers):
+ t = threading.Thread(
+ target=worker_loop,
+ args=(i, args.url, args.prompt, args.max_tokens, sleep_sec),
+ daemon=True,
+ )
+ t.start()
+ threads.append(t)
+
+ # Wait for duration or Ctrl+C
+ try:
+ stop_event.wait(timeout=args.duration)
+ except KeyboardInterrupt:
+ pass
+
+ stop_event.set()
+
+ # Wait for threads to finish
+ for t in threads:
+ t.join(timeout=5)
+
+ # Final stats
+ elapsed = time.time() - start_time
+ with stats_lock:
+ tok_rate = total_tokens / elapsed if elapsed > 0 else 0
+ req_rate = total_requests / elapsed if elapsed > 0 else 0
+
+ print()
+ print("=== Final Stats ===")
+ print(f" Duration: {elapsed:.1f}s")
+ print(f" Requests: {total_requests}")
+ print(f" Tokens: {total_tokens}")
+ print(f" Errors: {total_errors}")
+ print(f" Avg tok/s: {tok_rate:.1f}")
+ print(f" Avg req/s: {req_rate:.2f}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
new file mode 100644
index 0000000..40036b8
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -0,0 +1,43 @@
+# KEDA ScaledObject for KServe InferenceService with vLLM backend.
+# Scales based on total token throughput (prompt + generation) from Prometheus.
+#
+# Prerequisites:
+# - KEDA installed (https://keda.sh/docs/deploy/)
+# - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor)
+#
+# Customization:
+# - "opt-125m" in model_name must match the --model_name arg in inference-service.yaml
+# - The serverAddress must match your Prometheus URL
+#
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+ name: opt-125m-scaledobject
+spec:
+ scaleTargetRef:
+ # In RawDeployment mode KServe names the Deployment {isvc-name}-predictor.
+ name: opt-125m-predictor
+ minReplicaCount: 1
+ maxReplicaCount: 3
+ pollingInterval: 15 # how often KEDA checks the metric (seconds)
+ cooldownPeriod: 120 # seconds after last trigger activation before scaling to minReplicaCount
+ advanced:
+ horizontalPodAutoscalerConfig:
+ behavior:
+ scaleUp:
+ stabilizationWindowSeconds: 30 # short window to absorb metric noise before committing to scale-up
+ scaleDown:
+ stabilizationWindowSeconds: 120
+ policies:
+ - type: Pods
+ value: 1 # remove at most 1 replica per minute
+ periodSeconds: 60
+ triggers:
+ - type: prometheus
+ metadata:
+ serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+ query: >-
+ sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m]))
+ + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m]))
+ metricType: AverageValue
+ threshold: "5"