diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
new file mode 100644
index 0000000..e34d5be
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/README.md
@@ -0,0 +1,266 @@
+# KServe Autoscaling with KEDA and Custom Prometheus Metrics
+
+This example demonstrates autoscaling a KServe InferenceService using
+[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM.
+It scales based on total token throughput rather than simple request count,
+which is better suited for LLM inference workloads.
+
+For full documentation, see the
+[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling).
+
+## Architecture
+
+```mermaid
+flowchart LR
+
+%% ---------- STYLES ----------
+classDef kserve fill:#E8F0FE,stroke:#1A73E8,stroke-width:2px,color:#0B3D91
+classDef pod fill:#F1F8E9,stroke:#558B2F,stroke-width:2px
+classDef infra fill:#FFF8E1,stroke:#FF8F00,stroke-width:2px
+classDef traffic fill:#FCE4EC,stroke:#C2185B,stroke-width:2px
+
+%% ---------- KSERVE ----------
+subgraph KServe["<b> KServe </b>"]
+    direction TB
+
+    ISVC["<b>InferenceService</b><br/>opt-125m"]:::kserve
+    DEP["<b>Deployment</b><br/>opt-125m-predictor"]:::kserve
+
+    subgraph POD["Predictor <b>Pod</b> ×1–3"]
+        CTR["<b>kserve-container</b><br/>HuggingFace runtime · vLLM engine<br/>facebook/opt-125m :8080"]:::pod
+    end
+
+    ISVC -->|creates| DEP
+    DEP -->|manages| CTR
+end
+
+%% ---------- OBSERVABILITY ----------
+subgraph Observability["<b> Observability </b>"]
+    PROM[("Prometheus")]:::infra
+end
+
+%% ---------- KEDA ----------
+subgraph KEDA["<b> KEDA Autoscaling </b>"]
+    direction TB
+    SO["<b>ScaledObject</b><br/>Prometheus trigger<br/>threshold: 5 tok/s"]:::infra
+    HPA["<b>HorizontalPodAutoscaler</b>"]:::infra
+
+    SO -->|creates & drives| HPA
+end
+
+%% ---------- LOAD ----------
+LG["⚡ load-generator.py"]:::traffic
+
+%% ---------- FLOWS ----------
+CTR -->|/metrics| PROM
+PROM -->|query every 15s| SO
+HPA -->|scales| DEP
+SO -. targets .-> DEP
+LG -->|POST /completions| CTR
+```
+
+## Why Token Throughput?
+
+LLM requests vary wildly in duration depending on prompt and output length.
+Request-count metrics (concurrency, QPS) don't reflect actual GPU load.
+Token throughput stays elevated as long as the model is under pressure,
+making it a stable scaling signal.
+
+## Prerequisites
+
+- KEDA installed in the cluster — not available in all prokube clusters by default; see step 3 below
+- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor)
+
+## Files
+
+| File | Description |
+|------|-------------|
+| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) |
+| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput |
+| `load-generator.py` | Python load generator with presets for different scaling scenarios |
+
+## Quick Start
+
+> [!NOTE]
+> All of the examples below should be run in prokube notebook's terminal inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default.
+
+```bash
+# 1. Deploy the InferenceService
+kubectl apply -f inference-service.yaml
+
+# 2. Wait for it to become ready
+kubectl get isvc opt-125m -w
+
+# 3. Deploy the KEDA ScaledObject
+kubectl apply -f scaled-object.yaml
+
+# 4. Verify
+kubectl get scaledobject
+kubectl get hpa
+```
+
+> **NOTE 1.**  
+> If step 3 fails with `no matches for kind "ScaledObject"`, KEDA is not installed in your cluster.
+> Ask your admin to enable it.
+
+  
+> **NOTE 2.**  
+> The Prometheus query in `scaled-object.yaml` has no `namespace` filter, so it aggregates token
+> throughput across **all namespaces**. This is fine for testing, but if multiple users deploy a
+> model named `opt-125m` at the same time, their metrics will interfere and autoscaling will be
+> incorrect for both. For any real use, add a namespace filter to both queries in `scaled-object.yaml`:
+>
+> ```yaml
+> query: >-
+>   sum(rate(vllm:prompt_tokens_total{namespace="<your-namespace>",model_name="opt-125m"}[2m]))
+>   + sum(rate(vllm:generation_tokens_total{namespace="<your-namespace>",model_name="opt-125m"}[2m]))
+> ```
+
+## See It in Action
+
+After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle.
+
+### 1. Send a test inference request
+
+Get the internal cluster address and send a request:
+
+```bash
+# inference service name + "-predictor"
+SERVICE_URL=opt-125m-predictor
+
+curl -s "$SERVICE_URL/openai/v1/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"model":"opt-125m","prompt":"What is AI?","max_tokens":64}' \
+  | python -c 'import json,sys;print("\n", json.load(sys.stdin)["choices"][0]["text"].strip(), "\n")'
+```
+
+### 2. Generate load to trigger scale-up
+
+Use the included load generator to produce controlled, sustained load.
+It has two presets calibrated for the opt-125m model on CPU:
+
+| Mode | Workers | Sleep | Throughput | Scaling behavior |
+|------|---------|-------|------------|------------------|
+| `stable-2` | 1 | 8s | ~8 tok/s | Scales to 2 replicas and holds |
+| `stable-3` | 2 | 2s | ~22 tok/s | Scales to 3 replicas and holds |
+
+```bash
+# Scale to 2 replicas (moderate load)
+python load-generator.py --mode stable-2
+
+# Scale to 3 replicas (heavy load)
+python load-generator.py --mode stable-3
+
+# Custom: pick your own concurrency and pacing
+python load-generator.py --mode custom --workers 3 --sleep 1.0
+```
+
+Press `Ctrl+C` to stop the load at any time. By default the script runs for
+10 minutes; override with `--duration`.
+
+### 3. Observe autoscaling
+
+You can use dashboards (recommended, see below) or get a compact summary in terminal:
+
+```bash
+# polls every 10 seconds
+watch -n10 '
+echo "Deployment:"
+kubectl get deployment opt-125m-predictor
+
+echo
+echo "Autoscaler:"
+kubectl get hpa keda-hpa-opt-125m-scaledobject
+'
+```
+
+**Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see:
+
+- General vLLM Dashboard:  
+  https://YOUR_DOMAIN/grafana/d/vllm-general/vllm
+
+- vLLM Performance Statistics:  
+  https://YOUR_DOMAIN/grafana/d/performance-statistics/vllm-performance-statistics
+
+- vLLM Query Statistics:  
+  https://YOUR_DOMAIN/grafana/d/query-statistics4/vllm-query-statistics
+
+- Replica count and CPU load (you have to select your namespace/workload manually):  
+  https://YOUR_DOMAIN/grafana/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload
+
+Replace `YOUR_DOMAIN` with your cluster domain.
+
+### Expected behavior
+
+**Stable-2 mode** (~8 tok/s):
+1. **1 replica** at rest
+2. Load applied — metric rises to ~8 tok/s — `ceil(8/5) = 2` replicas needed
+3. **Scaled to 2 replicas** within ~1 minute
+4. Metric stabilizes at ~4 tok/s per replica (below threshold) — stays at 2
+5. Load removed — metric drops to 0 — cooldown period (120s) + stabilization window (120s)
+6. **Scaled back to 1** replica
+
+**Stable-3 mode** (~22 tok/s):
+1. **1 replica** at rest
+2. Load applied — metric rises quickly — `ceil(22/5) = 5`, capped at `maxReplicas=3`
+3. **Scaled to 3 replicas** within ~1-2 minutes
+4. Load removed — gradual scale-down: 3 → 2 → 1 (one pod removed per minute)
+
+## Scaling Math
+
+The ScaledObject uses `metricType: AverageValue` with `threshold: 5`. For
+external metrics, HPA computes:
+
+```
+desiredReplicas = ceil(totalMetricValue / threshold)
+```
+
+| Total tok/s | Desired replicas | Actual (capped 1-3) |
+|-------------|------------------|---------------------|
+| 0-5         | 1                | 1                   |
+| 5.1-10      | 2                | 2                   |
+| 10.1-15     | 3                | 3                   |
+| 15+         | 4+               | 3 (maxReplicas)     |
+
+## Customization
+
+**Model name**: the `model_name="opt-125m"` filter in the Prometheus queries inside
+`scaled-object.yaml` must match the `--model_name` argument in `inference-service.yaml`.
+
+**Threshold**: the `threshold: "5"` value means "scale up when each replica
+handles more than 5 tokens/second on average" (`AverageValue` divides the
+query result by replica count). Tune this based on load testing for your
+model and hardware.
+
+**Load generator presets**: the presets in `load-generator.py` are calibrated for
+opt-125m on CPU. If you change the model, hardware, or threshold, you'll need to
+recalibrate. Use `--mode custom` to experiment, and watch the Prometheus metric:
+
+```bash
+# Check the actual metric value KEDA sees
+kubectl run prom-check --rm -it --restart=Never --image=curlimages/curl -- \
+  -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query' \
+  --data-urlencode 'query=sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m])) + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m]))'
+```
+
+**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
+from the InferenceService args, add GPU resource requests, and consider
+adding a second trigger for GPU KV-cache utilization:
+
+```yaml
+# Add to scaled-object.yaml triggers list
+- type: prometheus
+  metadata:
+    serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+    query: >-
+      avg(vllm:gpu_cache_usage_perc{model_name="my-model"})
+    metricType: AverageValue
+    threshold: "0.75"
+```
+
+## References
+
+- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/)
+- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler)
+- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/)
+- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html)
diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml
new file mode 100644
index 0000000..f7a755b
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/inference-service.yaml
@@ -0,0 +1,40 @@
+# KServe InferenceService for OPT-125M with vLLM backend.
+# Uses RawDeployment mode — required when scaling with KEDA.
+#
+# This example runs on CPU. For GPU, remove --dtype=float32 and
+# --max-model-len, and adjust resources to request nvidia.com/gpu.
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: opt-125m
+  annotations:
+    # RawDeployment mode — creates a plain Deployment instead of a Knative Revision.
+    serving.kserve.io/deploymentMode: "RawDeployment"
+    # Tell KServe not to create its own HPA (KEDA will manage scaling).
+    serving.kserve.io/autoscalerClass: "external"
+spec:
+  predictor:
+    minReplicas: 1
+    maxReplicas: 3
+    model:
+      modelFormat:
+        name: huggingface
+      args:
+        - --model_name=opt-125m
+        - --model_id=facebook/opt-125m
+        - --backend=vllm
+        - --dtype=float32
+        - --max-model-len=512
+      # Explicit port declaration is required in RawDeployment mode
+      # for the cluster-wide PodMonitor to discover the metrics endpoint.
+      ports:
+        - name: user-port
+          containerPort: 8080
+          protocol: TCP
+      resources:
+        requests:
+          cpu: "2"
+          memory: 4Gi
+        limits:
+          cpu: "4"
+          memory: 8Gi
diff --git a/serving/kserve-keda-autoscaling/load-generator.py b/serving/kserve-keda-autoscaling/load-generator.py
new file mode 100644
index 0000000..f33381a
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/load-generator.py
@@ -0,0 +1,261 @@
+#!/usr/bin/env python3
+"""
+Load generator for vLLM-based KServe InferenceService autoscaling demos.
+
+Provides two preset scenarios to demonstrate KEDA autoscaling behaviors:
+
+  stable-2   - Sustained moderate load that triggers a stable scale-up to 2 replicas
+  stable-3   - Sustained heavy load that triggers a stable scale-up to 3 replicas
+
+You can also run in 'custom' mode to specify your own concurrency and sleep values.
+
+Usage (run from a terminal with network access to the service, e.g. a Kubeflow notebook):
+    python load-generator.py --mode stable-2
+    python load-generator.py --mode stable-3 --duration 300
+    python load-generator.py --mode custom --workers 3 --sleep 2.0
+
+Press Ctrl+C to stop the load at any time.
+"""
+
+import argparse
+import json
+import signal
+import threading
+import time
+import urllib.request
+import urllib.error
+
+# ---------------------------------------------------------------------------
+# Preset configurations
+#
+# Each preset defines (workers, sleep_between_requests).
+# "workers" is the number of concurrent request loops.
+# "sleep_between_requests" is how long each worker pauses (seconds) after
+# receiving a response before sending the next request.
+#
+# The ScaledObject uses:
+#   metricType: AverageValue   (total metric value / current replicas)
+#   threshold: 5               (tokens/sec per replica)
+#
+# HPA computes:  desiredReplicas = ceil(total_tok_per_sec / threshold)
+#
+#   stable-2:  target ~8 tok/s total   -> ceil(8/5) = 2
+#   stable-3:  target ~22 tok/s total  -> ceil(22/5) = 5, capped at maxReplicas=3
+#
+# Calibrated for opt-125m on CPU (float32, --max-model-len=512).
+# Each request averages ~116 tokens and ~6.7s of processing time.
+# Effective rate per worker ≈ 116 / (6.7 + sleep) tok/s.
+# ---------------------------------------------------------------------------
+
+PRESETS = {
+    "stable-2": {"workers": 1, "sleep": 8.0},
+    "stable-3": {"workers": 2, "sleep": 2.0},
+}
+
+# Default model endpoint (Kubernetes service DNS name in RawDeployment mode)
+DEFAULT_URL = "http://opt-125m-predictor/openai/v1/completions"
+DEFAULT_DURATION = 600  # 10 minutes
+DEFAULT_MAX_TOKENS = 200
+DEFAULT_PROMPT = (
+    "Write a long detailed story about a dragon who discovers a hidden kingdom"
+)
+
+# ---------------------------------------------------------------------------
+# Globals for stats
+# ---------------------------------------------------------------------------
+stats_lock = threading.Lock()
+total_requests = 0
+total_tokens = 0
+total_errors = 0
+start_time: float = 0.0
+stop_event = threading.Event()
+
+
+def send_request(url: str, prompt: str, max_tokens: int) -> dict | None:
+    """Send a single completion request to the vLLM endpoint."""
+    payload = json.dumps(
+        {
+            "model": "opt-125m",
+            "prompt": prompt,
+            "max_tokens": max_tokens,
+        }
+    ).encode("utf-8")
+
+    req = urllib.request.Request(
+        url,
+        data=payload,
+        headers={"Content-Type": "application/json"},
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=120) as resp:
+            return json.loads(resp.read().decode("utf-8"))
+    except (urllib.error.URLError, OSError, json.JSONDecodeError):
+        return None
+
+
+def worker_loop(
+    worker_id: int, url: str, prompt: str, max_tokens: int, sleep_sec: float
+):
+    """Continuously send requests with a sleep between each, until stop_event is set."""
+    global total_requests, total_tokens, total_errors
+
+    while not stop_event.is_set():
+        result = send_request(url, prompt, max_tokens)
+
+        with stats_lock:
+            if result and "usage" in result:
+                total_requests += 1
+                total_tokens += result["usage"].get("total_tokens", 0)
+            else:
+                total_errors += 1
+
+        # Sleep between requests (interruptible via stop_event)
+        if sleep_sec > 0 and not stop_event.is_set():
+            stop_event.wait(timeout=sleep_sec)
+
+
+def print_stats():
+    """Periodically print throughput stats."""
+    while not stop_event.is_set():
+        stop_event.wait(timeout=10)
+        if stop_event.is_set():
+            break
+        elapsed = time.time() - start_time
+        with stats_lock:
+            tok_rate = total_tokens / elapsed if elapsed > 0 else 0
+            req_rate = total_requests / elapsed if elapsed > 0 else 0
+            print(
+                f"  [{elapsed:6.0f}s] requests={total_requests}  "
+                f"tokens={total_tokens}  errors={total_errors}  "
+                f"avg_tok/s={tok_rate:.1f}  avg_req/s={req_rate:.2f}"
+            )
+
+
+def main():
+    global start_time
+
+    parser = argparse.ArgumentParser(
+        description="Load generator for KServe + KEDA autoscaling demo",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__,
+    )
+    parser.add_argument(
+        "--mode",
+        choices=["stable-2", "stable-3", "custom"],
+        default="stable-2",
+        help="Load preset (default: stable-2)",
+    )
+    parser.add_argument(
+        "--workers",
+        type=int,
+        default=None,
+        help="Number of concurrent workers (custom mode)",
+    )
+    parser.add_argument(
+        "--sleep",
+        type=float,
+        default=None,
+        help="Sleep seconds between requests per worker (custom mode)",
+    )
+    parser.add_argument(
+        "--url",
+        default=DEFAULT_URL,
+        help=f"Service URL (default: {DEFAULT_URL})",
+    )
+    parser.add_argument(
+        "--duration",
+        type=int,
+        default=DEFAULT_DURATION,
+        help=f"Duration in seconds (default: {DEFAULT_DURATION})",
+    )
+    parser.add_argument(
+        "--max-tokens",
+        type=int,
+        default=DEFAULT_MAX_TOKENS,
+        help=f"Max tokens per request (default: {DEFAULT_MAX_TOKENS})",
+    )
+    parser.add_argument(
+        "--prompt",
+        default=DEFAULT_PROMPT,
+        help="Prompt text",
+    )
+
+    args = parser.parse_args()
+
+    # Resolve configuration
+    if args.mode == "custom":
+        if args.workers is None or args.sleep is None:
+            parser.error("--workers and --sleep are required in custom mode")
+        workers = args.workers
+        sleep_sec = args.sleep
+    else:
+        preset = PRESETS[args.mode]
+        workers = args.workers if args.workers is not None else preset["workers"]
+        sleep_sec = args.sleep if args.sleep is not None else preset["sleep"]
+
+    print("=== Load Generator ===")
+    print(f"  Mode:      {args.mode}")
+    print(f"  Workers:   {workers}")
+    print(f"  Sleep:     {sleep_sec}s between requests")
+    print(f"  URL:       {args.url}")
+    print(f"  Duration:  {args.duration}s")
+    print(f"  Max tokens: {args.max_tokens}")
+    print()
+    print("Starting load... (Ctrl+C to stop)")
+    print()
+
+    # Handle Ctrl+C gracefully
+    def signal_handler(sig, frame):
+        print("\n\nStopping load...")
+        stop_event.set()
+
+    signal.signal(signal.SIGINT, signal_handler)
+    signal.signal(signal.SIGTERM, signal_handler)
+
+    start_time = time.time()
+
+    # Start stats printer
+    stats_thread = threading.Thread(target=print_stats, daemon=True)
+    stats_thread.start()
+
+    # Start worker threads
+    threads = []
+    for i in range(workers):
+        t = threading.Thread(
+            target=worker_loop,
+            args=(i, args.url, args.prompt, args.max_tokens, sleep_sec),
+            daemon=True,
+        )
+        t.start()
+        threads.append(t)
+
+    # Wait for duration or Ctrl+C
+    try:
+        stop_event.wait(timeout=args.duration)
+    except KeyboardInterrupt:
+        pass
+
+    stop_event.set()
+
+    # Wait for threads to finish
+    for t in threads:
+        t.join(timeout=5)
+
+    # Final stats
+    elapsed = time.time() - start_time
+    with stats_lock:
+        tok_rate = total_tokens / elapsed if elapsed > 0 else 0
+        req_rate = total_requests / elapsed if elapsed > 0 else 0
+
+    print()
+    print("=== Final Stats ===")
+    print(f"  Duration:   {elapsed:.1f}s")
+    print(f"  Requests:   {total_requests}")
+    print(f"  Tokens:     {total_tokens}")
+    print(f"  Errors:     {total_errors}")
+    print(f"  Avg tok/s:  {tok_rate:.1f}")
+    print(f"  Avg req/s:  {req_rate:.2f}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
new file mode 100644
index 0000000..40036b8
--- /dev/null
+++ b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -0,0 +1,43 @@
+# KEDA ScaledObject for KServe InferenceService with vLLM backend.
+# Scales based on total token throughput (prompt + generation) from Prometheus.
+#
+# Prerequisites:
+# - KEDA installed (https://keda.sh/docs/deploy/)
+# - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor)
+#
+# Customization:
+# - "opt-125m" in model_name must match the --model_name arg in inference-service.yaml
+# - The serverAddress must match your Prometheus URL
+#
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: opt-125m-scaledobject
+spec:
+  scaleTargetRef:
+    # In RawDeployment mode KServe names the Deployment {isvc-name}-predictor.
+    name: opt-125m-predictor
+  minReplicaCount: 1
+  maxReplicaCount: 3
+  pollingInterval: 15          # how often KEDA checks the metric (seconds)
+  cooldownPeriod: 120          # seconds after last trigger activation before scaling to minReplicaCount
+  advanced:
+    horizontalPodAutoscalerConfig:
+      behavior:
+        scaleUp:
+          stabilizationWindowSeconds: 30  # short window to absorb metric noise before committing to scale-up
+        scaleDown:
+          stabilizationWindowSeconds: 120
+          policies:
+            - type: Pods
+              value: 1          # remove at most 1 replica per minute
+              periodSeconds: 60
+  triggers:
+    - type: prometheus
+      metadata:
+        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+        query: >-
+          sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m]))
+          + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m]))
+        metricType: AverageValue
+        threshold: "5"