diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md new file mode 100644 index 0000000..e34d5be --- /dev/null +++ b/serving/kserve-keda-autoscaling/README.md @@ -0,0 +1,266 @@ +# KServe Autoscaling with KEDA and Custom Prometheus Metrics + +This example demonstrates autoscaling a KServe InferenceService using +[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM. +It scales based on total token throughput rather than simple request count, +which is better suited for LLM inference workloads. + +For full documentation, see the +[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling). + +## Architecture + +```mermaid +flowchart LR + +%% ---------- STYLES ---------- +classDef kserve fill:#E8F0FE,stroke:#1A73E8,stroke-width:2px,color:#0B3D91 +classDef pod fill:#F1F8E9,stroke:#558B2F,stroke-width:2px +classDef infra fill:#FFF8E1,stroke:#FF8F00,stroke-width:2px +classDef traffic fill:#FCE4EC,stroke:#C2185B,stroke-width:2px + +%% ---------- KSERVE ---------- +subgraph KServe[" KServe "] + direction TB + + ISVC["InferenceService
opt-125m"]:::kserve + DEP["Deployment
opt-125m-predictor"]:::kserve + + subgraph POD["Predictor Pod ×1–3"] + CTR["kserve-container
HuggingFace runtime · vLLM engine
facebook/opt-125m :8080"]:::pod + end + + ISVC -->|creates| DEP + DEP -->|manages| CTR +end + +%% ---------- OBSERVABILITY ---------- +subgraph Observability[" Observability "] + PROM[("Prometheus")]:::infra +end + +%% ---------- KEDA ---------- +subgraph KEDA[" KEDA Autoscaling "] + direction TB + SO["ScaledObject
Prometheus trigger
threshold: 5 tok/s"]:::infra + HPA["HorizontalPodAutoscaler"]:::infra + + SO -->|creates & drives| HPA +end + +%% ---------- LOAD ---------- +LG["⚡ load-generator.py"]:::traffic + +%% ---------- FLOWS ---------- +CTR -->|/metrics| PROM +PROM -->|query every 15s| SO +HPA -->|scales| DEP +SO -. targets .-> DEP +LG -->|POST /completions| CTR +``` + +## Why Token Throughput? + +LLM requests vary wildly in duration depending on prompt and output length. +Request-count metrics (concurrency, QPS) don't reflect actual GPU load. +Token throughput stays elevated as long as the model is under pressure, +making it a stable scaling signal. + +## Prerequisites + +- KEDA installed in the cluster — not available in all prokube clusters by default; see step 3 below +- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor) + +## Files + +| File | Description | +|------|-------------| +| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) | +| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput | +| `load-generator.py` | Python load generator with presets for different scaling scenarios | + +## Quick Start + +> [!NOTE] +> All of the examples below should be run in prokube notebook's terminal inside your cluster. The model created with RawDeployment is not accessible from outside the cluster by default. + +```bash +# 1. Deploy the InferenceService +kubectl apply -f inference-service.yaml + +# 2. Wait for it to become ready +kubectl get isvc opt-125m -w + +# 3. Deploy the KEDA ScaledObject +kubectl apply -f scaled-object.yaml + +# 4. Verify +kubectl get scaledobject +kubectl get hpa +``` + +> **NOTE 1.** +> If step 3 fails with `no matches for kind "ScaledObject"`, KEDA is not installed in your cluster. +> Ask your admin to enable it. + + +> **NOTE 2.** +> The Prometheus query in `scaled-object.yaml` has no `namespace` filter, so it aggregates token +> throughput across **all namespaces**. This is fine for testing, but if multiple users deploy a +> model named `opt-125m` at the same time, their metrics will interfere and autoscaling will be +> incorrect for both. For any real use, add a namespace filter to both queries in `scaled-object.yaml`: +> +> ```yaml +> query: >- +> sum(rate(vllm:prompt_tokens_total{namespace="",model_name="opt-125m"}[2m])) +> + sum(rate(vllm:generation_tokens_total{namespace="",model_name="opt-125m"}[2m])) +> ``` + +## See It in Action + +After deploying, you can trigger autoscaling and observe the full scale-up / scale-down cycle. + +### 1. Send a test inference request + +Get the internal cluster address and send a request: + +```bash +# inference service name + "-predictor" +SERVICE_URL=opt-125m-predictor + +curl -s "$SERVICE_URL/openai/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{"model":"opt-125m","prompt":"What is AI?","max_tokens":64}' \ + | python -c 'import json,sys;print("\n", json.load(sys.stdin)["choices"][0]["text"].strip(), "\n")' +``` + +### 2. Generate load to trigger scale-up + +Use the included load generator to produce controlled, sustained load. +It has two presets calibrated for the opt-125m model on CPU: + +| Mode | Workers | Sleep | Throughput | Scaling behavior | +|------|---------|-------|------------|------------------| +| `stable-2` | 1 | 8s | ~8 tok/s | Scales to 2 replicas and holds | +| `stable-3` | 2 | 2s | ~22 tok/s | Scales to 3 replicas and holds | + +```bash +# Scale to 2 replicas (moderate load) +python load-generator.py --mode stable-2 + +# Scale to 3 replicas (heavy load) +python load-generator.py --mode stable-3 + +# Custom: pick your own concurrency and pacing +python load-generator.py --mode custom --workers 3 --sleep 1.0 +``` + +Press `Ctrl+C` to stop the load at any time. By default the script runs for +10 minutes; override with `--duration`. + +### 3. Observe autoscaling + +You can use dashboards (recommended, see below) or get a compact summary in terminal: + +```bash +# polls every 10 seconds +watch -n10 ' +echo "Deployment:" +kubectl get deployment opt-125m-predictor + +echo +echo "Autoscaler:" +kubectl get hpa keda-hpa-opt-125m-scaledobject +' +``` + +**Grafana dashboards** (prokube clusters): to visualize token throughput and replica count over time, see: + +- General vLLM Dashboard: + https://YOUR_DOMAIN/grafana/d/vllm-general/vllm + +- vLLM Performance Statistics: + https://YOUR_DOMAIN/grafana/d/performance-statistics/vllm-performance-statistics + +- vLLM Query Statistics: + https://YOUR_DOMAIN/grafana/d/query-statistics4/vllm-query-statistics + +- Replica count and CPU load (you have to select your namespace/workload manually): + https://YOUR_DOMAIN/grafana/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload + +Replace `YOUR_DOMAIN` with your cluster domain. + +### Expected behavior + +**Stable-2 mode** (~8 tok/s): +1. **1 replica** at rest +2. Load applied — metric rises to ~8 tok/s — `ceil(8/5) = 2` replicas needed +3. **Scaled to 2 replicas** within ~1 minute +4. Metric stabilizes at ~4 tok/s per replica (below threshold) — stays at 2 +5. Load removed — metric drops to 0 — cooldown period (120s) + stabilization window (120s) +6. **Scaled back to 1** replica + +**Stable-3 mode** (~22 tok/s): +1. **1 replica** at rest +2. Load applied — metric rises quickly — `ceil(22/5) = 5`, capped at `maxReplicas=3` +3. **Scaled to 3 replicas** within ~1-2 minutes +4. Load removed — gradual scale-down: 3 → 2 → 1 (one pod removed per minute) + +## Scaling Math + +The ScaledObject uses `metricType: AverageValue` with `threshold: 5`. For +external metrics, HPA computes: + +``` +desiredReplicas = ceil(totalMetricValue / threshold) +``` + +| Total tok/s | Desired replicas | Actual (capped 1-3) | +|-------------|------------------|---------------------| +| 0-5 | 1 | 1 | +| 5.1-10 | 2 | 2 | +| 10.1-15 | 3 | 3 | +| 15+ | 4+ | 3 (maxReplicas) | + +## Customization + +**Model name**: the `model_name="opt-125m"` filter in the Prometheus queries inside +`scaled-object.yaml` must match the `--model_name` argument in `inference-service.yaml`. + +**Threshold**: the `threshold: "5"` value means "scale up when each replica +handles more than 5 tokens/second on average" (`AverageValue` divides the +query result by replica count). Tune this based on load testing for your +model and hardware. + +**Load generator presets**: the presets in `load-generator.py` are calibrated for +opt-125m on CPU. If you change the model, hardware, or threshold, you'll need to +recalibrate. Use `--mode custom` to experiment, and watch the Prometheus metric: + +```bash +# Check the actual metric value KEDA sees +kubectl run prom-check --rm -it --restart=Never --image=curlimages/curl -- \ + -s 'http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus/api/v1/query' \ + --data-urlencode 'query=sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m])) + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m]))' +``` + +**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512` +from the InferenceService args, add GPU resource requests, and consider +adding a second trigger for GPU KV-cache utilization: + +```yaml +# Add to scaled-object.yaml triggers list +- type: prometheus + metadata: + serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus + query: >- + avg(vllm:gpu_cache_usage_perc{model_name="my-model"}) + metricType: AverageValue + threshold: "0.75" +``` + +## References + +- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/) +- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler) +- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/) +- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html) diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml new file mode 100644 index 0000000..f7a755b --- /dev/null +++ b/serving/kserve-keda-autoscaling/inference-service.yaml @@ -0,0 +1,40 @@ +# KServe InferenceService for OPT-125M with vLLM backend. +# Uses RawDeployment mode — required when scaling with KEDA. +# +# This example runs on CPU. For GPU, remove --dtype=float32 and +# --max-model-len, and adjust resources to request nvidia.com/gpu. +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + name: opt-125m + annotations: + # RawDeployment mode — creates a plain Deployment instead of a Knative Revision. + serving.kserve.io/deploymentMode: "RawDeployment" + # Tell KServe not to create its own HPA (KEDA will manage scaling). + serving.kserve.io/autoscalerClass: "external" +spec: + predictor: + minReplicas: 1 + maxReplicas: 3 + model: + modelFormat: + name: huggingface + args: + - --model_name=opt-125m + - --model_id=facebook/opt-125m + - --backend=vllm + - --dtype=float32 + - --max-model-len=512 + # Explicit port declaration is required in RawDeployment mode + # for the cluster-wide PodMonitor to discover the metrics endpoint. + ports: + - name: user-port + containerPort: 8080 + protocol: TCP + resources: + requests: + cpu: "2" + memory: 4Gi + limits: + cpu: "4" + memory: 8Gi diff --git a/serving/kserve-keda-autoscaling/load-generator.py b/serving/kserve-keda-autoscaling/load-generator.py new file mode 100644 index 0000000..f33381a --- /dev/null +++ b/serving/kserve-keda-autoscaling/load-generator.py @@ -0,0 +1,261 @@ +#!/usr/bin/env python3 +""" +Load generator for vLLM-based KServe InferenceService autoscaling demos. + +Provides two preset scenarios to demonstrate KEDA autoscaling behaviors: + + stable-2 - Sustained moderate load that triggers a stable scale-up to 2 replicas + stable-3 - Sustained heavy load that triggers a stable scale-up to 3 replicas + +You can also run in 'custom' mode to specify your own concurrency and sleep values. + +Usage (run from a terminal with network access to the service, e.g. a Kubeflow notebook): + python load-generator.py --mode stable-2 + python load-generator.py --mode stable-3 --duration 300 + python load-generator.py --mode custom --workers 3 --sleep 2.0 + +Press Ctrl+C to stop the load at any time. +""" + +import argparse +import json +import signal +import threading +import time +import urllib.request +import urllib.error + +# --------------------------------------------------------------------------- +# Preset configurations +# +# Each preset defines (workers, sleep_between_requests). +# "workers" is the number of concurrent request loops. +# "sleep_between_requests" is how long each worker pauses (seconds) after +# receiving a response before sending the next request. +# +# The ScaledObject uses: +# metricType: AverageValue (total metric value / current replicas) +# threshold: 5 (tokens/sec per replica) +# +# HPA computes: desiredReplicas = ceil(total_tok_per_sec / threshold) +# +# stable-2: target ~8 tok/s total -> ceil(8/5) = 2 +# stable-3: target ~22 tok/s total -> ceil(22/5) = 5, capped at maxReplicas=3 +# +# Calibrated for opt-125m on CPU (float32, --max-model-len=512). +# Each request averages ~116 tokens and ~6.7s of processing time. +# Effective rate per worker ≈ 116 / (6.7 + sleep) tok/s. +# --------------------------------------------------------------------------- + +PRESETS = { + "stable-2": {"workers": 1, "sleep": 8.0}, + "stable-3": {"workers": 2, "sleep": 2.0}, +} + +# Default model endpoint (Kubernetes service DNS name in RawDeployment mode) +DEFAULT_URL = "http://opt-125m-predictor/openai/v1/completions" +DEFAULT_DURATION = 600 # 10 minutes +DEFAULT_MAX_TOKENS = 200 +DEFAULT_PROMPT = ( + "Write a long detailed story about a dragon who discovers a hidden kingdom" +) + +# --------------------------------------------------------------------------- +# Globals for stats +# --------------------------------------------------------------------------- +stats_lock = threading.Lock() +total_requests = 0 +total_tokens = 0 +total_errors = 0 +start_time: float = 0.0 +stop_event = threading.Event() + + +def send_request(url: str, prompt: str, max_tokens: int) -> dict | None: + """Send a single completion request to the vLLM endpoint.""" + payload = json.dumps( + { + "model": "opt-125m", + "prompt": prompt, + "max_tokens": max_tokens, + } + ).encode("utf-8") + + req = urllib.request.Request( + url, + data=payload, + headers={"Content-Type": "application/json"}, + ) + try: + with urllib.request.urlopen(req, timeout=120) as resp: + return json.loads(resp.read().decode("utf-8")) + except (urllib.error.URLError, OSError, json.JSONDecodeError): + return None + + +def worker_loop( + worker_id: int, url: str, prompt: str, max_tokens: int, sleep_sec: float +): + """Continuously send requests with a sleep between each, until stop_event is set.""" + global total_requests, total_tokens, total_errors + + while not stop_event.is_set(): + result = send_request(url, prompt, max_tokens) + + with stats_lock: + if result and "usage" in result: + total_requests += 1 + total_tokens += result["usage"].get("total_tokens", 0) + else: + total_errors += 1 + + # Sleep between requests (interruptible via stop_event) + if sleep_sec > 0 and not stop_event.is_set(): + stop_event.wait(timeout=sleep_sec) + + +def print_stats(): + """Periodically print throughput stats.""" + while not stop_event.is_set(): + stop_event.wait(timeout=10) + if stop_event.is_set(): + break + elapsed = time.time() - start_time + with stats_lock: + tok_rate = total_tokens / elapsed if elapsed > 0 else 0 + req_rate = total_requests / elapsed if elapsed > 0 else 0 + print( + f" [{elapsed:6.0f}s] requests={total_requests} " + f"tokens={total_tokens} errors={total_errors} " + f"avg_tok/s={tok_rate:.1f} avg_req/s={req_rate:.2f}" + ) + + +def main(): + global start_time + + parser = argparse.ArgumentParser( + description="Load generator for KServe + KEDA autoscaling demo", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=__doc__, + ) + parser.add_argument( + "--mode", + choices=["stable-2", "stable-3", "custom"], + default="stable-2", + help="Load preset (default: stable-2)", + ) + parser.add_argument( + "--workers", + type=int, + default=None, + help="Number of concurrent workers (custom mode)", + ) + parser.add_argument( + "--sleep", + type=float, + default=None, + help="Sleep seconds between requests per worker (custom mode)", + ) + parser.add_argument( + "--url", + default=DEFAULT_URL, + help=f"Service URL (default: {DEFAULT_URL})", + ) + parser.add_argument( + "--duration", + type=int, + default=DEFAULT_DURATION, + help=f"Duration in seconds (default: {DEFAULT_DURATION})", + ) + parser.add_argument( + "--max-tokens", + type=int, + default=DEFAULT_MAX_TOKENS, + help=f"Max tokens per request (default: {DEFAULT_MAX_TOKENS})", + ) + parser.add_argument( + "--prompt", + default=DEFAULT_PROMPT, + help="Prompt text", + ) + + args = parser.parse_args() + + # Resolve configuration + if args.mode == "custom": + if args.workers is None or args.sleep is None: + parser.error("--workers and --sleep are required in custom mode") + workers = args.workers + sleep_sec = args.sleep + else: + preset = PRESETS[args.mode] + workers = args.workers if args.workers is not None else preset["workers"] + sleep_sec = args.sleep if args.sleep is not None else preset["sleep"] + + print("=== Load Generator ===") + print(f" Mode: {args.mode}") + print(f" Workers: {workers}") + print(f" Sleep: {sleep_sec}s between requests") + print(f" URL: {args.url}") + print(f" Duration: {args.duration}s") + print(f" Max tokens: {args.max_tokens}") + print() + print("Starting load... (Ctrl+C to stop)") + print() + + # Handle Ctrl+C gracefully + def signal_handler(sig, frame): + print("\n\nStopping load...") + stop_event.set() + + signal.signal(signal.SIGINT, signal_handler) + signal.signal(signal.SIGTERM, signal_handler) + + start_time = time.time() + + # Start stats printer + stats_thread = threading.Thread(target=print_stats, daemon=True) + stats_thread.start() + + # Start worker threads + threads = [] + for i in range(workers): + t = threading.Thread( + target=worker_loop, + args=(i, args.url, args.prompt, args.max_tokens, sleep_sec), + daemon=True, + ) + t.start() + threads.append(t) + + # Wait for duration or Ctrl+C + try: + stop_event.wait(timeout=args.duration) + except KeyboardInterrupt: + pass + + stop_event.set() + + # Wait for threads to finish + for t in threads: + t.join(timeout=5) + + # Final stats + elapsed = time.time() - start_time + with stats_lock: + tok_rate = total_tokens / elapsed if elapsed > 0 else 0 + req_rate = total_requests / elapsed if elapsed > 0 else 0 + + print() + print("=== Final Stats ===") + print(f" Duration: {elapsed:.1f}s") + print(f" Requests: {total_requests}") + print(f" Tokens: {total_tokens}") + print(f" Errors: {total_errors}") + print(f" Avg tok/s: {tok_rate:.1f}") + print(f" Avg req/s: {req_rate:.2f}") + + +if __name__ == "__main__": + main() diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml new file mode 100644 index 0000000..40036b8 --- /dev/null +++ b/serving/kserve-keda-autoscaling/scaled-object.yaml @@ -0,0 +1,43 @@ +# KEDA ScaledObject for KServe InferenceService with vLLM backend. +# Scales based on total token throughput (prompt + generation) from Prometheus. +# +# Prerequisites: +# - KEDA installed (https://keda.sh/docs/deploy/) +# - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor) +# +# Customization: +# - "opt-125m" in model_name must match the --model_name arg in inference-service.yaml +# - The serverAddress must match your Prometheus URL +# +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: opt-125m-scaledobject +spec: + scaleTargetRef: + # In RawDeployment mode KServe names the Deployment {isvc-name}-predictor. + name: opt-125m-predictor + minReplicaCount: 1 + maxReplicaCount: 3 + pollingInterval: 15 # how often KEDA checks the metric (seconds) + cooldownPeriod: 120 # seconds after last trigger activation before scaling to minReplicaCount + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleUp: + stabilizationWindowSeconds: 30 # short window to absorb metric noise before committing to scale-up + scaleDown: + stabilizationWindowSeconds: 120 + policies: + - type: Pods + value: 1 # remove at most 1 replica per minute + periodSeconds: 60 + triggers: + - type: prometheus + metadata: + serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus + query: >- + sum(rate(vllm:prompt_tokens_total{model_name="opt-125m"}[2m])) + + sum(rate(vllm:generation_tokens_total{model_name="opt-125m"}[2m])) + metricType: AverageValue + threshold: "5"