-
Notifications
You must be signed in to change notification settings - Fork 35
Module 6.md
By completing this module, you will deliver:
Monitoring Infrastructure:
- β Prometheus Server: Time-series database scraping metrics every 15s with 7-day retention
- β 8 Production Alert Rules: High error rate, latency, service downtime, resource exhaustion
- β Grafana Dashboard: Real-time visualization of request rates, errors, latency percentiles, resource usage
- β Kubernetes Service Discovery: Automatic detection and monitoring of ML services
Real-World Impact:
- Incident Detection: Alert fires within 2 minutes of error rate exceeding 5%
- Debugging Speed: Reduce troubleshooting time from hours to minutes with correlated metrics
- Capacity Planning: Visualize CPU/memory trends to predict when to scale infrastructure
- SLA Monitoring: Track P95/P99 latency to ensure performance SLAs are met
By the end of this module, you will:
- β Configure Prometheus for metrics collection
- β Set up Kubernetes service discovery
- β Create alerting rules with PromQL
- β Build Grafana dashboards for ML monitoring
- β Understand MLOps-specific observability patterns
This module teaches you to build production monitoring for ML services using Prometheus and Grafana. Complete three progressive exercises that cover metrics collection, alerting, and visualization for your MLOps stack.
| Challenge | Without Monitoring | With Monitoring |
|---|---|---|
| ML Latency | "Why is inference slow?" | P95/P99 latency tracked |
| Error Rate | "Are predictions failing?" | 5xx errors alerted |
| Resource Usage | "Pod OOM killed" | Memory usage trends visible |
| Scaling Issues | "HPA not working?" | CPU/memory vs replicas correlated |
| Incident Response | Hours to debug | Minutes with correlated metrics |
This module uses a scaffolded learning approach with three progressive exercises:
Exercise 1: Alerting Rules
ββ Alert rule structure
ββ PromQL expressions for alerts
ββ Severity levels and thresholds
ββ Time-based alert conditions
Exercise 2: Grafana Dashboard
ββ Datasource configuration
ββ Dashboard panel creation
ββ PromQL queries for visualizations
ββ Panel types and formats
What does "scaffolded" mean?
- 80-90% of YAML is provided for you
- You fill in ~10-20% (critical configurations and queries)
- Focus on learning Prometheus/Grafana concepts
- Each TODO has inline hints showing exactly what to use
- Completed Module 4 (API Gateway deployment)
- Completed Module 3 (ML Service deployment)
- kubectl configured
- kind cluster running
Goal: Create alerting rules for high error rates, latency, and service downtime.
# Open the file
open prometheus-alerts.yaml
# Key concepts: PromQL expressions, severity levels, time thresholdsTest alerts:
# Deploy alerts
kubectl apply -f prometheus-alerts.yaml
# Restart Prometheus to load rules
kubectl rollout restart deployment/prometheus
# View in UI
kubectl port-forward svc/prometheus 9090:9090
# Navigate to: Alerts tabGoal: Build a Grafana dashboard with panels for request rate, errors, latency, and resource usage.
# Open the file
open grafana-dashboard.yaml
# Key concepts: Datasource config, PromQL queries, panel configurationTest dashboard:
# Deploy Grafana
kubectl apply -f grafana-dashboard.yaml
# Wait for ready
kubectl wait --for=condition=ready pod -l app=grafana --timeout=120s
# Access UI
kubectl port-forward svc/grafana 3000:3000
open http://localhost:3000
# Login: admin / admin
# Navigate to: Dashboards β MLOps Workshop β MLOps Overview# Port-forward gateway
kubectl port-forward svc/api-gateway-service 8080:80
# Generate traffic
for i in {1..100}; do
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "Go is amazing!","request_id": null}}' &
done
# Watch metrics in Grafana
# Request rate, latency, and resource usage should update- Scrape Model: Pull metrics from targets every 15s
- Service Discovery: Automatically find pods to monitor
- Relabeling: Filter and transform discovered targets
- TSDB: Time-series database for efficient storage
- PromQL: Query language for metrics
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: truePods opt-in with annotations:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"- alert: GatewayHighErrorRate
expr: |
rate(gateway_http_requests_total{status=~"5.."}[5m])
/ rate(gateway_http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "Error rate is {{ $value }}"# Request rate (req/sec)
rate(metric[5m])
# Error rate (percentage)
rate(errors[5m]) / rate(requests[5m])
# Latency percentiles
histogram_quantile(0.95, rate(metric_bucket[5m]))
# Service down
absent(up{job="service"} == 1)
# Resource usage
(usage / limit) > 0.9
{
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(gateway_http_requests_total[5m]))",
"legendFormat": "Requests/sec"
}
]
}
]
}Gateway Metrics (from Module 4):
gateway_http_requests_total{method,endpoint,status}
gateway_http_request_duration_seconds_bucket{le}
gateway_backend_requests_total{endpoint,status}
gateway_backend_request_duration_seconds_bucket{le}
ML Service Metrics (from BentoML):
bentoml_service_request_total
bentoml_service_request_duration_seconds
Kubernetes Metrics:
container_memory_usage_bytes
container_cpu_usage_seconds_total
kube_pod_status_phase
kube_horizontalpodautoscaler_status_current_replicas
# Prometheus
kubectl port-forward svc/prometheus 9090:9090
open http://localhost:9090
# Grafana
kubectl port-forward svc/grafana 3000:3000
open http://localhost:3000 # Login: admin/admin
# View targets
# Prometheus UI β Status β Targets
# View alerts
# Prometheus UI β Alerts
# Check Prometheus logs
kubectl logs -l app=prometheus
# Check Grafana logs
kubectl logs -l app=grafana
# Test PromQL query
# Prometheus UI β Graph β Enter querySymptoms:
- Prometheus UI β Status β Targets shows "0/0 up"
- Service discovery finds pods but doesn't scrape them
- Metrics not appearing in Prometheus
Root Cause: Missing pod annotations or incorrect relabel configuration
Step-by-step solution:
# 1. Check service discovery is finding pods
kubectl port-forward svc/prometheus 9090:9090
# Visit: http://localhost:9090/service-discovery
# Should see pods listed under "kubernetes-pods"
# 2. Verify pod annotations exist
kubectl get pods -l app=api-gateway -o yaml | grep -A 3 "prometheus.io"
# Should show:
# prometheus.io/scrape: "true"
# prometheus.io/port: "8080"
# prometheus.io/path: "/metrics"
# 3. If annotations missing, add them to deployment
kubectl patch deployment api-gateway -p '
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
'
# 4. Check relabel configs in prometheus-config.yaml
# Look for action: keep with source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
# 5. Check Prometheus logs for scrape errors
kubectl logs -l app=prometheus | grep -i error
kubectl logs -l app=prometheus | grep "scrape"Symptoms:
- Dashboard panels show "No Data" message
- Prometheus datasource shows green checkmark
- Time range is set correctly
Root Cause: No metrics exist yet, or wrong PromQL query
Step-by-step solution:
# 1. Test Prometheus datasource connection
# Grafana UI β Configuration β Data Sources β Prometheus β Save & Test
# Should show: "Data source is working"
# 2. Verify metrics exist in Prometheus
kubectl port-forward svc/prometheus 9090:9090
# Navigate to: http://localhost:9090/graph
# Query: gateway_http_requests_total
# Should return results
# 3. If no metrics, generate traffic
kubectl port-forward svc/api-gateway-service 8080:80 &
for i in {1..20}; do
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "Go is amazing!","request_id": null}}' || true
sleep 1
done
# 4. Wait 15-30 seconds for Prometheus to scrape
# 5. Check time range in Grafana
# Dashboard β Top right β Time picker β Last 15 minutes
# 6. Verify PromQL query syntax
# Test query directly in Prometheus UI first
# Example: rate(gateway_http_requests_total[5m])
# 7. Check panel datasource is set to Prometheus
# Panel β Edit β Query β Data source: Prometheus# Deploy Prometheus
kubectl apply -f prometheus-config.yaml
# Check Prometheus deployment
kubectl get deployment prometheus
kubectl get pods -l app=prometheus
kubectl describe pod -l app=prometheus
# View Prometheus logs
kubectl logs -l app=prometheus
kubectl logs -l app=prometheus -f # Follow logs
kubectl logs -l app=prometheus --previous # Previous container
# Access Prometheus UI
kubectl port-forward svc/prometheus 9090:9090
open http://localhost:9090
# Restart Prometheus
kubectl rollout restart deployment/prometheus
kubectl wait --for=condition=ready pod -l app=prometheus --timeout=120s
# Check Prometheus configuration
kubectl get configmap prometheus-config -o yaml
# Update configuration
kubectl apply -f prometheus-config.yaml
kubectl rollout restart deployment/prometheus
# Check Prometheus metrics about itself
curl http://localhost:9090/metrics
# Verify scrape targets
# Prometheus UI β Status β Targets
# Or via API:
curl http://localhost:9090/api/v1/targets# Access Prometheus UI for queries
kubectl port-forward svc/prometheus 9090:9090
open http://localhost:9090/graph
# Common queries for ML services:
# Request rate (requests per second)
rate(gateway_http_requests_total[5m])
sum(rate(gateway_http_requests_total[5m]))
# Error rate (percentage)
sum(rate(gateway_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(gateway_http_requests_total[5m])) * 100
# Request breakdown by endpoint
sum(rate(gateway_http_requests_total[5m])) by (endpoint)
# Request breakdown by status code
sum(rate(gateway_http_requests_total[5m])) by (status)
# P95 latency
histogram_quantile(0.95,
rate(gateway_http_request_duration_seconds_bucket[5m]))
# P99 latency
histogram_quantile(0.99,
rate(gateway_http_request_duration_seconds_bucket[5m]))
# ML inference latency
histogram_quantile(0.95,
rate(gateway_backend_request_duration_seconds_bucket[5m]))
# Memory usage (bytes)
container_memory_usage_bytes{pod=~"api-gateway.*"}
container_memory_usage_bytes{pod=~"sentiment-api.*"}
# Memory usage (percentage)
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100
# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"api-gateway.*"}[5m])
# HPA replicas
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="sentiment-api-hpa"}
kube_horizontalpodautoscaler_status_desired_replicas{horizontalpodautoscaler="sentiment-api-hpa"}
# Pod status
kube_pod_status_phase{pod=~"api-gateway.*"}
kube_pod_status_phase{pod=~"sentiment-api.*"}# Simple single request
kubectl port-forward svc/api-gateway-service 8080:80 &
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "Go is amazing!","request_id": null}}'
# Generate continuous traffic (light)
for i in {1..100}; do
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "Go is amazing!","request_id": "'$i'"}}' &
sleep 0.1
done
# Generate sustained load (heavy)
while true; do
for i in {1..10}; do
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "Go is amazing!","request_id": "'$i'"}}' &
done
sleep 1
done
# Generate mixed traffic (success + errors)
for i in {1..50}; do
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "Go is amazing!","request_id": null}}' &
curl -X POST http://localhost:8080/predict \
-d 'invalid json' &
done
# Stop background port-forward
pkill -f "port-forward.*8080:80"If you get stuck, reference implementations are in solution/:
Note: Try to complete exercises on your own first!
The Go API Gateway from Module 4 exposes Prometheus metrics automatically:
Gateway metrics exposed:
// modules/module-4/main.go
var (
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "gateway_http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "gateway_http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)Prometheus scrapes these automatically via annotations:
# modules/module-4/deployment.yaml
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"Query gateway metrics in Grafana:
# Request rate by endpoint
sum(rate(gateway_http_requests_total[5m])) by (endpoint)
# Error rate
sum(rate(gateway_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(gateway_http_requests_total[5m]))
# P95 latency
histogram_quantile(0.95,
rate(gateway_http_request_duration_seconds_bucket[5m]))
BentoML services from Module 3 expose metrics automatically:
BentoML default metrics:
bentoml_service_request_total{endpoint, http_response_code, service_name, service_version}
bentoml_service_request_duration_seconds{endpoint, service_name, service_version}
bentoml_service_request_in_progress{endpoint, service_name, service_version}
Kubernetes resource metrics:
# Memory usage of ML service
container_memory_usage_bytes{pod=~"sentiment-api.*"}
# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"sentiment-api.*"}[5m])
# HPA status
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="sentiment-api-hpa"}
Alert on ML service issues:
# prometheus-alerts.yaml
- alert: MLServiceDown
expr: absent(up{job="ml-service"} == 1)
for: 1m
labels:
severity: critical
annotations:
summary: "ML Service is down"
description: "ML service has been unavailable for 1+ minutes"
- alert: MLInferenceLatencyHigh
expr: |
histogram_quantile(0.95,
rate(gateway_backend_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "ML inference latency high: {{ $value }}s"Monitor Kubeflow pipeline runs and model training metrics:
Pipeline execution metrics:
# Pipeline runs by status
count(argo_workflows_status) by (status)
# Pipeline duration
histogram_quantile(0.95, argo_workflow_duration_seconds_bucket)
# Failed pipelines
count(argo_workflows_status{status="Failed"})
Model training metrics (custom):
# modules/module-1/train.py
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
training_accuracy = Gauge('model_training_accuracy',
'Model training accuracy',
registry=registry)
training_loss = Gauge('model_training_loss',
'Model training loss',
registry=registry)
# After training
training_accuracy.set(accuracy)
training_loss.set(loss)
push_to_gateway('prometheus-pushgateway:9091',
job='model-training',
registry=registry)Dashboard for ML lifecycle:
# Training jobs completed today
count(model_training_accuracy{job="model-training"})
# Latest model accuracy
model_training_accuracy{job="model-training"}
# Model deployment count
count(kube_deployment_labels{deployment=~"sentiment-api.*"})
| Component | Workshop | Production |
|---|---|---|
| Deployment | Raw manifests | Helm (kube-prometheus-stack) |
| Storage | emptyDir (ephemeral) | PersistentVolumeClaim (50Gi+) |
| Retention | 7 days | 30+ days |
| Replicas | 1 (single pod) | 2+ with HA |
| Auth | Anonymous enabled | RBAC + OAuth |
| Alerting | No AlertManager | AlertManager + PagerDuty/Slack |
| TLS | HTTP only | HTTPS with cert-manager |
Once you've completed all exercises:
Extend monitoring:
- Add more alert rules (CPU throttling, disk space)
- Create custom Grafana dashboards
- Integrate with AlertManager
- Add Loki for log aggregation
Production deployment:
- Use Helm for easier management
- Configure persistent storage
- Enable authentication and TLS
- Set up alert routing (PagerDuty, Slack)
β Workshop Complete! You've mastered the entire MLOps stack! π
β Metrics Collection - Automatic service discovery with Prometheus β Alerting - PromQL-based alerts for ML services β Visualization - Production dashboards with Grafana β MLOps Observability - Specific patterns for ML systems β Production Ready - Scalable monitoring architecture
Congratulations! You've completed the MLOps workshop and built a full production ML platform! π
From model training (Module 1) to monitoring (Module 6), you now have hands-on experience with the entire MLOps lifecycle.
| Previous | Home | Next |
|---|---|---|
| β Module 5: Kubeflow Pipelines & Model Serving | π Home | Module 7: CI/CD with GitHub Actions β |
MLOps Workshop | GitHub Repository