Module 6.md

Module 6: MLOps Monitoring with Prometheus & Grafana

What You'll Build

By completing this module, you will deliver:

Monitoring Infrastructure:

✅ Prometheus Server: Time-series database scraping metrics every 15s with 7-day retention
✅ 8 Production Alert Rules: High error rate, latency, service downtime, resource exhaustion
✅ Grafana Dashboard: Real-time visualization of request rates, errors, latency percentiles, resource usage
✅ Kubernetes Service Discovery: Automatic detection and monitoring of ML services

Real-World Impact:

Incident Detection: Alert fires within 2 minutes of error rate exceeding 5%
Debugging Speed: Reduce troubleshooting time from hours to minutes with correlated metrics
Capacity Planning: Visualize CPU/memory trends to predict when to scale infrastructure
SLA Monitoring: Track P95/P99 latency to ensure performance SLAs are met

Learning Objectives

By the end of this module, you will:

✅ Configure Prometheus for metrics collection
✅ Set up Kubernetes service discovery
✅ Create alerting rules with PromQL
✅ Build Grafana dashboards for ML monitoring
✅ Understand MLOps-specific observability patterns

Part 1: Setup & Prerequisites

This module teaches you to build production monitoring for ML services using Prometheus and Grafana. Complete three progressive exercises that cover metrics collection, alerting, and visualization for your MLOps stack.

Why Monitoring Matters for MLOps

Challenge	Without Monitoring	With Monitoring
ML Latency	"Why is inference slow?"	P95/P99 latency tracked
Error Rate	"Are predictions failing?"	5xx errors alerted
Resource Usage	"Pod OOM killed"	Memory usage trends visible
Scaling Issues	"HPA not working?"	CPU/memory vs replicas correlated
Incident Response	Hours to debug	Minutes with correlated metrics

Workshop Format

This module uses a scaffolded learning approach with three progressive exercises:

Exercise 1: Alerting Rules 
├─ Alert rule structure
├─ PromQL expressions for alerts
├─ Severity levels and thresholds
└─ Time-based alert conditions

Exercise 2: Grafana Dashboard 
├─ Datasource configuration
├─ Dashboard panel creation
├─ PromQL queries for visualizations
└─ Panel types and formats

What does "scaffolded" mean?

80-90% of YAML is provided for you
You fill in ~10-20% (critical configurations and queries)
Focus on learning Prometheus/Grafana concepts
Each TODO has inline hints showing exactly what to use

Prerequisites

Completed Module 4 (API Gateway deployment)
Completed Module 3 (ML Service deployment)
kubectl configured
kind cluster running

Part 2: Exercises

1. Complete Exercises

Exercise 1: Alerting Rules

Goal: Create alerting rules for high error rates, latency, and service downtime.

# Open the file
open prometheus-alerts.yaml

# Key concepts: PromQL expressions, severity levels, time thresholds

Test alerts:

# Deploy alerts
kubectl apply -f prometheus-alerts.yaml

# Restart Prometheus to load rules
kubectl rollout restart deployment/prometheus

# View in UI
kubectl port-forward svc/prometheus 9090:9090
# Navigate to: Alerts tab

Exercise 2: Grafana Dashboard

Goal: Build a Grafana dashboard with panels for request rate, errors, latency, and resource usage.

# Open the file
open grafana-dashboard.yaml

# Key concepts: Datasource config, PromQL queries, panel configuration

Test dashboard:

# Deploy Grafana
kubectl apply -f grafana-dashboard.yaml

# Wait for ready
kubectl wait --for=condition=ready pod -l app=grafana --timeout=120s

# Access UI
kubectl port-forward svc/grafana 3000:3000
open http://localhost:3000

# Login: admin / admin
# Navigate to: Dashboards → MLOps Workshop → MLOps Overview

Generate Traffic for Metrics

# Port-forward gateway
kubectl port-forward svc/api-gateway-service 8080:80

# Generate traffic
for i in {1..100}; do
  curl -X POST http://localhost:8080/predict \
       -H "Content-Type: application/json" \
          -d '{"request": {"text": "Go is amazing!","request_id": null}}' &
done

# Watch metrics in Grafana
# Request rate, latency, and resource usage should update

Part 3: Core Concepts

Key Concepts Covered

Prometheus Fundamentals

Scrape Model: Pull metrics from targets every 15s
Service Discovery: Automatically find pods to monitor
Relabeling: Filter and transform discovered targets
TSDB: Time-series database for efficient storage
PromQL: Query language for metrics

Kubernetes Service Discovery

kubernetes_sd_configs:
- role: pod
  namespaces:
    names:
    - default

relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  action: keep
  regex: true

Pods opt-in with annotations:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Alert Rule Structure

- alert: GatewayHighErrorRate
  expr: |
    rate(gateway_http_requests_total{status=~"5.."}[5m])
    / rate(gateway_http_requests_total[5m]) > 0.05
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Error rate is {{ $value }}"

PromQL Patterns

# Request rate (req/sec)
rate(metric[5m])

# Error rate (percentage)
rate(errors[5m]) / rate(requests[5m])

# Latency percentiles
histogram_quantile(0.95, rate(metric_bucket[5m]))

# Service down
absent(up{job="service"} == 1)

# Resource usage
(usage / limit) > 0.9

Grafana Dashboard JSON

{
  "panels": [
    {
      "title": "Request Rate",
      "targets": [
        {
          "expr": "sum(rate(gateway_http_requests_total[5m]))",
          "legendFormat": "Requests/sec"
        }
      ]
    }
  ]
}

MLOps-Specific Metrics

Gateway Metrics (from Module 4):

gateway_http_requests_total{method,endpoint,status}
gateway_http_request_duration_seconds_bucket{le}
gateway_backend_requests_total{endpoint,status}
gateway_backend_request_duration_seconds_bucket{le}

ML Service Metrics (from BentoML):

bentoml_service_request_total
bentoml_service_request_duration_seconds

Kubernetes Metrics:

container_memory_usage_bytes
container_cpu_usage_seconds_total
kube_pod_status_phase
kube_horizontalpodautoscaler_status_current_replicas

Part 4: Testing & Production

Common Commands

# Prometheus
kubectl port-forward svc/prometheus 9090:9090
open http://localhost:9090

# Grafana
kubectl port-forward svc/grafana 3000:3000
open http://localhost:3000  # Login: admin/admin

# View targets
# Prometheus UI → Status → Targets

# View alerts
# Prometheus UI → Alerts

# Check Prometheus logs
kubectl logs -l app=prometheus

# Check Grafana logs
kubectl logs -l app=grafana

# Test PromQL query
# Prometheus UI → Graph → Enter query

Part 5: Troubleshooting

Issue 1: Prometheus not scraping targets

Symptoms:

Prometheus UI → Status → Targets shows "0/0 up"
Service discovery finds pods but doesn't scrape them
Metrics not appearing in Prometheus

Root Cause: Missing pod annotations or incorrect relabel configuration

Step-by-step solution:

# 1. Check service discovery is finding pods
kubectl port-forward svc/prometheus 9090:9090
# Visit: http://localhost:9090/service-discovery
# Should see pods listed under "kubernetes-pods"

# 2. Verify pod annotations exist
kubectl get pods -l app=api-gateway -o yaml | grep -A 3 "prometheus.io"
# Should show:
#   prometheus.io/scrape: "true"
#   prometheus.io/port: "8080"
#   prometheus.io/path: "/metrics"

# 3. If annotations missing, add them to deployment
kubectl patch deployment api-gateway -p '
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
'

# 4. Check relabel configs in prometheus-config.yaml
# Look for action: keep with source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]

# 5. Check Prometheus logs for scrape errors
kubectl logs -l app=prometheus | grep -i error
kubectl logs -l app=prometheus | grep "scrape"

Issue 2: Grafana shows "No Data"

Symptoms:

Dashboard panels show "No Data" message
Prometheus datasource shows green checkmark
Time range is set correctly

Root Cause: No metrics exist yet, or wrong PromQL query

Step-by-step solution:

# 1. Test Prometheus datasource connection
# Grafana UI → Configuration → Data Sources → Prometheus → Save & Test
# Should show: "Data source is working"

# 2. Verify metrics exist in Prometheus
kubectl port-forward svc/prometheus 9090:9090
# Navigate to: http://localhost:9090/graph
# Query: gateway_http_requests_total
# Should return results

# 3. If no metrics, generate traffic
kubectl port-forward svc/api-gateway-service 8080:80 &
for i in {1..20}; do
  curl -X POST http://localhost:8080/predict \
       -H "Content-Type: application/json" \
       -d '{"request": {"text": "Go is amazing!","request_id": null}}' || true
  sleep 1
done

# 4. Wait 15-30 seconds for Prometheus to scrape

# 5. Check time range in Grafana
# Dashboard → Top right → Time picker → Last 15 minutes

# 6. Verify PromQL query syntax
# Test query directly in Prometheus UI first
# Example: rate(gateway_http_requests_total[5m])

# 7. Check panel datasource is set to Prometheus
# Panel → Edit → Query → Data source: Prometheus

Part 6: Reference

Commands Cheat Sheet

Prometheus Operations

# Deploy Prometheus
kubectl apply -f prometheus-config.yaml

# Check Prometheus deployment
kubectl get deployment prometheus
kubectl get pods -l app=prometheus
kubectl describe pod -l app=prometheus

# View Prometheus logs
kubectl logs -l app=prometheus
kubectl logs -l app=prometheus -f  # Follow logs
kubectl logs -l app=prometheus --previous  # Previous container

# Access Prometheus UI
kubectl port-forward svc/prometheus 9090:9090
open http://localhost:9090

# Restart Prometheus
kubectl rollout restart deployment/prometheus
kubectl wait --for=condition=ready pod -l app=prometheus --timeout=120s

# Check Prometheus configuration
kubectl get configmap prometheus-config -o yaml

# Update configuration
kubectl apply -f prometheus-config.yaml
kubectl rollout restart deployment/prometheus

# Check Prometheus metrics about itself
curl http://localhost:9090/metrics

# Verify scrape targets
# Prometheus UI → Status → Targets
# Or via API:
curl http://localhost:9090/api/v1/targets

PromQL Queries

# Access Prometheus UI for queries
kubectl port-forward svc/prometheus 9090:9090
open http://localhost:9090/graph

# Common queries for ML services:

# Request rate (requests per second)
rate(gateway_http_requests_total[5m])
sum(rate(gateway_http_requests_total[5m]))

# Error rate (percentage)
sum(rate(gateway_http_requests_total{status=~"5.."}[5m]))
  / sum(rate(gateway_http_requests_total[5m])) * 100

# Request breakdown by endpoint
sum(rate(gateway_http_requests_total[5m])) by (endpoint)

# Request breakdown by status code
sum(rate(gateway_http_requests_total[5m])) by (status)

# P95 latency
histogram_quantile(0.95,
  rate(gateway_http_request_duration_seconds_bucket[5m]))

# P99 latency
histogram_quantile(0.99,
  rate(gateway_http_request_duration_seconds_bucket[5m]))

# ML inference latency
histogram_quantile(0.95,
  rate(gateway_backend_request_duration_seconds_bucket[5m]))

# Memory usage (bytes)
container_memory_usage_bytes{pod=~"api-gateway.*"}
container_memory_usage_bytes{pod=~"sentiment-api.*"}

# Memory usage (percentage)
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100

# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"api-gateway.*"}[5m])

# HPA replicas
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="sentiment-api-hpa"}
kube_horizontalpodautoscaler_status_desired_replicas{horizontalpodautoscaler="sentiment-api-hpa"}

# Pod status
kube_pod_status_phase{pod=~"api-gateway.*"}
kube_pod_status_phase{pod=~"sentiment-api.*"}

Generating Test Traffic

# Simple single request
kubectl port-forward svc/api-gateway-service 8080:80 &
curl -X POST http://localhost:8080/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "Go is amazing!","request_id": null}}'

# Generate continuous traffic (light)
for i in {1..100}; do
  curl -X POST http://localhost:8080/predict \
       -H "Content-Type: application/json" \
       -d '{"request": {"text": "Go is amazing!","request_id": "'$i'"}}' &
  sleep 0.1
done

# Generate sustained load (heavy)
while true; do
  for i in {1..10}; do
    curl -X POST http://localhost:8080/predict \
         -H "Content-Type: application/json" \
         -d '{"request": {"text": "Go is amazing!","request_id": "'$i'"}}' &
  done
  sleep 1
done

# Generate mixed traffic (success + errors)
for i in {1..50}; do
  curl -X POST http://localhost:8080/predict \
       -H "Content-Type: application/json" \
       -d '{"request": {"text": "Go is amazing!","request_id": null}}' &
  curl -X POST http://localhost:8080/predict \
       -d 'invalid json' &
done

# Stop background port-forward
pkill -f "port-forward.*8080:80"

Solution Files

If you get stuck, reference implementations are in solution/:

Note: Try to complete exercises on your own first!

Integration Examples

Integration with Module 4 (API Gateway)

The Go API Gateway from Module 4 exposes Prometheus metrics automatically:

Gateway metrics exposed:

// modules/module-4/main.go
var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "gateway_http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "gateway_http_request_duration_seconds",
            Help: "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

Prometheus scrapes these automatically via annotations:

# modules/module-4/deployment.yaml
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Query gateway metrics in Grafana:

# Request rate by endpoint
sum(rate(gateway_http_requests_total[5m])) by (endpoint)

# Error rate
sum(rate(gateway_http_requests_total{status=~"5.."}[5m]))
  / sum(rate(gateway_http_requests_total[5m]))

# P95 latency
histogram_quantile(0.95,
  rate(gateway_http_request_duration_seconds_bucket[5m]))

Integration with Module 3 (ML Service)

BentoML services from Module 3 expose metrics automatically:

BentoML default metrics:

bentoml_service_request_total{endpoint, http_response_code, service_name, service_version}
bentoml_service_request_duration_seconds{endpoint, service_name, service_version}
bentoml_service_request_in_progress{endpoint, service_name, service_version}

Kubernetes resource metrics:

# Memory usage of ML service
container_memory_usage_bytes{pod=~"sentiment-api.*"}

# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"sentiment-api.*"}[5m])

# HPA status
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="sentiment-api-hpa"}

Alert on ML service issues:

# prometheus-alerts.yaml
- alert: MLServiceDown
  expr: absent(up{job="ml-service"} == 1)
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "ML Service is down"
    description: "ML service has been unavailable for 1+ minutes"

- alert: MLInferenceLatencyHigh
  expr: |
    histogram_quantile(0.95,
      rate(gateway_backend_request_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "ML inference latency high: {{ $value }}s"

Integration with Module 5 (Kubeflow Pipelines)

Monitor Kubeflow pipeline runs and model training metrics:

Pipeline execution metrics:

# Pipeline runs by status
count(argo_workflows_status) by (status)

# Pipeline duration
histogram_quantile(0.95, argo_workflow_duration_seconds_bucket)

# Failed pipelines
count(argo_workflows_status{status="Failed"})

Model training metrics (custom):

# modules/module-1/train.py
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

registry = CollectorRegistry()
training_accuracy = Gauge('model_training_accuracy',
                          'Model training accuracy',
                          registry=registry)
training_loss = Gauge('model_training_loss',
                      'Model training loss',
                      registry=registry)

# After training
training_accuracy.set(accuracy)
training_loss.set(loss)
push_to_gateway('prometheus-pushgateway:9091',
                job='model-training',
                registry=registry)

Dashboard for ML lifecycle:

# Training jobs completed today
count(model_training_accuracy{job="model-training"})

# Latest model accuracy
model_training_accuracy{job="model-training"}

# Model deployment count
count(kube_deployment_labels{deployment=~"sentiment-api.*"})

Production Considerations

Workshop vs Production

Component	Workshop	Production
Deployment	Raw manifests	Helm (`kube-prometheus-stack`)
Storage	emptyDir (ephemeral)	PersistentVolumeClaim (50Gi+)
Retention	7 days	30+ days
Replicas	1 (single pod)	2+ with HA
Auth	Anonymous enabled	RBAC + OAuth
Alerting	No AlertManager	AlertManager + PagerDuty/Slack
TLS	HTTP only	HTTPS with cert-manager

Next Steps

Once you've completed all exercises:

Extend monitoring:

Add more alert rules (CPU throttling, disk space)
Create custom Grafana dashboards
Integrate with AlertManager
Add Loki for log aggregation

Production deployment:

Use Helm for easier management
Configure persistent storage
Enable authentication and TLS
Set up alert routing (PagerDuty, Slack)

→ Workshop Complete! You've mastered the entire MLOps stack! 🎉

Key Takeaways

✅ Metrics Collection - Automatic service discovery with Prometheus ✅ Alerting - PromQL-based alerts for ML services ✅ Visualization - Production dashboards with Grafana ✅ MLOps Observability - Specific patterns for ML systems ✅ Production Ready - Scalable monitoring architecture

Congratulations! You've completed the MLOps workshop and built a full production ML platform! 🎉

From model training (Module 1) to monitoring (Module 6), you now have hands-on experience with the entire MLOps lifecycle.

Navigation

Previous	Home	Next
← Module 5: Kubeflow Pipelines & Model Serving	🏠 Home	Module 7: CI/CD with GitHub Actions →

Quick Links

MLOps Workshop | GitHub Repository

Module 6.md

Module 6: MLOps Monitoring with Prometheus & Grafana

What You'll Build

Learning Objectives

Part 1: Setup & Prerequisites

Why Monitoring Matters for MLOps

Workshop Format

Prerequisites

Part 2: Exercises

1. Complete Exercises

Exercise 1: Alerting Rules

Exercise 2: Grafana Dashboard

Generate Traffic for Metrics

Part 3: Core Concepts

Key Concepts Covered

Prometheus Fundamentals

Kubernetes Service Discovery

Alert Rule Structure

PromQL Patterns

Grafana Dashboard JSON

MLOps-Specific Metrics

Part 4: Testing & Production

Common Commands

Part 5: Troubleshooting

Issue 1: Prometheus not scraping targets

Issue 2: Grafana shows "No Data"

Part 6: Reference

Commands Cheat Sheet

Prometheus Operations

PromQL Queries

Generating Test Traffic

Solution Files

Integration Examples

Integration with Module 4 (API Gateway)

Integration with Module 3 (ML Service)

Integration with Module 5 (Kubeflow Pipelines)

Production Considerations

Workshop vs Production

Next Steps

Key Takeaways

Navigation

Quick Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally