Skip to content

Module 3.md

Rabieh Fashwall edited this page Nov 27, 2025 · 1 revision

Module 3: Kubernetes Deployment

What You'll Build

By the end of this module, you'll have:

  • ✅ Production-ready Kubernetes deployment for your ML service
  • ✅ Auto-scaling infrastructure that responds to traffic patterns
  • ✅ High-availability setup surviving node failures
  • ✅ Security-hardened containers running as non-root
  • ✅ Complete health monitoring with all probe types
  • ✅ Zero-downtime deployment capabilities
  • ✅ Resource-optimized configuration preventing waste

Real-World Impact:

  • Handle 10x traffic spikes automatically with HPA
  • Maintain 99.9%+ uptime with proper health checks
  • Pass security audits with hardened container configuration
  • Deploy updates without service interruption
  • Optimize costs by scaling down during low traffic

Learning Objectives

By the end of this module, you will:

  • ✅ Deploy ML models to Kubernetes with production-ready configuration
  • ✅ Configure resource limits and requests
  • ✅ Implement all three health probe types
  • ✅ Set up auto-scaling with Horizontal Pod Autoscaler
  • ✅ Ensure high availability with Pod Disruption Budget
  • ✅ Apply security best practices (non-root, read-only filesystem)
  • ✅ Use ConfigMap for externalized configuration
  • ✅ Implement pod anti-affinity for fault tolerance

Part 1: Setup & Prerequisites

Prerequisites

  • Completed Module 2 (BentoML service containerized)
  • kind installed (Kubernetes in Docker)
  • kubectl installed and configured
  • Docker image from Module 2: sentiment-api:v1

Workshop Format

Single Exercise: Production-Ready Deployment
├─ ConfigMap for configuration
├─ Deployment with proper resource management
├─ Service for network access
├─ Health probes (startup, liveness, readiness)
├─ Horizontal Pod Autoscaler (HPA)
├─ Pod Disruption Budget (PDB)
├─ Security hardening
└─ High availability (anti-affinity)

What does "scaffolded" mean?

  • 80-90% of YAML is provided for you
  • You fill in ~10-20% (20 specific configuration values)
  • Each TODO has inline hints showing exactly what to use

Part 2: Quick Start

Quick Start

1. Setup kind Cluster

# Create cluster
kind create cluster --config modules/module-0/kind.yaml

# Verify
kubectl cluster-info
kubectl get nodes

2. Build and Load Docker Image

# Build image (from Module 2)
cd ../module-2
bentoml build
bentoml containerize sentiment_service:latest -t sentiment-api:v1

# Load into kind
cd ../module-3
kind load docker-image sentiment-api:v1 --name mlops-workshop

# Verify image is loaded
docker exec -it mlops-workshop-control-plane crictl images | grep sentiment

3. Install Metrics Server (Required for HPA)

# Install
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Patch for kind (disable TLS verification)
kubectl patch deployment metrics-server -n kube-system --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'

# Verify
kubectl get deployment metrics-server -n kube-system

4. Complete the Exercise

Production-Ready Deployment

Goal: Deploy your ML service to Kubernetes with complete production configuration.

cd starter

# Open the file
open deployment.yaml

# Find and fill in 20 TODOs
# Look for: # YOUR CODE HERE

# Apply the manifest
kubectl apply -f deployment.yaml

# Verify all resources
kubectl get deployments,pods,svc,hpa,pdb -l app=sentiment-api

Test the API:

# Port forward
kubectl port-forward svc/sentiment-api-service 8080:80

# In another terminal, test prediction
curl -X POST http://localhost:8080/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "Kubernetes is awesome!","request_id": null}}'

# Expected response:
# {"sentiment": "POSITIVE", "score": 0.9998}

# Test health endpoint
curl http://localhost:8080/health

Key TODOs to Complete

Configuration (TODOs 1-2):

  • Set BentoML port and worker count in ConfigMap

Deployment Basics (TODOs 3-6):

  • Set deployment name, replicas, selector, and pod labels

Security (TODOs 7-8, 15-16):

  • Configure pod and container security contexts
  • Run as non-root user (UID 1000)
  • Read-only root filesystem

High Availability (TODOs 9-10):

  • Configure pod anti-affinity to spread across nodes

Container Configuration (TODOs 11-14):

  • Set container name, image, pull policy, and port

Resources (TODOs 17-18):

  • Configure CPU and memory requests (critical for HPA)

Health Probes (TODOs 19-20):

  • Configure startup and readiness probes

Part 3: Key Concepts

Key Concepts Covered

Kubernetes Fundamentals

  • Deployments: Manage replica Pods, rolling updates
  • Services: Stable network endpoint, load balancing
  • ConfigMaps: Externalized configuration
  • Labels & Selectors: Resource organization

Resource Management

  • Requests: Guaranteed minimum for scheduling
  • Limits: Maximum allowed (prevents exhaustion)
  • QoS Classes: Guaranteed, Burstable, BestEffort

Health and Reliability

  • Startup Probes: Handle slow ML model loading (30-60s)
  • Liveness Probes: Detect and restart dead containers
  • Readiness Probes: Control traffic routing
  • Self-healing: Automatic recovery

Auto-scaling

  • HPA (Horizontal Pod Autoscaler): Scale based on CPU/memory
  • Metrics: CPU 70%, Memory 80% targets
  • Scaling Policies: Stabilization windows, rate limits

High Availability

  • Pod Disruption Budget: Maintain availability during updates
  • Pod Anti-affinity: Spread pods across nodes
  • Rolling Updates: Zero-downtime deployments
  • Fault Tolerance: Survive node failures

Security

  • Non-root Containers: Reduce attack surface
  • Read-only Filesystem: Prevent file modifications
  • Dropped Capabilities: Minimal privileges
  • Security Contexts: Pod and container hardening

Part 4: Testing & Validation

Common Commands

# View all resources
kubectl get all -l app=sentiment-api

# Describe resources
kubectl describe deployment sentiment-api
kubectl describe pod <pod-name>
kubectl describe hpa sentiment-api-hpa

# Monitor resources
kubectl top pod -l app=sentiment-api
kubectl get hpa -w

# View logs
kubectl logs -l app=sentiment-api -f

# Port forwarding
kubectl port-forward svc/sentiment-api-service 8080:80

# Manual scaling (HPA will override)
kubectl scale deployment sentiment-api --replicas=5

# Delete all resources
kubectl delete -f deployment.yaml

Part 5: Troubleshooting

Troubleshooting

Issue 1: Pods stuck in "Pending"

Symptoms:

kubectl get pods
# Shows: STATUS=Pending for extended period

Root Causes:

  1. Image not loaded into kind cluster
  2. Insufficient cluster resources
  3. Node selector/affinity constraints not met

Solutions:

Check 1: Verify image is loaded

# Check if image exists in kind
docker exec -it mlops-workshop-control-plane crictl images | grep sentiment

# If not found, load it
kind load docker-image sentiment-api:v1 --name mlops-workshop

# Verify again
docker exec -it mlops-workshop-control-plane crictl images | grep sentiment

Check 2: Inspect pod events

# Get detailed information
kubectl describe pod <pod-name>

# Look for events like:
# - "FailedScheduling: 0/1 nodes available"
# - "ImagePullBackOff"
# - "Insufficient cpu/memory"

Check 3: Verify cluster resources

# Check node resources
kubectl describe nodes

# Check resource requests
kubectl describe deployment sentiment-api

# If insufficient, reduce requests in deployment.yaml

Issue 2: Pods stuck in "ImagePullBackOff"

Symptoms:

kubectl get pods
# Shows: STATUS=ImagePullBackOff or ErrImagePull

Root Cause: Wrong imagePullPolicy for local kind cluster

Solution:

# Check current imagePullPolicy
kubectl get deployment sentiment-api -o yaml | grep imagePullPolicy

# Should be: imagePullPolicy: Never (for kind)
# Fix in deployment.yaml TODO 13

# If set to "Always" or "IfNotPresent", change to "Never"
# Then reapply:
kubectl apply -f deployment.yaml

# Force restart pods
kubectl rollout restart deployment sentiment-api

Alternative: Rebuild and reload image

cd ../module-2
bentoml build
bentoml containerize sentiment_service:latest -t sentiment-api:v1
cd ../module-3
kind load docker-image sentiment-api:v1 --name mlops-workshop
kubectl delete pods -l app=sentiment-api

Issue 3: Security context errors

Symptoms:

Error: container has runAsNonRoot and image will run as root
Error: container has runAsNonRoot and image has non-numeric user

Root Cause: BentoML default image runs as root, conflicts with runAsNonRoot: true

Solutions:

Option 1: Adjust security context (for workshop)

# In deployment.yaml, use specific UID
securityContext:
  runAsUser: 1000
  runAsNonRoot: true
  # Remove runAsGroup if causing issues

Option 2: Build custom non-root image (production)

# In your BentoML project
# Create custom Dockerfile
FROM bentoml/bento-server:latest

# Create non-root user
RUN useradd -m -u 1000 bentouser && \
    chown -R bentouser:bentouser /home/bentouser

USER bentouser

# Rest of your build...

Option 3: Relax constraints temporarily

# For local testing only
securityContext:
  # Comment out runAsNonRoot temporarily
  # runAsNonRoot: true
  readOnlyRootFilesystem: false

Still stuck? Check the solution file


Part 6: Reference

Commands Cheat Sheet

Quick Start

# Create kind cluster
kind create cluster --name mlops-workshop

# Load image
kind load docker-image sentiment-api:v1 --name mlops-workshop

# Apply deployment
kubectl apply -f deployment.yaml

# Check status
kubectl get all -l app=sentiment-api

Logs and Debugging

# View pod logs
kubectl logs <pod-name>

# View logs from previous crashed container
kubectl logs <pod-name> --previous

# Follow logs in real-time
kubectl logs -f <pod-name>

# Logs from all pods with label
kubectl logs -l app=sentiment-api --all-containers=true

# Tail last 50 lines
kubectl logs <pod-name> --tail=50

# Logs since last 1 hour
kubectl logs <pod-name> --since=1h

# Exec into pod
kubectl exec -it <pod-name> -- /bin/bash
kubectl exec -it <pod-name> -- sh  # if bash not available

# Run command in pod
kubectl exec <pod-name> -- curl localhost:3000/health

Port Forwarding and Access

# Port forward service
kubectl port-forward svc/sentiment-api-service 8080:80

# Port forward in background
kubectl port-forward svc/sentiment-api-service 8080:80 &

# Test API
curl -X POST http://localhost:8080/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "Test"}'

Scaling Operations

# Manual scale (HPA will override)
kubectl scale deployment sentiment-api --replicas=5

# Check HPA status
kubectl get hpa sentiment-api-hpa

# Watch HPA in real-time
kubectl get hpa sentiment-api-hpa -w
watch kubectl get hpa sentiment-api-hpa

# Disable HPA temporarily
kubectl delete hpa sentiment-api-hpa

# Re-enable HPA
kubectl apply -f deployment.yaml

Load Testing

# Install hey
go install github.com/rakyll/hey@latest

# Run load test
hey -z 2m -c 20 -m POST \
    -H "Content-Type: application/json" \
    -d '{"text":"Load test"}' \
    http://localhost:8080/predict

# Using Apache Bench
ab -n 1000 -c 10 -p request.json -T application/json \
   http://localhost:8080/predict

# Watch HPA respond
watch kubectl get hpa sentiment-api-hpa

Solution File

If you get stuck, a complete reference implementation is available:

  • solution/deployment.yaml - All TODOs completed with detailed comments

Note: Try to complete the exercise on your own first! The solution is heavily commented to explain every configuration.

Next Steps

Once you've completed the exercise and tests pass:

Module 4: API Gateway with Go

In Module 4, you'll build a high-performance API gateway in Go to sit in front of your ML service!

Key Takeaways

What We Learned

  • Kubernetes Deployments: Manage containerized ML workloads
  • Resource Management: Prevent resource starvation and overcommit
  • Health Probes: Enable self-healing and zero-downtime updates
  • Auto-scaling: Automatically adjust capacity based on load
  • High Availability: Survive node failures and maintenance
  • Security: Run with least privilege, harden containers
  • Production Patterns: Real-world best practices for ML deployments

Best Practices

  • Always set resource requests (required for HPA)
  • Use all three probe types (startup, liveness, readiness)
  • Configure PDB to maintain availability during updates
  • Run containers as non-root
  • Use read-only root filesystem
  • Externalize configuration with ConfigMap
  • Spread pods across nodes with anti-affinity
  • Set appropriate HPA targets for ML workloads (70% CPU, 80% memory)
  • Use conservative scale-down policies (ML models take time to load)

Having issues? Check the Troubleshooting section or review the solution file!


Navigation

Previous Home Next
Module 2: Model Packaging & Serving 🏠 Home Module 4: API Gateway & Polyglot Architecture

Quick Links


MLOps Workshop | GitHub Repository

Clone this wiki locally