Module 3.md

Module 3: Kubernetes Deployment

What You'll Build

By the end of this module, you'll have:

✅ Production-ready Kubernetes deployment for your ML service
✅ Auto-scaling infrastructure that responds to traffic patterns
✅ High-availability setup surviving node failures
✅ Security-hardened containers running as non-root
✅ Complete health monitoring with all probe types
✅ Zero-downtime deployment capabilities
✅ Resource-optimized configuration preventing waste

Real-World Impact:

Handle 10x traffic spikes automatically with HPA
Maintain 99.9%+ uptime with proper health checks
Pass security audits with hardened container configuration
Deploy updates without service interruption
Optimize costs by scaling down during low traffic

Learning Objectives

By the end of this module, you will:

✅ Deploy ML models to Kubernetes with production-ready configuration
✅ Configure resource limits and requests
✅ Implement all three health probe types
✅ Set up auto-scaling with Horizontal Pod Autoscaler
✅ Ensure high availability with Pod Disruption Budget
✅ Apply security best practices (non-root, read-only filesystem)
✅ Use ConfigMap for externalized configuration
✅ Implement pod anti-affinity for fault tolerance

Part 1: Setup & Prerequisites

Prerequisites

Completed Module 2 (BentoML service containerized)
kind installed (Kubernetes in Docker)
kubectl installed and configured
Docker image from Module 2: sentiment-api:v1

Workshop Format

Single Exercise: Production-Ready Deployment
├─ ConfigMap for configuration
├─ Deployment with proper resource management
├─ Service for network access
├─ Health probes (startup, liveness, readiness)
├─ Horizontal Pod Autoscaler (HPA)
├─ Pod Disruption Budget (PDB)
├─ Security hardening
└─ High availability (anti-affinity)

What does "scaffolded" mean?

80-90% of YAML is provided for you
You fill in ~10-20% (20 specific configuration values)
Each TODO has inline hints showing exactly what to use

Part 2: Quick Start

Quick Start

1. Setup kind Cluster

# Create cluster
kind create cluster --config modules/module-0/kind.yaml

# Verify
kubectl cluster-info
kubectl get nodes

2. Build and Load Docker Image

# Build image (from Module 2)
cd ../module-2
bentoml build
bentoml containerize sentiment_service:latest -t sentiment-api:v1

# Load into kind
cd ../module-3
kind load docker-image sentiment-api:v1 --name mlops-workshop

# Verify image is loaded
docker exec -it mlops-workshop-control-plane crictl images | grep sentiment

3. Install Metrics Server (Required for HPA)

# Install
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Patch for kind (disable TLS verification)
kubectl patch deployment metrics-server -n kube-system --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'

# Verify
kubectl get deployment metrics-server -n kube-system

4. Complete the Exercise

Production-Ready Deployment

Goal: Deploy your ML service to Kubernetes with complete production configuration.

cd starter

# Open the file
open deployment.yaml

# Find and fill in 20 TODOs
# Look for: # YOUR CODE HERE

# Apply the manifest
kubectl apply -f deployment.yaml

# Verify all resources
kubectl get deployments,pods,svc,hpa,pdb -l app=sentiment-api

Test the API:

# Port forward
kubectl port-forward svc/sentiment-api-service 8080:80

# In another terminal, test prediction
curl -X POST http://localhost:8080/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "Kubernetes is awesome!","request_id": null}}'

# Expected response:
# {"sentiment": "POSITIVE", "score": 0.9998}

# Test health endpoint
curl http://localhost:8080/health

Key TODOs to Complete

Configuration (TODOs 1-2):

Set BentoML port and worker count in ConfigMap

Deployment Basics (TODOs 3-6):

Set deployment name, replicas, selector, and pod labels

Security (TODOs 7-8, 15-16):

Configure pod and container security contexts
Run as non-root user (UID 1000)
Read-only root filesystem

High Availability (TODOs 9-10):

Configure pod anti-affinity to spread across nodes

Container Configuration (TODOs 11-14):

Set container name, image, pull policy, and port

Resources (TODOs 17-18):

Configure CPU and memory requests (critical for HPA)

Health Probes (TODOs 19-20):

Configure startup and readiness probes

Part 3: Key Concepts

Key Concepts Covered

Kubernetes Fundamentals

Deployments: Manage replica Pods, rolling updates
Services: Stable network endpoint, load balancing
ConfigMaps: Externalized configuration
Labels & Selectors: Resource organization

Resource Management

Requests: Guaranteed minimum for scheduling
Limits: Maximum allowed (prevents exhaustion)
QoS Classes: Guaranteed, Burstable, BestEffort

Health and Reliability

Startup Probes: Handle slow ML model loading (30-60s)
Liveness Probes: Detect and restart dead containers
Readiness Probes: Control traffic routing
Self-healing: Automatic recovery

Auto-scaling

HPA (Horizontal Pod Autoscaler): Scale based on CPU/memory
Metrics: CPU 70%, Memory 80% targets
Scaling Policies: Stabilization windows, rate limits

High Availability

Pod Disruption Budget: Maintain availability during updates
Pod Anti-affinity: Spread pods across nodes
Rolling Updates: Zero-downtime deployments
Fault Tolerance: Survive node failures

Security

Non-root Containers: Reduce attack surface
Read-only Filesystem: Prevent file modifications
Dropped Capabilities: Minimal privileges
Security Contexts: Pod and container hardening

Part 4: Testing & Validation

Common Commands

# View all resources
kubectl get all -l app=sentiment-api

# Describe resources
kubectl describe deployment sentiment-api
kubectl describe pod <pod-name>
kubectl describe hpa sentiment-api-hpa

# Monitor resources
kubectl top pod -l app=sentiment-api
kubectl get hpa -w

# View logs
kubectl logs -l app=sentiment-api -f

# Port forwarding
kubectl port-forward svc/sentiment-api-service 8080:80

# Manual scaling (HPA will override)
kubectl scale deployment sentiment-api --replicas=5

# Delete all resources
kubectl delete -f deployment.yaml

Part 5: Troubleshooting

Troubleshooting

Issue 1: Pods stuck in "Pending"

Symptoms:

kubectl get pods
# Shows: STATUS=Pending for extended period

Root Causes:

Image not loaded into kind cluster
Insufficient cluster resources
Node selector/affinity constraints not met

Solutions:

Check 1: Verify image is loaded

# Check if image exists in kind
docker exec -it mlops-workshop-control-plane crictl images | grep sentiment

# If not found, load it
kind load docker-image sentiment-api:v1 --name mlops-workshop

# Verify again
docker exec -it mlops-workshop-control-plane crictl images | grep sentiment

Check 2: Inspect pod events

# Get detailed information
kubectl describe pod <pod-name>

# Look for events like:
# - "FailedScheduling: 0/1 nodes available"
# - "ImagePullBackOff"
# - "Insufficient cpu/memory"

Check 3: Verify cluster resources

# Check node resources
kubectl describe nodes

# Check resource requests
kubectl describe deployment sentiment-api

# If insufficient, reduce requests in deployment.yaml

Issue 2: Pods stuck in "ImagePullBackOff"

Symptoms:

kubectl get pods
# Shows: STATUS=ImagePullBackOff or ErrImagePull

Root Cause: Wrong imagePullPolicy for local kind cluster

Solution:

# Check current imagePullPolicy
kubectl get deployment sentiment-api -o yaml | grep imagePullPolicy

# Should be: imagePullPolicy: Never (for kind)
# Fix in deployment.yaml TODO 13

# If set to "Always" or "IfNotPresent", change to "Never"
# Then reapply:
kubectl apply -f deployment.yaml

# Force restart pods
kubectl rollout restart deployment sentiment-api

Alternative: Rebuild and reload image

cd ../module-2
bentoml build
bentoml containerize sentiment_service:latest -t sentiment-api:v1
cd ../module-3
kind load docker-image sentiment-api:v1 --name mlops-workshop
kubectl delete pods -l app=sentiment-api

Issue 3: Security context errors

Symptoms:

Error: container has runAsNonRoot and image will run as root
Error: container has runAsNonRoot and image has non-numeric user

Root Cause: BentoML default image runs as root, conflicts with runAsNonRoot: true

Solutions:

Option 1: Adjust security context (for workshop)

# In deployment.yaml, use specific UID
securityContext:
  runAsUser: 1000
  runAsNonRoot: true
  # Remove runAsGroup if causing issues

Option 2: Build custom non-root image (production)

# In your BentoML project
# Create custom Dockerfile
FROM bentoml/bento-server:latest

# Create non-root user
RUN useradd -m -u 1000 bentouser && \
    chown -R bentouser:bentouser /home/bentouser

USER bentouser

# Rest of your build...

Option 3: Relax constraints temporarily

# For local testing only
securityContext:
  # Comment out runAsNonRoot temporarily
  # runAsNonRoot: true
  readOnlyRootFilesystem: false

Still stuck? Check the solution file

Part 6: Reference

Commands Cheat Sheet

Quick Start

# Create kind cluster
kind create cluster --name mlops-workshop

# Load image
kind load docker-image sentiment-api:v1 --name mlops-workshop

# Apply deployment
kubectl apply -f deployment.yaml

# Check status
kubectl get all -l app=sentiment-api

Logs and Debugging

# View pod logs
kubectl logs <pod-name>

# View logs from previous crashed container
kubectl logs <pod-name> --previous

# Follow logs in real-time
kubectl logs -f <pod-name>

# Logs from all pods with label
kubectl logs -l app=sentiment-api --all-containers=true

# Tail last 50 lines
kubectl logs <pod-name> --tail=50

# Logs since last 1 hour
kubectl logs <pod-name> --since=1h

# Exec into pod
kubectl exec -it <pod-name> -- /bin/bash
kubectl exec -it <pod-name> -- sh  # if bash not available

# Run command in pod
kubectl exec <pod-name> -- curl localhost:3000/health

Port Forwarding and Access

# Port forward service
kubectl port-forward svc/sentiment-api-service 8080:80

# Port forward in background
kubectl port-forward svc/sentiment-api-service 8080:80 &

# Test API
curl -X POST http://localhost:8080/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "Test"}'

Scaling Operations

# Manual scale (HPA will override)
kubectl scale deployment sentiment-api --replicas=5

# Check HPA status
kubectl get hpa sentiment-api-hpa

# Watch HPA in real-time
kubectl get hpa sentiment-api-hpa -w
watch kubectl get hpa sentiment-api-hpa

# Disable HPA temporarily
kubectl delete hpa sentiment-api-hpa

# Re-enable HPA
kubectl apply -f deployment.yaml

Load Testing

# Install hey
go install github.com/rakyll/hey@latest

# Run load test
hey -z 2m -c 20 -m POST \
    -H "Content-Type: application/json" \
    -d '{"text":"Load test"}' \
    http://localhost:8080/predict

# Using Apache Bench
ab -n 1000 -c 10 -p request.json -T application/json \
   http://localhost:8080/predict

# Watch HPA respond
watch kubectl get hpa sentiment-api-hpa

Solution File

If you get stuck, a complete reference implementation is available:

solution/deployment.yaml - All TODOs completed with detailed comments

Note: Try to complete the exercise on your own first! The solution is heavily commented to explain every configuration.

Next Steps

Once you've completed the exercise and tests pass:

→ Module 4: API Gateway with Go

In Module 4, you'll build a high-performance API gateway in Go to sit in front of your ML service!

Key Takeaways

What We Learned

✅ Kubernetes Deployments: Manage containerized ML workloads
✅ Resource Management: Prevent resource starvation and overcommit
✅ Health Probes: Enable self-healing and zero-downtime updates
✅ Auto-scaling: Automatically adjust capacity based on load
✅ High Availability: Survive node failures and maintenance
✅ Security: Run with least privilege, harden containers
✅ Production Patterns: Real-world best practices for ML deployments

Best Practices

Always set resource requests (required for HPA)
Use all three probe types (startup, liveness, readiness)
Configure PDB to maintain availability during updates
Run containers as non-root
Use read-only root filesystem
Externalize configuration with ConfigMap
Spread pods across nodes with anti-affinity
Set appropriate HPA targets for ML workloads (70% CPU, 80% memory)
Use conservative scale-down policies (ML models take time to load)

Having issues? Check the Troubleshooting section or review the solution file!

Navigation

Previous	Home	Next
← Module 2: Model Packaging & Serving	🏠 Home	Module 4: API Gateway & Polyglot Architecture →

Quick Links

MLOps Workshop | GitHub Repository

Module 3.md

Module 3: Kubernetes Deployment

What You'll Build

Learning Objectives

Part 1: Setup & Prerequisites

Prerequisites

Workshop Format

Part 2: Quick Start

Quick Start

1. Setup kind Cluster

2. Build and Load Docker Image

3. Install Metrics Server (Required for HPA)

4. Complete the Exercise

Production-Ready Deployment

Key TODOs to Complete

Part 3: Key Concepts

Key Concepts Covered

Kubernetes Fundamentals

Resource Management

Health and Reliability

Auto-scaling

High Availability

Security

Part 4: Testing & Validation

Common Commands

Part 5: Troubleshooting

Troubleshooting

Issue 1: Pods stuck in "Pending"

Issue 2: Pods stuck in "ImagePullBackOff"

Issue 3: Security context errors

Part 6: Reference

Commands Cheat Sheet

Quick Start

Logs and Debugging

Port Forwarding and Access

Scaling Operations

Load Testing

Solution File

Next Steps

Key Takeaways

What We Learned

Best Practices

Navigation

Quick Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally