Module 1.md

Module 1: Model Training & Experiment Tracking

What You'll Build

By the end of this module, you'll have:

✅ A production-ready sentiment analysis model trained on IMDB dataset
✅ Complete experiment tracking pipeline with MLflow
✅ Registered model versions in MLflow Model Registry (optional advanced section)
✅ Model lifecycle management with aliases and automated promotion (optional)
✅ Full test suite validating your implementation

Overview

Learn ML experiment tracking by training a sentiment analysis model with progressively more production features. This module uses a scaffolded approach where you fill in specific code blanks rather than writing everything from scratch.

Learning Objectives

By the end of this module, you will:

✅ Train NLP models using Hugging Face transformers
✅ Track experiments with MLflow
✅ Log parameters, metrics, and models
✅ Register models in MLflow model registry
✅ Manage model lifecycle with aliases and automated promotion

Part 1: Setup & Prerequisites

Prerequisites

Python 3.9+ installed
Virtual environment activated
Basic understanding of ML concepts

Setup

1. Navigate to Module Directory

cd modules/module-1

2. Install Dependencies

pip install -r requirements.txt

This installs:

transformers - Hugging Face transformers library
datasets - Hugging Face datasets library
mlflow - Experiment tracking
scikit-learn - Metrics computation

3. Start MLflow UI (Optional but Recommended)

mlflow ui

Open browser to http://localhost:5000 to view experiments in real-time.

Workshop Structure

Exercise 1: Basic Training
         ↓
Exercise 2: MLflow Tracking & Registry
         ├─ Part 1: Basic Tracking
         └─ Part 2: Model Registry

Each exercise builds on the previous one, adding more capabilities.

Part 2: Core Exercises

Exercise 1: Basic Model Training

Goal

Train a sentiment analysis model using Hugging Face transformers.

What You'll Implement

Load pre-trained DistilBERT model and tokenizer
Load IMDB sentiment dataset
Tokenize text data
Configure training
Train and evaluate model
Save trained model

Instructions

Open the starter file:
Find and fill in 10 TODOs:
- Look for comments like # YOUR CODE HERE
- Each TODO has hints showing exactly what function to call
- Most are 1-3 lines of code
Run your implementation:
```
python train_basic.py
```

Key TODOs to Complete

TODO 1-2: Load model and tokenizer

# Hint: Use AutoModelForSequenceClassification.from_pretrained()
# Hint: Use AutoTokenizer.from_pretrained()

TODO 3: Load IMDB dataset

# Hint: Use load_dataset("imdb")

TODO 4: Tokenize text

# Hint: Call tokenizer() with padding="max_length", truncation=True

TODO 5-7: Set up training

# Hint: Create TrainingArguments, Trainer, call trainer.train()

TODO 8: Evaluate model

# Hint: Call trainer.evaluate()

TODO 9-10: Save model

# Hint: trainer.save_model(), tokenizer.save_pretrained()

Stuck?

Check the hints in TODO comments
Review solution/train_basic.py for reference

Exercise 2: MLflow Tracking & Model Registry

Goal

Learn experiment tracking and model lifecycle management with MLflow. This exercise has two parts: basic tracking and advanced registry workflow (optional).

What You'll Implement

Part 1 (Required):

Import MLflow and transformers integration
Set up MLflow experiments
Log training hyperparameters
Log evaluation metrics
Log trained models as artifacts

Part 2 (Advanced/Optional):

Register multiple model versions
Transition models through stages using MLflow 2.9+ aliases
Load models by stage/alias
Implement automated model promotion logic

Part 1: Basic MLflow Tracking

Instructions

Open the starter file:
Find and fill in 8 TODOs (Part 1):
- TODO 1: Import MLflow
- TODO 2-4: Log parameters
- TODO 5: Log training loss
- TODO 6: Log evaluation metrics
- TODO 7: Log model
- TODO 8: Set experiment name
Run your implementation:
```
python train_with_mlflow.py
```
View in MLflow UI:
- Open http://localhost:5000

Key TODOs to Complete

TODO 1: Import MLflow

# FILL IN: Import mlflow and mlflow.transformers

TODO 2-4: Log parameters

# FILL IN: Use mlflow.log_param() to log model_name, epochs, batch_size, etc.

TODO 5: Log training loss

# FILL IN: Use mlflow.log_metric("train_loss", value)

TODO 6: Log evaluation metrics

# FILL IN: Use mlflow.log_metric() for eval_loss, accuracy, precision, f1

TODO 7: Log model

# FILL IN: Use mlflow.transformers.log_model()

TODO 8: Set experiment name

# FILL IN: Use mlflow.set_experiment()

What's New?

All hyperparameters are tracked automatically
Metrics are stored for comparison
Models are versioned as artifacts
You can compare multiple runs in the UI

Part 2: Model Registry Workflow

When to Complete This Part

✅ You've completed Part 1 successfully
✅ You have extra time
✅ You want to learn model lifecycle management
✅ You need model governance workflows

Instructions

The file is already open (train_with_mlflow.py)
Find and fill in 4 Advanced TODOs (Part 2):
- Advanced TODO 1: Train and register models
- Advanced TODO 2: Transition model to stage using aliases
- Advanced TODO 3: Load model by alias
- Advanced TODO 4: Implement automated promotion logic
Run the advanced workflow:
```
python train_with_mlflow.py --advanced
```
View in MLflow UI:
- Open http://localhost:5000

Key Concepts

Model Lifecycle with Aliases (MLflow 2.9+):

Register → Set Alias → Load by Alias → Promote

champion: Production model serving live traffic
challenger: Model being A/B tested
staging: Model undergoing validation
archived: Old version, no longer used

Loading Models by Alias:

# Load champion model
model_uri = "models:/sentiment-classifier@champion"
model = mlflow.transformers.load_model(model_uri)

# Deployment code doesn't need to know version number!
# When you promote a new model, it's automatically used

Automated Promotion:

# Compare staging vs champion
if staging_accuracy > champion_accuracy:
    # Set new champion
    client.set_registered_model_alias(
        name=model_name,
        alias="champion",
        version=staging_version
    )

Integration with Module 2

In Module 2, you'll load models from the Registry in BentoML:

import mlflow
import bentoml

@bentoml.service
class SentimentService:
    def __init__(self):
        # Load latest champion model from Registry
        model_uri = "models:/sentiment-classifier@champion"
        self.model = mlflow.transformers.load_model(model_uri)

    @bentoml.api
    def predict(self, text: str) -> dict:
        result = self.model(text)
        return {"sentiment": result[0]["label"]}

Now when you promote a new model to champion in MLflow, your service automatically uses it on restart!

Solutions

Complete reference implementations are available in the solution/ folder:

solution/train_basic.py - Exercise 1 solution
solution/train_with_mlflow.py - Exercise 2 solution (both Part 1 and Part 2)

Use these if you get stuck or want to compare approaches!

Part 5: Troubleshooting

Troubleshooting

Issue 1: "ModuleNotFoundError: No module named 'transformers'"

Symptoms: Import errors when running training scripts

Solution:

# Install all required dependencies
pip install -r requirements.txt

# Verify installation
python -c "import transformers; print(transformers.__version__)"

Prevention: Always activate your virtual environment before running scripts.

Issue 2: Training is very slow

Symptoms: Each epoch takes 3-5+ minutes on CPU

Root Cause: Transformer models are computationally expensive. CPU training is significantly slower than GPU.

Solutions:

# Option 1: Reduce training samples
python train_production.py --train_samples 500 --test_samples 100

# Option 2: Reduce epochs
python train_production.py --epochs 1

# Option 3: Use smaller model
python train_production.py --model_name distilbert-base-uncased  # Already default

Issue 3: MLflow UI shows no experiments

Symptoms: Browser shows "No experiments" at http://localhost:5000

Solutions:

Run training first: MLflow UI only shows data after runs are created
```
python starter/train_with_mlflow.py
```
Check MLflow directory:
```
ls mlruns/
```

Verify experiment name:

# Check if experiment exists
mlflow experiments search

Restart MLflow UI:

# Kill existing UI
pkill -f "mlflow ui"

# Restart
mlflow ui

Issue 4: "Dataset download fails or times out"

Symptoms:

ConnectionError: Couldn't reach the Hugging Face Hub

Solutions:

Option 1: Retry with timeout increase

from datasets import load_dataset
dataset = load_dataset("imdb", timeout=120)  # Increase timeout

Option 2: Download once and cache

# Pre-download dataset
python -c "from datasets import load_dataset; load_dataset('imdb')"

# Check cache location
ls ~/.cache/huggingface/datasets/

Option 3: Use manual download

# If network issues persist, download manually:
# https://huggingface.co/datasets/imdb

Still stuck? Check the solutions folder

Part 6: Reference

Commands Cheat Sheet

Quick Start

# Navigate to module
cd modules/module-1

# Install dependencies
pip install -r requirements.txt

# Start MLflow UI (optional but recommended)
mlflow ui --host 0.0.0.0 --port 5000

# Run basic training
python starter/train_basic.py

# Run with MLflow tracking (Part 1 - basic tracking)
python starter/train_with_mlflow.py

# Run advanced registry workflow (Part 2 - optional)
python starter/train_with_mlflow.py --advanced

Training Commands

# Exercise 1: Basic training
python starter/train_basic.py

# Exercise 2 Part 1: MLflow tracking (default mode)
python starter/train_with_mlflow.py

# Exercise 2 Part 2: Advanced registry workflow
python starter/train_with_mlflow.py --advanced

# Show available options
python starter/train_with_mlflow.py --help

MLflow Commands

# Start MLflow UI
mlflow ui

# Start on specific host/port
mlflow ui --host 0.0.0.0 --port 5001

# List all experiments
mlflow experiments search

# Search runs in experiment
mlflow runs list --experiment-id 1

# Create new experiment
mlflow experiments create --experiment-name my-experiment

# Delete experiment
mlflow experiments delete --experiment-id 2

# View run details
mlflow runs describe --run-id <run-id>

Environment Management

# Create virtual environment
python -m venv venv

# Activate virtual environment
source venv/bin/activate  # macOS

# Install requirements
pip install -r requirements.txt

# List installed packages
pip list

# Deactivate virtual environment
deactivate

What You'll Build in Module 2

Building on Module 1's trained model, Module 2 adds:

✅ REST API endpoints for predictions
✅ Request validation and error handling
✅ Docker containerization
✅ Kubernetes deployment
✅ Load testing and performance monitoring

Key Takeaways

What We Learned

✅ HuggingFace Transformers: Load and fine-tune pre-trained models
✅ MLflow Tracking: Track experiments, parameters, and metrics
✅ Model Registry: Version and manage trained models
✅ Production Patterns: CLI arguments, error handling, logging

Next Steps

Module 2: Package models with BentoML for serving
Module 3: Deploy to Kubernetes clusters
Module 4: Build Go API gateways

Navigation

Previous	Home	Next
← Module 0: Environment Setup	🏠 Home	Module 2: Model Packaging & Serving →

Quick Links

MLOps Workshop | GitHub Repository

Module 1.md

Module 1: Model Training & Experiment Tracking

What You'll Build

Overview

Learning Objectives

Part 1: Setup & Prerequisites

Prerequisites

Setup

1. Navigate to Module Directory

2. Install Dependencies

3. Start MLflow UI (Optional but Recommended)

Workshop Structure

Part 2: Core Exercises

Exercise 1: Basic Model Training

Goal

What You'll Implement

Instructions

Key TODOs to Complete

Stuck?

Exercise 2: MLflow Tracking & Model Registry

Goal

What You'll Implement

Part 1: Basic MLflow Tracking

Instructions

Key TODOs to Complete

What's New?

Part 2: Model Registry Workflow

When to Complete This Part

Instructions

Key Concepts

Integration with Module 2

Solutions

Part 5: Troubleshooting

Troubleshooting

Issue 1: "ModuleNotFoundError: No module named 'transformers'"

Issue 2: Training is very slow

Issue 3: MLflow UI shows no experiments

Issue 4: "Dataset download fails or times out"

Part 6: Reference

Commands Cheat Sheet

Quick Start

Training Commands

MLflow Commands

Environment Management

What You'll Build in Module 2

Key Takeaways

What We Learned

Next Steps

Navigation

Quick Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally