Skip to content

Module 1.md

Rabieh Fashwall edited this page Nov 27, 2025 · 1 revision

Module 1: Model Training & Experiment Tracking

What You'll Build

By the end of this module, you'll have:

  • ✅ A production-ready sentiment analysis model trained on IMDB dataset
  • ✅ Complete experiment tracking pipeline with MLflow
  • ✅ Registered model versions in MLflow Model Registry (optional advanced section)
  • ✅ Model lifecycle management with aliases and automated promotion (optional)
  • ✅ Full test suite validating your implementation

Overview

Learn ML experiment tracking by training a sentiment analysis model with progressively more production features. This module uses a scaffolded approach where you fill in specific code blanks rather than writing everything from scratch.

Learning Objectives

By the end of this module, you will:

  • ✅ Train NLP models using Hugging Face transformers
  • ✅ Track experiments with MLflow
  • ✅ Log parameters, metrics, and models
  • ✅ Register models in MLflow model registry
  • ✅ Manage model lifecycle with aliases and automated promotion

Part 1: Setup & Prerequisites

Prerequisites

  • Python 3.9+ installed
  • Virtual environment activated
  • Basic understanding of ML concepts

Setup

1. Navigate to Module Directory

cd modules/module-1

2. Install Dependencies

pip install -r requirements.txt

This installs:

  • transformers - Hugging Face transformers library
  • datasets - Hugging Face datasets library
  • mlflow - Experiment tracking
  • scikit-learn - Metrics computation

3. Start MLflow UI (Optional but Recommended)

mlflow ui

Open browser to http://localhost:5000 to view experiments in real-time.

Workshop Structure

Exercise 1: Basic Training
         ↓
Exercise 2: MLflow Tracking & Registry
         ├─ Part 1: Basic Tracking
         └─ Part 2: Model Registry

Each exercise builds on the previous one, adding more capabilities.


Part 2: Core Exercises

Exercise 1: Basic Model Training

Goal

Train a sentiment analysis model using Hugging Face transformers.

What You'll Implement

  • Load pre-trained DistilBERT model and tokenizer
  • Load IMDB sentiment dataset
  • Tokenize text data
  • Configure training
  • Train and evaluate model
  • Save trained model

Instructions

  1. Open the starter file:

  2. Find and fill in 10 TODOs:

    • Look for comments like # YOUR CODE HERE
    • Each TODO has hints showing exactly what function to call
    • Most are 1-3 lines of code
  3. Run your implementation:

    python train_basic.py

Key TODOs to Complete

TODO 1-2: Load model and tokenizer

# Hint: Use AutoModelForSequenceClassification.from_pretrained()
# Hint: Use AutoTokenizer.from_pretrained()

TODO 3: Load IMDB dataset

# Hint: Use load_dataset("imdb")

TODO 4: Tokenize text

# Hint: Call tokenizer() with padding="max_length", truncation=True

TODO 5-7: Set up training

# Hint: Create TrainingArguments, Trainer, call trainer.train()

TODO 8: Evaluate model

# Hint: Call trainer.evaluate()

TODO 9-10: Save model

# Hint: trainer.save_model(), tokenizer.save_pretrained()

Stuck?

  • Check the hints in TODO comments
  • Review solution/train_basic.py for reference

Exercise 2: MLflow Tracking & Model Registry

Goal

Learn experiment tracking and model lifecycle management with MLflow. This exercise has two parts: basic tracking and advanced registry workflow (optional).

What You'll Implement

Part 1 (Required):

  • Import MLflow and transformers integration
  • Set up MLflow experiments
  • Log training hyperparameters
  • Log evaluation metrics
  • Log trained models as artifacts

Part 2 (Advanced/Optional):

  • Register multiple model versions
  • Transition models through stages using MLflow 2.9+ aliases
  • Load models by stage/alias
  • Implement automated model promotion logic

Part 1: Basic MLflow Tracking

Instructions

  1. Open the starter file:

  2. Find and fill in 8 TODOs (Part 1):

    • TODO 1: Import MLflow
    • TODO 2-4: Log parameters
    • TODO 5: Log training loss
    • TODO 6: Log evaluation metrics
    • TODO 7: Log model
    • TODO 8: Set experiment name
  3. Run your implementation:

    python train_with_mlflow.py
  4. View in MLflow UI:

Key TODOs to Complete

TODO 1: Import MLflow

# FILL IN: Import mlflow and mlflow.transformers

TODO 2-4: Log parameters

# FILL IN: Use mlflow.log_param() to log model_name, epochs, batch_size, etc.

TODO 5: Log training loss

# FILL IN: Use mlflow.log_metric("train_loss", value)

TODO 6: Log evaluation metrics

# FILL IN: Use mlflow.log_metric() for eval_loss, accuracy, precision, f1

TODO 7: Log model

# FILL IN: Use mlflow.transformers.log_model()

TODO 8: Set experiment name

# FILL IN: Use mlflow.set_experiment()

What's New?

  • All hyperparameters are tracked automatically
  • Metrics are stored for comparison
  • Models are versioned as artifacts
  • You can compare multiple runs in the UI

Part 2: Model Registry Workflow

When to Complete This Part

  • ✅ You've completed Part 1 successfully
  • ✅ You have extra time
  • ✅ You want to learn model lifecycle management
  • ✅ You need model governance workflows

Instructions

  1. The file is already open (train_with_mlflow.py)

  2. Find and fill in 4 Advanced TODOs (Part 2):

    • Advanced TODO 1: Train and register models
    • Advanced TODO 2: Transition model to stage using aliases
    • Advanced TODO 3: Load model by alias
    • Advanced TODO 4: Implement automated promotion logic
  3. Run the advanced workflow:

    python train_with_mlflow.py --advanced
  4. View in MLflow UI:

Key Concepts

Model Lifecycle with Aliases (MLflow 2.9+):

Register → Set Alias → Load by Alias → Promote
  • champion: Production model serving live traffic
  • challenger: Model being A/B tested
  • staging: Model undergoing validation
  • archived: Old version, no longer used

Loading Models by Alias:

# Load champion model
model_uri = "models:/sentiment-classifier@champion"
model = mlflow.transformers.load_model(model_uri)

# Deployment code doesn't need to know version number!
# When you promote a new model, it's automatically used

Automated Promotion:

# Compare staging vs champion
if staging_accuracy > champion_accuracy:
    # Set new champion
    client.set_registered_model_alias(
        name=model_name,
        alias="champion",
        version=staging_version
    )

Integration with Module 2

In Module 2, you'll load models from the Registry in BentoML:

import mlflow
import bentoml

@bentoml.service
class SentimentService:
    def __init__(self):
        # Load latest champion model from Registry
        model_uri = "models:/sentiment-classifier@champion"
        self.model = mlflow.transformers.load_model(model_uri)

    @bentoml.api
    def predict(self, text: str) -> dict:
        result = self.model(text)
        return {"sentiment": result[0]["label"]}

Now when you promote a new model to champion in MLflow, your service automatically uses it on restart!


Solutions

Complete reference implementations are available in the solution/ folder:

  • solution/train_basic.py - Exercise 1 solution
  • solution/train_with_mlflow.py - Exercise 2 solution (both Part 1 and Part 2)

Use these if you get stuck or want to compare approaches!


Part 5: Troubleshooting

Troubleshooting

Issue 1: "ModuleNotFoundError: No module named 'transformers'"

Symptoms: Import errors when running training scripts

Solution:

# Install all required dependencies
pip install -r requirements.txt

# Verify installation
python -c "import transformers; print(transformers.__version__)"

Prevention: Always activate your virtual environment before running scripts.


Issue 2: Training is very slow

Symptoms: Each epoch takes 3-5+ minutes on CPU

Root Cause: Transformer models are computationally expensive. CPU training is significantly slower than GPU.

Solutions:

# Option 1: Reduce training samples
python train_production.py --train_samples 500 --test_samples 100

# Option 2: Reduce epochs
python train_production.py --epochs 1

# Option 3: Use smaller model
python train_production.py --model_name distilbert-base-uncased  # Already default

Issue 3: MLflow UI shows no experiments

Symptoms: Browser shows "No experiments" at http://localhost:5000

Solutions:

  1. Run training first: MLflow UI only shows data after runs are created

    python starter/train_with_mlflow.py
  2. Check MLflow directory:

    ls mlruns/
  3. Verify experiment name:

    # Check if experiment exists
    mlflow experiments search
  4. Restart MLflow UI:

    # Kill existing UI
    pkill -f "mlflow ui"
    
    # Restart
    mlflow ui

Issue 4: "Dataset download fails or times out"

Symptoms:

ConnectionError: Couldn't reach the Hugging Face Hub

Solutions:

Option 1: Retry with timeout increase

from datasets import load_dataset
dataset = load_dataset("imdb", timeout=120)  # Increase timeout

Option 2: Download once and cache

# Pre-download dataset
python -c "from datasets import load_dataset; load_dataset('imdb')"

# Check cache location
ls ~/.cache/huggingface/datasets/

Option 3: Use manual download

# If network issues persist, download manually:
# https://huggingface.co/datasets/imdb

Still stuck? Check the solutions folder


Part 6: Reference

Commands Cheat Sheet

Quick Start

# Navigate to module
cd modules/module-1

# Install dependencies
pip install -r requirements.txt

# Start MLflow UI (optional but recommended)
mlflow ui --host 0.0.0.0 --port 5000

# Run basic training
python starter/train_basic.py

# Run with MLflow tracking (Part 1 - basic tracking)
python starter/train_with_mlflow.py

# Run advanced registry workflow (Part 2 - optional)
python starter/train_with_mlflow.py --advanced

Training Commands

# Exercise 1: Basic training
python starter/train_basic.py

# Exercise 2 Part 1: MLflow tracking (default mode)
python starter/train_with_mlflow.py

# Exercise 2 Part 2: Advanced registry workflow
python starter/train_with_mlflow.py --advanced

# Show available options
python starter/train_with_mlflow.py --help

MLflow Commands

# Start MLflow UI
mlflow ui

# Start on specific host/port
mlflow ui --host 0.0.0.0 --port 5001

# List all experiments
mlflow experiments search

# Search runs in experiment
mlflow runs list --experiment-id 1

# Create new experiment
mlflow experiments create --experiment-name my-experiment

# Delete experiment
mlflow experiments delete --experiment-id 2

# View run details
mlflow runs describe --run-id <run-id>

Environment Management

# Create virtual environment
python -m venv venv

# Activate virtual environment
source venv/bin/activate  # macOS

# Install requirements
pip install -r requirements.txt

# List installed packages
pip list

# Deactivate virtual environment
deactivate

What You'll Build in Module 2

Building on Module 1's trained model, Module 2 adds:

  • ✅ REST API endpoints for predictions
  • ✅ Request validation and error handling
  • ✅ Docker containerization
  • ✅ Kubernetes deployment
  • ✅ Load testing and performance monitoring

Key Takeaways

What We Learned

  • HuggingFace Transformers: Load and fine-tune pre-trained models
  • MLflow Tracking: Track experiments, parameters, and metrics
  • Model Registry: Version and manage trained models
  • Production Patterns: CLI arguments, error handling, logging

Next Steps

  • Module 2: Package models with BentoML for serving
  • Module 3: Deploy to Kubernetes clusters
  • Module 4: Build Go API gateways

Navigation

Previous Home Next
Module 0: Environment Setup 🏠 Home Module 2: Model Packaging & Serving

Quick Links


MLOps Workshop | GitHub Repository

Clone this wiki locally