This project demonstrates how to build, train, tune, and deploy a deep learning model for handwritten digit recognition using the MNIST dataset, leveraging Azure Machine Learning. The workflow includes data exploration, model training, hyperparameter tuning, and deploying a real-time inference API.
The goal is to train a neural network to classify handwritten digits (0-9) from the MNIST dataset using Azure Machine Learning. The project covers the full ML lifecycle from data exploration to deployment of a REST API for inference.
- An Azure account and access to Azure Machine Learning Studio.
- Python 3.8+ environment, preferably via Anaconda.
- VS Code and Jupyter Notebook.
- Azure ML Python SDK v2
- Postman for API testing.
- Dataset: Uses the MNIST dataset (available in TensorFlow or via Azure Storage).
- Explore in Python: Visualize samples, reshape, normalize, and convert images to JSON for model input. (exploreMNISTDataset.py)
import tensorflow as tf
import matplotlib.pyplot as plt
import json
# Load the hand written Dataset
(X_train, Y_train), (X_test,Y_test) = mnist.load_data()
# Pick a sample to plot
sample = 1200
# Reshape image into a single row
image = X_train[sample]
image_array = image.reshape((1,784))
# Normalize the dataset
image_array_decimal = image_array/255
# Convert to JSON
lists = image_array_decimal.tolist()
json_str = json.dumps(lists)
print(json_str)
# Display the image
plt.imshow(image, cmap='gray')
plt.title(f"Label: {Y_train[sample]}")
plt.axis('off')
plt.show()- Authenticate and connect using
MLClientandDefaultAzureCredential. - Configure your subscription, resource group, and workspace:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient(
DefaultAzureCredential(), "SUBSCRIPTION_ID", "RESOURCE_GROUP", "WORKSPACE_NAME"
)Note: Make sute to login to azure using terminal with
az login- Create or use an existing GPU compute cluster for training.
- Set up environment using curated AzureML TensorFlow environment.
from azure.ai.ml.entities import AmlCompute
gpu_compute_target = "YOUR_COMPUTE_CLUSTER"
curated_env_name = "AzureML-tensorflow-2.12-cuda11@latest"- Prepare the MNIST data path (Azure Storage).
webpath = "wasbs://..."- Build command job to run the training script using specified parameters.
from azure.ai.ml import command, Input
from azure.ai.ml.entities import UserIdentityConfiguration
job = command(
inputs=dict(
data_folder=Input(type="uri_folder", path="web_path"),
batch_size=64,
first_layer_neurons=256,
second_layer_neurons=128,
learning_rate=0.01,
),
compute=gpu_compute_target,
environment=curated_env_name,
code="./",
command="python deep_learning_training_script.py --data-folder ${{inputs.data_folder}} --batch-size ${{inputs.batch_size}} --first-layer-neurons ${{inputs.first_layer_neurons}} --second-layer-neurons ${{inputs.second_layer_neurons}} --learning-rate ${{inputs.learning_rate}}",
experiment_name="tf-dnn-image-classify",
display_name="tensorflow-classify-mnist-digit-images-with-dnn",
)
ml_client.jobs.create_or_update(job)- Monitor training in Azure ML Studio.
- Configure sweep to search for best batch size, neuron counts, and learning rate.
- Use BanditPolicy for early stopping.
- Run sweep job and select the best performing configuration.
from azure.ai.ml.sweep import Choice, LogUniform
job_for_sweep = job(
batch_size=Choice(values=[32, 64, 128]),
first_layer_neurons=Choice(values=[16, 64, 128, 256, 512]),
second_layer_neurons=Choice(values=[16, 64, 256, 512]),
learning_rate=LogUniform(min_value=-6, max_value=-1),
)
sweep_job = job_for_sweep.sweep(
compute=gpu_compute_target,
sampling_algorithm="random",
primary_metric="validation_acc",
goal="Maximize",
max_total_trials=8,
max_concurrent_trials=4,
early_termination_policy=BanditPolicy(slack_factor=0.1, evaluation_interval=2),
)
ml_client.create_or_update(sweep_job)- Register the best model from the sweep job for future inference.
from azure.ai.ml.entities import Model
model = Model(
path="azureml://jobs/{best_run}/outputs/artifacts/paths/outputs/model/",
name="mnist-handwriting-model",
description="Model created from hyperparameter sweep.",
type="custom_model",
)
registered_model = ml_client.models.create_or_update(model=model)- Deploy model as a real-time endpoint using Azure Container Instance or Kubernetes.
- Configure web service input/output for prediction.
-
Test using Postman:
- Send JSON formatted input to the endpoint.
- View prediction output.
-
Test using Python requests:
- Use the generated code snippet to send requests and parse responses.
The file deep_learning_training_script.py is the core training script executed by Azure ML during the command job submission. In your notebook, you specify this script to run on the remote compute target with all required arguments:
command="python deep_learning_training_script.py --data-folder ${{inputs.data_folder}} --batch-size ${{inputs.batch_size}} --first-layer-neurons ${{inputs.first_layer_neurons}} --second-layer-neurons ${{inputs.second_layer_neurons}} --learning-rate ${{inputs.learning_rate}}"Note: Make sure it is in the same folder as the notebook in your local computer.
Azure ML takes care of provisioning resources, passing parameters, running the script, tracking metrics, and uploading outputs.
- Imports dependencies: TensorFlow, NumPy, MLflow, and utilities for data loading and management.
- Loads and preprocesses the MNIST dataset: Reads compressed
.gzfiles, normalizes images, and prepares labels. - Builds a configurable neural network: Two hidden layers with ReLU activation, output layer for digit classification.
- Defines training logic:
- One-hot encoding for labels.
- Loss and accuracy metrics.
- Training loop: shuffles data, splits into batches, updates weights using gradient descent.
- Tracks metrics and parameters with MLflow.
- Periodically saves checkpoints and finally saves the trained model.
- Supports hyperparameter tuning: Accepts batch size, neuron counts, learning rate as command-line arguments.
- Handles resume-from-checkpoint: Optionally resumes training from a previous checkpoint if specified.
Summary of Workflow:
- Model Definition: A neural network with two hidden layers.
- Data Loading: Preprocess and load the MNIST dataset.
- Training Loop: Shuffle and split data into mini-batches, update weights, log metrics, checkpoint.
- Model Saving: Save final trained model to disk.
- Experiment Logging: Use MLflow to track all parameters, metrics, and artifacts.
- Neural Networks: Structure with input, multiple hidden layers, and output neurons.
- Activation Functions: Threshold, sigmoid, rectifier (ReLU), hyperbolic tangent.
- Gradient Descent: Optimization algorithm to minimize loss and update weights.
- Hyperparameter Tuning: Automated search for the best configuration.
- Deployment States: Transitioning, unhealthy, unschedulable, failed, healthy.
- Azure Machine Learning Documentation
- TensorFlow Documentation
- MNIST Dataset
- Postman API Tool
- Anaconda Distribution
- VS Code
- YouTube Tutorial
Tutorial and walkthrough based on Deeptech data science and machine learning YouTube session and provided Jupyter Notebook.