Skip to content

SangiSI/databricks-mlops-lifecycle-interactive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Databricks MLOps Lifecycle — v0.1 Foundation

Python Platform Delta Lake MLflow Serverless Release Status


Overview

v0.1-foundation establishes the core MLOps platform scaffold for an end‑to‑end machine learning lifecycle implemented using Databricks Asset Bundles.

This release focuses on reproducible infrastructure, training orchestration, and model lifecycle readiness, forming the base for future inference and monitoring capabilities.


Architecture Goals

  • Infrastructure-as-Code
  • Reproducible ML pipelines
  • Serverless-compatible execution
  • MLflow experiment tracking
  • Registry-independent model lifecycle
  • Bundle-driven deployment

Design Principles

This project follows a set of architectural principles commonly used in production ML platforms:

  • Infrastructure as Code
    All pipeline resources are defined declaratively using Databricks Asset Bundles to ensure reproducibility and environment portability.

  • Decoupled ML Lifecycle
    Data preparation, training, model selection, inference, and monitoring are implemented as independent pipeline stages.

  • Registry Independence
    Model selection logic does not assume availability of a model registry, enabling execution in minimal or restricted environments.

  • Reproducibility First
    Dataset preparation, feature engineering, and model training are designed to produce deterministic outputs from a clean workspace.

  • Incremental Platform Evolution
    The system is intentionally built in staged releases (v0.1 → v0.3) to mirror real-world ML platform development cycles.


Pipeline Architecture

graph LR

A[Prepare Dataset] --> B[Train Model]
B --> C[Select Production Model]
C --> D[Batch Inference]
D --> E[Monitoring & Drift Detection]

subgraph v0.1 Foundation
A
B
C
end

subgraph Future Releases
D
E
end
Loading

⚠️ In v0.1, foundation stages are production-ready up to model selection.
Batch inference and monitoring are planned for upcoming releases.


Quick Start

Deploy and run the pipeline using the Databricks CLI.

1️⃣ Deploy the bundle

databricks bundle deploy

2️⃣ Run the pipeline

databricks bundle run mlops_lifecycle_pipeline

3️⃣ Inspect results

Navigate in Databricks:

Workspace → Jobs → mlops_lifecycle_pipeline

Then open MLflow Experiments to view training metrics.


Implemented Components

Issue #1 — Bundle Infrastructure

  • Databricks Asset Bundle configuration
  • Environment isolation
  • Job orchestration as code
  • Serverless execution compatibility

Issue #2 — Dataset Preparation

NYC Taxi dataset ingestion pipeline:

  • Public dataset ingestion
  • Data quality filtering
  • Feature engineering
  • Deterministic dataset splits
  • Managed Delta tables

Created tables:

main.mlops_lifecycle.train_set
main.mlops_lifecycle.test_set
main.mlops_lifecycle.extra_set

Issue #3 — Model Training + MLflow

Distributed training workflow:

  • Feature vectorization
  • Regression model training
  • MLflow experiment tracking
  • Metric logging:
    • RMSE
    • MAE
  • Reproducible bundle execution

Issue #4 — Model Lifecycle Strategy

Registry-independent model loading:

  • Stage tagging via MLflow run tags
  • Production-stage identification
  • Fallback loading logic
  • Unity Catalog optionality

MLflow Tracking

Each training run logs:

  • Parameters
  • Metrics
  • Artifacts
  • Stage tags
  • Execution metadata

Accessible via:

Databricks Workspace → Experiments

Execution Evidence

The following artifacts demonstrate successful execution of the MLOps pipeline in the Databricks environment.

Pipeline Job Run

Example run of the Databricks job orchestrating the pipeline stages.

Pipeline Run


MLflow Experiment Tracking

Training runs logged with parameters, metrics, and artifacts.

MLflow Experiment


Delta Tables Created

Managed Delta tables generated during the pipeline execution.

Delta Tables


Tech Stack

  • Databricks Asset Bundles
  • PySpark
  • Delta Lake
  • MLflow
  • Databricks Serverless Compute
  • Python

Release Scope

Stage Status
Data Preparation ✅ Complete
Training ✅ Complete
Model Selection ✅ Complete
Batch Inference ⏳ v0.2
Monitoring & Drift ⏳ v0.3

Release

Tag: v0.1-foundation

This release establishes the production-ready MLOps scaffold.


🔜 Roadmap

v0.2

  • Batch inference pipelines
  • Scheduled scoring jobs
  • Prediction Delta outputs

v0.3

  • Monitoring & drift detection
  • Data quality metrics
  • Model performance tracking

👤 Author

Sangam Kumar Singh
Senior Applied AI / MLOps Architect
GenAI • Distributed ML • Decision Intelligence

About

Production-grade Databricks MLOps lifecycle demonstrating Spark ML pipelines, MLflow experiment tracking, Delta Lake feature storage, orchestration, and model drift monitoring.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages