Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ jobs:
conda run -n hotel_management pip install -r requirements.txt
conda run -n hotel_management pip install pytest
conda run -n hotel_management pip install coverage
conda run -n hotel_management pip install httpx

- name: Show environment
run: conda run -n hotel_management conda list
Expand Down
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@ htmlcov/
nosetests.xml
coverage.xml
coverage.json
coverage_ml_service.xml
coverage_e2e.xml
coverage_unit.xml
coverage_integration.xml
*.cover
*.py.cover
.hypothesis/
Expand Down Expand Up @@ -230,4 +234,7 @@ __marimo__/
/predictions/
/monitoring/
/orchestration_logs/
/scripts_logs/
/scripts_logs/

# tools
/tools/
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ repos:
pass_filenames: false

- id: mypy
name: mypy (ml + pipelines + ml_service)
name: mypy type checking
entry: mypy
language: system
pass_filenames: false
args: ["ml", "pipelines", "ml_service"]
args: ["ml", "pipelines", "scripts", "ml_service", "tests"]

- id: import-layers
name: import layer guardrails
Expand Down
151 changes: 116 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,53 +2,132 @@

## Overview

### A reproducible ML experimentation and model lifecycle system.
### An end-to-end ML platform that guarantees reproducibility across datasets, features, and models — with full lineage tracking and validation.

- Currently supports the modeling of regression and classification tasks using the CatBoost algorithm.
- Was initially formed based on a hotel_bookings dataset:
- located in `data/raw/hotel_bookings/v1/2026-02-25T22-43-23_732dfdb7/data.csv`
- originally from https://www.kaggle.com/datasets/mojtaba142/hotel-booking
- Current architecture expanded to support many datasets.
- The ml workflow covers everything from the registration of a raw data snapshot to model monitoring.
- From: https://www.kaggle.com/datasets/mojtaba142/hotel-booking
- The architecture has since been expanded to support many datasets with minimal code changes.
- The ML workflow covers everything from the registration of a raw data snapshot to model monitoring.
- Designed with **production ML system constraints in mind**: reproducibility, traceability, validation, and modularity.
> Note: The repo was previously named `hotel_management`, so you will see that name around the repo; renamed for clarity
> on what the project does.

> Another note: A few artifacts are intentionally included, along with their respective logs.
> This enables quick inspection of expected outputs of each pipeline, without having to run anything.

## Why?

1. Many ML platforms are either overengineered for small teams, or lack essential safeguards:
- For small teams, overengineering can be an issue:
- Most small teams (1-5 developers) do not need worry about run conditions, very scalable storage, and so on
- They need a simple, but strong and reliable platform
- Some teams fail in the other direction:
- They fully rely on notebooks
- They forget about validation and lineage tracking
- They avoid elementary checks in order to "keep it simple"

This project keeps the workflow simple, while still providing the most important sanity checks
across the entire ML workflow. With minor modifications (dataset specificities, different algorithms),
this tool can be used by an individual, or a small team of data scientists.

2. Most learning courses are too specific:
- There are many courses on how to do regression or classification, or how to write python code
- There are many tutorials on how to use specific algorithms, and how they work under the hood
- There are very few courses/tutorials explaining the ML workflow in a simple manner
- It is very hard to find a platform for quick experimentation to understand how ML workflows work

This project can also serve as a learning tool for understanding ML workflows beyond notebook-based experimentation.
It is easy to set up, and comes with a friendly UI, as well as some pre-saved artifacts for quick inspection.
Users can quickly experiment and learn on their own, and the only assumption is that they know how
to either set up Docker, or python and conda.

## Inspiration

This project started as part of my master's thesis, where the initial goal was to train several models on a hotel booking
dataset and expose them as tools for an LLM.

While working on that, I quickly ran into practical issues that are common in real-world ML work but rarely addressed in tutorials:
- Repetitive boilerplate for training and evaluation
- Difficulty reusing pipelines across slightly different setups
- Fragile experiment tracking (risk of losing artifacts or overwriting results)
- Inability to reliably pause and resume long-running experiments
- Lack of structure when working beyond notebooks

To address these problems, I started building small utilities to make experimentation more reliable and less error-prone. Over time,
this evolved into a broader system focused on reproducibility, modularity, and traceability across the entire ML lifecycle.

At some point, it became clear that building a proper ML workflow system was a more meaningful direction than the original project idea,
so I leaned into it and expanded the architecture into what it is today.

## Key Achievements

- **~17,500** lines of production code
- **~29,000** lines of tests (auto-generated + custom)
- **Fully reproducible pipelines** via artifact hashing
- **End-to-end ML lifecycle support**
- **4,000+** lines of pre-included configurations
- Easy-to-use **ML service** (as a local web app)
- Comprehensive documentation (**3,000+** lines of Markdown)

## Features

Pipelines for every part of the ml workflow:
- Data preprocessing
- Register raw data snapshots
- Build interim and processed datasets
### End-to-End ML Pipelines:
- Data registration and preprocessing
- Feature (set) freezing
- Hyperparameter search
- Model training
- Model evaluation
- Model explainability
- Model promotion
- Includes model registry for staging and production
- Archives past production models
- Model inference
- Model monitoring

Maximum **decoupling** of datasets, feature sets, and modeling
- Datasets merge at runtime, using predefined configs and DAG for ordering
- Feature sets merge at runtime using a predefined entity key
- Models can use any snapshots of datasets and feature sets via snapshot bindings registry
- Validation ensures consistency and predefined minimum row presence

Full **reproducibility**
- Hashing and downstream validation of relevant `artifacts` and `configs`
- Runtime info validation (hardware, git commit, environment...)

Code **quality** ensured by CI, which includes:
- `ruff` checks
- `mypy` checks (moderate strictness)
- import layer checks
- naming conventions checks
- **1235 tests** -> fails if coverage drops below 90%
- Model training, evaluation and explainability
- Model promotion and archiving
- Model inference and monitoring

### Reproducibility & Validation
- Artifact hashing across pipelines
- Environment & runtime validation
- Heavy versioning:
- All configurations
- Interim and processed data configurations
- Feature registry
- Global and algorithm defaults
- Model specifications + search and training configurations
- Pipeline configurations
- Environment overlay
- Promotion thresholds
- Snapshot bindings
- Target creation
- Splitting and target creation performed at runtime, based on model specifications
- Inference predictions schema
- Heavily snapshot-based:
- datasets
- feature sets
- training, evaluation, and explainability runs
- promotion and post-promotion runs

### Modular Architecture
- Decoupled datasets, features, and models
- Runtime datasets (DAG + configurations) and feature sets (entity key + configurations) merging
- Flexible snapshot bindings

### Reliability
- Atomic file writing
- Runtime saving of best hyperparameters from each search phase (broad + narrow)
- Runtime saving of model snapshots during training (e.g. every 30 seconds)

### Code Quality
- CI with linting (ruff), typing (mypy), and structure checks
- **90%+** coverage enforced by CI across **1,500+** tests

## Example Use Case

A data scientist can:
1. Register a new dataset snapshot
2. Optimize its memory in one or more ways
3. Process the dataset in one or more ways
4. Define and freeze many feature sets, each based on one or more related datasets
5. Perform one or more hyperparameter searches
6. Train models based on the hyperparameter search results (many training runs allowed per each search)
7. Evaluate and explain the trained models, however many times
8. Stage, promote, and archive models
9. Run inference and monitoring on incoming data

## Installation

Expand All @@ -68,7 +147,9 @@ Two options:

See the [usage guide](docs/usage.md) for instructions on running the workflow.

### Usage examples (via `ml_service`):
### Usage examples:

The system includes a browser-based interface (`ml_service`) for interacting with pipelines and configurations:

#### Configs Writing, Validation, Saving, and Viewing - Interim Data Configs Example

Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/boundaries.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
## New shared code goes into domain package first

- avoid placing shared code in `ml.utils`
- instead, try placing it where it logically belongs, e.g. in `ml.runners`, `ml.modeling`, `ml.promotion`, etc.
- place it where it logically belongs, e.g. in `ml.runners`, `ml.modeling`, `ml.promotion`, etc.
- `ml.utils` should only contain code that is genuinely reusable across multiple different domains
- for instance, loading json and yaml files, getting the current git commit, and setting up a pipeline runner belong to `ml.utils`
- `get_trainer.py` is only used by trainer, so it does not belong in `ml.utils`; instead it belongs to `ml.runners.training.utils`
Expand Down
12 changes: 12 additions & 0 deletions docs/architecture/decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,18 @@

This file records key architectural decisions, their rationale, and alternatives considered.

## Key Architectural Decisions (Summary)

The system is built around a few core principles:

- **Immutability of artifacts** (datasets, features, experiments)
- **Full reproducibility via configs + snapshot IDs**
- **Decoupling of datasets, features, and models**
- **Snapshot-based versioning instead of mutable state**
- **Filesystem-based storage with strict validation**

These decisions shape the entire architecture. Detailed breakdowns are provided below.

## Decision Classification

Each decision is classified as one of:
Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/system_invariants.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,4 +110,4 @@

- If any of the above differ, reproducibility is not fully guaranteed.
- Config hash match is very important for reproducibility; python version, conda environment hash and git commit matches are moderately important; os and hardware matches are the least important.
- It is technically possible to get the same results with config hash match alone, but the user assummes responsibility for any unexpected results in that case.
- It is technically possible to get the same results with config hash match alone, but the user assumes responsibility for any unexpected results in that case.
2 changes: 1 addition & 1 deletion docs/architecture/validation_guarantees.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
- Metrics suitability for business objective
- Absence of data leakage during CV
- Param distribution quality
- Compatibility between scoring function and specific algorithm beyong supported enum check
- Compatibility between scoring function and specific algorithm beyond supported enum check

## Promotion Validation

Expand Down
2 changes: 1 addition & 1 deletion docs/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This document describes the testing strategy, conventions, and instructions for this ML project.
> Note: Most of the tests currently found in the repo were AI-generated (with careful prompting)
> Note: only the folders that constitute the main focus of the repo are tested (`ml/`, `pipelines/`, `scripts/` (excluding fake data generator))
> Note: the following directories are tested: `ml/`, `pipelines/`, `ml_service`, `scripts/` (excluding fake data generator)

## Environment Setup

Expand Down
6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
profile = "black"
line_length = 100
py_version = 311
src_paths = ["ml", "pipelines", "scripts", "ml_service"]
src_paths = ["ml", "pipelines", "scripts", "ml_service", "tests"]
skip = [".git", ".venv", "env", "__pycache__", ".pytest_cache"]

[tool.ruff]
line-length = 100
target-version = "py311"
src = ["ml", "pipelines", "scripts", "ml_service"]
src = ["ml", "pipelines", "scripts", "ml_service", "tests"]
exclude = [
".git",
".venv",
Expand Down Expand Up @@ -39,7 +39,7 @@ exclude = "(^notebooks/|^feature_store/|^data/|^experiments/)"

[tool.coverage.run]
branch = true
source = ["ml", "pipelines", "scripts"]
source = ["ml", "pipelines", "scripts", "ml_service"]
omit = [
"tests/*",
"notebooks/*",
Expand Down
1 change: 0 additions & 1 deletion pytest.ini
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,5 @@ addopts =
--strict-markers
markers =
unit: fast isolated unit tests
slow: tests that are slow or involve real training
integration: integration tests that may involve multiple components
e2e: end-to-end tests that exercise CLI or multi-layer flows
3 changes: 2 additions & 1 deletion scripts/generators/generate_fake_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -484,7 +484,8 @@ def main() -> int:
metadata.detect_table_from_dataframe(
data=df_model,
table_name="synthetic_data",
infer_keys=None # type: ignore -> None is actually a valid value for infer_keys
# None is actually a valid value for infer keys, but mypy doesn't like it
infer_keys=None # type: ignore
)

metadata.set_primary_key(
Expand Down
2 changes: 1 addition & 1 deletion scripts/generators/generate_snapshot_binding.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def scan_latest_snapshots(base_dir: Path) -> dict[str, dict[str, str]]:
Returns:
dict[name][version] = snapshot_name (str)
"""
result = {}
result: dict[str, dict[str, str]] = {}
if not base_dir.exists():
logger.warning(f"Base directory {base_dir} does not exist. Skipping.")
return result
Expand Down
2 changes: 1 addition & 1 deletion scripts/quality/check_naming_conventions.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def check_ast(file: Path):
)


def main():
def main() -> int:
"""Main function to check naming conventions across the codebase.

This script checks that:
Expand Down
4 changes: 4 additions & 0 deletions tests/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""Make the `tests` directory a package so mypy maps test modules unambiguously.

This file is intentionally empty.
"""
33 changes: 33 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,22 @@
and ensure the project root is on `sys.path` for tests.
"""

import contextlib
import sys
import types
from pathlib import Path
from typing import Any

import pytest

# Lightweight TestClient fixture for FastAPI integration-style tests
try:
import ml_service.backend.main as _backend_main
from fastapi.testclient import TestClient
except Exception: # pragma: no cover - defensive import for environments without FastAPI
TestClient = None # type: ignore
_backend_main = None # type: ignore

# Global test stub for the optional `catboost` dependency. Many modules import
# `catboost` at import-time; providing a minimal stub prevents import errors
# when running unit tests in environments without the real package installed.
Expand Down Expand Up @@ -65,3 +76,25 @@ def __init__(self, *args, **kwargs):

if str(PROJECT_ROOT) not in sys.path:
sys.path.insert(0, str(PROJECT_ROOT))


@pytest.fixture
def fastapi_client():
"""Provide a `TestClient` for the ml_service FastAPI app.

Tests that exercise ml_service backend routers can use this fixture.
If FastAPI isn't available in the environment the fixture will raise
at import-time when a test attempts to use it.
"""
if TestClient is None or _backend_main is None:
raise RuntimeError("FastAPI TestClient or ml_service backend not importable in test environment")

client = TestClient(_backend_main.app)
# During tests, disable slowapi rate limiting if present to avoid
# accidental 429 failures caused by shared TestClient remote address.
with contextlib.suppress(Exception):
_backend_main.app.state.limiter.enabled = False
try:
yield client
finally:
client.close()
Loading
Loading