Sebastijan-Dominis · Sebastijan-Dominis · Mar 31, 2026 · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -36,6 +36,7 @@ jobs:
           conda run -n hotel_management pip install -r requirements.txt
           conda run -n hotel_management pip install pytest
           conda run -n hotel_management pip install coverage
+          conda run -n hotel_management pip install httpx
 
       - name: Show environment
         run: conda run -n hotel_management conda list

diff --git a/.gitignore b/.gitignore
@@ -46,6 +46,10 @@ htmlcov/
 nosetests.xml
 coverage.xml
 coverage.json
+coverage_ml_service.xml
+coverage_e2e.xml
+coverage_unit.xml
+coverage_integration.xml
 *.cover
 *.py.cover
 .hypothesis/
@@ -230,4 +234,7 @@ __marimo__/
 /predictions/
 /monitoring/
 /orchestration_logs/
-/scripts_logs/
+/scripts_logs/
+
+# tools
+/tools/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,11 +10,11 @@ repos:
         pass_filenames: false
 
       - id: mypy
-        name: mypy (ml + pipelines + ml_service)
+        name: mypy type checking
         entry: mypy
         language: system
         pass_filenames: false
-        args: ["ml", "pipelines", "ml_service"]
+        args: ["ml", "pipelines", "scripts", "ml_service", "tests"]
 
       - id: import-layers
         name: import layer guardrails

diff --git a/README.md b/README.md
@@ -2,53 +2,132 @@
 
 ## Overview
 
-### A reproducible ML experimentation and model lifecycle system.
+### An end-to-end ML platform that guarantees reproducibility across datasets, features, and models — with full lineage tracking and validation.
 
 - Currently supports the modeling of regression and classification tasks using the CatBoost algorithm.
 - Was initially formed based on a hotel_bookings dataset:
-    - located in `data/raw/hotel_bookings/v1/2026-02-25T22-43-23_732dfdb7/data.csv`
-    - originally from https://www.kaggle.com/datasets/mojtaba142/hotel-booking 
-- Current architecture expanded to support many datasets.
-- The ml workflow covers everything from the registration of a raw data snapshot to model monitoring.
+    - From: https://www.kaggle.com/datasets/mojtaba142/hotel-booking 
+    - The architecture has since been expanded to support many datasets with minimal code changes.
+- The ML workflow covers everything from the registration of a raw data snapshot to model monitoring.
+- Designed with **production ML system constraints in mind**: reproducibility, traceability, validation, and modularity.
 > Note: The repo was previously named `hotel_management`, so you will see that name around the repo; renamed for clarity 
 > on what the project does.
 
 > Another note: A few artifacts are intentionally included, along with their respective logs.
 > This enables quick inspection of expected outputs of each pipeline, without having to run anything.
 
+## Why?
+
+1. Many ML platforms are either overengineered for small teams, or lack essential safeguards:
+- For small teams, overengineering can be an issue:
+  - Most small teams (1-5 developers) do not need worry about run conditions, very scalable storage, and so on
+  - They need a simple, but strong and reliable platform
+- Some teams fail in the other direction:
+  - They fully rely on notebooks
+  - They forget about validation and lineage tracking
+  - They avoid elementary checks in order to "keep it simple"
+
+This project keeps the workflow simple, while still providing the most important sanity checks
+across the entire ML workflow. With minor modifications (dataset specificities, different algorithms), 
+this tool can be used by an individual, or a small team of data scientists.
+
+2. Most learning courses are too specific:
+- There are many courses on how to do regression or classification, or how to write python code
+- There are many tutorials on how to use specific algorithms, and how they work under the hood
+- There are very few courses/tutorials explaining the ML workflow in a simple manner
+- It is very hard to find a platform for quick experimentation to understand how ML workflows work
+
+This project can also serve as a learning tool for understanding ML workflows beyond notebook-based experimentation.
+It is easy to set up, and comes with a friendly UI, as well as some pre-saved artifacts for quick inspection.
+Users can quickly experiment and learn on their own, and the only assumption is that they know how
+to either set up Docker, or python and conda.
+
+## Inspiration
+
+This project started as part of my master's thesis, where the initial goal was to train several models on a hotel booking 
+dataset and expose them as tools for an LLM.
+
+While working on that, I quickly ran into practical issues that are common in real-world ML work but rarely addressed in tutorials:
+- Repetitive boilerplate for training and evaluation
+- Difficulty reusing pipelines across slightly different setups
+- Fragile experiment tracking (risk of losing artifacts or overwriting results)
+- Inability to reliably pause and resume long-running experiments
+- Lack of structure when working beyond notebooks
+
+To address these problems, I started building small utilities to make experimentation more reliable and less error-prone. Over time, 
+this evolved into a broader system focused on reproducibility, modularity, and traceability across the entire ML lifecycle.
+
+At some point, it became clear that building a proper ML workflow system was a more meaningful direction than the original project idea, 
+so I leaned into it and expanded the architecture into what it is today.
+
+## Key Achievements
+
+- **~17,500** lines of production code
+- **~29,000** lines of tests (auto-generated + custom)
+- **Fully reproducible pipelines** via artifact hashing
+- **End-to-end ML lifecycle support**
+- **4,000+** lines of pre-included configurations
+- Easy-to-use **ML service** (as a local web app)
+- Comprehensive documentation (**3,000+** lines of Markdown)
+
 ## Features
 
-Pipelines for every part of the ml workflow:
-- Data preprocessing
-  - Register raw data snapshots
-  - Build interim and processed datasets
+### End-to-End ML Pipelines:
+- Data registration and preprocessing
 - Feature (set) freezing
 - Hyperparameter search
-- Model training
-- Model evaluation
-- Model explainability
-- Model promotion
-  - Includes model registry for staging and production
-  - Archives past production models
-- Model inference
-- Model monitoring
-
-Maximum **decoupling** of datasets, feature sets, and modeling
-- Datasets merge at runtime, using predefined configs and DAG for ordering
-- Feature sets merge at runtime using a predefined entity key
-- Models can use any snapshots of datasets and feature sets via snapshot bindings registry
-- Validation ensures consistency and predefined minimum row presence
-
-Full **reproducibility**
-- Hashing and downstream validation of relevant `artifacts` and `configs`
-- Runtime info validation (hardware, git commit, environment...)
-
-Code **quality** ensured by CI, which includes:
-- `ruff` checks
-- `mypy` checks (moderate strictness)
-- import layer checks
-- naming conventions checks
-- **1235 tests** -> fails if coverage drops below 90%
+- Model training, evaluation and explainability
+- Model promotion and archiving
+- Model inference and monitoring
+
+### Reproducibility & Validation
+- Artifact hashing across pipelines
+- Environment & runtime validation
+- Heavy versioning:
+  - All configurations
+    - Interim and processed data configurations
+    - Feature registry
+    - Global and algorithm defaults
+    - Model specifications + search and training configurations
+    - Pipeline configurations
+    - Environment overlay
+    - Promotion thresholds
+    - Snapshot bindings
+  - Target creation
+    - Splitting and target creation performed at runtime, based on model specifications
+  - Inference predictions schema
+- Heavily snapshot-based:
+  - datasets
+  - feature sets
+  - training, evaluation, and explainability runs
+  - promotion and post-promotion runs
+
+### Modular Architecture
+- Decoupled datasets, features, and models
+- Runtime datasets (DAG + configurations) and feature sets (entity key + configurations) merging
+- Flexible snapshot bindings
+
+### Reliability
+- Atomic file writing
+- Runtime saving of best hyperparameters from each search phase (broad + narrow)
+- Runtime saving of model snapshots during training (e.g. every 30 seconds)
+
+### Code Quality
+- CI with linting (ruff), typing (mypy), and structure checks
+- **90%+** coverage enforced by CI across **1,500+** tests
+
+## Example Use Case
+
+A data scientist can:
+1. Register a new dataset snapshot
+2. Optimize its memory in one or more ways
+3. Process the dataset in one or more ways
+4. Define and freeze many feature sets, each based on one or more related datasets
+5. Perform one or more hyperparameter searches
+6. Train models based on the hyperparameter search results (many training runs allowed per each search)
+7. Evaluate and explain the trained models, however many times
+8. Stage, promote, and archive models
+9. Run inference and monitoring on incoming data
 
 ## Installation
 
@@ -68,7 +147,9 @@ Two options:
 
 See the [usage guide](docs/usage.md) for instructions on running the workflow.
 
-### Usage examples (via `ml_service`):
+### Usage examples:
+
+The system includes a browser-based interface (`ml_service`) for interacting with pipelines and configurations:
 
 #### Configs Writing, Validation, Saving, and Viewing - Interim Data Configs Example
 

diff --git a/docs/architecture/boundaries.md b/docs/architecture/boundaries.md
@@ -30,7 +30,7 @@
 ## New shared code goes into domain package first
 
 - avoid placing shared code in `ml.utils`
-- instead, try placing it where it logically belongs, e.g. in `ml.runners`, `ml.modeling`, `ml.promotion`, etc.
+- place it where it logically belongs, e.g. in `ml.runners`, `ml.modeling`, `ml.promotion`, etc.
 - `ml.utils` should only contain code that is genuinely reusable across multiple different domains
 - for instance, loading json and yaml files, getting the current git commit, and setting up a pipeline runner belong to `ml.utils`
 - `get_trainer.py` is only used by trainer, so it does not belong in `ml.utils`; instead it belongs to `ml.runners.training.utils`

diff --git a/docs/architecture/decisions.md b/docs/architecture/decisions.md
@@ -2,6 +2,18 @@
 
 This file records key architectural decisions, their rationale, and alternatives considered.
 
+## Key Architectural Decisions (Summary)
+
+The system is built around a few core principles:
+
+- **Immutability of artifacts** (datasets, features, experiments)
+- **Full reproducibility via configs + snapshot IDs**
+- **Decoupling of datasets, features, and models**
+- **Snapshot-based versioning instead of mutable state**
+- **Filesystem-based storage with strict validation**
+
+These decisions shape the entire architecture. Detailed breakdowns are provided below.
+
 ## Decision Classification
 
 Each decision is classified as one of:

diff --git a/docs/architecture/system_invariants.md b/docs/architecture/system_invariants.md
@@ -110,4 +110,4 @@
 
 - If any of the above differ, reproducibility is not fully guaranteed.
 - Config hash match is very important for reproducibility; python version, conda environment hash and git commit matches are moderately important; os and hardware matches are the least important.
-- It is technically possible to get the same results with config hash match alone, but the user assummes responsibility for any unexpected results in that case.
+- It is technically possible to get the same results with config hash match alone, but the user assumes responsibility for any unexpected results in that case.
diff --git a/docs/architecture/validation_guarantees.md b/docs/architecture/validation_guarantees.md
@@ -53,7 +53,7 @@
 - Metrics suitability for business objective
 - Absence of data leakage during CV
 - Param distribution quality
-- Compatibility between scoring function and specific algorithm beyong supported enum check
+- Compatibility between scoring function and specific algorithm beyond supported enum check
 
 ## Promotion Validation
 

diff --git a/docs/testing.md b/docs/testing.md
@@ -2,7 +2,7 @@
 
 This document describes the testing strategy, conventions, and instructions for this ML project.
 > Note: Most of the tests currently found in the repo were AI-generated (with careful prompting)
-> Note: only the folders that constitute the main focus of the repo are tested (`ml/`, `pipelines/`, `scripts/` (excluding fake data generator))
+> Note: the following directories are tested: `ml/`, `pipelines/`, `ml_service`, `scripts/` (excluding fake data generator)
 
 ## Environment Setup
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -2,13 +2,13 @@
 profile = "black"
 line_length = 100
 py_version = 311
-src_paths = ["ml", "pipelines", "scripts", "ml_service"]
+src_paths = ["ml", "pipelines", "scripts", "ml_service", "tests"]
 skip = [".git", ".venv", "env", "__pycache__", ".pytest_cache"]
 
 [tool.ruff]
 line-length = 100
 target-version = "py311"
-src = ["ml", "pipelines", "scripts", "ml_service"]
+src = ["ml", "pipelines", "scripts", "ml_service", "tests"]
 exclude = [
   ".git",
   ".venv",
@@ -39,7 +39,7 @@ exclude = "(^notebooks/|^feature_store/|^data/|^experiments/)"
 
 [tool.coverage.run]
 branch = true
-source = ["ml", "pipelines", "scripts"]
+source = ["ml", "pipelines", "scripts", "ml_service"]
 omit = [
   "tests/*",
   "notebooks/*",

diff --git a/pytest.ini b/pytest.ini
@@ -8,6 +8,5 @@ addopts =
     --strict-markers
 markers =
     unit: fast isolated unit tests
-    slow: tests that are slow or involve real training
     integration: integration tests that may involve multiple components
     e2e: end-to-end tests that exercise CLI or multi-layer flows
diff --git a/scripts/generators/generate_fake_data.py b/scripts/generators/generate_fake_data.py
@@ -484,7 +484,8 @@ def main() -> int:
         metadata.detect_table_from_dataframe(
             data=df_model,
             table_name="synthetic_data",
-            infer_keys=None # type: ignore -> None is actually a valid value for infer_keys
+            # None is actually a valid value for infer keys, but mypy doesn't like it
+            infer_keys=None # type: ignore
         )
 
         metadata.set_primary_key(

diff --git a/scripts/generators/generate_snapshot_binding.py b/scripts/generators/generate_snapshot_binding.py
@@ -34,7 +34,7 @@ def scan_latest_snapshots(base_dir: Path) -> dict[str, dict[str, str]]:
     Returns:
         dict[name][version] = snapshot_name (str)
     """
-    result = {}
+    result: dict[str, dict[str, str]] = {}
     if not base_dir.exists():
         logger.warning(f"Base directory {base_dir} does not exist. Skipping.")
         return result

diff --git a/scripts/quality/check_naming_conventions.py b/scripts/quality/check_naming_conventions.py
@@ -105,7 +105,7 @@ def check_ast(file: Path):
                 )
 
 
-def main():
+def main() -> int:
     """Main function to check naming conventions across the codebase.
 
     This script checks that:

diff --git a/tests/__init__.py b/tests/__init__.py
@@ -0,0 +1,4 @@
+"""Make the `tests` directory a package so mypy maps test modules unambiguously.
+
+This file is intentionally empty.
+"""
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -4,11 +4,22 @@
 and ensure the project root is on `sys.path` for tests.
 """
 
+import contextlib
 import sys
 import types
 from pathlib import Path
 from typing import Any
 
+import pytest
+
+# Lightweight TestClient fixture for FastAPI integration-style tests
+try:
+    import ml_service.backend.main as _backend_main
+    from fastapi.testclient import TestClient
+except Exception:  # pragma: no cover - defensive import for environments without FastAPI
+    TestClient = None  # type: ignore
+    _backend_main = None  # type: ignore
+
 # Global test stub for the optional `catboost` dependency. Many modules import
 # `catboost` at import-time; providing a minimal stub prevents import errors
 # when running unit tests in environments without the real package installed.
@@ -65,3 +76,25 @@ def __init__(self, *args, **kwargs):
 
 if str(PROJECT_ROOT) not in sys.path:
     sys.path.insert(0, str(PROJECT_ROOT))
+
+
+@pytest.fixture
+def fastapi_client():
+    """Provide a `TestClient` for the ml_service FastAPI app.
+
+    Tests that exercise ml_service backend routers can use this fixture.
+    If FastAPI isn't available in the environment the fixture will raise
+    at import-time when a test attempts to use it.
+    """
+    if TestClient is None or _backend_main is None:
+        raise RuntimeError("FastAPI TestClient or ml_service backend not importable in test environment")
+
+    client = TestClient(_backend_main.app)
+    # During tests, disable slowapi rate limiting if present to avoid
+    # accidental 429 failures caused by shared TestClient remote address.
+    with contextlib.suppress(Exception):
+        _backend_main.app.state.limiter.enabled = False
+    try:
+        yield client
+    finally:
+        client.close()