Skip to content

Vatsal1208/ml-pipeline-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

ML Pipeline API

A FastAPI-based machine learning pipeline that accepts any CSV, automatically cleans data, trains a model of your choice, and serves predictions — all through REST endpoints.


📦 Installation

pip install -r requirements.txt

Run the API

uvicorn main:app --reload

Then open http://localhost:8000/docs for interactive Swagger UI.


📁 Project Structure

ml_pipeline_api/
│
├── app/
│   ├── functions/
│   │   ├── cleaning_functions.py   # Missing values, duplicates, summary
│   │   ├── outlier_functions.py    # IQR-based outlier detection & removal
│   │   ├── feature_functions.py    # Label encoding, one-hot, binning, log transform
│   │   ├── scaling_functions.py    # StandardScaler, MinMaxScaler, RobustScaler
│   │   └── model_functions.py      # 20+ models, train, predict, evaluate
│   │
│   ├── routers/
│   │   └── predict.py              # All API endpoints
│   │
│   ├── models/                  
│   ├── main.py                      # FastAPI app entry point
│   └── storage.py                   # In-memory model storage
│  
│
├── requirements.txt
└── README.md

🔁 How It Works

Upload CSV → Auto Clean → Encode → Scale → Train → Evaluate → Predict
  1. Upload any CSV with a target column
  2. Pipeline auto-cleans: removes duplicates, fills missing values, removes outliers
  3. Categorical columns are label-encoded, numeric columns are scaled
  4. Model is trained and stored in memory under a model_id
  5. Send new rows to /predict/predict to get predictions

🌐 API Endpoints

Method Endpoint Description
GET / Health check
GET /predict/available-models List all available models
POST /predict/train Upload CSV and train a model
GET /predict/evaluate Get evaluation metrics
POST /predict/predict Send a row, get prediction

GET /predict/available-models

Returns all available classification and regression models with descriptions.


POST /predict/train

Upload a CSV and train a model.

Parameters:

Parameter Type Description
model_name string e.g. "random forest"
target_column string Column to predict
model_id string Name to save model under (default: "default")
parameters JSON string Optional hyperparameters e.g. {"n_estimators": 200}
file CSV file Your dataset

Example Response:

{
  "status": "trained",
  "model_used": "random forest",
  "model_id": "my_model",
  "train_rows": 800,
  "test_rows": 200,
  "summary": {
    "duplicates_removed": 5,
    "outliers_removed": {"age": 3},
    "rows_remaining": 992
  }
}

GET /predict/evaluate

Get evaluation metrics for a trained model.

Classification returns: accuracy, classification report, confusion matrix

Regression returns: R² score, MAE, MSE

Example Response:

{
  "model_id": "my_model",
  "model_name": "random forest",
  "results": {
    "accuracy": 0.9123,
    "confusion_matrix": [[80, 5], [3, 112]]
  }
}

POST /predict/predict

Send a new row and get a prediction.

Example Request:

{
  "data": {"age": 35, "salary": 60000, "department": "Sales"},
  "model_id": "my_model",
  "return_proba": false
}

Example Response:

{
  "model_id": "my_model",
  "prediction": [1]
}

Set return_proba: true to get probability scores instead of class labels.


🤖 Available Models

Classification

Model Best For
random forest Best all-rounder
gradient boosting Highest accuracy, slower
logistic regression When relationship is linear
svc Small/medium high-dimension data
knn Small datasets
naive bayes Very fast baseline
decision tree Simple, interpretable
adaboost Clean data
extra trees Faster than random forest

Regression

Model Best For
random forest regressor Best all-rounder
gradient boosting regressor Highest accuracy, slower
linear regression Linear relationships
ridge Linear + reduces overfitting
lasso Linear + removes useless features
svr Small/medium datasets
knn regressor Small datasets
decision tree regressor Non-linear data

⚙️ Auto Preprocessing Pipeline

Every uploaded CSV goes through this automatically before training:

Step What Happens
Duplicates Removed automatically
Missing values Filled with median (numeric columns)
Outliers Removed using IQR method (1.5× rule)
Categorical columns Label encoded
Numeric columns Scaled with StandardScaler

The same transformations are applied to new rows at predict-time — no data leakage.


🛠️ Technologies Used

Tool Purpose
FastAPI REST API framework
Uvicorn ASGI server
Scikit-learn ML models, preprocessing, evaluation
Pandas Data manipulation
Pydantic Request validation
Joblib Scaler persistence

📝 Notes

  • Models are stored in memory — they reset when the server restarts
  • Multiple models can be trained and stored using different model_id values
  • return_proba: true only works for models that support probability scores

👤 Author

Vatsal ML Pipeline API FastAPI | Scikit-learn | Python

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages