ML Pipeline API

A FastAPI-based machine learning pipeline that accepts any CSV, automatically cleans data, trains a model of your choice, and serves predictions — all through REST endpoints.

📦 Installation

pip install -r requirements.txt

Run the API

uvicorn main:app --reload

Then open http://localhost:8000/docs for interactive Swagger UI.

📁 Project Structure

ml_pipeline_api/
│
├── app/
│   ├── functions/
│   │   ├── cleaning_functions.py   # Missing values, duplicates, summary
│   │   ├── outlier_functions.py    # IQR-based outlier detection & removal
│   │   ├── feature_functions.py    # Label encoding, one-hot, binning, log transform
│   │   ├── scaling_functions.py    # StandardScaler, MinMaxScaler, RobustScaler
│   │   └── model_functions.py      # 20+ models, train, predict, evaluate
│   │
│   ├── routers/
│   │   └── predict.py              # All API endpoints
│   │
│   ├── models/                  
│   ├── main.py                      # FastAPI app entry point
│   └── storage.py                   # In-memory model storage
│  
│
├── requirements.txt
└── README.md

🔁 How It Works

Upload CSV → Auto Clean → Encode → Scale → Train → Evaluate → Predict

Upload any CSV with a target column
Pipeline auto-cleans: removes duplicates, fills missing values, removes outliers
Categorical columns are label-encoded, numeric columns are scaled
Model is trained and stored in memory under a model_id
Send new rows to /predict/predict to get predictions

🌐 API Endpoints

Method	Endpoint	Description
GET	`/`	Health check
GET	`/predict/available-models`	List all available models
POST	`/predict/train`	Upload CSV and train a model
GET	`/predict/evaluate`	Get evaluation metrics
POST	`/predict/predict`	Send a row, get prediction

GET `/predict/available-models`

Returns all available classification and regression models with descriptions.

POST `/predict/train`

Upload a CSV and train a model.

Parameters:

Parameter	Type	Description
`model_name`	string	e.g. `"random forest"`
`target_column`	string	Column to predict
`model_id`	string	Name to save model under (default: `"default"`)
`parameters`	JSON string	Optional hyperparameters e.g. `{"n_estimators": 200}`
`file`	CSV file	Your dataset

Example Response:

{
  "status": "trained",
  "model_used": "random forest",
  "model_id": "my_model",
  "train_rows": 800,
  "test_rows": 200,
  "summary": {
    "duplicates_removed": 5,
    "outliers_removed": {"age": 3},
    "rows_remaining": 992
  }
}

GET `/predict/evaluate`

Get evaluation metrics for a trained model.

Classification returns: accuracy, classification report, confusion matrix

Regression returns: R² score, MAE, MSE

Example Response:

{
  "model_id": "my_model",
  "model_name": "random forest",
  "results": {
    "accuracy": 0.9123,
    "confusion_matrix": [[80, 5], [3, 112]]
  }
}

POST `/predict/predict`

Send a new row and get a prediction.

Example Request:

{
  "data": {"age": 35, "salary": 60000, "department": "Sales"},
  "model_id": "my_model",
  "return_proba": false
}

Example Response:

{
  "model_id": "my_model",
  "prediction": [1]
}

Set return_proba: true to get probability scores instead of class labels.

🤖 Available Models

Classification

Model	Best For
random forest	Best all-rounder
gradient boosting	Highest accuracy, slower
logistic regression	When relationship is linear
svc	Small/medium high-dimension data
knn	Small datasets
naive bayes	Very fast baseline
decision tree	Simple, interpretable
adaboost	Clean data
extra trees	Faster than random forest

Regression

Model	Best For
random forest regressor	Best all-rounder
gradient boosting regressor	Highest accuracy, slower
linear regression	Linear relationships
ridge	Linear + reduces overfitting
lasso	Linear + removes useless features
svr	Small/medium datasets
knn regressor	Small datasets
decision tree regressor	Non-linear data

⚙️ Auto Preprocessing Pipeline

Every uploaded CSV goes through this automatically before training:

Step	What Happens
Duplicates	Removed automatically
Missing values	Filled with median (numeric columns)
Outliers	Removed using IQR method (1.5× rule)
Categorical columns	Label encoded
Numeric columns	Scaled with StandardScaler

The same transformations are applied to new rows at predict-time — no data leakage.

🛠️ Technologies Used

Tool	Purpose
FastAPI	REST API framework
Uvicorn	ASGI server
Scikit-learn	ML models, preprocessing, evaluation
Pandas	Data manipulation
Pydantic	Request validation
Joblib	Scaler persistence

📝 Notes

Models are stored in memory — they reset when the server restarts
Multiple models can be trained and stored using different model_id values
return_proba: true only works for models that support probability scores

👤 Author

Vatsal ML Pipeline API FastAPI | Scikit-learn | Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Pipeline API

📦 Installation

Run the API

📁 Project Structure

🔁 How It Works

🌐 API Endpoints

GET `/predict/available-models`

POST `/predict/train`

GET `/predict/evaluate`

POST `/predict/predict`

🤖 Available Models

Classification

Regression

⚙️ Auto Preprocessing Pipeline

🛠️ Technologies Used

📝 Notes

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ML Pipeline API

📦 Installation

Run the API

📁 Project Structure

🔁 How It Works

🌐 API Endpoints

GET /predict/available-models

POST /predict/train

GET /predict/evaluate

POST /predict/predict

🤖 Available Models

Classification

Regression

⚙️ Auto Preprocessing Pipeline

🛠️ Technologies Used

📝 Notes

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

GET `/predict/available-models`

POST `/predict/train`

GET `/predict/evaluate`

POST `/predict/predict`

Packages