A FastAPI-based machine learning pipeline that accepts any CSV, automatically cleans data, trains a model of your choice, and serves predictions — all through REST endpoints.
pip install -r requirements.txtuvicorn main:app --reloadThen open http://localhost:8000/docs for interactive Swagger UI.
ml_pipeline_api/
│
├── app/
│ ├── functions/
│ │ ├── cleaning_functions.py # Missing values, duplicates, summary
│ │ ├── outlier_functions.py # IQR-based outlier detection & removal
│ │ ├── feature_functions.py # Label encoding, one-hot, binning, log transform
│ │ ├── scaling_functions.py # StandardScaler, MinMaxScaler, RobustScaler
│ │ └── model_functions.py # 20+ models, train, predict, evaluate
│ │
│ ├── routers/
│ │ └── predict.py # All API endpoints
│ │
│ ├── models/
│ ├── main.py # FastAPI app entry point
│ └── storage.py # In-memory model storage
│
│
├── requirements.txt
└── README.md
Upload CSV → Auto Clean → Encode → Scale → Train → Evaluate → Predict
- Upload any CSV with a target column
- Pipeline auto-cleans: removes duplicates, fills missing values, removes outliers
- Categorical columns are label-encoded, numeric columns are scaled
- Model is trained and stored in memory under a
model_id - Send new rows to
/predict/predictto get predictions
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Health check |
| GET | /predict/available-models |
List all available models |
| POST | /predict/train |
Upload CSV and train a model |
| GET | /predict/evaluate |
Get evaluation metrics |
| POST | /predict/predict |
Send a row, get prediction |
Returns all available classification and regression models with descriptions.
Upload a CSV and train a model.
Parameters:
| Parameter | Type | Description |
|---|---|---|
model_name |
string | e.g. "random forest" |
target_column |
string | Column to predict |
model_id |
string | Name to save model under (default: "default") |
parameters |
JSON string | Optional hyperparameters e.g. {"n_estimators": 200} |
file |
CSV file | Your dataset |
Example Response:
{
"status": "trained",
"model_used": "random forest",
"model_id": "my_model",
"train_rows": 800,
"test_rows": 200,
"summary": {
"duplicates_removed": 5,
"outliers_removed": {"age": 3},
"rows_remaining": 992
}
}Get evaluation metrics for a trained model.
Classification returns: accuracy, classification report, confusion matrix
Regression returns: R² score, MAE, MSE
Example Response:
{
"model_id": "my_model",
"model_name": "random forest",
"results": {
"accuracy": 0.9123,
"confusion_matrix": [[80, 5], [3, 112]]
}
}Send a new row and get a prediction.
Example Request:
{
"data": {"age": 35, "salary": 60000, "department": "Sales"},
"model_id": "my_model",
"return_proba": false
}Example Response:
{
"model_id": "my_model",
"prediction": [1]
}Set return_proba: true to get probability scores instead of class labels.
| Model | Best For |
|---|---|
| random forest | Best all-rounder |
| gradient boosting | Highest accuracy, slower |
| logistic regression | When relationship is linear |
| svc | Small/medium high-dimension data |
| knn | Small datasets |
| naive bayes | Very fast baseline |
| decision tree | Simple, interpretable |
| adaboost | Clean data |
| extra trees | Faster than random forest |
| Model | Best For |
|---|---|
| random forest regressor | Best all-rounder |
| gradient boosting regressor | Highest accuracy, slower |
| linear regression | Linear relationships |
| ridge | Linear + reduces overfitting |
| lasso | Linear + removes useless features |
| svr | Small/medium datasets |
| knn regressor | Small datasets |
| decision tree regressor | Non-linear data |
Every uploaded CSV goes through this automatically before training:
| Step | What Happens |
|---|---|
| Duplicates | Removed automatically |
| Missing values | Filled with median (numeric columns) |
| Outliers | Removed using IQR method (1.5× rule) |
| Categorical columns | Label encoded |
| Numeric columns | Scaled with StandardScaler |
The same transformations are applied to new rows at predict-time — no data leakage.
| Tool | Purpose |
|---|---|
| FastAPI | REST API framework |
| Uvicorn | ASGI server |
| Scikit-learn | ML models, preprocessing, evaluation |
| Pandas | Data manipulation |
| Pydantic | Request validation |
| Joblib | Scaler persistence |
- Models are stored in memory — they reset when the server restarts
- Multiple models can be trained and stored using different
model_idvalues return_proba: trueonly works for models that support probability scores
Vatsal ML Pipeline API FastAPI | Scikit-learn | Python