Machine-Learning-Project

Setting up the project

Download the dataset
Download the data from KaggleHub and save the dataset into a folder called dataset in the root directory of this project. The dataset file should be named cab_rides.csv.
Install dependencies
Run the following command to install the required Python packages:
```
pip install -r requirements.txt
```
Run the project Execute the main.py file to pre-process the data (have not started on models yet):
Error Handling If the dataset file is missing dataset/cab_rides.csv, the program will notify you to download the file manually. This is done because the files are too large for GitHub to upload correctly.

Project Overview

This project aims to predict the price of a cab ride using a machine learning pipeline trained on ride data from Lyft and Uber. This task falls under supervised regression, where the model learns patterns between input features (like distance, cab type, and surge multiplier) and the continuous target variable (price).

By implementing, training, tuning, and evaluating multiple models, we aim to identify the most accurate and generalizable approach for price prediction.

Preprocessing
- Removes all rows with missing output feature price
- scales the numerical fields distance and surge_multiplier*
- Uses KNN Imputation to handle missing input input features distance, and surge_multiplier
- Filles in missing data for the cab_type data with the most frequent value.
KNN Imputation is used on the more important fields for determining the price of the cab trip, while a more simple method of selecting the mode works for the most frequent value to prevent the program from becoming cumbersome.
Models Implemented Each model is trained on the processed dataset and validated using GridSearchCV where applicable.

Linear Regression - Implements a simple baseline model. - Outputs Mean Squared Error (MSE) and R² score for training and test data.

K-Nearest Neighbors (KNN) - Predicts using feature similarity. - Hyperparameters tuned: n_neighbors (5–20) weights (uniform, distance) - GridSearchCV used for optimal configuration.

Random Forest Regressor - An ensemble method using decision trees. - Hyperparameters tuned: n_estimators (50–200) max_depth (None, 10, 20) min_samples_split (2, 5) - Outputs training MSE, R², and best parameters

Gradient Boosting Regressor - Boosted ensemble of shallow trees. - Hyperparameters tuned: n_estimators (50–250) learning_rate (0.01–0.3) - Outputs negative MSE and best parameters
Performance Evaluation
- All models are evaluated using: Mean Squared Error (MSE) R² Score Cross-validation on validation set
- Evaluation occurs after each model is trained and hyperparameters are tuned.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitignore		.gitignore
README.md		README.md
evaluation.py		evaluation.py
gradient_boosting.py		gradient_boosting.py
import_data.py		import_data.py
knn_regression.py		knn_regression.py
linear_regression.py		linear_regression.py
main.py		main.py
random_forest.py		random_forest.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine-Learning-Project

Setting up the project

Project Overview

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine-Learning-Project

Setting up the project

Project Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages