PySpark ML on Big Data: NYC Taxi Fare Prediction

This project demonstrates simple machine learning techniques using PySpark on a large dataset (NYC Taxi fares, ~55M rows). It performs data loading, preprocessing, and linear regression to predict fare amounts.

Setup

Download train.csv from https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data and place in data/.
Install dependencies: pip install -r requirements.txt
Run: spark-submit src/main.py

Why PySpark?

PySpark (Apache Spark's Python API) is ideal for big data ML as it enables distributed processing across clusters, handling datasets larger than a single machine's memory. Unlike single-node libraries like pandas or scikit-learn, it scales horizontally, provides fault tolerance, and optimizes operations via lazy evaluation.

Results

The script trains a model and evaluates RMSE on a test split.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark ML on Big Data: NYC Taxi Fare Prediction

Setup

Why PySpark?

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PySpark ML on Big Data: NYC Taxi Fare Prediction

Setup

Why PySpark?

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages