Skip to content

leif-erickson/big_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

PySpark ML on Big Data: NYC Taxi Fare Prediction

This project demonstrates simple machine learning techniques using PySpark on a large dataset (NYC Taxi fares, ~55M rows). It performs data loading, preprocessing, and linear regression to predict fare amounts.

Setup

  1. Download train.csv from https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data and place in data/.
  2. Install dependencies: pip install -r requirements.txt
  3. Run: spark-submit src/main.py

Why PySpark?

PySpark (Apache Spark's Python API) is ideal for big data ML as it enables distributed processing across clusters, handling datasets larger than a single machine's memory. Unlike single-node libraries like pandas or scikit-learn, it scales horizontally, provides fault tolerance, and optimizes operations via lazy evaluation.

Results

The script trains a model and evaluates RMSE on a test split.

About

Work with larger datasets; FP, pyspark, arrow....

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages