This project demonstrates simple machine learning techniques using PySpark on a large dataset (NYC Taxi fares, ~55M rows). It performs data loading, preprocessing, and linear regression to predict fare amounts.
- Download
train.csvfrom https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data and place indata/. - Install dependencies:
pip install -r requirements.txt - Run:
spark-submit src/main.py
PySpark (Apache Spark's Python API) is ideal for big data ML as it enables distributed processing across clusters, handling datasets larger than a single machine's memory. Unlike single-node libraries like pandas or scikit-learn, it scales horizontally, provides fault tolerance, and optimizes operations via lazy evaluation.
The script trains a model and evaluates RMSE on a test split.