Financial Forecasting Frontier Distributed ML

This project demonstrates a distributed machine learning pipeline for financial forecasting using Apache Spark (PySpark). It simulates a real-world banking environment to analyze customer data, predict term deposit subscriptions, and process real-time transactions.

Project Structure

The project is modularized into five Python scripts, each focusing on a specific aspect of distributed computing:

01_data_management.py: Handles data loading from CSV and efficient storage using Parquet (simulating Hive/Hadoop).
02_eda.py: Performs Exploratory Data Analysis (EDA) and generates visualizations (saved in plots/).
03_predictive_modeling.py: Trains a Random Forest Classifier to predict customer subscription behavior (y).
04_streaming.py: Simulates real-time banking transactions and processes them using Spark Structured Streaming.
05_parallelism.py: Demonstrates data parallelism techniques like repartitioning and parallel aggregation.

Prerequisites

Python 3.10+
Java 8/11/17 (Required for Apache Spark)
uv (Fast Python package installer)

Installation

Clone the repository (if applicable) or navigate to the project directory.

Install dependencies using uv:

uv init
uv add pyspark pandas matplotlib numpy

How to Run

Run the scripts in the following order to simulate the full pipeline:

1. Data Management

Loads bank.csv and saves it as bank_data.parquet.

uv run 01_data_management.py

2. Exploratory Data Analysis (EDA)

Generates statistical summaries and plots in the plots/ directory.

uv run 02_eda.py

3. Predictive Modeling

Trains a Random Forest model and saves it to ml_model_trained/.

uv run 03_predictive_modeling.py

4. Real-Time Streaming

Simulates a stream of transactions and aggregates them in real-time (runs for 30 seconds).

uv run 04_streaming.py

5. Data Parallelism

Demonstrates parallel processing capabilities.

uv run 05_parallelism.py

How It Works

Distributed Storage: Uses Parquet format to allow Spark to read data in parallel, mimicking a distributed file system like HDFS.
In-Memory Processing: Spark caches data in memory for fast iterative processing during EDA and Model Training.
Structured Streaming: Treats real-time data as an unbounded table, allowing the same DataFrame API to be used for both batch and streaming data.
ML Pipelines: Encapsulates preprocessing (indexing, vector assembly) and modeling into a single portable workflow. 69: 70: ## Submission Materials 71: 72: This repository contains the complete source code for the 5-part project. 73: - Part 1: Data Management (Parquet & Storage) 74: - Part 2: Exploratory Data Analysis (Spark SQL & Matplotlib) 75: - Part 3: Predictive Modeling (Random Forest & PySpark ML) 76: - Part 4: Real-Time Streaming (Structured Streaming) 77: - Part 5: Data Parallelism (Repartitioning & Distributed Aggregation)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bank_data.parquet		bank_data.parquet
ml_model_trained		ml_model_trained
plots		plots
stream_input		stream_input
.python-version		.python-version
01_data_management.py		01_data_management.py
02_eda.py		02_eda.py
03_predictive_modeling.py		03_predictive_modeling.py
04_streaming.py		04_streaming.py
05_parallelism.py		05_parallelism.py
Final_Project_Submission.ipynb		Final_Project_Submission.ipynb
README.md		README.md
bank.csv		bank.csv
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial Forecasting Frontier Distributed ML

Project Structure

Prerequisites

Installation

How to Run

1. Data Management

2. Exploratory Data Analysis (EDA)

3. Predictive Modeling

4. Real-Time Streaming

5. Data Parallelism

How It Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Financial Forecasting Frontier Distributed ML

Project Structure

Prerequisites

Installation

How to Run

1. Data Management

2. Exploratory Data Analysis (EDA)

3. Predictive Modeling

4. Real-Time Streaming

5. Data Parallelism

How It Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages