Skip to content

SheikhSarvar/financial-forecasting

Repository files navigation

Financial Forecasting Frontier Distributed ML

This project demonstrates a distributed machine learning pipeline for financial forecasting using Apache Spark (PySpark). It simulates a real-world banking environment to analyze customer data, predict term deposit subscriptions, and process real-time transactions.

Project Structure

The project is modularized into five Python scripts, each focusing on a specific aspect of distributed computing:

  1. 01_data_management.py: Handles data loading from CSV and efficient storage using Parquet (simulating Hive/Hadoop).
  2. 02_eda.py: Performs Exploratory Data Analysis (EDA) and generates visualizations (saved in plots/).
  3. 03_predictive_modeling.py: Trains a Random Forest Classifier to predict customer subscription behavior (y).
  4. 04_streaming.py: Simulates real-time banking transactions and processes them using Spark Structured Streaming.
  5. 05_parallelism.py: Demonstrates data parallelism techniques like repartitioning and parallel aggregation.

Prerequisites

  • Python 3.10+
  • Java 8/11/17 (Required for Apache Spark)
  • uv (Fast Python package installer)

Installation

  1. Clone the repository (if applicable) or navigate to the project directory.

  2. Install dependencies using uv:

    uv init
    uv add pyspark pandas matplotlib numpy

How to Run

Run the scripts in the following order to simulate the full pipeline:

1. Data Management

Loads bank.csv and saves it as bank_data.parquet.

uv run 01_data_management.py

2. Exploratory Data Analysis (EDA)

Generates statistical summaries and plots in the plots/ directory.

uv run 02_eda.py

3. Predictive Modeling

Trains a Random Forest model and saves it to ml_model_trained/.

uv run 03_predictive_modeling.py

4. Real-Time Streaming

Simulates a stream of transactions and aggregates them in real-time (runs for 30 seconds).

uv run 04_streaming.py

5. Data Parallelism

Demonstrates parallel processing capabilities.

uv run 05_parallelism.py

How It Works

  • Distributed Storage: Uses Parquet format to allow Spark to read data in parallel, mimicking a distributed file system like HDFS.
  • In-Memory Processing: Spark caches data in memory for fast iterative processing during EDA and Model Training.
  • Structured Streaming: Treats real-time data as an unbounded table, allowing the same DataFrame API to be used for both batch and streaming data.
  • ML Pipelines: Encapsulates preprocessing (indexing, vector assembly) and modeling into a single portable workflow. 69: 70: ## Submission Materials 71: 72: This repository contains the complete source code for the 5-part project. 73: - Part 1: Data Management (Parquet & Storage) 74: - Part 2: Exploratory Data Analysis (Spark SQL & Matplotlib) 75: - Part 3: Predictive Modeling (Random Forest & PySpark ML) 76: - Part 4: Real-Time Streaming (Structured Streaming) 77: - Part 5: Data Parallelism (Repartitioning & Distributed Aggregation)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors