This project demonstrates a distributed machine learning pipeline for financial forecasting using Apache Spark (PySpark). It simulates a real-world banking environment to analyze customer data, predict term deposit subscriptions, and process real-time transactions.
The project is modularized into five Python scripts, each focusing on a specific aspect of distributed computing:
01_data_management.py: Handles data loading from CSV and efficient storage using Parquet (simulating Hive/Hadoop).02_eda.py: Performs Exploratory Data Analysis (EDA) and generates visualizations (saved inplots/).03_predictive_modeling.py: Trains a Random Forest Classifier to predict customer subscription behavior (y).04_streaming.py: Simulates real-time banking transactions and processes them using Spark Structured Streaming.05_parallelism.py: Demonstrates data parallelism techniques like repartitioning and parallel aggregation.
- Python 3.10+
- Java 8/11/17 (Required for Apache Spark)
- uv (Fast Python package installer)
-
Clone the repository (if applicable) or navigate to the project directory.
-
Install dependencies using
uv:uv init uv add pyspark pandas matplotlib numpy
Run the scripts in the following order to simulate the full pipeline:
Loads bank.csv and saves it as bank_data.parquet.
uv run 01_data_management.pyGenerates statistical summaries and plots in the plots/ directory.
uv run 02_eda.pyTrains a Random Forest model and saves it to ml_model_trained/.
uv run 03_predictive_modeling.pySimulates a stream of transactions and aggregates them in real-time (runs for 30 seconds).
uv run 04_streaming.pyDemonstrates parallel processing capabilities.
uv run 05_parallelism.py- Distributed Storage: Uses Parquet format to allow Spark to read data in parallel, mimicking a distributed file system like HDFS.
- In-Memory Processing: Spark caches data in memory for fast iterative processing during EDA and Model Training.
- Structured Streaming: Treats real-time data as an unbounded table, allowing the same DataFrame API to be used for both batch and streaming data.
- ML Pipelines: Encapsulates preprocessing (indexing, vector assembly) and modeling into a single portable workflow. 69: 70: ## Submission Materials 71: 72: This repository contains the complete source code for the 5-part project. 73: - Part 1: Data Management (Parquet & Storage) 74: - Part 2: Exploratory Data Analysis (Spark SQL & Matplotlib) 75: - Part 3: Predictive Modeling (Random Forest & PySpark ML) 76: - Part 4: Real-Time Streaming (Structured Streaming) 77: - Part 5: Data Parallelism (Repartitioning & Distributed Aggregation)