Term project for the course 'Information-Systems-Analysis-and-Design' during the 9th semester at NTUA in the School of Electrical and Computer Engineering under the supervision of Professor Dimitrios Tsoumakos.
This project involves comparing two prominent Python scaling frameworks, Ray and Dask, for big data analysis and machine learning tasks. Ray and Dask are open-source frameworks designed to facilitate the parallel execution of Python code for distributed computing tasks. This comparison aims to evaluate their performance, scalability, and ease of use in handling large-scale data analytics and ML workloads.
Python has emerged as a popular programming language for data analysis and machine learning due to its simplicity, versatility, and extensive ecosystem of libraries and frameworks. However, as datasets continue to grow in size and complexity, traditional single-machine solutions become inadequate for processing and analyzing such data efficiently. This has led to the development of distributed computing frameworks like Ray and Dask, which enable parallel execution across multiple nodes or cores, thereby allowing users to scale their data analysis and ML workflows to handle large-scale datasets.
- Installation and Setup: Successfully install and configure Ray and Dask frameworks on local or cloud-based resources.
- Data Loading and Preprocessing: Generate or acquire real-world datasets and load them into Ray and Dask for analysis. Perform necessary preprocessing steps to prepare the data for analysis.
- Performance Measurement: Develop a suite of Python scripts to benchmark the performance of Ray and Dask across various data processing and ML tasks. Evaluate their scalability and efficiency under different workload scenarios.
- Comparison Analysis: Analyze and compare the performance, scalability, and ease of use of Ray and Dask based on the results obtained from the performance measurement phase. Identify the strengths and weaknesses of each framework for different types of tasks and workloads.
To run the Python scripts, follow these steps:
-
Installation and Setup of Virtual Machines:
- Using Okeanos-documentation, create and set up your virtual machines. In this project, we used a total of 3 VMs with 8 gigabytes of RAM and 30 gigabytes of disk storage and 4 CPUs each (1 master-2 workers).
- Create a private network with the three VMs.
-
Framework Installation:
- Install Ray and Dask frameworks.
- Install all necessary libraries.
-
Data Loading and Preprocessing:
- Generate test data using the
datagen.pyfunction:python3 datagen.py --num_samples <num_samples> --num_features <features>
- Generate test data using the
-
Classification:
-
For Dask:
- Convert the
data.libsvmfile to a CSV file using theconvert_libsvm_to_csvfunction. - Run
Dask schedulerto initiate the Dask cluster. - Connect to the worker VMs and run
dask worker tcp://<ip_address:port> - Run the
.pyfile.
- Convert the
-
For Ray:
- Move the
data.libsvmfile to therayfolder. - Initiate the cluster with
ray start --head --dashboard-host "0.0.0.0". - Connect to the cluster with a worker node using
ray start --address='ip_address:port'. - Run the
.pyfile.
- Move the
-
-
Clustering:
- Follow the same steps as in the clustering folder.
- Athanasios Varis
- Georgios Vlachopoulos
- Ioannis Nikiforidis