Semester project for "Advanced Databases" course. Extracting valuable analytics from a large dataset in a distributed cloud environment using Apache Hadoop and Apache Spark.
This project exists because :
a) It was mandatory in order to pass the course 😆 and
b) To showcase the capabilities of modern big data frameworks.Using Hadoop for storage and Spark for fast analytics,all in a cloud-friendly setup this project will help users understand how to build scalable data pipelines, process massive datasets efficiently, and extract meaningful insights that traditional tools can't handle.
1. Install Apache Spark and Hadoop on your (virtual) machines.
To fully leverage their distributed processing capabilities, it is recommended to have at least two nodes — one configured as the master/worker node and the other as worker node.You can find a sample guide here for setting up your virtual machines.
If you choose not to follow the guide and you want to use your desired cloud environment, you have to follow the following steps:
- Create 2 virtual machines with the following characteristics:
- Ubuntu Server LTS 22.04 OS
- 4 CPUs
- 8GB RAM
- 30GB disk capacity
- Create a private network through which the virtual machines can communicate with one another.
- Install the latest Java version in both virtual machines.
- Install the latest version of Apache Hadoop and Apache Spark using their official sites.
2. Download in both machines the datasets (.csv) upon which we will be working:
hdfs -dfs put to import the data inside HDFS (Hadoop's distrubuted filesystem).4. Run each query and view the results.
Submit an issue here on GitHub.