big-data-hadoop-spark

Semester project for "Advanced Databases" course. Extracting valuable analytics from a large dataset in a distributed cloud environment using Apache Hadoop and Apache Spark.

Goal

This project exists because :
a) It was mandatory in order to pass the course 😆 and
b) To showcase the capabilities of modern big data frameworks.Using Hadoop for storage and Spark for fast analytics,all in a cloud-friendly setup this project will help users understand how to build scalable data pipelines, process massive datasets efficiently, and extract meaningful insights that traditional tools can't handle.

Install

1. Install Apache Spark and Hadoop on your (virtual) machines.
To fully leverage their distributed processing capabilities, it is recommended to have at least two nodes — one configured as the master/worker node and the other as worker node.You can find a sample guide here for setting up your virtual machines.

If you choose not to follow the guide and you want to use your desired cloud environment, you have to follow the following steps:

Create 2 virtual machines with the following characteristics:
- Ubuntu Server LTS 22.04 OS
- 4 CPUs
- 8GB RAM
- 30GB disk capacity
Create a private network through which the virtual machines can communicate with one another.
Install the latest Java version in both virtual machines.
Install the latest version of Apache Hadoop and Apache Spark using their official sites.

2. Download in both machines the datasets (.csv) upon which we will be working:

Crime Data from 2010 to 2019 link
Crime Data from 2020 to present link
LAPD Polic Stations link
Median Household Income by Zip Code Los Angeles County link

3.Use hdfs -dfs put to import the data inside HDFS (Hadoop's distrubuted filesystem).
4. Run each query and view the results.

Contact

Submit an issue here on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Queries		Queries
Reports		Reports
Create_dataset.py		Create_dataset.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

big-data-hadoop-spark

Goal

Install

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

big-data-hadoop-spark

Goal

Install

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages