This repository contains exercises in Databricks that ingests Global Temperature and Global Temperature By Country data from Kaggle and CO2 Emissions data from OWID and transforms it. The goal of this exercise is to teach some basics about data wrangling and Spark with respect to real world questions.
- Which countries are worse-hit (higher temperature anomalies)?
- Which countries are the biggest emitters?
- What are some attempts of ranking “biggest polluters” in a sensible way?
In order to answer some of the questions of the exercise, we picked open-source data from Open World in Data (OWID) and Kaggle.
The specific datasets:
- CO2 Emissions (2020).csv (OWID)
- GlobalLandTemperaturesByCountry (Kaggle)
- GlobalTemperatures.csv (Kaggle)
Since the point of this exercise is to learn how to work with data and the datasets from OWID and Kaggle are both too clean and curated, a set of dirtied data is provided.
They can be found at:
- Basic knowledge of Python
- Basic knowledge of Spark
- Databricks Community Edition (free) account
- Navigate the Databricks Community Login Page
- Click Signup
-
Fill in your details, click "Get Started For Free"
-
SCROLL TO THE BOTTOM to create a Community Account
- Clone this repo if you haven't already done so
- Open Data Ingestion CO2 vs Temperature.py in Databricks Community Edition

- Follow instructions, move on to following exercises once all tests pass.
- Solutions can be found here.
- Clone this repo if you haven't already done so
- Open Data Transformation CO2 vs Temperature.py in Databricks Community Edition

- Follow instructions, move on to following exercises once all tests pass.
- Solutions can be found here.
