Background

Sparkify

Sparkify services music streaming app. Sparkify team is interested in which songs their users listening to, with their streaming app.

Analytical Goals

Goal of analysis is to understand what songs users are listening to
Currently, usage log data and song data are stored in a JSON format, which is hard to query on.
Data engineers are to design and build Redshift DB using JSON data, so that Sparkify team can easily query their data.

Database Schema Design

Since this database is for analytical purpose, I chose star schema for analytical query performance. It is not normalized to 3NF form, so this schema may lacks in terms of handling redundancy and consistency.

Fact tables
- song plays log table
Dimension tables
- songs table
- artist table
- user table
- time table

ETL Pipeline

ETL Pipeline Overview

Songs and log JSON files (in S3 bucket) are loaded into two separate staging tables (staging_songs, staging_events) using COPY command.
Once JSON files are loaded into staging tables, we use INSERT INTO commands to extract, transform and load data into the target fact & dimension tables as below.

Fact & Dimension Tables

Extracting songs and artist data
- Data for songs and artists relations are extracted from song_data (JSON format)
- From song_data,
  - song_id, title, artist_id, year, duration are loaded onto songs relation
  - artist_id, artist_name, artist_location, artist_latitude, artist_longitude are loaded onto artists relation
Extracting user, time, log data
- Data for user, time and song plays relations are extracted from log_data(JSON format)
- From log_data,
  - ts (timestamp) data is loaded onto time relation. Hour, day, week of year, month and weekday values are extracted from ts before loaded.
  - user_id, first_name, last_name, gender, level are loaded onto users relation
  - songplays table, which is a fact table, has ts, user_id, level, song_id, artist_id, session_id, location, user_agent as its columns

How to Run Pipeline

For creating tables,

python create_tables.py

For running ETL pipeline,

python etl.py

Example Queries

Query below fetches the most played song in songplays table

Query

SELECT songplays.song_id, title, COUNT(*) FROM songplays
JOIN songs ON songplays.song_id = songs.song_id
GROUP BY songplays.song_id, title
ORDER BY COUNT(*) DESC
LIMIT 1

Result

song_id	title	count
SOBONKR12A58A7A7E0	You're The One	37

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Background

Sparkify

Analytical Goals

Database Schema Design

ETL Pipeline

ETL Pipeline Overview

Fact & Dimension Tables

How to Run Pipeline

Example Queries

Query

Result

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Background

Sparkify

Analytical Goals

Database Schema Design

ETL Pipeline

ETL Pipeline Overview

Fact & Dimension Tables

How to Run Pipeline

Example Queries

Query

Result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages