This project demonstrates how to model a NoSQL database using Apache Cassandra to support specific analytical queries for a fictional music streaming app called Sparkify. The focus is on applying Cassandra's query-driven data modeling principles, designing denormalized tables, and running queries using real-world event data.
Sparkify/
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── sparkify_etl.py
├── event_datafile_new.csv
└── app/
└── sparkify_etl.py (same as above)
⚠️ Make sure you have Docker Desktop installed and running on your machine.
cd path/to/Sparkifydocker compose up --buildThis command:
- Spins up a Cassandra database container.
- Runs the
sparkify_etl.pyscript inside a Python container. - Automatically populates the Cassandra database and executes 3 analytical queries.
Once complete, the container logs will show the output for:
- ✅ Data ingestion
- ✅ Query results
- ✅ Table cleanup
-
Connects to Cassandra (container hostname:
cassandra). -
Creates a keyspace named
sparkify. -
Defines and populates 3 tables for 3 analytical queries:
- Question: Give me the artist, song title, and song's length during
sessionId = 338anditemInSession = 4 - Primary Key:
(sessionId, itemInSession)
- Question: Give me artist, song, and user name (first & last) for
userId = 10andsessionId = 182sorted byitemInSession - Primary Key:
((userId, sessionId), itemInSession)
- Question: Give me every user's name (first & last) who listened to the song
'All Hands Against His Own' - Primary Key:
(song, userId)
- Question: Give me the artist, song title, and song's length during
-
Performs SELECT queries and prints results.
-
Drops all tables and shuts down the Cassandra session.
The Docker container installs the following Python packages:
cassandra-driver
pandasAll dependencies are listed in requirements.txt.
- Efficient use of Cassandra primary keys.
- Queries return exact expected output.
- Full automation inside Docker.
- Tables are cleaned up after run.
This project simulates a real-world scenario where:
- Data is queried in a specific way (query-first design).
- Denormalization is not optional—it's necessary.
- You optimize for read efficiency and partition design.
It’s a great demonstration of applied NoSQL design thinking and distributed data engineering in action.
🖥️ Sample Terminal Output After running the ETL process, your terminal output should resemble the following, showing successful keyspace creation, query results, and table cleanup:
# View running containers
docker ps
# Stop containers
docker compose down
# Rebuild and rerun
docker compose up --build
# Clean up dangling resources
docker system prune -aBuilt as part of Udacity's Data Engineering Nanodegree.
Created by Kareem Rizk AWS Certified | Cloud & Data Engineer | DevOps Enthusiast
