Spotify KDD Analysis & Recommender System

Overview

This project implements an end-to-end Knowledge Discovery from Data (KDD) pipeline for Spotify streaming data. It processes raw ZIP archives into structured matrices, applies collaborative filtering and clustering algorithms, and generates personalized song and artist recommendations.

The pipeline supports:

Data ingestion & preprocessing (ZIP -> JSON -> CSV -> sparse matrices)
Exploratory analysis and query testing
Collaborative filtering (Pearson correlation, KNN)
Clustering (DBSCAN, k-means, hierarchical)
Recommendation evaluation with confusion matrices & accuracy scores

Repository Structure

1. Data Cleaning & Preprocessing

zip_to_folder.py — Extracts multiple ZIP files into a folder.
zip_to_json.py — Extracts a single ZIP file and converts to JSON.
mass_json_to_csv.py — Aggregates multiple JSON files into a single CSV.
csv_to_artist_matrix.py — Creates a user–artist sparse COO matrix from CSV (keeps user/song/artist mappings).
csv_to_matrix.py — Creates a user–song sparse COO matrix from CSV (keeps user/song/artist mappings).

Generated Files:

CSVs:
- all_users_data.csv
- user_to_artists_matrix.csv
- top_artists_user_matrix.csv
- song_user_matrix.csv
NPZ Matrices:
- user_song_matrix.npz
- user_song_matrix_not_normalized.npz
- user_artist_matrix.npz
JSON Mappings:
- artist_id_to_info.json
- artist_to_song_ids.json
- user_id_to_row.json
- top_artists.json
- all_users_top_artists.json
- all_users_top_songs.json
- user_data.json (anonymized)

2. Initial Query Testing / Modeling

queryTests.py — Exploratory queries (top artists, top songs, top users, etc.).
solo_analysis/ — Early data cleaning and brief k-means modeling on a single user's streaming history.

3. Collaborative Filtering

collab_filtering.py — Core collaborative filtering functions.
collab_filtering_eval.py — Evaluates collaborative filtering methods.

run_collab_filtering.py — Runs evaluation for all users.

python3 collab_filtering_eval.py <method> {user_id} <size> <repeats>

Contributors

Zach Mattes — zmattes@calpoly.edu
Sharon Liang — sliang19@calpoly.edu
Jason Jelincic — jjjelinsic@calpoly.edu
Sofija Dimitrijevic — dimitrij@calpoly.edu

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results		results
README.md		README.md
UserNodes.py		UserNodes.py
all_users_top_artists.json		all_users_top_artists.json
all_users_top_songs.json		all_users_top_songs.json
artist_id_to_info.json		artist_id_to_info.json
artist_to_song_ids.json		artist_to_song_ids.json
artist_to_users_matrix.csv		artist_to_users_matrix.csv
clustering_utls.py		clustering_utls.py
collab_filtering.py		collab_filtering.py
collab_filtering_eval.py		collab_filtering_eval.py
csv_to_artist_matrix.py		csv_to_artist_matrix.py
csv_to_matrix.py		csv_to_matrix.py
dbscan.py		dbscan.py
dbscanGridSearch.py		dbscanGridSearch.py
dbscan_cluster_stats.txt		dbscan_cluster_stats.txt
dendrogram.png		dendrogram.png
hlclustering.py		hlclustering.py
hlclusteringGridSearch.py		hlclusteringGridSearch.py
kmeans.py		kmeans.py
kmeansDash.py		kmeansDash.py
kmeansGridSearch.py		kmeansGridSearch.py
mass_json_to_csv.py		mass_json_to_csv.py
old_clustering_utils.py		old_clustering_utils.py
queryTests.py		queryTests.py
requirements.txt		requirements.txt
run_collab_filtering.py		run_collab_filtering.py
song_id_to_info.json		song_id_to_info.json
song_user_matrix.csv		song_user_matrix.csv
sorting_jsons.py		sorting_jsons.py
spotify_solo.ipynb		spotify_solo.ipynb
top_artists.json		top_artists.json
top_artists_user_matrix.csv		top_artists_user_matrix.csv
user_artist_matrix.npz		user_artist_matrix.npz
user_data.json		user_data.json
user_id_to_row.json		user_id_to_row.json
user_song_matrix.npz		user_song_matrix.npz
user_song_matrix_not_normalized.npz		user_song_matrix_not_normalized.npz
user_to_artists_matrix.csv		user_to_artists_matrix.csv
zip_to_folder.py		zip_to_folder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify KDD Analysis & Recommender System

Overview

Repository Structure

1. Data Cleaning & Preprocessing

2. Initial Query Testing / Modeling

3. Collaborative Filtering

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spotify KDD Analysis & Recommender System

Overview

Repository Structure

1. Data Cleaning & Preprocessing

2. Initial Query Testing / Modeling

3. Collaborative Filtering

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages