This project implements an end-to-end Knowledge Discovery from Data (KDD) pipeline for Spotify streaming data. It processes raw ZIP archives into structured matrices, applies collaborative filtering and clustering algorithms, and generates personalized song and artist recommendations.
The pipeline supports:
- Data ingestion & preprocessing (ZIP -> JSON -> CSV -> sparse matrices)
- Exploratory analysis and query testing
- Collaborative filtering (Pearson correlation, KNN)
- Clustering (DBSCAN, k-means, hierarchical)
- Recommendation evaluation with confusion matrices & accuracy scores
zip_to_folder.py— Extracts multiple ZIP files into a folder.zip_to_json.py— Extracts a single ZIP file and converts to JSON.mass_json_to_csv.py— Aggregates multiple JSON files into a single CSV.csv_to_artist_matrix.py— Creates a user–artist sparse COO matrix from CSV (keeps user/song/artist mappings).csv_to_matrix.py— Creates a user–song sparse COO matrix from CSV (keeps user/song/artist mappings).
Generated Files:
- CSVs:
all_users_data.csvuser_to_artists_matrix.csvtop_artists_user_matrix.csvsong_user_matrix.csv
- NPZ Matrices:
user_song_matrix.npzuser_song_matrix_not_normalized.npzuser_artist_matrix.npz
- JSON Mappings:
artist_id_to_info.jsonartist_to_song_ids.jsonuser_id_to_row.jsontop_artists.jsonall_users_top_artists.jsonall_users_top_songs.jsonuser_data.json(anonymized)
queryTests.py— Exploratory queries (top artists, top songs, top users, etc.).solo_analysis/— Early data cleaning and brief k-means modeling on a single user's streaming history.
collab_filtering.py— Core collaborative filtering functions.collab_filtering_eval.py— Evaluates collaborative filtering methods.run_collab_filtering.py— Runs evaluation for all users.python3 collab_filtering_eval.py <method> {user_id} <size> <repeats>
- Zach Mattes — zmattes@calpoly.edu
- Sharon Liang — sliang19@calpoly.edu
- Jason Jelincic — jjjelinsic@calpoly.edu
- Sofija Dimitrijevic — dimitrij@calpoly.edu