Skip to content

shaaronl/spotify-kdd-recommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spotify KDD Analysis & Recommender System


Overview

This project implements an end-to-end Knowledge Discovery from Data (KDD) pipeline for Spotify streaming data. It processes raw ZIP archives into structured matrices, applies collaborative filtering and clustering algorithms, and generates personalized song and artist recommendations.

The pipeline supports:

  • Data ingestion & preprocessing (ZIP -> JSON -> CSV -> sparse matrices)
  • Exploratory analysis and query testing
  • Collaborative filtering (Pearson correlation, KNN)
  • Clustering (DBSCAN, k-means, hierarchical)
  • Recommendation evaluation with confusion matrices & accuracy scores

Repository Structure

1. Data Cleaning & Preprocessing

  • zip_to_folder.py — Extracts multiple ZIP files into a folder.
  • zip_to_json.py — Extracts a single ZIP file and converts to JSON.
  • mass_json_to_csv.py — Aggregates multiple JSON files into a single CSV.
  • csv_to_artist_matrix.py — Creates a user–artist sparse COO matrix from CSV (keeps user/song/artist mappings).
  • csv_to_matrix.py — Creates a user–song sparse COO matrix from CSV (keeps user/song/artist mappings).

Generated Files:

  • CSVs:
    • all_users_data.csv
    • user_to_artists_matrix.csv
    • top_artists_user_matrix.csv
    • song_user_matrix.csv
  • NPZ Matrices:
    • user_song_matrix.npz
    • user_song_matrix_not_normalized.npz
    • user_artist_matrix.npz
  • JSON Mappings:
    • artist_id_to_info.json
    • artist_to_song_ids.json
    • user_id_to_row.json
    • top_artists.json
    • all_users_top_artists.json
    • all_users_top_songs.json
    • user_data.json (anonymized)

2. Initial Query Testing / Modeling

  • queryTests.py — Exploratory queries (top artists, top songs, top users, etc.).
  • solo_analysis/ — Early data cleaning and brief k-means modeling on a single user's streaming history.

3. Collaborative Filtering

  • collab_filtering.py — Core collaborative filtering functions.
  • collab_filtering_eval.py — Evaluates collaborative filtering methods.
  • run_collab_filtering.py — Runs evaluation for all users.
    python3 collab_filtering_eval.py <method> {user_id} <size> <repeats>
    

Contributors

About

A Spotify listening data analysis and recommendation system that processes raw streaming history into structured user-song and user-artist matrices, applies KNN-based collaborative filtering and performs clustering to uncover listening patterns. Includes exploratory analysis, personalized recommendations, and visualizations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors