A deep learning project for classifying music genres using audio features and neural networks.
This repository is part of my learning journey in Deep Learning for audio processing using Python and TensorFlow.
The project focuses on extracting MFCC features from audio files and training a CNN model to recognize different music genres.
This project was created as part of my learning process in Deep Learning, especially in audio data processing.
I followed and learned from this YouTube playlist:
📺 Valerio Velardo - The Sound of AI
The dataset used in this project is the GTZAN Music Genre Dataset, available on Kaggle and commonly used in music genre classification research.
Through this project, I explore how neural networks can understand sound patterns and classify music into different genres. My main goal is to gain hands-on experience with audio feature extraction, model training, and evaluation.
PythonTensorFlowKerasLibrosaNumPyMatplotlibScikit-learn
- Extracts MFCC features from audio files
- Preprocesses dataset into JSON format
- Trains a CNN-based deep learning model
- Evaluates model performance
- Predicts music genres from audio input
- Visualizes training history (accuracy & loss)
Here are the main files in this repository and their purposes:
| File Name | Description |
|---|---|
prepare_dataset.py |
Extracts MFCC features from audio files and saves them into a JSON file |
data.json |
Contains extracted MFCC features and labels |
genre_classification.py |
Trains the deep learning model |
cnn_genre_classification.py |
CNN-based model implementation |
Note: File names may vary depending on project updates.
This project is divided into several learning stages, starting from basic audio processing to building deep learning models for classification.
I started by learning how audio data can be transformed into numerical features that can be understood by machine learning models. One of the most important features I explored was MFCC (Mel-Frequency Cepstral Coefficients).
In this step, WAV audio files were converted into MFCC features and stored in data.json for further training and evaluation.
After preparing the dataset, I built my first genre classification model using a Multi-Layer Perceptron (MLP) based on the extracted audio features. This model served as my initial baseline for genre classification.
MLP Model Architecture
During training, the model showed signs of overfitting, with training accuracy reaching around 0.9 while testing accuracy stayed low at about 0.4. To address this issue, I introduced dropout layers and L2 regularization to improve generalization. As a result, the model became more stable and achieved a final accuracy of approximately 0.6.
📸 Model Performance Comparison
(Left) Overfitted model without regularization. (Right) Regularized model with L2 and Dropout.
Next, I explored Convolutional Neural Networks (CNN) to build a more powerful model for audio classification. Compared to MLP, CNN can better capture important patterns from MFCC features, making the model more effective in learning audio representations.
At this stage, I learned how convolution and pooling layers help extract meaningful features and improve model performance. I also realized that in TensorFlow, CNN models require input data in 4D format when using batch processing. Therefore, it is important to properly prepare and reshape the input data before feeding it into the network.
For example, the training data needs to be converted into a 4D tensor as follows:
X_train[..., np.newaxis] # 4D array -> (num_samples, num_frames, num_mfcc, depth)This reshaping step ensures that each sample includes a channel dimension, making it compatible with CNN layers in TensorFlow.
Overall, this step helped me gain a deeper understanding of how CNNs work for audio classification tasks.
After learning about Recurrent Neural Networks (RNN), I understood that standard RNNs have limitations in remembering long-term context, especially when dealing with long sequences. This happens because important information from earlier time steps can gradually fade as the sequence becomes longer.
To solve this problem, I studied and implemented Long Short-Term Memory (LSTM), which is an advanced form of RNN designed to handle long-term dependencies. LSTM uses memory cells and gating mechanisms to control what information should be remembered or forgotten over time.
As a result, the LSTM-based model demonstrated better capability in learning long-term dependencies and produced more reliable classification results.
📸 Model Performance Comparison
(Left) RNN Accuracy | (Right) LSTM Accuracy