This project implements a voice command recognition system using Convolutional Neural Networks (CNNs) for single-word audio classification.
The model is trained on a synthetic speech dataset and evaluated on both clean and noisy speech samples, as well as real human voice recordings.
The goal of this project was to build a classifier capable of recognizing simple spoken commands such as “up”, “down”, “left”, “right”, “on”, “off”, etc.
It was developed as part of the Pattern Recognition and Machine Learning course at the Faculty of Electrical Engineering, Computer Science and Information Technology Osijek (FERIT Osijek).
The solution includes both:
- A CNN-based classification model trained in Python using Keras and TensorFlow
- A Streamlit web application that enables users to upload or record audio for real-time prediction
The dataset was sourced from Kaggle – Synthetic Speech Commands.
It contains 30 different English words synthesized with the espeak tool, with variations in:
- Speaker tone, pitch, and pronunciation
- Background noise (airport, street, train station, ocean waves, white noise)
For this project, 10 words were selected to represent commands: down, go, left, no, off, on, right, stop, up, yes
Each audio sample:
- Duration: 1 second
- Format: 16-bit, mono, 16 kHz sampling rate, .wav
The dataset was divided as follows:
| Split | Samples |
|---|---|
| Training | 11,403 |
| Validation | 1,426 |
| Test | 1,425 |
The classification model is a Convolutional Neural Network (CNN) designed to process spectrograms of audio signals.
Key features:
- 3 convolutional layers (ReLU activations)
- Batch Normalization and MaxPooling after each convolution
- Dropout layers (0.25) to prevent overfitting
- Fully connected layers ending with a Softmax output
- Categorical Cross-Entropy loss
- Adadelta optimizer with learning rate 0.01 and weight decay 0.00001
The model was trained for 20 epochs with batch size = 32 and Early Stopping to prevent overfitting.
| Dataset | Description | Accuracy |
|---|---|---|
| Synthetic clean data | Speech without noise | 97% |
| Synthetic noisy data | Added environmental noise | 84% |
| Real human speech | Recorded from 5 speakers | 25% |
Confusion Matrix on test data
Confusion Matrix on our data
As expected, the accuracy on real data is lower due to the difference between synthetic and real human voice patterns.
- Waveform analysis and FFT for frequency domain insights
- Spectrogram and Mel-frequency cepstral coefficients (MFCC) extraction using librosa
- Conversion of spectrograms into image-like tensors for CNN input
- Python
- TensorFlow / Keras
- NumPy, SciPy, Pandas, Matplotlib
- Librosa for audio feature extraction
- Streamlit for interactive web apps
- SoundDevice for microphone recording
Two Streamlit applications are included:
app_for_recording.py— records 1-second audio clips using a microphonestreamlit_app.py— loads trained model and allows:- Uploading
.wavfiles - Recording new samples
- Classifying speech commands in real time
- Uploading
Run the app with:
streamlit run streamlit_app.py✅ CNN-based voice command classification ✅ Real-time prediction via Streamlit interface ✅ Handles both uploaded and live-recorded audio ✅ Visualization of training metrics and confusion matrices
Train with real human speech datasets to improve generalization Explore transfer learning using pretrained audio models Add noise reduction preprocessing Expand command set and improve multilingual support