Voice Command Classification

This project implements a voice command recognition system using Convolutional Neural Networks (CNNs) for single-word audio classification.
The model is trained on a synthetic speech dataset and evaluated on both clean and noisy speech samples, as well as real human voice recordings.

Overview

The goal of this project was to build a classifier capable of recognizing simple spoken commands such as “up”, “down”, “left”, “right”, “on”, “off”, etc.
It was developed as part of the Pattern Recognition and Machine Learning course at the Faculty of Electrical Engineering, Computer Science and Information Technology Osijek (FERIT Osijek).

The solution includes both:

A CNN-based classification model trained in Python using Keras and TensorFlow
A Streamlit web application that enables users to upload or record audio for real-time prediction

Dataset

The dataset was sourced from Kaggle – Synthetic Speech Commands.
It contains 30 different English words synthesized with the espeak tool, with variations in:

Speaker tone, pitch, and pronunciation
Background noise (airport, street, train station, ocean waves, white noise)

For this project, 10 words were selected to represent commands: down, go, left, no, off, on, right, stop, up, yes

Each audio sample:

Duration: 1 second
Format: 16-bit, mono, 16 kHz sampling rate, .wav

The dataset was divided as follows:

Split	Samples
Training	11,403
Validation	1,426
Test	1,425

Model Architecture

The classification model is a Convolutional Neural Network (CNN) designed to process spectrograms of audio signals.

Key features:

3 convolutional layers (ReLU activations)
Batch Normalization and MaxPooling after each convolution
Dropout layers (0.25) to prevent overfitting
Fully connected layers ending with a Softmax output
Categorical Cross-Entropy loss
Adadelta optimizer with learning rate 0.01 and weight decay 0.00001

The model was trained for 20 epochs with batch size = 32 and Early Stopping to prevent overfitting.

Performance

Dataset	Description	Accuracy
Synthetic clean data	Speech without noise	97%
Synthetic noisy data	Added environmental noise	84%
Real human speech	Recorded from 5 speakers	25%

Confusion Matrix on test data

Confusion Matrix on our data

As expected, the accuracy on real data is lower due to the difference between synthetic and real human voice patterns.

Audio Preprocessing

Waveform analysis and FFT for frequency domain insights
Spectrogram and Mel-frequency cepstral coefficients (MFCC) extraction using librosa
Conversion of spectrograms into image-like tensors for CNN input

Tools & Libraries

Python
TensorFlow / Keras
NumPy, SciPy, Pandas, Matplotlib
Librosa for audio feature extraction
Streamlit for interactive web apps
SoundDevice for microphone recording

Streamlit Application

Two Streamlit applications are included:

app_for_recording.py — records 1-second audio clips using a microphone
streamlit_app.py — loads trained model and allows:
- Uploading .wav files
- Recording new samples
- Classifying speech commands in real time

Run the app with:

streamlit run streamlit_app.py

Features

✅ CNN-based voice command classification ✅ Real-time prediction via Streamlit interface ✅ Handles both uploaded and live-recorded audio ✅ Visualization of training metrics and confusion matrices

Future Improvements

Train with real human speech datasets to improve generalization Explore transfer learning using pretrained audio models Add noise reduction preprocessing Expand command set and improve multilingual support

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
neural network		neural network
voice_command_classification_jazvic.pdf		voice_command_classification_jazvic.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Voice Command Classification

Overview

Dataset

Model Architecture

Performance

Audio Preprocessing

Tools & Libraries

Streamlit Application

Features

Future Improvements

About

Uh oh!

Releases

Packages

ijazvic/Voice-Command-Classification

Folders and files

Latest commit

History

Repository files navigation

Voice Command Classification

Overview

Dataset

Model Architecture

Performance

Audio Preprocessing

Tools & Libraries

Streamlit Application

Features

Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages