Skip to content

ijazvic/Voice-Command-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Voice Command Classification

This project implements a voice command recognition system using Convolutional Neural Networks (CNNs) for single-word audio classification.
The model is trained on a synthetic speech dataset and evaluated on both clean and noisy speech samples, as well as real human voice recordings.

Overview

The goal of this project was to build a classifier capable of recognizing simple spoken commands such as “up”, “down”, “left”, “right”, “on”, “off”, etc.
It was developed as part of the Pattern Recognition and Machine Learning course at the Faculty of Electrical Engineering, Computer Science and Information Technology Osijek (FERIT Osijek).

The solution includes both:

  • A CNN-based classification model trained in Python using Keras and TensorFlow
  • A Streamlit web application that enables users to upload or record audio for real-time prediction

Dataset

The dataset was sourced from Kaggle – Synthetic Speech Commands.
It contains 30 different English words synthesized with the espeak tool, with variations in:

  • Speaker tone, pitch, and pronunciation
  • Background noise (airport, street, train station, ocean waves, white noise)

For this project, 10 words were selected to represent commands: down, go, left, no, off, on, right, stop, up, yes

Each audio sample:

  • Duration: 1 second
  • Format: 16-bit, mono, 16 kHz sampling rate, .wav

The dataset was divided as follows:

Split Samples
Training 11,403
Validation 1,426
Test 1,425

Model Architecture

The classification model is a Convolutional Neural Network (CNN) designed to process spectrograms of audio signals.

Key features:

  • 3 convolutional layers (ReLU activations)
  • Batch Normalization and MaxPooling after each convolution
  • Dropout layers (0.25) to prevent overfitting
  • Fully connected layers ending with a Softmax output
  • Categorical Cross-Entropy loss
  • Adadelta optimizer with learning rate 0.01 and weight decay 0.00001
image

The model was trained for 20 epochs with batch size = 32 and Early Stopping to prevent overfitting.

Performance

Dataset Description Accuracy
Synthetic clean data Speech without noise 97%
Synthetic noisy data Added environmental noise 84%
Real human speech Recorded from 5 speakers 25%

Confusion Matrix on test data

image

Confusion Matrix on our data

image

As expected, the accuracy on real data is lower due to the difference between synthetic and real human voice patterns.

Audio Preprocessing

  • Waveform analysis and FFT for frequency domain insights
  • Spectrogram and Mel-frequency cepstral coefficients (MFCC) extraction using librosa
  • Conversion of spectrograms into image-like tensors for CNN input
image

Tools & Libraries

  • Python
  • TensorFlow / Keras
  • NumPy, SciPy, Pandas, Matplotlib
  • Librosa for audio feature extraction
  • Streamlit for interactive web apps
  • SoundDevice for microphone recording

Streamlit Application

Two Streamlit applications are included:

  1. app_for_recording.py — records 1-second audio clips using a microphone
  2. streamlit_app.py — loads trained model and allows:
    • Uploading .wav files
    • Recording new samples
    • Classifying speech commands in real time

Run the app with:

streamlit run streamlit_app.py

Features

✅ CNN-based voice command classification ✅ Real-time prediction via Streamlit interface ✅ Handles both uploaded and live-recorded audio ✅ Visualization of training metrics and confusion matrices

Future Improvements

Train with real human speech datasets to improve generalization Explore transfer learning using pretrained audio models Add noise reduction preprocessing Expand command set and improve multilingual support

About

Neural network project developed as part of the Pattern Recognition and Machine Learning course.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published