Skip to content

This repository contains the speech and audio processing module of the perception layer for robotic systems.

License

Notifications You must be signed in to change notification settings

Eurecat/eut_speech_audio_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

185 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

EutSpeechAudioProcessing: Audio Stream Management, VAD, Speaker Diarization, Wake Word Detection & Speech Recognition

๐Ÿš€ Production-ready ROS2 (Jazzy, Humble-WIP) audio perception stack with advanced VAD and speaker diarization ๐Ÿ—ฃ๏ธ and state-of-the-art Whisper ASR ๐Ÿ“. Uniquely integrates MongoDB ๐Ÿ’พ for persistent speaker embedding storage with automatic re-identification across sessionsโ€”speaker identities survive Docker restarts! Fully containerized architecture with hardware-isolated audio management and modular speech processing pipeline for enterprise-grade human-robot interaction. Based on the ros4hri ๐Ÿค– standard, with an optional ROS4HRI-compatible publication mode. The default configuration uses a scalability-oriented architecture, leveraging state-of-the-art open-source AI models in an enterprise-grade architecture.

๐Ÿ—๏ธ Architecture Overview

Audio Processing Architecture
Audio Perception Stack Architecture

EutSpeechAudioProcessing provides end-to-end audio perception for robotics, from hardware audio capture to speech understanding.

Key Features

  • ๐ŸŽค Hardware-Isolated Audio Capture: Robust audio stream management with automatic device detection and error recovery
  • ๐Ÿ—ฃ๏ธ Voice Activity Detection (VAD): Real-time speech segment detection with configurable sensitivity
  • ๐Ÿ‘ฅ Speaker Diarization with Persistence: Multi-speaker identification using deep learning embeddings stored in MongoDBโ€”speaker identities persist across Docker restarts and robot sessions
  • ๐Ÿ“ State-of-the-Art ASR: High-accuracy speech transcription powered by OpenAI Whisper models
  • ๐Ÿ”Š Wake Word Detection: Configurable keyword spotting for hands-free voice activation
  • ๐Ÿ—„๏ธ MongoDB Database: Automatic speaker embedding storage and re-identification with persistent identity management
  • ๐Ÿณ Decoupled Architecture: Hardware management and speech processing run in separate containers for maximum reliability
  • โš™๏ธ Modular Pipeline: Enable/disable VAD, diarization, wake word, and ASR independently based on your needs

Expected logs when running the audio processing pipeline
Expected Pipeline Logs During Operation

Overview

This repository contains the speech and audio processing module for the perception layer of robotic systems, enabling comprehensive audio understanding and natural human-robot interaction through voice.

Architecture

The system features a decoupled two-component architecture for robust operation and reliability:

๐ŸŽ™๏ธ Audio Stream Manager

Hardware-isolated audio capture that interfaces directly with audio devices, preventing hardware issues from affecting the speech processing pipeline.

๐Ÿง  Speech Recognition Pipeline

A modular processing chain that transforms raw audio into actionable insights:

  • Voice Activity Detection (VAD): Detects when speech is present in the audio stream
  • Speaker Diarization: Identifies and segments different speakers with persistent identity storage in MongoDBโ€”speaker embeddings survive container restarts and system reboots
  • Wake Word Detection: Keyword spotting for voice activation
  • Speech Transcription: Converts spoken language into text using automatic speech recognition (ASR)

๐Ÿ”‘ Unique Feature: Unlike traditional solutions, speaker identities are automatically saved to MongoDB and reloaded on startup, enabling seamless speaker re-identification across sessions without manual re-enrollment.


๐Ÿš€ Quick Start

Installation & Setup

Step 0: Build Base Image

First, build the required base Docker image from EutRobAIDockers.

git clone git@github.com:Eurecat/EutRobAIDockers.git
cd EutRobAIDockers
./build_container.sh 
# Defaults to ROS2 Jazzy and GPU
# Optionally, use --clean-rebuild to force a complete rebuild without cached layers. --cpu flag can be used to build a CPU-only image if needed. etc.

Step 1: Clone Repository

git clone git@github.com:Eurecat/eut_speech_audio_processing.git
cd eut_speech_audio_processing

Step 2: Build Application Image

For Vulcanexus-based installations:

cd Docker && ./build_container.sh --vulcanexus

For standard installations:

cd Docker && ./build_container.sh

Build Options:

  • Use --clean-rebuild flag to force a complete rebuild without cached layers

Configuration Parameters

Hugging Face Token Setup:
Configure your Hugging Face token in the .env file (see .env.example for template) to access state-of-the-art models:

  • openai/whisper - Advanced speech recognition
  • pyannote/embedding - Speaker voice embeddings
  • pyannote/segmentation - Speaker diarization

Ensure your token has appropriate permissions for these model repositories.

Usage

Docker Compose (Recommended)

Navigate to the Docker directory and launch both services simultaneously:

cd Docker
docker compose up

This command will initialize both the Audio Stream Manager and the Speech Recognition Pipeline services automatically.

Microphone Selection:

  1. Check detected audio devices:

    docker logs audio_device_manager

    Example output shows available devices with their hardware IDs.

  2. Modify device_name with the desired one in audio_params.yaml

  3. Restart only the audio service:

    docker restart audio_device_manager

Service Configuration

The Docker Compose setup includes two main services:

  1. Audio Device Manager Service: Handles audio input device selection and stream management
  2. Speech Recognition Service: Provides VAD, diarization, wake word and ASR capabilities

Enabling/Disabling Components

You can selectively enable or disable speech recognition components by editing the command section in the dev-docker-compose.yaml file. Modify the speech recognition service command as follows:

# Example: Disable diarization and ASR, keep only VAD
command: bash -c "source /workspace/install/setup.bash && ros2 launch speech_recognition speech_recognition.launch.py enable_diarization:=false enable_asr:=false"

Available options:

  • enable_vad:=true/false - Voice Activity Detection
  • enable_diarization:=true/false - Speaker Diarization
  • enable_wake_word:=true/false- Wake Word
  • enable_asr:=true/false - Automatic Speech Recognition

Important Dependencies:

  • Diarization requires VAD to work properly
  • ASR requires both VAD and Diarization for optimal performance

Managing the Speaker Recognition Database

The speaker diarization system uses MongoDB to persistently store speaker voice embeddings, enabling automatic re-identification across Docker container restarts and robot sessions. Once a speaker is enrolled, their voice profile remains in the database indefinitely.

Query the database:

mongosh
use speaker_recognition
db.speakers.find()

Access the web interface:
http://0.0.0.0:8081/db/speaker_recognition/speakers

Delete the database:
Remove the associated Docker volume to clear all speaker embeddings and start fresh.

This persistence means your robot can recognize previously encountered speakers without re-enrollment, making interactions more natural and continuous across sessions.

Formatting code - Pre-commit Hooks (Optional but Recommended)

This repository uses Ruff for automatic Python code formatting via pre-commit hooks.

Quick Setup:

# Install pre-commit
pip install pre-commit

# Install the git hooks 
pre-commit install # Runs on changed files only by default when git commit

# (Optional) Run on all existing files
pre-commit run --all-files

#If you need to commit urgently and skip the pre-commit checks
git commit -m "urgent fix" --no-verify

Now Ruff will automatically format your code before each commit. If formatting changes are made, review them with git diff, then stage and commit again.

Follow PRECOMMIT.md for detailed instructions and troubleshooting tips related to pre-commit hooks.


Troubleshooting

Port 27017 Already in Use

If you encounter the error failed to bind host port for 0.0.0.0:27017:172.21.0.2:27017/tcp: address already in use, this means another service is already occupying port 27017. The docker-compose MongoDB service cannot start because the port is blocked. To resolve this, identify and stop the conflicting service with sudo lsof -i :27017 and kill the process if needed, then restart docker-compose.

sudo lsof -ti:27017 | xargs -r sudo kill -9

Failed to Load Identity Database from MongoDB

If you encounter the error

 [ERROR] Failed to load identity database from MongoDB: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 5.0s, Topology Description: <TopologyDescription id: 699c509d3119785fb03732f5, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

Then probably you have some bad configuration in your volumne of mongodb from previous compose, run compose down to remove all volumes and start again. When doing any change on the compose.yaml also do

docker compose down -v
docker compose up

Container Name Conflicts

If you switch between dev-docker-compose.yaml and docker-compose.yaml, you may encounter errors like Conflict. The container name "/mongodb_faces" is already in use. This happens because containers from the previous compose file are still running. To resolve this, remove all containers and restart:

docker stop $(docker ps -q) #or kill or rm to avoid losing data if you have any important container running

then run docker compose up again. This cleanly removes all existing containers and allows the new composition to start fresh.

Setup for Local Testing

  1. Configure secrets (if needed for your workflow):

    # Create a secrets file
    touch .secrets
    
    # Add your secrets (example):
    echo "HF_TOKEN=your_huggingface_token_here" >> .secrets

    โš ๏ธ Important: Don't commit the .secrets file to GitHub! Add it to .gitignore:

    echo ".secrets" >> .gitignore

Running CI/CD Locally

Follow CI_CD_SETUP.md for detailed instructions on how to run GitHub Actions workflows locally.


License

Apache-2.0

Maintainers

About

This repository contains the speech and audio processing module of the perception layer for robotic systems.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •