EutSpeechAudioProcessing: Audio Stream Management, VAD, Speaker Diarization, Wake Word Detection & Speech Recognition
๐ Production-ready ROS2 (Jazzy, Humble-WIP) audio perception stack with advanced VAD and speaker diarization ๐ฃ๏ธ and state-of-the-art Whisper ASR ๐. Uniquely integrates MongoDB ๐พ for persistent speaker embedding storage with automatic re-identification across sessionsโspeaker identities survive Docker restarts! Fully containerized architecture with hardware-isolated audio management and modular speech processing pipeline for enterprise-grade human-robot interaction. Based on the ros4hri ๐ค standard, with an optional ROS4HRI-compatible publication mode. The default configuration uses a scalability-oriented architecture, leveraging state-of-the-art open-source AI models in an enterprise-grade architecture.
Audio Perception Stack Architecture
EutSpeechAudioProcessing provides end-to-end audio perception for robotics, from hardware audio capture to speech understanding.
- ๐ค Hardware-Isolated Audio Capture: Robust audio stream management with automatic device detection and error recovery
- ๐ฃ๏ธ Voice Activity Detection (VAD): Real-time speech segment detection with configurable sensitivity
- ๐ฅ Speaker Diarization with Persistence: Multi-speaker identification using deep learning embeddings stored in MongoDBโspeaker identities persist across Docker restarts and robot sessions
- ๐ State-of-the-Art ASR: High-accuracy speech transcription powered by OpenAI Whisper models
- ๐ Wake Word Detection: Configurable keyword spotting for hands-free voice activation
- ๐๏ธ MongoDB Database: Automatic speaker embedding storage and re-identification with persistent identity management
- ๐ณ Decoupled Architecture: Hardware management and speech processing run in separate containers for maximum reliability
- โ๏ธ Modular Pipeline: Enable/disable VAD, diarization, wake word, and ASR independently based on your needs
Expected Pipeline Logs During Operation
This repository contains the speech and audio processing module for the perception layer of robotic systems, enabling comprehensive audio understanding and natural human-robot interaction through voice.
The system features a decoupled two-component architecture for robust operation and reliability:
Hardware-isolated audio capture that interfaces directly with audio devices, preventing hardware issues from affecting the speech processing pipeline.
A modular processing chain that transforms raw audio into actionable insights:
- Voice Activity Detection (VAD): Detects when speech is present in the audio stream
- Speaker Diarization: Identifies and segments different speakers with persistent identity storage in MongoDBโspeaker embeddings survive container restarts and system reboots
- Wake Word Detection: Keyword spotting for voice activation
- Speech Transcription: Converts spoken language into text using automatic speech recognition (ASR)
๐ Unique Feature: Unlike traditional solutions, speaker identities are automatically saved to MongoDB and reloaded on startup, enabling seamless speaker re-identification across sessions without manual re-enrollment.
First, build the required base Docker image from EutRobAIDockers.
git clone git@github.com:Eurecat/EutRobAIDockers.git
cd EutRobAIDockers
./build_container.sh
# Defaults to ROS2 Jazzy and GPU
# Optionally, use --clean-rebuild to force a complete rebuild without cached layers. --cpu flag can be used to build a CPU-only image if needed. etc.git clone git@github.com:Eurecat/eut_speech_audio_processing.git
cd eut_speech_audio_processingFor Vulcanexus-based installations:
cd Docker && ./build_container.sh --vulcanexusFor standard installations:
cd Docker && ./build_container.shBuild Options:
- Use
--clean-rebuildflag to force a complete rebuild without cached layers
Hugging Face Token Setup:
Configure your Hugging Face token in the .env file (see .env.example for template) to access state-of-the-art models:
openai/whisper- Advanced speech recognitionpyannote/embedding- Speaker voice embeddingspyannote/segmentation- Speaker diarization
Ensure your token has appropriate permissions for these model repositories.
Navigate to the Docker directory and launch both services simultaneously:
cd Docker
docker compose upThis command will initialize both the Audio Stream Manager and the Speech Recognition Pipeline services automatically.
Microphone Selection:
-
Check detected audio devices:
docker logs audio_device_manager
Example output shows available devices with their hardware IDs.
-
Modify device_name with the desired one in audio_params.yaml
-
Restart only the audio service:
docker restart audio_device_manager
The Docker Compose setup includes two main services:
- Audio Device Manager Service: Handles audio input device selection and stream management
- Speech Recognition Service: Provides VAD, diarization, wake word and ASR capabilities
You can selectively enable or disable speech recognition components by editing the command section in the dev-docker-compose.yaml file. Modify the speech recognition service command as follows:
# Example: Disable diarization and ASR, keep only VAD
command: bash -c "source /workspace/install/setup.bash && ros2 launch speech_recognition speech_recognition.launch.py enable_diarization:=false enable_asr:=false"Available options:
enable_vad:=true/false- Voice Activity Detectionenable_diarization:=true/false- Speaker Diarizationenable_wake_word:=true/false- Wake Wordenable_asr:=true/false- Automatic Speech Recognition
Important Dependencies:
- Diarization requires VAD to work properly
- ASR requires both VAD and Diarization for optimal performance
The speaker diarization system uses MongoDB to persistently store speaker voice embeddings, enabling automatic re-identification across Docker container restarts and robot sessions. Once a speaker is enrolled, their voice profile remains in the database indefinitely.
Query the database:
mongosh
use speaker_recognition
db.speakers.find()Access the web interface:
http://0.0.0.0:8081/db/speaker_recognition/speakers
Delete the database:
Remove the associated Docker volume to clear all speaker embeddings and start fresh.
This persistence means your robot can recognize previously encountered speakers without re-enrollment, making interactions more natural and continuous across sessions.
This repository uses Ruff for automatic Python code formatting via pre-commit hooks.
Quick Setup:
# Install pre-commit
pip install pre-commit
# Install the git hooks
pre-commit install # Runs on changed files only by default when git commit
# (Optional) Run on all existing files
pre-commit run --all-files
#If you need to commit urgently and skip the pre-commit checks
git commit -m "urgent fix" --no-verifyNow Ruff will automatically format your code before each commit. If formatting changes are made, review them with git diff, then stage and commit again.
Follow PRECOMMIT.md for detailed instructions and troubleshooting tips related to pre-commit hooks.
If you encounter the error failed to bind host port for 0.0.0.0:27017:172.21.0.2:27017/tcp: address already in use, this means another service is already occupying port 27017. The docker-compose MongoDB service cannot start because the port is blocked. To resolve this, identify and stop the conflicting service with sudo lsof -i :27017 and kill the process if needed, then restart docker-compose.
sudo lsof -ti:27017 | xargs -r sudo kill -9If you encounter the error
[ERROR] Failed to load identity database from MongoDB: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 5.0s, Topology Description: <TopologyDescription id: 699c509d3119785fb03732f5, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>Then probably you have some bad configuration in your volumne of mongodb from previous compose, run compose down to remove all volumes and start again. When doing any change on the compose.yaml also do
docker compose down -v
docker compose upIf you switch between dev-docker-compose.yaml and docker-compose.yaml, you may encounter errors like Conflict. The container name "/mongodb_faces" is already in use. This happens because containers from the previous compose file are still running. To resolve this, remove all containers and restart:
docker stop $(docker ps -q) #or kill or rm to avoid losing data if you have any important container runningthen run docker compose up again. This cleanly removes all existing containers and allows the new composition to start fresh.
-
Configure secrets (if needed for your workflow):
# Create a secrets file touch .secrets # Add your secrets (example): echo "HF_TOKEN=your_huggingface_token_here" >> .secrets
โ ๏ธ Important: Don't commit the.secretsfile to GitHub! Add it to.gitignore:echo ".secrets" >> .gitignore
Follow CI_CD_SETUP.md for detailed instructions on how to run GitHub Actions workflows locally.
Apache-2.0