This project implements a near real-time audio transcription system with a GPU-backed transcription server and a client for continuous audio streaming and transcription.
- Server (server-transcribe.py): Runs on a CUDA GPU-enabled machine, providing transcription services via HTTP.
- Client (send-streaming-voice.py): Captures audio and sends it to the server for real-time transcription.
- Python 3.10+
- FastAPI
- uvicorn
- NVIDIA GPU with CUDA support
- NeMo ASR model (Canary-1B)
- Install required packages:
pip install fastapi uvicorn nemo_toolkit[all] python-dotenv - Ensure necessary CUDA libraries are installed for your GPU.
python server-transcribe.py
- Python 3.7+
- pyaudio
- requests
- python-dotenv
- Install required packages:
pip install pyaudio requests python-dotenv
Create a .env file in the same directory as send-streaming-voice.py:
TRANSCRIBE_ENDPOINT=http://your-server-ip:8726/transcribe
Replace your-server-ip with the IP address or hostname of your GPU server.
python send-streaming-voice.py
python text-diff.py test/data/text-source.txt test/data/text-transcribed(-silero).txt
- Start the server on your GPU-enabled machine.
- Run the client on the machine where you want to capture audio.
- Speak into the microphone connected to the client machine.
- The client will continuously send audio chunks to the server.
- The server will process these chunks and return transcriptions.
- The client will print the transcriptions as they are received.
[Client Machine] [GPU Server]
+------------------+ +------------------+
| | | |
| Microphone | | NVIDIA GPU |
| | | | | |
| v | | v |
| send-streaming- | HTTP POST | server- |
| voice.py | ---------> | transcribe.py |
| | | Audio Data | | |
| | | | | |
| | | HTTP | | |
| | | Response | | |
| v | <--------- | | |
| Display |Transcription| | |
| Transcription | | | |
+------------------+ +------------------+
- Uses FastAPI to create an HTTP server.
- Loads a pre-trained NeMo ASR model (Canary-1B) for transcription.
- Receives audio chunks via POST requests.
- Processes audio using the GPU for fast transcription.
- Returns transcribed text as JSON responses.
- Uses PyAudio to capture audio from the microphone.
- Streams audio in chunks (default 5 seconds).
- Sends each audio chunk to the server as a WAV file in a POST request.
- Receives and displays the transcription for each chunk.
- The server's firewall should allow incoming connections on the specified port.
- Audio is captured in 5-second chunks by default. Adjust this in the
AudioStreamerclass if needed. - The transcription model used is NVIDIA's Canary-1B. You may need to adjust paths or model loading based on your specific setup.
- For audio capture issues, ensure your microphone is properly configured and recognized by your system.
- For server connection issues, verify that the
TRANSCRIBE_ENDPOINTin the client's.envfile is correct and that the server is running and accessible. - GPU-related issues on the server side may require checking CUDA installation and compatibility with the NeMo toolkit.
MIT License