A voice agent application that integrates SIP/RTP with LLMs (Large Language Models), TTS (Text-to-Speech), and STT (Speech-to-Text).
- SIP/RTP Integration: Handles VoIP calls using
sipgoandpion/sdp. - Speech-to-Text (STT): Uses Whisper for high-accuracy speech recognition.
- Text-to-Speech (TTS): Uses Piper for fast, neural text-to-speech.
- LLM Integration: (Planned) Connects to LLMs for conversational intelligence.
The application relies on shared libraries from the whisper.cpp project.
The source code is included as a git submodule in third_party/whisper.cpp.
git submodule update --init --recursive
cd third_party/whisper.cpp
cmake -B build -DGGML_CUDA=1
cmake --build build -j $(nproc) --config ReleaseThe application is configured to look for libraries in third_party/whisper.cpp/build when run via run.sh.
Download a Whisper component model (e.g., base.en) to models/ggml-base.en.bin.
mkdir -p models
wget -O models/ggml-base.en.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.binDownload and install the Piper binary and a voice model.
- Download Piper binary release from Piper GitHub Releases.
- Extract it to a location (e.g.,
/opt/piperor locally). - Download a voice model (ONNX + JSON config) from Piper Voices.
- Example:
en_US-lessac-medium
- Example:
Update config.json with the paths to the Piper binary and model.
Copy config.json.example to config.json and update the values:
{
"sip_port": 5060,
"rtp_start_port": 40000,
"rtp_end_port": 50000,
"whisper_model_path": "models/ggml-base.en.bin",
"piper_binary_path": "/path/to/piper/piper",
"piper_model_path": "/path/to/piper/models/en_US-lessac-medium.onnx",
"http_port": 3000
}The easiest way to run the application with all dependencies correctly linked is to use the run.sh script:
./run.shThis script sets up CGO_LDFLAGS and LD_LIBRARY_PATH to point to the whisper.cpp build directory and the CUDA libraries.
All arguments passed to run.sh are forwarded to the agent.
Example for verbose logging:
./run.sh -verboseThe vendor-repos directory contains research checkpoints and references. It is not used for the build process.
You can trigger the TTS engine to speak a phrase using the /api/speak endpoint:
curl -X POST http://localhost:3000/api/speak \
-H "Content-Type: application/json" \
-d '{"text": "Hello, world!"}'