Self-hosted voice-enabled AI assistant on hardware you already own. Open-source stack, no cloud APIs, no recurring costs. The personality is yours — the plumbing is here.
A blueprint for building a voice-enabled local AI assistant — the kind that runs on your own hardware, gives you a custom voice character, and doesn't ship a single token to OpenAI or Anthropic. The interesting part isn't any single component (they all exist as open-source projects). The interesting part is the architecture: how the pieces fit together into a coherent voice assistant you can talk to.
This repo provides:
- A generalized voice proxy (
proxy.py) that adds pitch/tempo transformation to any OpenAI-compatible TTS service — give your assistant any voice character without retraining models - A Modelfile template showing the structure of a well-tuned conversational AI persona for Ollama
- A voice config template mapping OpenAI-style voice names to specific Piper TTS speakers
- Architecture documentation explaining how the components connect
The actual personality of your assistant is yours to define. This repo gives you the engineering substrate; you bring the soul.
graph LR
Client[Voice client<br/>Open WebUI / app / phone]
Proxy[voice-proxy :8002<br/>Pitch + tempo shift]
TTS[OpenedAI Speech :8001<br/>Piper TTS engine]
LLM[Ollama :11434<br/>Your custom Modelfile]
Voices[(Piper voice<br/>model files)]
Brain[(Your Modelfile<br/>= the personality)]
Client -->|TTS request| Proxy
Proxy -->|forward| TTS
TTS -->|generated audio| Proxy
Proxy -->|pitch-shifted audio| Client
Client -->|text completion| LLM
TTS -.reads.-> Voices
LLM -.reads.-> Brain
- Ollama runs the language model with your custom personality defined via a Modelfile. This is the assistant's "brain" — its conversational behavior, voice (textual), personality, hard rules.
- OpenedAI Speech provides an OpenAI-compatible TTS API that the client can call. Under the hood it uses Piper TTS models — small, fast, CPU-friendly neural voices.
- The voice proxy (this repo's
proxy.py) sits between the client and OpenedAI Speech. It forwards TTS requests, then post-processes the audio with ffmpeg to apply pitch/tempo transformation — giving the voice a custom character without needing different voice models. - The client can be anything OpenAI-API-compatible: Open WebUI (most common), a custom app, voice-input integrations, etc.
The whole stack runs locally. No tokens leave your machine.
Most local TTS solutions give you a small set of pre-trained voice models. If none of them sound the way you want your assistant to sound, your options have traditionally been:
- Train your own voice model (hours of recording + GPU training time)
- Pay for a cloud service with more voices ($$$ + privacy tradeoff)
- Accept whatever's available
The proxy approach is a cheap third option: take any existing voice and transform its character with classical signal processing. ffmpeg's asetrate + atempo filter chain lets you:
- Raise pitch (
PITCH_MULTIPLIER > 1.0) for a brighter, more youthful voice character - Lower pitch (
PITCH_MULTIPLIER < 1.0) for a deeper, more authoritative voice character - Compensate speech rate (
TEMPO_COMPENSATION = 1/PITCH_MULTIPLIERapproximately) so the result sounds natural at normal speaking speed
It's not as flexible as a custom-trained voice, but it's a one-line config change to give your assistant a distinctive sound that's neither the upstream's default nor a recognizable cloud voice.
You need three services running, plus optionally a client:
| Service | What it does | Repo |
|---|---|---|
| Ollama | Local LLM serving | https://ollama.com |
| OpenedAI Speech | TTS API (Piper-backed) | https://github.com/matatonic/openedai-speech |
| voice-proxy (this repo) | Pitch/tempo transformation | this repo |
| Open WebUI (optional) | Web UI with voice support | https://github.com/open-webui/open-webui |
# 1. Install Ollama + pull a base model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:7b
# 2. Build your custom assistant from the Modelfile template
cp Modelfile.example Modelfile
nano Modelfile # edit the SYSTEM block with your assistant's persona
ollama create my-assistant -f Modelfile
# 3. Install OpenedAI Speech (Docker is simplest)
docker run -d --name openedai-speech -p 8001:8000 \
-v ./voices:/app/voices \
-v ./voice_to_speaker.yaml:/app/voice_to_speaker.yaml \
ghcr.io/matatonic/openedai-speech
# 4. Copy + edit the voice config
cp voice_to_speaker.example.yaml voice_to_speaker.yaml
nano voice_to_speaker.yaml # adjust speaker IDs as desired
# 5. Run the voice proxy with your chosen voice character
PITCH_MULTIPLIER=1.25 TEMPO_COMPENSATION=0.8 \
DEFAULT_VOICE=nova \
UPSTREAM=http://localhost:8001 \
python proxy.py
# 6. Point your client (Open WebUI, etc.) at the proxy:
# TTS endpoint: http://localhost:8002/v1/audio/speechAll proxy settings are env-vars. See proxy.py docstring for the full list.
| Variable | Default | Purpose |
|---|---|---|
UPSTREAM |
http://localhost:8001 |
URL of the underlying TTS service |
DEFAULT_VOICE |
None |
If set, override client's voice choice (useful for consistency) |
PITCH_MULTIPLIER |
1.0 |
Frequency multiplier (>1 brighter, <1 deeper) |
TEMPO_COMPENSATION |
1.0 |
Speed adjustment to keep speech rate natural after pitch shift |
HOST / PORT |
0.0.0.0 / 8002 |
Where the proxy listens |
If both PITCH_MULTIPLIER and TEMPO_COMPENSATION are 1.0, the proxy is a pass-through with no ffmpeg overhead.
This stack is designed to run on modest consumer hardware. Tested working on:
- Linux Mint laptops (x86_64, 7-8GB RAM)
- Termux + Debian proot on Android (aarch64, 5GB+ RAM)
For the LLM: pick your Ollama model based on available RAM:
- ~2GB:
qwen2.5:1.5b,llama3.2:1b - ~4GB:
qwen2.5:3b,llama3.2:3b - ~8GB:
qwen2.5:7b,llama3.1:8b(recommended for conversational quality) - 10GB+:
qwen2.5:14b,llama3.1:70b(if you have the RAM)
TTS is much lighter — Piper models are 60-100MB each and run comfortably on CPU.
This stack composes well with hydra-cluster — my heterogeneous AI inference cluster — to distribute the workload:
- Heavy node runs Ollama with the bigger model
- Lighter node runs OpenedAI Speech + the voice proxy
- Main node runs Open WebUI as the front-end and routes traffic over a Tailscale mesh
This way a phone or low-spec device can be a usable interface to an assistant that's actually computing on more capable hardware elsewhere in your network. The proxy + TTS layer is light enough to run on a phone or tablet, while the LLM lives on whatever has the most RAM.
- Python 3.10+ with FastAPI + httpx for the proxy
- ffmpeg for pitch/tempo audio transformation
- Ollama for local LLM serving with Modelfile-based persona definition
- OpenedAI Speech as the TTS engine (OpenAI-API-compatible)
- Piper TTS voice models (downloaded separately from the openedai-speech repo)
- Open WebUI (optional, as the client front-end)
- The proxy pattern is more powerful than it looks. A small FastAPI app + 5 lines of ffmpeg gives you voice character customization that would otherwise require model retraining. Cheap engineering wins.
asetrate+atempois the magic combo. Pitch-only filters (likerubberband) sound better but are slow and not always available. Theasetrate/atempochain is in every ffmpeg build and runs in real time on a Raspberry Pi.- Modelfile design matters more than model size. A well-tuned 3B-parameter model with a thoughtful Modelfile often feels more pleasant to talk to than a poorly-prompted 14B model. Spend time on the persona, not just the compute.
- OpenAI-compatible APIs are an underrated standardization. OpenedAI Speech, Ollama, Open WebUI all speak the same dialect — you can swap any piece without breaking the others. Makes experimentation easy.
- ✅ Voice proxy functional with configurable pitch/tempo
- ✅ Tested with OpenedAI Speech + Piper voices
- ✅ Tested with Ollama as the LLM backend
- ✅ Verified working in cluster context (hydra-cluster)
- 🔲 STT (speech-to-text) integration — currently relies on the client (e.g., Open WebUI) to do this
- 🔲 Streaming responses for lower latency
- 🔲 Docker compose file for one-shot deployment
Being clear about scope:
- Not a complete deploy-and-use product. This is a stack/architecture you assemble. The proxy is small (~150 LOC); the value is in the pattern + the documentation of how the pieces connect.
- Not bundled with voice models. The Piper voice files are 60-100MB each and have their own licenses (mostly free-for-personal-use, check each model's source). Fetch them from openedai-speech's repo or huggingface.
- Not bundled with a specific assistant personality. That's intentional — your assistant should be yours, not mine. Use
Modelfile.exampleas the structural template.
MIT — see LICENSE. Use, adapt, and learn from this freely. If you build something cool with it, I'd love to hear about it.
Joshua Jen Robiano Pujante — BSAIS student at Saint Paul School of Professional Studies, Tacloban City, Philippines.
LinkedIn · Companion repos: hydra-cluster · openutau-headless · sariling-analyst