voice-ai-stack

Self-hosted voice-enabled AI assistant on hardware you already own. Open-source stack, no cloud APIs, no recurring costs. The personality is yours — the plumbing is here.

What this is

A blueprint for building a voice-enabled local AI assistant — the kind that runs on your own hardware, gives you a custom voice character, and doesn't ship a single token to OpenAI or Anthropic. The interesting part isn't any single component (they all exist as open-source projects). The interesting part is the architecture: how the pieces fit together into a coherent voice assistant you can talk to.

This repo provides:

A generalized voice proxy (proxy.py) that adds pitch/tempo transformation to any OpenAI-compatible TTS service — give your assistant any voice character without retraining models
A Modelfile template showing the structure of a well-tuned conversational AI persona for Ollama
A voice config template mapping OpenAI-style voice names to specific Piper TTS speakers
Architecture documentation explaining how the components connect

The actual personality of your assistant is yours to define. This repo gives you the engineering substrate; you bring the soul.

Architecture

graph LR
    Client[Voice client<br/>Open WebUI / app / phone]
    Proxy[voice-proxy :8002<br/>Pitch + tempo shift]
    TTS[OpenedAI Speech :8001<br/>Piper TTS engine]
    LLM[Ollama :11434<br/>Your custom Modelfile]
    Voices[(Piper voice<br/>model files)]
    Brain[(Your Modelfile<br/>= the personality)]

    Client -->|TTS request| Proxy
    Proxy -->|forward| TTS
    TTS -->|generated audio| Proxy
    Proxy -->|pitch-shifted audio| Client
    Client -->|text completion| LLM
    TTS -.reads.-> Voices
    LLM -.reads.-> Brain

How the pieces fit

Ollama runs the language model with your custom personality defined via a Modelfile. This is the assistant's "brain" — its conversational behavior, voice (textual), personality, hard rules.
OpenedAI Speech provides an OpenAI-compatible TTS API that the client can call. Under the hood it uses Piper TTS models — small, fast, CPU-friendly neural voices.
The voice proxy (this repo's proxy.py) sits between the client and OpenedAI Speech. It forwards TTS requests, then post-processes the audio with ffmpeg to apply pitch/tempo transformation — giving the voice a custom character without needing different voice models.
The client can be anything OpenAI-API-compatible: Open WebUI (most common), a custom app, voice-input integrations, etc.

The whole stack runs locally. No tokens leave your machine.

The voice proxy pattern

Most local TTS solutions give you a small set of pre-trained voice models. If none of them sound the way you want your assistant to sound, your options have traditionally been:

Train your own voice model (hours of recording + GPU training time)
Pay for a cloud service with more voices ($$$ + privacy tradeoff)
Accept whatever's available

The proxy approach is a cheap third option: take any existing voice and transform its character with classical signal processing. ffmpeg's asetrate + atempo filter chain lets you:

Raise pitch (PITCH_MULTIPLIER > 1.0) for a brighter, more youthful voice character
Lower pitch (PITCH_MULTIPLIER < 1.0) for a deeper, more authoritative voice character
Compensate speech rate (TEMPO_COMPENSATION = 1/PITCH_MULTIPLIER approximately) so the result sounds natural at normal speaking speed

It's not as flexible as a custom-trained voice, but it's a one-line config change to give your assistant a distinctive sound that's neither the upstream's default nor a recognizable cloud voice.

Setup

Prerequisites

You need three services running, plus optionally a client:

Service	What it does	Repo
Ollama	Local LLM serving	https://ollama.com
OpenedAI Speech	TTS API (Piper-backed)	https://github.com/matatonic/openedai-speech
voice-proxy (this repo)	Pitch/tempo transformation	this repo
Open WebUI (optional)	Web UI with voice support	https://github.com/open-webui/open-webui

Installation

# 1. Install Ollama + pull a base model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:7b

# 2. Build your custom assistant from the Modelfile template
cp Modelfile.example Modelfile
nano Modelfile             # edit the SYSTEM block with your assistant's persona
ollama create my-assistant -f Modelfile

# 3. Install OpenedAI Speech (Docker is simplest)
docker run -d --name openedai-speech -p 8001:8000 \
    -v ./voices:/app/voices \
    -v ./voice_to_speaker.yaml:/app/voice_to_speaker.yaml \
    ghcr.io/matatonic/openedai-speech

# 4. Copy + edit the voice config
cp voice_to_speaker.example.yaml voice_to_speaker.yaml
nano voice_to_speaker.yaml      # adjust speaker IDs as desired

# 5. Run the voice proxy with your chosen voice character
PITCH_MULTIPLIER=1.25 TEMPO_COMPENSATION=0.8 \
DEFAULT_VOICE=nova \
UPSTREAM=http://localhost:8001 \
python proxy.py

# 6. Point your client (Open WebUI, etc.) at the proxy:
#    TTS endpoint: http://localhost:8002/v1/audio/speech

Configuration

All proxy settings are env-vars. See proxy.py docstring for the full list.

Variable	Default	Purpose
`UPSTREAM`	`http://localhost:8001`	URL of the underlying TTS service
`DEFAULT_VOICE`	`None`	If set, override client's voice choice (useful for consistency)
`PITCH_MULTIPLIER`	`1.0`	Frequency multiplier (>1 brighter, <1 deeper)
`TEMPO_COMPENSATION`	`1.0`	Speed adjustment to keep speech rate natural after pitch shift
`HOST` / `PORT`	`0.0.0.0` / `8002`	Where the proxy listens

If both PITCH_MULTIPLIER and TEMPO_COMPENSATION are 1.0, the proxy is a pass-through with no ffmpeg overhead.

Hardware notes

This stack is designed to run on modest consumer hardware. Tested working on:

Linux Mint laptops (x86_64, 7-8GB RAM)
Termux + Debian proot on Android (aarch64, 5GB+ RAM)

For the LLM: pick your Ollama model based on available RAM:

~2GB: qwen2.5:1.5b, llama3.2:1b
~4GB: qwen2.5:3b, llama3.2:3b
~8GB: qwen2.5:7b, llama3.1:8b (recommended for conversational quality)
10GB+: qwen2.5:14b, llama3.1:70b (if you have the RAM)

TTS is much lighter — Piper models are 60-100MB each and run comfortably on CPU.

Integration with a cluster

This stack composes well with hydra-cluster — my heterogeneous AI inference cluster — to distribute the workload:

Heavy node runs Ollama with the bigger model
Lighter node runs OpenedAI Speech + the voice proxy
Main node runs Open WebUI as the front-end and routes traffic over a Tailscale mesh

This way a phone or low-spec device can be a usable interface to an assistant that's actually computing on more capable hardware elsewhere in your network. The proxy + TTS layer is light enough to run on a phone or tablet, while the LLM lives on whatever has the most RAM.

Tech stack

Python 3.10+ with FastAPI + httpx for the proxy
ffmpeg for pitch/tempo audio transformation
Ollama for local LLM serving with Modelfile-based persona definition
OpenedAI Speech as the TTS engine (OpenAI-API-compatible)
Piper TTS voice models (downloaded separately from the openedai-speech repo)
Open WebUI (optional, as the client front-end)

Lessons learned

The proxy pattern is more powerful than it looks. A small FastAPI app + 5 lines of ffmpeg gives you voice character customization that would otherwise require model retraining. Cheap engineering wins.
asetrate + atempo is the magic combo. Pitch-only filters (like rubberband) sound better but are slow and not always available. The asetrate/atempo chain is in every ffmpeg build and runs in real time on a Raspberry Pi.
Modelfile design matters more than model size. A well-tuned 3B-parameter model with a thoughtful Modelfile often feels more pleasant to talk to than a poorly-prompted 14B model. Spend time on the persona, not just the compute.
OpenAI-compatible APIs are an underrated standardization. OpenedAI Speech, Ollama, Open WebUI all speak the same dialect — you can swap any piece without breaking the others. Makes experimentation easy.

Status

✅ Voice proxy functional with configurable pitch/tempo
✅ Tested with OpenedAI Speech + Piper voices
✅ Tested with Ollama as the LLM backend
✅ Verified working in cluster context (hydra-cluster)
🔲 STT (speech-to-text) integration — currently relies on the client (e.g., Open WebUI) to do this
🔲 Streaming responses for lower latency
🔲 Docker compose file for one-shot deployment

What this is NOT

Being clear about scope:

Not a complete deploy-and-use product. This is a stack/architecture you assemble. The proxy is small (~150 LOC); the value is in the pattern + the documentation of how the pieces connect.
Not bundled with voice models. The Piper voice files are 60-100MB each and have their own licenses (mostly free-for-personal-use, check each model's source). Fetch them from openedai-speech's repo or huggingface.
Not bundled with a specific assistant personality. That's intentional — your assistant should be yours, not mine. Use Modelfile.example as the structural template.

License

MIT — see LICENSE. Use, adapt, and learn from this freely. If you build something cool with it, I'd love to hear about it.

Author

Joshua Jen Robiano Pujante — BSAIS student at Saint Paul School of Professional Studies, Tacloban City, Philippines.

LinkedIn · Companion repos: hydra-cluster · openutau-headless · sariling-analyst

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
Modelfile.example		Modelfile.example
README.md		README.md
proxy.py		proxy.py
voice_to_speaker.example.yaml		voice_to_speaker.example.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voice-ai-stack

What this is

Architecture

How the pieces fit

The voice proxy pattern

Setup

Prerequisites

Installation

Configuration

Hardware notes

Integration with a cluster

Tech stack

Lessons learned

Status

What this is NOT

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

voice-ai-stack

What this is

Architecture

How the pieces fit

The voice proxy pattern

Setup

Prerequisites

Installation

Configuration

Hardware notes

Integration with a cluster

Tech stack

Lessons learned

Status

What this is NOT

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages