Skip to content

Jenkikan01/voice-ai-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

voice-ai-stack

Self-hosted voice-enabled AI assistant on hardware you already own. Open-source stack, no cloud APIs, no recurring costs. The personality is yours — the plumbing is here.

License Python Stack

What this is

A blueprint for building a voice-enabled local AI assistant — the kind that runs on your own hardware, gives you a custom voice character, and doesn't ship a single token to OpenAI or Anthropic. The interesting part isn't any single component (they all exist as open-source projects). The interesting part is the architecture: how the pieces fit together into a coherent voice assistant you can talk to.

This repo provides:

  • A generalized voice proxy (proxy.py) that adds pitch/tempo transformation to any OpenAI-compatible TTS service — give your assistant any voice character without retraining models
  • A Modelfile template showing the structure of a well-tuned conversational AI persona for Ollama
  • A voice config template mapping OpenAI-style voice names to specific Piper TTS speakers
  • Architecture documentation explaining how the components connect

The actual personality of your assistant is yours to define. This repo gives you the engineering substrate; you bring the soul.

Architecture

graph LR
    Client[Voice client<br/>Open WebUI / app / phone]
    Proxy[voice-proxy :8002<br/>Pitch + tempo shift]
    TTS[OpenedAI Speech :8001<br/>Piper TTS engine]
    LLM[Ollama :11434<br/>Your custom Modelfile]
    Voices[(Piper voice<br/>model files)]
    Brain[(Your Modelfile<br/>= the personality)]

    Client -->|TTS request| Proxy
    Proxy -->|forward| TTS
    TTS -->|generated audio| Proxy
    Proxy -->|pitch-shifted audio| Client
    Client -->|text completion| LLM
    TTS -.reads.-> Voices
    LLM -.reads.-> Brain
Loading

How the pieces fit

  1. Ollama runs the language model with your custom personality defined via a Modelfile. This is the assistant's "brain" — its conversational behavior, voice (textual), personality, hard rules.
  2. OpenedAI Speech provides an OpenAI-compatible TTS API that the client can call. Under the hood it uses Piper TTS models — small, fast, CPU-friendly neural voices.
  3. The voice proxy (this repo's proxy.py) sits between the client and OpenedAI Speech. It forwards TTS requests, then post-processes the audio with ffmpeg to apply pitch/tempo transformation — giving the voice a custom character without needing different voice models.
  4. The client can be anything OpenAI-API-compatible: Open WebUI (most common), a custom app, voice-input integrations, etc.

The whole stack runs locally. No tokens leave your machine.

The voice proxy pattern

Most local TTS solutions give you a small set of pre-trained voice models. If none of them sound the way you want your assistant to sound, your options have traditionally been:

  • Train your own voice model (hours of recording + GPU training time)
  • Pay for a cloud service with more voices ($$$ + privacy tradeoff)
  • Accept whatever's available

The proxy approach is a cheap third option: take any existing voice and transform its character with classical signal processing. ffmpeg's asetrate + atempo filter chain lets you:

  • Raise pitch (PITCH_MULTIPLIER > 1.0) for a brighter, more youthful voice character
  • Lower pitch (PITCH_MULTIPLIER < 1.0) for a deeper, more authoritative voice character
  • Compensate speech rate (TEMPO_COMPENSATION = 1/PITCH_MULTIPLIER approximately) so the result sounds natural at normal speaking speed

It's not as flexible as a custom-trained voice, but it's a one-line config change to give your assistant a distinctive sound that's neither the upstream's default nor a recognizable cloud voice.

Setup

Prerequisites

You need three services running, plus optionally a client:

Service What it does Repo
Ollama Local LLM serving https://ollama.com
OpenedAI Speech TTS API (Piper-backed) https://github.com/matatonic/openedai-speech
voice-proxy (this repo) Pitch/tempo transformation this repo
Open WebUI (optional) Web UI with voice support https://github.com/open-webui/open-webui

Installation

# 1. Install Ollama + pull a base model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:7b

# 2. Build your custom assistant from the Modelfile template
cp Modelfile.example Modelfile
nano Modelfile             # edit the SYSTEM block with your assistant's persona
ollama create my-assistant -f Modelfile

# 3. Install OpenedAI Speech (Docker is simplest)
docker run -d --name openedai-speech -p 8001:8000 \
    -v ./voices:/app/voices \
    -v ./voice_to_speaker.yaml:/app/voice_to_speaker.yaml \
    ghcr.io/matatonic/openedai-speech

# 4. Copy + edit the voice config
cp voice_to_speaker.example.yaml voice_to_speaker.yaml
nano voice_to_speaker.yaml      # adjust speaker IDs as desired

# 5. Run the voice proxy with your chosen voice character
PITCH_MULTIPLIER=1.25 TEMPO_COMPENSATION=0.8 \
DEFAULT_VOICE=nova \
UPSTREAM=http://localhost:8001 \
python proxy.py

# 6. Point your client (Open WebUI, etc.) at the proxy:
#    TTS endpoint: http://localhost:8002/v1/audio/speech

Configuration

All proxy settings are env-vars. See proxy.py docstring for the full list.

Variable Default Purpose
UPSTREAM http://localhost:8001 URL of the underlying TTS service
DEFAULT_VOICE None If set, override client's voice choice (useful for consistency)
PITCH_MULTIPLIER 1.0 Frequency multiplier (>1 brighter, <1 deeper)
TEMPO_COMPENSATION 1.0 Speed adjustment to keep speech rate natural after pitch shift
HOST / PORT 0.0.0.0 / 8002 Where the proxy listens

If both PITCH_MULTIPLIER and TEMPO_COMPENSATION are 1.0, the proxy is a pass-through with no ffmpeg overhead.

Hardware notes

This stack is designed to run on modest consumer hardware. Tested working on:

  • Linux Mint laptops (x86_64, 7-8GB RAM)
  • Termux + Debian proot on Android (aarch64, 5GB+ RAM)

For the LLM: pick your Ollama model based on available RAM:

  • ~2GB: qwen2.5:1.5b, llama3.2:1b
  • ~4GB: qwen2.5:3b, llama3.2:3b
  • ~8GB: qwen2.5:7b, llama3.1:8b (recommended for conversational quality)
  • 10GB+: qwen2.5:14b, llama3.1:70b (if you have the RAM)

TTS is much lighter — Piper models are 60-100MB each and run comfortably on CPU.

Integration with a cluster

This stack composes well with hydra-cluster — my heterogeneous AI inference cluster — to distribute the workload:

  • Heavy node runs Ollama with the bigger model
  • Lighter node runs OpenedAI Speech + the voice proxy
  • Main node runs Open WebUI as the front-end and routes traffic over a Tailscale mesh

This way a phone or low-spec device can be a usable interface to an assistant that's actually computing on more capable hardware elsewhere in your network. The proxy + TTS layer is light enough to run on a phone or tablet, while the LLM lives on whatever has the most RAM.

Tech stack

  • Python 3.10+ with FastAPI + httpx for the proxy
  • ffmpeg for pitch/tempo audio transformation
  • Ollama for local LLM serving with Modelfile-based persona definition
  • OpenedAI Speech as the TTS engine (OpenAI-API-compatible)
  • Piper TTS voice models (downloaded separately from the openedai-speech repo)
  • Open WebUI (optional, as the client front-end)

Lessons learned

  • The proxy pattern is more powerful than it looks. A small FastAPI app + 5 lines of ffmpeg gives you voice character customization that would otherwise require model retraining. Cheap engineering wins.
  • asetrate + atempo is the magic combo. Pitch-only filters (like rubberband) sound better but are slow and not always available. The asetrate/atempo chain is in every ffmpeg build and runs in real time on a Raspberry Pi.
  • Modelfile design matters more than model size. A well-tuned 3B-parameter model with a thoughtful Modelfile often feels more pleasant to talk to than a poorly-prompted 14B model. Spend time on the persona, not just the compute.
  • OpenAI-compatible APIs are an underrated standardization. OpenedAI Speech, Ollama, Open WebUI all speak the same dialect — you can swap any piece without breaking the others. Makes experimentation easy.

Status

  • ✅ Voice proxy functional with configurable pitch/tempo
  • ✅ Tested with OpenedAI Speech + Piper voices
  • ✅ Tested with Ollama as the LLM backend
  • ✅ Verified working in cluster context (hydra-cluster)
  • 🔲 STT (speech-to-text) integration — currently relies on the client (e.g., Open WebUI) to do this
  • 🔲 Streaming responses for lower latency
  • 🔲 Docker compose file for one-shot deployment

What this is NOT

Being clear about scope:

  • Not a complete deploy-and-use product. This is a stack/architecture you assemble. The proxy is small (~150 LOC); the value is in the pattern + the documentation of how the pieces connect.
  • Not bundled with voice models. The Piper voice files are 60-100MB each and have their own licenses (mostly free-for-personal-use, check each model's source). Fetch them from openedai-speech's repo or huggingface.
  • Not bundled with a specific assistant personality. That's intentional — your assistant should be yours, not mine. Use Modelfile.example as the structural template.

License

MIT — see LICENSE. Use, adapt, and learn from this freely. If you build something cool with it, I'd love to hear about it.

Author

Joshua Jen Robiano Pujante — BSAIS student at Saint Paul School of Professional Studies, Tacloban City, Philippines.

LinkedIn · Companion repos: hydra-cluster · openutau-headless · sariling-analyst

About

Self-hosted voice-enabled AI assistant blueprint — Ollama + OpenedAI Speech + Piper + a pitch-shifting voice character proxy. No cloud, runs on consumer hardware.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages