voice-input

Local voice input with screen-aware context. Push-to-talk on Mac, transcribed by Whisper, refined by a local LLM.

No cloud services. Your voice and screen data never leave your machine.

Features

Push-to-talk — Hold left Option key to record, release to transcribe and paste
Real-time streaming — Partial transcription shown during recording
Screen-aware context — Vision model reads your screen to improve accuracy
LLM text refinement — Removes filler words, adds punctuation, fixes recognition errors
Multi-language — Japanese, English, Chinese, Korean (auto-detected)
Auto-paste + Enter — Result pasted and submitted automatically (hold Ctrl to paste without Enter)

Quick start (Mac, 16 GB)

Works on any Apple Silicon Mac with 16 GB unified memory. No separate server needed.

1. Install prerequisites

Python 3.11+
Ollama — local LLM runtime

2. Setup

git clone https://github.com/xuiltul/voice-input
cd voice-input
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
ollama pull gemma3:4b

3. Start

Terminal 1 (server):

WHISPER_MODEL=small LLM_MODEL=gemma3:4b .venv/bin/python ws_server.py

Terminal 2 (client):

.venv/bin/python mac_client.py --server ws://localhost:8991 --model gemma3:4b

4. Grant permissions

System Settings > Privacy & Security:

Microphone → Terminal
Accessibility → Terminal
Screen Recording → Terminal

5. Use it

Hold left Option → speak → release → text is pasted.

Memory usage (~9 GB total):

Component	Memory
macOS	~5 GB
Whisper `small` (CPU, int8)	~1 GB
gemma3:4b (Ollama)	~3 GB

32 GB+ Mac: Use WHISPER_MODEL=large-v3-turbo LLM_MODEL=qwen2.5:7b for better accuracy. Add ollama pull qwen3-vl:8b-instruct and set VISION_MODEL=qwen3-vl:8b-instruct for screen-aware context.

Auto-start (optional)

Wrap the client in an Automator app so macOS can grant permissions to it:

Open Automator.app → Application → Run Shell Script:

cd ~/voice-input && /usr/bin/python3 mac_client.py --server ws://localhost:8991 --model gemma3:4b

Save as ~/Applications/VoiceInput.app
Grant permissions (Accessibility, Input Monitoring, Microphone, Screen Recording) to VoiceInput.app
Add to Login Items for auto-start

Quick start (Linux + NVIDIA GPU)

For faster inference with a dedicated GPU.

git clone https://github.com/xuiltul/voice-input
cd voice-input
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt

ollama pull gpt-oss:20b
ollama pull qwen3-vl:8b-instruct  # optional: screen context

# LD_LIBRARY_PATH required for pip-installed CUDA libs
PYVER=$(.venv/bin/python -c 'import sys;print(f"python{sys.version_info.major}.{sys.version_info.minor}")')
LD_LIBRARY_PATH=".venv/lib/$PYVER/site-packages/nvidia/cublas/lib:.venv/lib/$PYVER/site-packages/nvidia/cudnn/lib" \
  .venv/bin/python ws_server.py

Mac client connects to the server:

pip3 install sounddevice numpy websockets pynput
python3 mac_client.py --server ws://YOUR_SERVER_IP:8991

Usage

Push-to-talk

Action	Result
Hold left Option → speak → release	Transcribe → refine → paste + Enter
Hold left Option + Ctrl → speak → release	Transcribe → refine → paste only (no Enter)

Screen context

When you start recording, a screenshot is captured and analyzed by a vision model. If the analysis completes before you stop recording, the HUD turns green — the LLM will use your screen content to improve accuracy (e.g., technical terms visible on screen).

Client options

python3 mac_client.py [options]

  -s, --server URL      WebSocket server (default: ws://localhost:8991)
  -l, --language CODE   Language hint (default: ja)
  -m, --model NAME      Ollama model for refinement
  --raw                 Skip LLM refinement, Whisper output only
  --no-screenshot       Disable screen context

Configuration

Environment variables

Variable	Default	Description
`LLM_MODEL`	`gpt-oss:20b`	Ollama model for text refinement
`WHISPER_MODEL`	`large-v3-turbo`	Whisper model (`small`, `medium`, `large-v3-turbo`)
`WHISPER_DEVICE`	`auto`	`auto` (CUDA if available, else CPU), `cuda`, `cpu`
`WHISPER_COMPUTE_TYPE`	`default`	`float16` (CUDA) or `int8` (CPU)
`VISION_MODEL`	`qwen3-vl:8b-instruct`	Vision model for screen context
`VISION_SERVERS`	(local Ollama)	Remote Ollama URLs for vision (comma-separated)
`DEFAULT_LANGUAGE`	`ja`	Default language

Recommended models by hardware

Hardware	Whisper	LLM	Vision	Total memory
Mac 16 GB	`small`	`gemma3:4b`	(skip)	~9 GB
Mac 32 GB	`large-v3-turbo`	`qwen2.5:7b`	`qwen3-vl:8b-instruct`	~20 GB
Linux GPU 24 GB	`large-v3-turbo`	`gpt-oss:20b`	(remote)	~15 GB VRAM

Multi-language prompts

Refinement prompts are in prompts/ as JSON:

prompts/
├── ja.json    # Japanese (default)
├── en.json    # English
├── zh.json    # Chinese
└── ko.json    # Korean

To add a new language, create prompts/{code}.json:

{
  "system_prompt": "...",
  "user_template": "Format this:\n```\n{text}\n```",
  "few_shot": [
    { "user": "...", "assistant": "..." }
  ]
}

Prompt design note: Keep the system prompt concise (~400 chars). Small models (7B-20B) with think: "low" degrade when the prompt is too long — they drop content instead of formatting it. Use max 2 few-shot examples.

Advanced

Docker

docker build -t voice-input .
docker run --gpus all -p 8991:8991 \
  -e LLM_MODEL=gpt-oss:20b \
  -v ollama-data:/root/.ollama \
  voice-input

HTTP API

# Transcribe + refine
curl -X POST http://localhost:8990/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @recording.wav

# Whisper only
curl -X POST "http://localhost:8990/transcribe?raw=true" \
  -H "Content-Type: audio/wav" \
  --data-binary @recording.wav

Response:

{
  "text": "Refined text",
  "raw_text": "Raw Whisper output",
  "language": "ja",
  "duration": 5.2,
  "processing_time": { "transcribe": 0.3, "refine": 4.1 }
}

Architecture

Component	Role
`voice_input.py`	Core pipeline: Whisper + Ollama LLM + Vision
`ws_server.py`	WebSocket server (port 8991)
`mac_client.py`	Push-to-talk client with HUD overlay
`prompts/`	Language-specific refinement prompts

Whisper auto-detects CUDA/CPU. On CUDA, uses float16; on CPU, uses int8.

Voice slash commands (Claude Code)

Say "コマンド" followed by a command name to input a slash command:

「コマンドヘルプ」→ /help
「コマンドコミット」→ /commit
"command compact" → /compact

Commands auto-loaded from ~/.claude/skills/*/SKILL.md at startup.

Why?

Cloud voice input sends your audio and screen to external servers. This tool runs everything locally — your data never leaves your machine.

日本語ガイド

これは何？

Macのプッシュトゥートークで音声入力し、Whisperで文字起こし、ローカルLLMで整形するツールです。クラウドサービス不要、音声もスクリーンデータもネットワーク外に出ません。

MacBook (16 GB) でのセットアップ

必要なもの:

Apple Silicon Mac (M1/M2/M3/M4, 16 GB以上)
Python 3.11+
Ollama

手順:

# クローン＆セットアップ
git clone https://github.com/xuiltul/voice-input
cd voice-input
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
ollama pull gemma3:4b

# サーバー起動（ターミナル1）
WHISPER_MODEL=small LLM_MODEL=gemma3:4b .venv/bin/python ws_server.py

# クライアント起動（ターミナル2）
.venv/bin/python mac_client.py --server ws://localhost:8991 --model gemma3:4b

macOSの権限設定 — システム設定 > プライバシーとセキュリティ:

マイク → Terminal
アクセシビリティ → Terminal
画面収録 → Terminal

使い方

左Optionキーを押しながら話す → 離すと文字起こし → 整形 → 自動ペースト＋Enter
左Option + Ctrl を押しながら話す → ペーストのみ（Enter送信なし）

メモリ使用量

コンポーネント	メモリ
macOS	~5 GB
Whisper small (CPU)	~1 GB
gemma3:4b (Ollama)	~3 GB
合計	~9 GB

32 GB以上のMacでは WHISPER_MODEL=large-v3-turbo LLM_MODEL=qwen2.5:7b でより高精度になります。

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
prompts		prompts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
com.voice-input.client.plist		com.voice-input.client.plist
docker-entrypoint.sh		docker-entrypoint.sh
hotword_cache.py		hotword_cache.py
mac_client.py		mac_client.py
requirements.txt		requirements.txt
start.sh.example		start.sh.example
stop.sh		stop.sh
transcribe		transcribe
transcribe.py		transcribe.py
voice-input		voice-input
voice_input.py		voice_input.py
ws_server.py		ws_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voice-input

Features

Quick start (Mac, 16 GB)

1. Install prerequisites

2. Setup

3. Start

4. Grant permissions

5. Use it

Auto-start (optional)

Quick start (Linux + NVIDIA GPU)

Usage

Push-to-talk

Screen context

Client options

Configuration

Environment variables

Recommended models by hardware

Multi-language prompts

Advanced

Why?

日本語ガイド

これは何？

MacBook (16 GB) でのセットアップ

使い方

メモリ使用量

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

voice-input

Features

Quick start (Mac, 16 GB)

1. Install prerequisites

2. Setup

3. Start

4. Grant permissions

5. Use it

Auto-start (optional)

Quick start (Linux + NVIDIA GPU)

Usage

Push-to-talk

Screen context

Client options

Configuration

Environment variables

Recommended models by hardware

Multi-language prompts

Advanced

Why?

日本語ガイド

これは何？

MacBook (16 GB) でのセットアップ

使い方

メモリ使用量

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages