Releases: jamiepine/voicebox
v0.5.0
The Capture release.
Voicebox stops being just a voice-cloning studio and becomes a full AI voice studio. Hold a key anywhere on your machine, speak, release — the transcript lands in the focused text field. Flip the primitive around and any MCP-aware agent — Claude Code, Cursor, Spacebot — speaks back through an on-screen pill in one of your cloned voices. A local LLM sits between the two, so transcripts come out clean and voice profiles can carry a personality that reshapes what the agent says before it gets spoken.
Dictation — speak anywhere, paste anywhere
- Global hotkey capture. Hold a customizable chord anywhere on your machine (defaults: right-Cmd + right-Option on macOS, right-Ctrl + right-Shift on Windows), speak, release. A floating on-screen pill walks through recording → transcribing → refining → done with a live elapsed timer. The transcript lands as clean text.
- Push-to-talk and toggle modes, each with its own chord. The default toggle chord adds Space to the push-to-talk chord. Holding PTT and tapping Space mid-hold upgrades a hold into a hands-free session without a gap in the recording.
- Auto-paste into the focused app. Once transcription finishes, Voicebox synthesizes a paste into whatever text field had focus when you started the chord — not wherever focus drifted while you were talking. Works across Dvorak / AZERTY layouts. Your clipboard is saved before and restored after.
- Chord picker UI. Customize either chord from Settings → Captures by holding the keys you want. Left/right modifier badges show whether a key is the left or right variant.
- Defaults stay out of your way. macOS defaults avoid left-hand Cmd+Option chords so the system shortcuts they collide with stay yours. Windows defaults route around AltGr collisions on German / French / Spanish layouts.
- Accessibility permission is scoped. If macOS Accessibility isn't granted, dictation still runs and transcripts still land in the Captures tab — only synthetic paste is disabled. The permission prompt lives inline next to the auto-paste toggle, not as a global banner.
Personality — voice profiles that speak for themselves
Voice profiles now carry an optional personality — a free-form description of who this voice is, up to 2000 characters. When set, two new controls appear next to the generate button, each powered by a new Qwen3 LLM running entirely locally:
- Compose — the shuffle button drops a fresh in-character line into the textarea. Click again for variety, edit before speaking.
- Speak in character — the wand toggle runs your input through the personality LLM before TTS, preserving every idea but delivering it in the character's voice.
The same LLM doubles as the refinement model, so there's one local LLM in the app, not two.
API surface. POST /generate, POST /speak, and the MCP voicebox.speak tool accept personality: bool. POST /profiles/{id}/compose powers the shuffle button. MCP client bindings carry a default_personality: bool that applies when personality isn't passed explicitly.
Agents — any MCP-aware agent gets a voice
Voicebox ships a built-in Model Context Protocol server at http://127.0.0.1:17493/mcp so Claude Code, Cursor, Windsurf, Cline, VS Code MCP extensions — any MCP-aware agent — can call into your local Voicebox install. Four tools ship with dotted names:
-
voicebox.speak— speak text in any voice profile, with optionalpersonality: trueto run through the profile's personality LLM first -
voicebox.transcribe— Whisper transcription of a base64 blob or an absolute local path. Path mode is restricted to loopback callers so a Voicebox bound on0.0.0.0doesn't double as an unauthenticated arbitrary-local-file read primitive. -
voicebox.list_captures— recent captures with their transcripts -
voicebox.list_profiles— available voice profiles (cloned + preset) -
Streamable HTTP as primary transport. Cursor / Windsurf / VS Code / Claude Code all support it out of the box — drop a
mcpServersblock with the URL and anX-Voicebox-Client-Idheader. -
Stdio shim for clients that don't speak HTTP MCP. A
voicebox-mcpbinary ships inside the app bundle as a Tauri sidecar. The Settings page renders the install snippet with the right absolute path pre-filled. -
Per-client voice binding. Pin Claude Code to Morgan, Cursor to Scarlett, Cline to its own voice — the
X-Voicebox-Client-Idheader resolves to a bound voice wheneverspeakis called without an explicitprofile. Managed in Settings → MCP. -
Profile resolution precedence. Explicit
profilearg (name or id, case-insensitive) → per-client binding → global default fromcapture_settings.default_playback_voice_id→ error with a pointer to Settings. -
Speaking pill. Agent-initiated speech surfaces the same on-screen pill as dictation, in a
speakingstate with the profile name and an elapsed timer. Silent background TTS is a trust hazard — the pill always shows what's coming out of your machine. -
POST /speakREST wrapper. Same code path and voice resolution for shell scripts, ACP, A2A, GitHub Actions, or anything else that isn't MCP-native.
Claude Code one-liner:
claude mcp add voicebox --transport http --url http://127.0.0.1:17493/mcp --header "X-Voicebox-Client-Id: claude-code"
Refinement
A clean transcript needs more than Whisper. Each capture flows through a small Qwen3 LLM that strips fillers, fixes punctuation, and optionally rewrites self-corrections — all on-device.
- Loop-stripping before the LLM sees the transcript. Whisper's "thanks for watching thanks for watching thanks for watching…" hallucination loops are collapsed at a six-identical-tokens threshold (case-insensitive) so a small refinement model can't echo them back. Coverage spans single-word runs, multi-word phrases, CJK character runs, and Japanese emphasis patterns; legitimate repetition ("no, no, no, no, no") doesn't cross the threshold.
- Per-capture flag snapshot.
smart_cleanup,self_correction, andpreserve_technicalare stored on each capture, so refinement can be re-run later with different flags without losing the raw transcript. - Model picker — Qwen3 0.6B (400 MB, very fast), 1.7B (1.1 GB, fast), 4B (2.5 GB, full quality). 0.6B is the default; 1.7B is the sweet spot for transcripts with code identifiers.
Captures tab + settings
Settings → Captures is now the home for the whole dictation flow:
- Dictation: global shortcut toggle, push-to-talk chord picker, toggle chord picker, live pill preview, auto-paste into focused field (with inline accessibility prompt).
- Transcription: model picker (Whisper Base / Small / Medium / Large / Turbo), language lock.
- Refinement: auto-refine toggle, model picker, smart cleanup, remove self-corrections, preserve technical terms.
- Playback: default voice for the Captures tab's "Play as" action — picking a voice from the split-button persists the choice across tab switches and restarts.
- Storage: captures folder quick-open.
Stories — timeline editor
The Stories tab graduates from a TTS sequencer into a real timeline editor. Same generation-row backing, but clips now compose with imported audio, per-clip levels, and a flexible track stack.
- Import external audio. Drag a music file onto the story content area or pick one from the new "Import audio" entry in the add-clip popover. Accepted formats: wav / mp3 / flac / ogg / m4a / aac / webm, capped at 200 MB. Imported clips show their filename instead of a profile name and skip the regenerate / version-picker controls — there's nothing to regenerate.
- Per-clip volume. A
Volume2icon in the clip-edit toolbar opens a 0–200% slider. Adjustments apply live and to exports. Split and duplicate carry the volume forward into the new clips. - Regenerate from both the clip's chat-list dropdown and the track-editor toolbar. Re-runs the underlying generation through the same path the History tab uses, with completion tracked in the global pending set.
- Add empty tracks above or below the timeline via tiny
+strips at the top of the topmost label cell and the bottom of the bottommost. Sticky in the label column so they follow horizontal scroll. - Zoom bar tracks the project. Min scope is 10 seconds visible (zoomed in cap), max is the entire project (zoomed out cap), default lands on 60 s. Both the +/− buttons and the scrollbar edge-drag handles clamp to those dynamic bounds.
Interface
- Theme selector. Light / dark / system in Settings → General, persisted across sessions. System mode listens for OS-level appearance changes and flips live without a restart.
- Scrubbable waveform player on captures. The capture detail card now embeds a WaveSurfer waveform with click-to-seek and a current / total timestamp pair, replacing the static duration label.
- Capture pill light mode. The on-screen pill gets a dedicated light palette so it stays legible against bright windows.
- Readiness checklist in the Captures settings sidebar. The same six-gate checklist the Captures empty state uses mirrors into Settings → Captures so a red gate can't hide behind a green toggle. Hidden once every gate is green. macOS-only rows (Input Monitoring, Accessibility) hide entirely on Windows and Linux.
Windows parity
Same dictation flow on Windows. Right-hand default chord (Ctrl+Shift) avoids AltGr collisions on layouts where Ctrl+Alt is the compose key. Focus is captured at chord-start so paste lands in the original field even if focus drifts during transcribe/refine.
v0.4.5
Second hotfix for the "offline mode is enabled" crash on model load. 0.4.4 reverted the inference-path offline guards but kept the same trap on the load path, so users who updated to 0.4.4 kept hitting the exact error the release was supposed to fix (#526). This release removes the load-path guards and patches the transformers tokenizer load to be robust to HuggingFace metadata failures at the source, so the class of bug can't recur.
Reliability
- Load no longer fails with "offline mode is enabled" (#530, fixes #526). transformers 4.57.x added an unconditional
huggingface_hub.model_info()call insideAutoTokenizer.from_pretrained(via_patch_mistral_regex) that runs for every non-local repo load, regardless of cache state or whether the target model is actually a Mistral variant. The load-timeHF_HUB_OFFLINEguard from 0.4.2 turned that into a hard crash for cached online users the moment 0.4.4 removed the inference-path guard that had been masking the problem. Fix wraps_patch_mistral_regexso any exception from the HF metadata check is caught and the tokenizer is returned unchanged — matching the success-path behavior for non-Mistral repos. The wrapper installs atbackend.backendsimport time so it covers Qwen Base, Qwen CustomVoice, TADA, and every other transformers-backed engine on Windows, Linux, and CUDA alike. The load-timeforce_offline_if_cachedguards were removed — with the wrapper in place they provide zero value and only risk re-introducing the same failure mode. - No more 30s pause when generating without a network. The HuggingFace metadata timeout called out as a known caveat in 0.4.4 is covered by the same patch; offline users no longer wait for the check to time out before load completes.
v0.4.4
Hotfix for a regression in 0.4.3 where generation and transcription could fail outright with "offline mode is enabled" even when the user was online.
Reliability
- Inference no longer fails with "offline mode is enabled" while online (#524, reverts the inference-path guards from #503). 0.4.3 wrapped every inference body (
generate,transcribe,create_voice_clone_prompt) with a process-wideHF_HUB_OFFLINEflip to stop lazy HuggingFace lookups from hanging when the network drops mid-inference (#462). That flag also blocks legitimate metadata calls (e.g.HfApi().model_infofor revision resolution) so online users started seeing generation fail outright. Inference now runs with the process's default HF state. Load-time offline guards — which weren't the source of the regression — stay in place.
Known caveat: users generating without an internet connection may see brief pauses during inference while HuggingFace metadata lookups time out (typically ~30s, after which the library recovers). A proper offline-mode toggle is planned for 0.4.5.
voicebox v0.4.3
A patch focused on two user-impacting reliability fixes: macOS DMG notarization (unblocks brew install voicebox on macOS 15 Sequoia and fixes spurious "app isn't signed" Gatekeeper dialogs on older Intel Macs) and Kokoro Japanese voice initialization on fresh installs.
macOS
- DMGs are now notarized and stapled (#523). Tauri's bundler notarizes the
.appinside the DMG but ships the DMG wrapper itself unnotarized. Gatekeeper rejects that on macOS 15 Sequoia (confirmed by Homebrew Cask CI failing on both arm and intel Sequoia runners) and causes the "the app is not signed" dialog on older Intel Macs when Apple's notarization servers are slow or unreachable (#509). The release workflow now submits each DMG tonotarytool, staples the ticket, verifies withspctl, and overwrites the draft-release assettauri-actionuploaded. Adds ~5-10 min per macOS job.
Backend
- Kokoro Japanese voices no longer crash on fresh installs (#521, fixes #514).
misaki[ja]pulls infugashi, which needs a MeCab dictionary on disk. Theunidicpackage that was being installed ships no data and expects a ~526MB runtime download thatjust setupdoesn't run (and which wouldn't survive PyInstaller anyway). Swapped tounidic-lite, which bundles a MeCab-compatible dict inside the wheel (~50MB). Collected inbuild_binary.pyso frozen builds pick upunidic_lite/dicdir/.
voicebox v0.4.2
This release localizes the entire app. English, Simplified Chinese (zh-CN), Traditional Chinese (zh-TW), and Japanese (ja) are wired up end-to-end across every tab, modal, dialog, and toast — 559 translation keys per locale, parity verified. Plus a batch of reliability fixes: offline-mode now actually stays offline, Chatterbox accepts reference samples it used to reject, MLX Qwen 0.6B points at the right repo, and macOS system audio survives backgrounding.
Internationalization (#508)
- i18next foundation with an in-app language switcher that re-renders the tree on change — lazy-loaded components were holding stale strings without an explicit key-bump on the React root.
- Four locales at full coverage: English, Simplified Chinese, Traditional Chinese, Japanese. No partial/English-fallback surfaces.
- Every user-visible surface translated: Stories (list, content editor, dialogs, toasts), Effects (list, detail, chain editor, built-in preset names), Voices (table, search, inspector, Create/Edit modal, audio sample panels), Audio Channels (list, dialogs, device picker), history + story dropdown menus, ProfileCard / ProfileList / HistoryTable, and the unsupported-model note.
- Relative dates localize via
date-fnslocale objects (3 days ago→3 天前/3 日前) —Intl.RelativeTimeFormatdoesn't produce the phrasing we use in the history table. - Dev-build version suffix (
v0.4.2 (dev)/(开发版)/(開發版)/(開発版)) is now locale-aware. - 559 translation keys across all four locales.
Reliability
HF_HUB_OFFLINEnow guards every inference path (#503) — some engines were still attempting a HuggingFace metadata roundtrip on first load when offline mode was enabled, causing hangs on airgapped or flaky networks.- Chatterbox reference samples are preprocessed instead of rejected (#502) — samples outside the expected sample rate or channel layout are resampled to match, rather than failing with an opaque error.
- MLX Qwen 0.6B repo path fixed (#501) — now points at the published
mlx-communityrepo so the model actually downloads on Apple Silicon. - macOS system audio survives backgrounding (#486, closes #41) — WKWebView was tearing down the audio session when the app lost focus, silently killing system-audio capture.
- MLX backend
miniaudiodependency pinned (#506) —mlx_audio.sttneeds it at runtime and nothing else transitively pulled it in, so--no-depsinstalls were breaking on first use.
Landing / Docs
- New
/downloadpage (#487) — no more dumping first-time visitors onto the GitHub releases list. The API example snippet on the landing page also got an accuracy pass. - Download redirects work behind reverse proxies (#498) — uses the public origin instead of
localhostwhen resolving platform-specific installer URLs. - MDX docs audited against the multi-engine backend (#484) — stale single-engine assumptions removed.
- Three more tutorials + mobile navbar / hero CTA fixes (#483).
Linux
- Still not shipping. The re-enable attempt (#488) landed on
mainbut CI still hangs in thetauri-actionbundler step onubuntu-22.04— no output for 25+ minutes afterrpmbundling, even withcreateUpdaterArtifacts: falseand--bundles deb,rpm. The matrix entry is disabled again for 0.4.2; the ubuntu-specific setup steps stay in the workflow so re-enabling is a one-line change once we identify the hang. Next release will take another pass.
New Contributors
- @shekharyv — download redirects behind reverse proxies (#498)
v0.4.1
A fast follow-up to 0.4.0 focused on making the new engines actually load in the production binary — plus generation cancellation, Linux system-audio capture, and the repo's first PR-time type check. Five first-time contributors shipped in this release.
0.4.0 introduced three new TTS engines, but the frozen PyInstaller binary tripped over several Python-ecosystem quirks that don't show up in the dev venv: transformers opening .py sources at runtime, scipy.stats._distn_infrastructure hitting a frozen-importer NameError, and chatterbox-multilingual failing to find its Chinese segmenter dictionary. This release patches all of those in one sweep.
Frozen-Binary Reliability (#438)
- Kokoro now bundles
.pysources alongside.pycvia--collect-all kokorosotransformers'_can_set_attn_implementationregex scan can read them — previouslyFileNotFoundError: kokoro/modules.pykilled Kokoro loading in production builds - Chatterbox Multilingual now bundles
spacy_pkuseg/dicts/default.pkland the package's native.soextensions via--collect-all spacy_pkuseg— previously the Chinese word segmenter crashed withFileNotFoundErroron first load - scipy.stats._distn_infrastructure — new runtime hook source-patches the trailing
del obj(which raisesNameErrorunder PyInstaller's frozen importer because the preceding list comprehension evaluates empty) toglobals().pop('obj', None), unblockinglibrosa→scipy.signal→scipy.statsfor every TTS engine that depends on librosa - transformers.masking_utils — same runtime hook forces
_is_torch_greater_or_equal_than_2_6 = Falseso the oldersdpa_mask_older_torchpath is selected; the 2.6+ path usesTransformGetItemToIndex(), a realtorch._dynamograph transform our permissive stub can't reproduce - torch._dynamo — no-op stub replaces the real module before
transformersimports it, preventing thetorch._numpy._ufuncsimport crash (NameError: name 'name' is not defined) that blocked Kokoro and every engine pulling inflex_attention .specpaths are now repo-relative instead of absolute, so the generated spec is portable across machines and CI
Generation
- Cancel queued or running generations (#444) — new
/generate/{id}/cancelendpoint and a Stop button on the history row while generating. The serial queue now tracks per-ID state (queued / running / cancelled) so queued jobs are skipped before the worker picks them up and running jobs are.cancel()-ed mid-flight;run_generationcatchesCancelledErrorand marks the rowfailedwith a "cancelled" error. - Legacy
data/path prefix resolution (#440) — generations stored with the olddata/prefix under pre-0.4 installs now resolve correctly after the storage root moved, fixing 404s for historical audio.
Model Migration
- Migration dialog no longer hangs when the cache is empty (#439) — the backend now emits a completion SSE event even when zero models are moved.
- Storage-change flow surfaces a toast when there's nothing to migrate (#433) instead of proceeding with a no-op move and restarting the server.
- Deleting all generations from a voice profile now deletes the associated version files and DB rows too (#447) — previously orphaned versions accumulated in storage.
Platform
- Linux system audio capture (#457) —
cpal's ALSA backend doesn't expose PulseAudio/PipeWire monitor sources by name, so the previous device-name search never matched and silently fell back to the microphone. Detection now usespactl get-default-sink+pactl list short sourcesand routes viaPULSE_SOURCE, with the name-based search retained as a fallback whenpactlis absent.
Frontend CI
- First PR-time quality gate (#418) — new
.github/workflows/ci.ymlrunsbun run typecheck+bun run build:webon every PR. Fixed pre-existing type issues that were being suppressed with@ts-expect-error, cleaned up a dep-array typo ([platform.metadata.isTauricheckOnMountcheckForUpdates]) inuseAutoUpdater, and removed 100+ lines of deadModelItemcode fromModelManagement.tsx. - Follow-up: widened
apiClient.migrateModels()return type to includemovedanderrorsso the storage-change handler typechecks against the real backend response (#470).
Docs
- Clarified in the Quick Start + README that paralinguistic tags (
[laugh],[sigh]) only work with Chatterbox Turbo; other engines read them as literal text (#450).
New Contributors
- @Bortlesboat — generation cancellation (#444)
- @gaojulong — migration dialog hang fix (#439)
- @fuleinist — migration no-op toast (#433)
- @erionjuniordeandrade-a11y — frontend CI + type hardening (#418)
- @estefrac — Linux pactl system-audio capture (#457)
v0.4.0
The biggest Voicebox release yet. Three new TTS engines bring the lineup to seven — HumeAI TADA, Kokoro 82M, and Qwen CustomVoice join Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and Chatterbox Turbo. GPU support broadens to Intel Arc (XPU) and NVIDIA Blackwell (RTX 50-series), with runtime diagnostics that warn when your PyTorch build doesn't match your GPU. The CUDA backend is now split into independently versioned server and library archives, so upgrading no longer redownloads 4 GB of PyTorch/CUDA DLLs.
This release also marks a big community moment: 13 new contributors shipped fixes and features in 0.4.0. Thirty-plus bug fixes target the most-reported issues in the tracker — numpy 2.x TTS crashes, Windows background-server reliability, macOS 11 launch failures, audio playback silence, Stories clip-splitting races, history status staleness, and more.
New TTS Engines
HumeAI TADA — Expressive English & Multilingual (#296)
- Added
tada-1b(English) andtada-3b-ml(multilingual) backends - Replaced
descript-audio-codecwith a lightweight DAC shim to cut dependencies - Switched audio decoding to
soundfileto sidesteptorchcodecbundling issues - Redirected gated Llama tokenizer lookups to an ungated mirror so model loading works out of the box
- Fixed tokenizer patch that was corrupting
AutoTokenizerfor other engines - Fixed TorchScript error in frozen builds
Kokoro 82M — Fast Lightweight TTS (#325)
- Added Kokoro 82M engine with a new voice profile type system that distinguishes preset voices from cloned profiles
- Profile grid now handles engine compatibility directly — removed redundant dropdown filtering
- Tightened Kokoro profile handling so preset voices can't be edited like cloned profiles
Qwen CustomVoice (#328)
- Added
qwen-custom-voicepreset engine backed by Qwen3-TTS - Enforced preset/profile engine compatibility across the generation flow
- Floating generator now shows all engines instead of silently filtering
Voice Profile UX
Until 0.4, every engine in Voicebox was a cloning model, so every voice profile was usable with every engine and the profile grid just showed them all. Introducing Kokoro and Qwen CustomVoice — which work from preset voices rather than cloned samples — broke that assumption for the first time. An early cut on main filtered the grid by the selected engine, which left users running pre-release builds thinking their cloned voices had vanished whenever they switched to a preset-only engine.
This release ships the resolution before it ever reaches a tagged version:
- Grey-out instead of filter — all profiles are always visible; unsupported ones render dimmed with a compatibility hint at the bottom of the grid
- Auto-switch on selection — clicking a greyed-out profile selects it AND switches the engine to a compatible one, instead of silently doing nothing
- Instruct toggle restored for Qwen CustomVoice — the floating generate box now reveals a delivery-instructions input (tone, emotion, pace) when CustomVoice is selected. Hidden across the board while the new multi-engine lineup was stabilizing because most engines don't honor the kwarg; now conditionally exposed only for the one engine that was actually trained for instruction-based style control
- Supported profiles sort first; the grid scrolls the selected profile into view after engine/sort changes
- Fixed engine desync on tab navigation — the form now initializes its engine from the store
- Fixed the disabled-and-selected card click edge case by bouncing selection to re-trigger the auto-switch
- Cleaned up scroll effect timers (requestAnimationFrame + setTimeout) to prevent stale DOM writes on unmount or rapid selection changes
GPU & Platform
Intel Arc (XPU) Support (#320)
- First-class Intel Arc support across all PyTorch-based backends
- Device-aware seeding, XPU detection in the GPU status panel, and setup flow detection
- Reports correct device name and VRAM in settings
Blackwell / RTX 50-series Support (#316, #401)
- Upgraded the CUDA backend from cu126 → cu128 for RTX 50-series support
- Added
sm_120+PTXto the CUDA build viaTORCH_CUDA_ARCH_LISTfor forward-compatibility with Blackwell architectures (closes 5 open reports: #386, #395, #396, #399, #400) - GPU settings UI fixes around install/uninstall state
GPU Compatibility Diagnostics (#367, adapted)
- New
check_cuda_compatibility()compares the current device's compute capability against the bundled PyTorch's architecture list - Health endpoint exposes a
gpu_compatibility_warningfield so the UI can surface mismatches - Startup logs a
WARNwhen the installed PyTorch build doesn't support the detected GPU - GPU status label shows
[UNSUPPORTED - see logs]— no more silent "no kernel image" failures
Split CUDA Backend (#298)
- CUDA backend now ships as two independently versioned archives: a small server binary and a large libs archive (the ~4 GB of PyTorch/CUDA DLLs)
- Upgrading Voicebox no longer redownloads the libs archive when only the server binary changed
- Added
asyncio.Lockarounddownload_cuda_binary()so auto-update and manual download can't race on the same temp file (#428) - Updated
package_cuda.pyfor PyInstaller 6.18 onedir layout - Temp archives are always cleaned up on failure, even when the install aborts mid-extract
Bug Fixes
Critical: TTS Generation
- numpy 2.x
torch.from_numpycrash (#361) — torch compiled against numpy 1.x ABI fails silently when paired with numpy 2.x, causingRuntimeError: Numpy is not available/Unable to create tensoron every TTS request in bundled macOS Intel / Rosetta builds. Pinnednumpy<2.0in requirements and added a PyInstaller runtime hook with actypes.memmovefallback as belt-and-suspenders. Hardened afterward to raise on unknown dtypes instead of silently reinterpreting bytes as float32.
Platform Reliability
- Windows background server (#402) — "keep server running after close" now actually keeps the server running. The HTTP
/watchdog/disablerequest could lose the race against process exit on Windows; added a.keep-runningsentinel file as a synchronous fallback, with stale-sentinel cleanup on startup to avoid orphan server processes - macOS 11 launch crash (#424) — weak-linked ScreenCaptureKit so the app can launch on macOS < 12.3 instead of crashing at dyld resolution. Gated system audio capture behind a real
sw_versversion check so unsupported systems cleanly advertise "not available" rather than crashing at runtime - macOS Intel (x86_64) setup (#416) — relaxed
torch>=2.7.0→torch>=2.2.0. PyTorch dropped pre-built x86_64 wheels after 2.2.2, so Intel Mac devs could no longerpip install. Now resolves to the latest compatible torch per platform - Offline model loading (#318) — Qwen TTS and Whisper force offline mode when loading cached models, so startup works without network access
- GUI startup with external server (#319) — fixed GUI launch when pointed at a remote/external server, and added data refresh on server switch; hardened health validation and error handling
- Qwen3-TTS cache split on Windows (adapted from #218) — route
Qwen3TTSModel.from_pretrainedthroughhf_constants.HF_HUB_CACHEso the speech tokenizer andpreprocessor_config.jsonresolve from a single cache root - Qwen3-TTS bundling (#305) — bundle
qwen_ttssource files in the PyInstaller build to fixinspect.getsourceerrors in frozen builds - Backend import paths (#345) — moved lazy imports to top-level with absolute paths to resolve the "Failed to Save" preset error caused by
ModuleNotFoundErrorin production builds - Effects service import (#384) — fixed
ModuleNotFoundErroron preset create/update by switching to relative imports (#349)
Audio & Playback
- cpal stream silent playback (#405) —
cpal::Streamwas dropped on function return immediately afterplay(), causing every playback to fall silent. Now holds the stream until either the buffer drains or the stop flag fires (#404)
Stories & History
- Clip-splitting race (#403) — rapid double-clicks on split could race through
split_story_itemwith inconsistent state. Addedwith_for_update()row locking on the backend and anisPendingguard on the frontend (#366) - History
statusstaleness (#394) —GET /history/{id}was hardcodingstatus="completed"regardless of the DB row, breaking any client polling for job completion. Now returnsstatus,error,engine,model_size, andis_favoritedfrom the actual row - "Clear failed" bulk button (#412) — new
DELETE /history/failedendpoint and a header strip showing `"N failed generati...
v0.3.0
This release rewrites the backend into a modular architecture, overhauls the settings UI into routed sub-pages, fixes audio player freezing, migrates documentation to Fumadocs, and ships a batch of bug fixes targeting the most-reported issues from the tracker.
The backend's 3,000-line monolith main.py has been decomposed into domain routers, a services layer, and a proper database package. A style guide and ruff configuration now enforce consistency. On the frontend, settings have been split into dedicated routed pages with server logs, a changelog viewer, and an about page. The audio player no longer freezes mid-playback, and model loading status is now visible in the UI. Seven user-reported bugs have been fixed, including server crashes during sample uploads, generation list staleness, cryptic error messages, and CUDA support for RTX 50-series GPUs.
Settings Overhaul (#294)
- Split settings into routed sub-tabs: General, Generation, GPU, Logs, Changelog, About
- Added live server log viewer with auto-scroll
- Added in-app changelog page that parses
CHANGELOG.mdat build time - Added About page with version info, license, and generation folder quick-open
- Extracted reusable
SettingRowcomponent for consistent setting layouts
Audio Player Fix (#293)
- Fixed audio player freezing during playback
- Improved playback UX with better state management and listener cleanup
- Fixed restart race condition during regeneration
- Added stable keys for audio element re-rendering
- Improved accessibility across player controls
Backend Refactor (#285)
- Extracted all routes from
main.pyinto 13 domain routers underbackend/routes/—main.pydropped from ~3,100 lines to ~10 - Moved CRUD and service modules into
backend/services/, platform detection intobackend/utils/ - Split monolithic
database.pyinto adatabase/package with separatemodels,session,migrations, andseedmodules - Added
backend/STYLE_GUIDE.mdandpyproject.tomlwith ruff linting config - Removed dead code: unused
_get_cuda_dll_excludes, stalestudio.py,example_usage.py, oldMakefile - Deduplicated shared logic across TTS backends into
backends/base.py - Improved startup logging with version, platform, data directory, and database stats
- Fixed startup database session leak — sessions now rollback and close in
finallyblock - Isolated shutdown unload calls so one backend failure doesn't block the others
- Handled null duration in
story_itemsmigration - Reject model migration when target is a subdirectory of source cache
Documentation Rewrite (#288)
- Migrated docs site from Mintlify to Fumadocs (Next.js-based)
- Rewrote introduction and root page with content from README
- Added "Edit on GitHub" links and last-updated timestamps on all pages
- Generated OpenAPI spec and auto-generated API reference pages
- Removed stale planning docs (
CUDA_BACKEND_SWAP,EXTERNAL_PROVIDERS,MLX_AUDIO,TTS_PROVIDER_ARCHITECTURE, etc.) - Sidebar groups now expand by default; root redirects to
/docs - Added OG image metadata and
/ogpreview page
UI & Frontend
- Added model loading status indicator and effects preset dropdown (3187344)
- Fixed take-label race condition during regeneration
- Added accessible focus styling to select component
- Softened select focus indicator opacity
- Addressed 4 critical and 12 major issues from CodeRabbit review
Bug Fixes (#295)
- Fixed sample uploads crashing the server — audio decoding now runs in a thread pool instead of blocking the async event loop (#278)
- Fixed generation list not updating when a generation completes — switched to
refetchQueriesfor reliable cache busting, added SSE error fallback, and page reset on completion (#231) - Fixed error toasts showing
[object Object]instead of the actual error message (#290) - Added Whisper model selection (
base,small,medium,large,turbo) and expanded language support to the/transcribeendpoint (#233) - Upgraded CUDA backend build from cu121 to cu126 for RTX 50-series (Blackwell) GPU support (#289)
- Handled client disconnects in SSE and streaming endpoints to suppress
[Errno 32] Broken Pipeerrors (#248) - Fixed Docker build failure from pip hash mismatch on Qwen3-TTS dependencies (#286)
- Added 50 MB upload size limit with chunked reads to prevent unbounded memory allocation on sample uploads
- Eliminated redundant double audio decode in sample processing pipeline
Platform Fixes
- Replaced
netstatwithTcpStream+ PowerShell for Windows port detection (#277) - Fixed Docker frontend build and cleaned up Docker docs
- Fixed macOS download links to use
.dmginstead of.app.tar.gz - Added dynamic download redirect routes to landing site
Release Tooling
- Added
draft-release-notesandrelease-bumpagent skills - Wired CI release workflow to extract notes from
CHANGELOG.mdfor GitHub Releases - Backfilled changelog with all historical releases
v0.2.3
The "it works in dev but not in prod" release. This version fixes a series of PyInstaller bundling issues that prevented model downloading, loading, generation, and progress tracking from working in production builds.
Model Downloads Now Actually Work
The v0.2.1/v0.2.2 builds could not download or load models that weren't already cached from a dev install. This release fixes the entire chain:
- Chatterbox, Chatterbox Turbo, and LuxTTS all download, load, and generate correctly in bundled builds
- Real-time download progress — byte-level progress bars now work in production. The root cause:
huggingface_hubsilently disables tqdm progress bars based on logger level, which prevented our progress tracker from receiving byte updates. We now force-enable the internal counter regardless. - Fixed Python 3.12.0
code.replace()bug — the macOS build was on Python 3.12.0, which has a known CPython bug that corrupts bytecode when PyInstaller rewrites code objects. This causedNameError: name 'obj' is not definedcrashes during scipy/torch imports. Upgraded to Python 3.12.13.
PyInstaller Fixes
- Collect all
inflectfiles —typeguard's@typecheckeddecorator callsinspect.getsource()at import time, which needs.pysource files, not just bytecode. Fixes LuxTTS "could not get source code" error. - Collect all
perthfiles — bundles the pretrained watermark model (hparams.yaml,.pth.tar) needed by Chatterbox at runtime - Collect all
piper_phonemizefiles — bundlesespeak-ng-data/(phoneme tables, language dicts) needed by LuxTTS for text-to-phoneme conversion - Set
ESPEAK_DATA_PATHin frozen builds so the espeak-ng C library finds the bundled data instead of looking at/usr/share/espeak-ng-data/ - Collect all
linacodecfiles — fixesinspect.getsourceerror in Vocos codec - Collect all
zipvoicefiles — fixes source code lookup in LuxTTS voice cloning - Copy metadata for
requests,transformers,huggingface-hub,tokenizers,safetensors,tqdm— fixesimportlib.metadatalookups in frozen binary - Add hidden imports for
chatterbox,chatterbox_turbo,luxtts,zipvoicebackends - Add
multiprocessing.freeze_support()to fix resource_tracker subprocess crash in frozen binary --noconsolenow only applied on Windows — macOS/Linux need stdout/stderr for Tauri sidecar log capture- Hardened
sys.stdout/sys.stderrdevnull redirect to test writability, not justNonecheck
Updater
- Fixed updater artifact generation with
v1Compatiblefortauri-actionsignature files - Updated
tauri-actionto v0.6 to fix updater JSON and.siggeneration
Other Fixes
- Full traceback logging on all backend model loading errors (was just
str(e)before)
v0.2.2
UPDATE: I'm working on a rewrite of the model downloading, it's absolute hell and takes a while to test as it always works in dev and never in prod builds. Will have a solution up ASAP. If you're eager to test 0.2.x please compile from source. Next update will solve model downloading and the updater issue for good.
v0.2.2
- Fix Chatterbox model support in bundled builds [SIKE fixed in 0.2.3]
- Fix LuxTTS/ZipVoice support in bundled builds [SIKE fixed in 0.2.3]
- Auto-update CUDA binary when app version changes
- CUDA download progress bar
- Fix server process staying alive on macOS (SIGHUP handling, watchdog grace period)
- Hide console window when running CUDA binary on Windows