Skip to content

ET Mac Voxtral Realtime Desktop App#219

Open
seyeong-han wants to merge 18 commits intometa-pytorch:mainfrom
seyeong-han:et-voxtral-realtime
Open

ET Mac Voxtral Realtime Desktop App#219
seyeong-han wants to merge 18 commits intometa-pytorch:mainfrom
seyeong-han:et-voxtral-realtime

Conversation

@seyeong-han
Copy link
Contributor

No description provided.

seyeong-han and others added 15 commits March 3, 2026 13:32
Native SwiftUI macOS app that wraps ExecuTorch's voxtral_realtime_runner
for on-device speech transcription using Voxtral-Mini-4B (Metal int4).

Features:
- Live transcription with real-time token streaming
- Model preloading with loading progress indicators
- Pause/resume within the same session
- Session history with search, rename, and persistence
- Audio level waveform visualization
- Bundled runner binary, libomp, and model artifacts via build phase

Uses XcodeGen (project.yml) to generate the Xcode project.

Co-authored-by: Claude <noreply@anthropic.com>
Made-with: Cursor
- Introduced DictationManager to handle dictation state and hotkey registration.
- Implemented startDictation and stopDictation methods in TranscriptStore for managing dictation sessions.
- Added DictationOverlayView and DictationPanel for user interface during dictation.
- Updated VoxtralRealtimeApp to integrate dictation features, including accessibility checks and hotkey registration.
- Enhanced user experience with real-time dictation text display and command menu options for starting/stopping dictation.
- Raise silence threshold from 0.005 to 0.02 so background noise
  doesn't prevent auto-stop
- Save frontmost app reference before showing panel and re-activate
  it before pasting
- Use nil CGEventSource and .cgSessionEventTap for reliable paste
- Add 300ms delay after panel dismiss for focus to settle
- Remove AXIsProcessTrusted guard (unreliable with Debug builds)

Co-authored-by: Claude <noreply@anthropic.com>
Made-with: Cursor
Overlay text area grows smoothly from 40pt to 200pt as transcribed
text exceeds two lines, with animated height transition.

Co-authored-by: Claude <noreply@anthropic.com>
Made-with: Cursor
- Introduced new preferences for silence detection: silence threshold and silence timeout, allowing users to customize sensitivity and auto-stop delay.
- Updated SettingsView to include sliders for adjusting silence detection parameters.
- Added a script to create a DMG for easy application distribution with a drag-to-Applications UI.
- Included new app icon assets for better visual representation in the app.
… model repo

- Rename project directory from apps/macos/speech-studio to apps/macos/VoxtralRealtimeApp
- Rename branch from et-speech-studio to et-voxtral-realtime
- Rewrite README as HF showcase app with end-user and developer sections
- Update DMG volume name and all text references from "Speech Studio" to "Voxtral Realtime"
- Update SetupGuideView with context-aware instructions (bundled vs developer build)
- Update context.md to reflect model bundling in DMG and new paths

Made-with: Cursor
- Add scripts/build.sh: one-command pipeline (check prereqs → download
  models → xcodegen → xcodebuild → create DMG), supports --download-models
- Update create_dmg.sh: validates all 5 required files (runner, libomp,
  model, preprocessor, tokenizer) exist in .app bundle before creating DMG
- Update README: add Download section pointing to GitHub Releases for end
  users, add quick-build section for developers, clarify that models are
  not in git and must be downloaded before building
- Update context.md with distribution model and build pipeline decisions

Made-with: Cursor
- build.sh now checks CONDA_DEFAULT_ENV is set before proceeding, with
  full setup instructions if no env is active
- README restructured: conda env creation is step 1, all subsequent steps
  (ExecuTorch install, runner build, model download) run inside the env
- Consolidated pip installs (huggingface_hub, sounddevice) into one step
- Added DYLD_LIBRARY_PATH to CLI test section
- Updated context.md constraints with conda env requirement

Made-with: Cursor
Tested: full pipeline runs end-to-end producing a 3.5 GB DMG with all
5 required files bundled (runner, libomp, model, preprocessor, tokenizer).

- build.sh: default EXECUTORCH_PATH changed to ~/executorch, enforces
  non-base conda env with full setup guide for et-metal, --help shows
  complete one-time setup sequence
- project.yml: post-compile script reads EXECUTORCH_PATH and MODEL_DIR
  env vars (defaults to ~/executorch and ~/voxtral_realtime_quant_metal)
- Preferences.swift: fallback runner path updated to ~/executorch
- create_dmg.sh: osascript layout step is now non-fatal (skipped in
  non-interactive shells), hdiutil detach tolerates errors
- context.md: paths and constraints updated for et-metal + ~/executorch

Made-with: Cursor
Microphone:
- Check AVCaptureDevice.authorizationStatus live before every
  startTranscription, resumeTranscription, and startDictation instead
  of relying on cached healthResult
- Add HealthCheck.liveMicPermission() for direct, non-cached checks
- Validate AudioEngine input format after start — throw
  microphoneNotAvailable if hardware returns zero sample rate
- Re-run health check when app returns to foreground so UI reflects
  permission changes made in System Settings
- Error messages now tell user to "quit and relaunch the app" since
  macOS caches permission grants per process lifetime

Accessibility (auto-paste):
- Re-check AXIsProcessTrustedWithOptions right before paste, not
  just at startup — catches trust invalidated by debug rebuilds
- Handle nil CGEvents explicitly: log clear error instead of silently
  failing via optional chaining
- Copy text to clipboard before attempting paste so it's always
  available even if CGEvent fails
- Remove startup Accessibility prompt — defer to first paste attempt
  to avoid confusing users who don't use dictation

Made-with: Cursor
Explains how to clear stale permission entries when mic or Accessibility
prompts stop appearing after multiple builds/installs.

Made-with: Cursor
The entitlements file was empty while Hardened Runtime was enabled,
which caused macOS to silently deny microphone access without showing
the permission prompt.

- com.apple.security.device.audio-input: required for mic access
  under Hardened Runtime
- com.apple.security.cs.disable-library-validation: required to load
  the bundled unsigned voxtral_realtime_runner and libomp.dylib

Made-with: Cursor
Root cause: xcodegen's `entitlements:` block without `properties:` was
overwriting VoxtralRealtime.entitlements to an empty dict on every
`xcodegen generate`. The built app under Hardened Runtime had no
audio-input entitlement, so macOS silently denied mic access without
showing the permission prompt.

Fix:
- Add `properties:` to the entitlements block in project.yml so xcodegen
  generates the correct keys every time
- Export EXECUTORCH_PATH and MODEL_DIR in build.sh so xcodebuild's
  post-compile script inherits them
- Remove CODE_SIGN_ALLOW_ENTITLEMENTS_MODIFICATION (no longer needed)

Verified: codesign -d --entitlements shows both
com.apple.security.device.audio-input and
com.apple.security.cs.disable-library-validation in the built app.
Mic permission prompt appears on first launch after TCC reset.

Made-with: Cursor
- Add BSD license headers to all 20 Swift source files
- Add BSD license headers to shell scripts (build.sh, create_dmg.sh)
- Update bundle identifier from com.younghan to org.pytorch.executorch
- Update GitHub release URL from personal fork to official pytorch repo
- Update .gitignore to exclude DMG files (binary artifacts)
- Update LICENSE file with proper BSD license text
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 5, 2026
Made-with: Cursor
- Compressed from 17 MB (3456x2234, 240fps .mov) to 563 KB (1728p, 30fps .mp4)
- Uploaded to v1.0.0 release assets for GitHub README rendering
- Removed large .mov from git history

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant