Skip to content

Add Moonshine ASR backend (CPU-friendly, with Arabic-tuned default)#19

Open
Ahmed-Ezzat20 wants to merge 1 commit into
bakrianoo:masterfrom
Ahmed-Ezzat20:feat/moonshine-asr-backend
Open

Add Moonshine ASR backend (CPU-friendly, with Arabic-tuned default)#19
Ahmed-Ezzat20 wants to merge 1 commit into
bakrianoo:masterfrom
Ahmed-Ezzat20:feat/moonshine-asr-backend

Conversation

@Ahmed-Ezzat20
Copy link
Copy Markdown
Contributor

Summary

Adds a sixth transcription backend, Moonshine (Useful Sensors), via the
Hugging Face Transformers integration. Moonshine is a small (27–61 M params),
CPU-friendly ASR model and the Arabic-specialised checkpoint
moonshine-tiny-ar
matches whisper-medium quality on Arabic Fleurs/CommonVoice at 28× fewer
parameters
— a great fit for users who want a free, local, Arabic-strong
alternative to Deepgram or full Whisper.

pip install \"mazinger[transcribe-moonshine]\"
mazinger transcribe audio.mp3 --method moonshine --language ar -o subs.srt

The default model is picked from --language: moonshine-tiny-ar for Arabic
and moonshine-base otherwise. Users can override with --model.

Implementation notes

  • _transcribe_moonshine() in mazinger/transcribe.py: loads
    MoonshineForConditionalGeneration + AutoProcessor, runs Silero VAD
    ourselves (Moonshine has no native long-audio chunker and was trained on
    ≤30 s clips), then transcribes each VAD chunk and stamps it with the chunk's
    start/end times.
  • Hallucination guard: applies the model card's 13 tokens/sec cap as
    max_length per chunk to suppress runaway outputs on near-silent regions.
  • Model caching: reuses the existing _whisper_cache, so
    transcribe.clear_cache() correctly frees both Moonshine and Whisper models
    between runs.
  • Audio preprocessing: reuses _preprocess_audio() so a project that
    switches between faster-whisper and moonshine doesn't re-encode the audio.

Cross-cutting fix triggered by Moonshine's hallucination behaviour

While testing on a real Arabic clip with a music outro, Moonshine's third VAD
chunk got stuck in a \"كما قلت، كما قلت، كما قلت...\" loop. The existing
_REPEATED_WORD_RE only collapses whitespace-separated repeats, so
punctuation-separated stuck-token loops slipped through.

Added _REPEATED_PHRASE_RE that collapses 3+ phrase repeats joined by Latin
or Arabic punctuation. This benefits every backend that can produce
stuck-token loops on near-silent / OOD audio — Whisper, faster-whisper, and
WhisperX all do this occasionally too. Verified with 7 positive cases (Arabic
multi-word, Latin multi-word, single-word + punct, mixed real text + loop)
and a negative case ensuring normal sentences with shared words like "and"
are not over-collapsed.

Tests

tests/test_moonshine.py — 9 unit tests covering default-model selection,
the new phrase-cleanup regex (positive + negative), method-literal
membership, and the dispatch error message. They run without downloading
the model
so they're cheap to add to CI. The full test suite (14 tests
including the existing MLX ones) passes.

End-to-end test result

Tested on the same 58-second Arabic clip used to validate Deepgram earlier:

Run Wall-clock Real-time factor Notes
Cold (downloads model) 122 s ~0.5× Surfaced the punctuation-loop bug above
Warm (cached model) 24 s ~2.4× Loop collapsed correctly, two real-content chunks transcribe cleanly

Quality is comparable to Deepgram on the speech portions; the local zero-cost
zero-network properties are the main win.

Files

File Change
mazinger/transcribe.py New _transcribe_moonshine(), default-model picker, dispatch case, docstring updates, new _REPEATED_PHRASE_RE
mazinger/cli/_transcribe.py --method moonshine choice + help text
mazinger/cli/_groups.py --transcribe-method moonshine choice + help text
pyproject.toml New transcribe-moonshine extra (transformers>=4.49, torch>=2.4, silero-vad>=5.0)
tests/test_moonshine.py 9 new unit tests
README.md Feature list, install extras, dedicated Quick Start section
docs/installation.md Local transcription extras + task matrix row
docs/cli-reference.md Choice lists + transcribe example
docs/quick-start.md Moonshine usage example

Adds a sixth transcription backend (--method moonshine / --transcribe-method
moonshine) using Useful Sensors' Moonshine ASR via Hugging Face Transformers.
Moonshine is small (27-61M params), CPU-friendly, and the Arabic-tuned
moonshine-tiny-ar variant matches whisper-medium quality at 28x fewer params.

Implementation:
- _transcribe_moonshine() in transcribe.py: loads MoonshineForConditionalGeneration,
  runs Silero VAD ourselves (Moonshine has no native long-audio chunker and was
  trained on <=30s clips), and applies the model card's 13 tokens/sec hallucination
  cap on each chunk.
- Default model auto-picked from --language: moonshine-tiny-ar for Arabic,
  moonshine-base for English / unknown. Users can always pass --model to override.
- Reuses _whisper_cache so transcribe.clear_cache() correctly frees both the
  model and the processor between runs.
- Reuses _preprocess_audio (16 kHz mono WAV) so subsequent backend calls on the
  same project don't reconvert.

Cross-cutting fix triggered by Moonshine's hallucination behaviour:
- Add _REPEATED_PHRASE_RE that collapses 3+ punctuation-separated phrase repeats
  (Latin or Arabic punctuation). Catches stuck-token loops like "كما قلت، كما
  قلت، كما قلت" that the existing whitespace-only regex missed. This benefits
  ALL backends (Whisper, faster-whisper, Moonshine) on near-silent / OOD audio.

Tests:
- tests/test_moonshine.py: 9 unit tests covering default-model selection,
  the new phrase-cleanup regex (positive + negative cases), method-literal
  membership, and the dispatch error message. All 14 tests pass (5 existing
  MLX + 9 new Moonshine).

Docs: README, docs/installation.md, docs/cli-reference.md, docs/quick-start.md
all updated with install extras, choice lists, and a Quick Start example
highlighting the Arabic checkpoint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant