Skip to content

feat: add Zipformer2 transducer support for Vosk/sherpa-onnx models#443

Draft
JarbasAl wants to merge 7 commits intoFluidInference:mainfrom
TigreGotico:feat/sherpa-zipformer-coreml
Draft

feat: add Zipformer2 transducer support for Vosk/sherpa-onnx models#443
JarbasAl wants to merge 7 commits intoFluidInference:mainfrom
TigreGotico:feat/sherpa-zipformer-coreml

Conversation

@JarbasAl
Copy link
Copy Markdown
Contributor

@JarbasAl JarbasAl commented Mar 27, 2026

Summary

  • Add AsrModelVersion.zipformer2 for icefall Zipformer2 transducer CoreML models
    (Vosk/sherpa-onnx)
  • Greedy RNNT decoder with stateless context-window decoder (no LSTM), no duration prediction
  • End-to-end inference pipeline: raw audio → 80-bin kaldi mel spectrogram → encoder → RNNT decode
  • AsrModels.loadZipformer2(from:) for loading local model directories

Architecture differences from Parakeet TDT

Parakeet TDT Zipformer2 RNNT
Decoder Stateful LSTM (1-2 layers) Stateless (context window of 2 tokens)
Duration 5-bin prediction (skip 0-4 frames) Standard RNNT (1 frame per step)
blank_id 1024 / 8192 0
Encoder input Raw audio waveform 80-dim mel spectrogram
Encoder output [1, D, T] [1, T, D]

Files changed

  • ZipformerRnntDecoder.swift — greedy RNNT decode with vDSP argmax
  • AsrModels.swiftAsrModelVersion.zipformer2 properties, loadZipformer2(from:) loader
  • AsrManager.swift — mel spectrogram init, cleanup, availability check
  • AsrTranscription.swiftexecuteZipformerInference pipeline (mel → encoder → decode)
  • ModelNames.swiftRepo.zipformer2, ModelNames.Zipformer2 enum
  • AsrBenchmark.swift, TranscribeCommand.swift — exhaustive switch cases

Usage

let models = try AsrModels.loadZipformer2(from: modelDirectory)                                   
let manager = AsrManager()                                                                          
try await manager.initialize(models: models)                                                      
let result = try await manager.transcribe(audioBuffer)                                              

Companion PR

Depends on FluidInference/mobius#35 for the CoreML model conversion scripts
(PyTorch checkpoint → encoder/decoder/joiner.mlpackage).

Add AsrModelVersion.zipformer2 for icefall Zipformer2 transducer
CoreML models. Key differences from Parakeet TDT:

- Stateless decoder (context window of token IDs, no LSTM states)
- Standard RNNT greedy decode (no duration prediction)
- blank_id=0, 80 mel bins, encoder output shape [1,T,D] not [1,D,T]

New files:
- ZipformerRnntDecoder.swift: greedy RNNT decode with vDSP argmax

Modified:
- AsrModelVersion: add .zipformer2 with properties (blankId,
  contextSize, requiresMelInput, hasStatelessDecoder, melBins)
- ModelNames: add Zipformer2 enum and Repo.zipformer2
- AsrManager: route .zipformer2 to ZipformerRnntDecoder
- CLI switches: handle new enum case
Complete the Zipformer2 integration so it works end-to-end:

- AsrManager: initialize AudioMelSpectrogram (80-bin kaldi fbank, no
  preemphasis, periodic window) when model version requires mel input
- AsrTranscription: add executeZipformerInference path that computes
  mel spectrogram from raw audio, feeds it to the encoder, then runs
  greedy RNNT decode
- AsrModels: add loadZipformer2(from:) for loading models directly
  from a local directory (encoder/decoder/joiner.mlpackage + vocab.json)
- Cleanup melSpectrogram on AsrManager.cleanup()
@Alex-Wengg
Copy link
Copy Markdown
Member

@JarbasAl what exactly are the advantages of zipformer over other parakeeet models

- Skip LSTM state initialization for stateless Zipformer2 decoder
  (initializeDecoderState was sending Parakeet-style h_in/c_in inputs
  to the Zipformer2 model which expects only 'y')
- Auto-compile .mlpackage to .mlmodelc on first load
- Use .all compute units for Zipformer2 (avoids slow ANE compilation)
- Pad/truncate mel frames to encoder's fixed input size (1495 frames)
- Add --model-version zipformer2 CLI option with --model-dir support

Tested: swift run fluidaudiocli transcribe audio.wav --model-version
zipformer2 --model-dir /path/to/vosk-0.62-atc-int8
When Zipformer2 models are exported with --fuse-mel, the Preprocessor
takes raw audio (1, 239120) like Parakeet — no external mel needed.

- loadZipformer2: auto-detect fused (Preprocessor.mlpackage) vs
  separate encoder, set hasFusedMel flag
- AsrModels.hasFusedMel: controls whether mel extraction is external
- usesSplitFrontend: returns false for fused Zipformer2
- Dynamic encoder output key resolution (encoder/encoder_out)
- Version-specific maxAudioSamples (239120 for Zipformer2)

Tested with real audio via CLI:
  swift run fluidaudiocli transcribe audio.wav \
    --model-version zipformer2 \
    --model-dir /path/to/vosk-0.62-atc-fused
Remove non-fused mel path — Zipformer2 now requires fused
Preprocessor.mlpackage (audio_signal → encoder features), same
interface as Parakeet tdtCtc110m.

Removed:
- requiresMelInput, hasFusedMel, melBins properties
- melSpectrogram instance and init/cleanup on AsrManager
- executeZipformerInference method (80+ lines)
- Complex isAvailable branching for mel mode

Zipformer2 is now just another fused encoder variant:
hasFusedEncoder=true, same preprocessor flow as Parakeet.
@SGD2718
Copy link
Copy Markdown
Collaborator

SGD2718 commented Mar 27, 2026

Is there any way you can convert these ONNX models to CoreML?

JarbasAl and others added 2 commits March 28, 2026 02:48
- Compute valid encoder frames from audio length for traced models
  (encoder_out_lens is a traced constant, not actual frame count)
- Switch to single-token-per-frame greedy decode matching Python
  reference (prevents token repetition loops)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ZipformerRnntDecoder: add beamDecode() with configurable beam width,
  LM weight, and top-K token candidates per frame
- Word-level LM scoring at SentencePiece boundaries
- DecodingMethod enum (.greedy, .beamSearch) in ASRConfig
- AsrManager: route zipformer2 to beam/greedy based on config,
  add arpaLanguageModel property and setLanguageModel()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants