feat: add Zipformer2 transducer support for Vosk/sherpa-onnx models#443
Draft
JarbasAl wants to merge 7 commits intoFluidInference:mainfrom
Draft
feat: add Zipformer2 transducer support for Vosk/sherpa-onnx models#443JarbasAl wants to merge 7 commits intoFluidInference:mainfrom
JarbasAl wants to merge 7 commits intoFluidInference:mainfrom
Conversation
Add AsrModelVersion.zipformer2 for icefall Zipformer2 transducer CoreML models. Key differences from Parakeet TDT: - Stateless decoder (context window of token IDs, no LSTM states) - Standard RNNT greedy decode (no duration prediction) - blank_id=0, 80 mel bins, encoder output shape [1,T,D] not [1,D,T] New files: - ZipformerRnntDecoder.swift: greedy RNNT decode with vDSP argmax Modified: - AsrModelVersion: add .zipformer2 with properties (blankId, contextSize, requiresMelInput, hasStatelessDecoder, melBins) - ModelNames: add Zipformer2 enum and Repo.zipformer2 - AsrManager: route .zipformer2 to ZipformerRnntDecoder - CLI switches: handle new enum case
Complete the Zipformer2 integration so it works end-to-end: - AsrManager: initialize AudioMelSpectrogram (80-bin kaldi fbank, no preemphasis, periodic window) when model version requires mel input - AsrTranscription: add executeZipformerInference path that computes mel spectrogram from raw audio, feeds it to the encoder, then runs greedy RNNT decode - AsrModels: add loadZipformer2(from:) for loading models directly from a local directory (encoder/decoder/joiner.mlpackage + vocab.json) - Cleanup melSpectrogram on AsrManager.cleanup()
Member
|
@JarbasAl what exactly are the advantages of zipformer over other parakeeet models |
- Skip LSTM state initialization for stateless Zipformer2 decoder (initializeDecoderState was sending Parakeet-style h_in/c_in inputs to the Zipformer2 model which expects only 'y') - Auto-compile .mlpackage to .mlmodelc on first load - Use .all compute units for Zipformer2 (avoids slow ANE compilation) - Pad/truncate mel frames to encoder's fixed input size (1495 frames) - Add --model-version zipformer2 CLI option with --model-dir support Tested: swift run fluidaudiocli transcribe audio.wav --model-version zipformer2 --model-dir /path/to/vosk-0.62-atc-int8
When Zipformer2 models are exported with --fuse-mel, the Preprocessor
takes raw audio (1, 239120) like Parakeet — no external mel needed.
- loadZipformer2: auto-detect fused (Preprocessor.mlpackage) vs
separate encoder, set hasFusedMel flag
- AsrModels.hasFusedMel: controls whether mel extraction is external
- usesSplitFrontend: returns false for fused Zipformer2
- Dynamic encoder output key resolution (encoder/encoder_out)
- Version-specific maxAudioSamples (239120 for Zipformer2)
Tested with real audio via CLI:
swift run fluidaudiocli transcribe audio.wav \
--model-version zipformer2 \
--model-dir /path/to/vosk-0.62-atc-fused
Remove non-fused mel path — Zipformer2 now requires fused Preprocessor.mlpackage (audio_signal → encoder features), same interface as Parakeet tdtCtc110m. Removed: - requiresMelInput, hasFusedMel, melBins properties - melSpectrogram instance and init/cleanup on AsrManager - executeZipformerInference method (80+ lines) - Complex isAvailable branching for mel mode Zipformer2 is now just another fused encoder variant: hasFusedEncoder=true, same preprocessor flow as Parakeet.
Collaborator
|
Is there any way you can convert these ONNX models to CoreML? |
- Compute valid encoder frames from audio length for traced models (encoder_out_lens is a traced constant, not actual frame count) - Switch to single-token-per-frame greedy decode matching Python reference (prevents token repetition loops) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ZipformerRnntDecoder: add beamDecode() with configurable beam width, LM weight, and top-K token candidates per frame - Word-level LM scoring at SentencePiece boundaries - DecodingMethod enum (.greedy, .beamSearch) in ASRConfig - AsrManager: route zipformer2 to beam/greedy based on config, add arpaLanguageModel property and setLanguageModel() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AsrModelVersion.zipformer2for icefall Zipformer2 transducer CoreML models(Vosk/sherpa-onnx)
AsrModels.loadZipformer2(from:)for loading local model directoriesArchitecture differences from Parakeet TDT
[1, D, T][1, T, D]Files changed
ZipformerRnntDecoder.swift— greedy RNNT decode with vDSP argmaxAsrModels.swift—AsrModelVersion.zipformer2properties,loadZipformer2(from:)loaderAsrManager.swift— mel spectrogram init, cleanup, availability checkAsrTranscription.swift—executeZipformerInferencepipeline (mel → encoder → decode)ModelNames.swift—Repo.zipformer2,ModelNames.Zipformer2enumAsrBenchmark.swift,TranscribeCommand.swift— exhaustive switch casesUsage
Companion PR
Depends on FluidInference/mobius#35 for the CoreML model conversion scripts
(PyTorch checkpoint → encoder/decoder/joiner.mlpackage).