feat: add CoreML export for Sherpa-ONNX Zipformer2 transducer#35
feat: add CoreML export for Sherpa-ONNX Zipformer2 transducer#35JarbasAl wants to merge 6 commits intoFluidInference:mainfrom
Conversation
Convert icefall Zipformer2 transducer checkpoints (used by Vosk and sherpa-onnx) to CoreML .mlpackage format from original PyTorch .pt checkpoints. Exports encoder, stateless decoder, and joiner as three separate models. Key challenges solved: - Patched coremltools _cast to handle numpy ndarray constants from aten::Int ops in traced graphs - Replaced dynamic reshapes in Zipformer2 attention with unflatten/flatten to avoid traced shape variables - Froze CompactRelPositionalEncoding outputs via forward hooks to eliminate pe.size() indexing during tracing - Patched Conv2dSubsampling to use permute+flatten instead of reshape(b,t,c*f) - Replaced convert_num_channels zeros+cat with constant_pad_nd - SimpleUpsample uses repeat_interleave instead of expand+reshape Includes comparison script (PyTorch vs CoreML encoder cosine similarity and greedy RNNT transcription), int8 quantization (3.5x compression), mel spectrogram extraction, and greedy RNNT decoder. Validated: FP32 cosine=1.000000, INT8 cosine=0.999861, FP16 cosine=0.991. Encoder 258MB FP32 → 75MB INT8. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| dec_path = output_dir / "decoder.mlpackage" | ||
| dec_ml.short_description = f"Zipformer2 Stateless Decoder (context_size={context_size})" | ||
| dec_ml.author = AUTHOR | ||
| dec_ml.save(str(dec_path)) |
There was a problem hiding this comment.
🔴 Missing shutil.rmtree before saving decoder and joiner mlpackages causes failure on re-runs
The encoder save at convert-coreml.py:458-459 correctly checks for an existing .mlpackage directory and removes it before calling save(), but the decoder save at line 487 and the joiner save at line 520 lack this protection. CoreML's save() for .mlpackage format (which creates a directory structure) will fail if the target path already exists. This means re-running the conversion with the same --output-dir will succeed for the encoder (existing package is cleaned up) but fail for the decoder or joiner. The quantize-coreml.py at lines 89-91 follows the correct pattern of rmtree before save, confirming this is an oversight.
| dec_path = output_dir / "decoder.mlpackage" | |
| dec_ml.short_description = f"Zipformer2 Stateless Decoder (context_size={context_size})" | |
| dec_ml.author = AUTHOR | |
| dec_ml.save(str(dec_path)) | |
| dec_path = output_dir / "decoder.mlpackage" | |
| dec_ml.short_description = f"Zipformer2 Stateless Decoder (context_size={context_size})" | |
| dec_ml.author = AUTHOR | |
| if dec_path.exists(): | |
| shutil.rmtree(dec_path) | |
| dec_ml.save(str(dec_path)) |
Was this helpful? React with 👍 or 👎 to provide feedback.
| join_path = output_dir / "joiner.mlpackage" | ||
| join_ml.short_description = "Zipformer2 Joiner (encoder_out + decoder_out -> logits)" | ||
| join_ml.author = AUTHOR | ||
| join_ml.save(str(join_path)) |
There was a problem hiding this comment.
🔴 Missing shutil.rmtree before saving joiner mlpackage causes failure on re-runs
Same issue as the decoder save — the joiner save at line 520 is missing the if exists: shutil.rmtree() guard that the encoder save has at convert-coreml.py:458-459. This will cause the conversion to fail when the output directory already contains joiner.mlpackage from a previous run.
| join_path = output_dir / "joiner.mlpackage" | |
| join_ml.short_description = "Zipformer2 Joiner (encoder_out + decoder_out -> logits)" | |
| join_ml.author = AUTHOR | |
| join_ml.save(str(join_path)) | |
| join_path = output_dir / "joiner.mlpackage" | |
| join_ml.short_description = "Zipformer2 Joiner (encoder_out + decoder_out -> logits)" | |
| join_ml.author = AUTHOR | |
| if join_path.exists(): | |
| shutil.rmtree(join_path) | |
| join_ml.save(str(join_path)) |
Was this helpful? React with 👍 or 👎 to provide feedback.
| # Patch Conv2dSubsampling to avoid aten::Int ops that coremltools rejects | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| def _freeze_rel_pos_encoding(module: nn.Module) -> None: |
There was a problem hiding this comment.
🟡 _freeze_rel_pos_encoding return type annotated as None but returns a tuple
The function _freeze_rel_pos_encoding at line 145 is annotated -> None but actually returns (captured, hooks) at line 178. The caller at convert-coreml.py:426 correctly unpacks the return value as captured, hooks = _freeze_rel_pos_encoding(encoder). While this works at runtime (Python doesn't enforce return type annotations), it's incorrect and would be flagged by any type checker as trying to unpack None.
| def _freeze_rel_pos_encoding(module: nn.Module) -> None: | |
| def _freeze_rel_pos_encoding(module: nn.Module) -> Tuple[dict, list]: |
Was this helpful? React with 👍 or 👎 to provide feedback.
Add FusedPreprocessorForExport that bakes kaldi fbank mel extraction into the CoreML encoder. When --fuse-mel is passed, the output is Preprocessor.mlpackage taking raw audio (1, 239120) like Parakeet, instead of encoder.mlpackage taking mel frames (1, 1495, 80). New file: fused_fbank.py — traceable kaldi-compatible fbank extractor using index-gather framing (no as_strided/unfold which coremltools rejects). Includes preemphasis (0.97), DC offset removal, povey window, and HTK mel filterbank via torchaudio. Both export modes preserved: --fuse-mel for FluidAudio integration, default mel-frames mode for standalone encoder validation.
Fused preprocessor (audio → encoder features) is now the default. Use --no-fuse-mel for standalone encoder with mel frame input.
Three fixes to fused_fbank.py: - Use high_freq=0.0 (Nyquist=8000) for mel filterbank, matching torchaudio.kaldi.fbank default. Was using 7600 (kaldi CLI default). - Fix padding offset: left_pad = win//2 - hop//2 = 120, not win//2 = 200 - Apply preemphasis per-frame after extraction (kaldi order), not on the full waveform before framing - Use reflection padding (torch.flip + cat) instead of zero padding - Use machine epsilon for log floor instead of energy_floor=1.0 Verified with debug-fbank.py: all 10 processing steps pass with cosine=1.000000 and max_diff=0.000000 against torchaudio reference. New file: debug-fbank.py — step-by-step comparison script for validating fbank parity at each processing stage.
- model_avg has substantially different weights (cosine=0.82 vs ONNX) causing ~26% WER instead of ~9% - Add compare-pipeline.py for stage-by-stage PyTorch vs CoreML validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
.ptfiles.mlpackagefilesKey technical challenges solved
_castto handle numpy ndarray constants fromaten::Intopsunflatten/flattenCompactRelPositionalEncodingoutputs via forward hooks to eliminatepe.size()indexing during tracingConv2dSubsamplingto usepermute+flatteninstead ofreshape(b,t,c*f)convert_num_channelsusesconstant_pad_ndinstead ofzeros+catSimpleUpsampleusesrepeat_interleaveinstead ofexpand+reshapeValidated accuracy
Test plan
uv sync && uv run python convert-coreml.py --checkpoint <path> --tokens <path> --output-dir ./build/testuv run python compare-models.py --checkpoint <path> --tokens <path> --coreml-dir ./build/test --audio-file <wav>uv run python quantize-coreml.py --input-dir ./build/test --output-dir ./build/test-int8🤖 Generated with Claude Code