Skip to content

feat: add CoreML export for Sherpa-ONNX Zipformer2 transducer#35

Draft
JarbasAl wants to merge 6 commits intoFluidInference:mainfrom
TigreGotico:feat/sherpa-onnx-zipformer-coreml
Draft

feat: add CoreML export for Sherpa-ONNX Zipformer2 transducer#35
JarbasAl wants to merge 6 commits intoFluidInference:mainfrom
TigreGotico:feat/sherpa-onnx-zipformer-coreml

Conversation

@JarbasAl
Copy link
Copy Markdown
Contributor

@JarbasAl JarbasAl commented Mar 27, 2026

Summary

  • Convert icefall Zipformer2 transducer checkpoints (Vosk/sherpa-onnx) to CoreML from original PyTorch .pt files
  • Exports encoder, stateless decoder, and joiner as three .mlpackage files
  • Includes comparison script, int8 quantization (3.5x compression), mel extraction, and greedy RNNT decoder

Key technical challenges solved

  • Patched coremltools _cast to handle numpy ndarray constants from aten::Int ops
  • Replaced dynamic reshapes in Zipformer2 attention with unflatten/flatten
  • Froze CompactRelPositionalEncoding outputs via forward hooks to eliminate pe.size() indexing during tracing
  • Patched Conv2dSubsampling to use permute+flatten instead of reshape(b,t,c*f)
  • convert_num_channels uses constant_pad_nd instead of zeros+cat
  • SimpleUpsample uses repeat_interleave instead of expand+reshape

Validated accuracy

Variant Encoder Size Cosine Similarity Transcription
FP32 258 MB 1.000000 Exact match
INT8 75 MB 0.999861 Near-match
FP16 129 MB 0.990837 Diverges at tail

Test plan

  • uv sync && uv run python convert-coreml.py --checkpoint <path> --tokens <path> --output-dir ./build/test
  • uv run python compare-models.py --checkpoint <path> --tokens <path> --coreml-dir ./build/test --audio-file <wav>
  • uv run python quantize-coreml.py --input-dir ./build/test --output-dir ./build/test-int8

🤖 Generated with Claude Code


Open with Devin

Convert icefall Zipformer2 transducer checkpoints (used by Vosk and
sherpa-onnx) to CoreML .mlpackage format from original PyTorch .pt
checkpoints. Exports encoder, stateless decoder, and joiner as three
separate models.

Key challenges solved:
- Patched coremltools _cast to handle numpy ndarray constants from
  aten::Int ops in traced graphs
- Replaced dynamic reshapes in Zipformer2 attention with
  unflatten/flatten to avoid traced shape variables
- Froze CompactRelPositionalEncoding outputs via forward hooks to
  eliminate pe.size() indexing during tracing
- Patched Conv2dSubsampling to use permute+flatten instead of
  reshape(b,t,c*f)
- Replaced convert_num_channels zeros+cat with constant_pad_nd
- SimpleUpsample uses repeat_interleave instead of expand+reshape

Includes comparison script (PyTorch vs CoreML encoder cosine similarity
and greedy RNNT transcription), int8 quantization (3.5x compression),
mel spectrogram extraction, and greedy RNNT decoder.

Validated: FP32 cosine=1.000000, INT8 cosine=0.999861, FP16
cosine=0.991. Encoder 258MB FP32 → 75MB INT8.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +484 to +487
dec_path = output_dir / "decoder.mlpackage"
dec_ml.short_description = f"Zipformer2 Stateless Decoder (context_size={context_size})"
dec_ml.author = AUTHOR
dec_ml.save(str(dec_path))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing shutil.rmtree before saving decoder and joiner mlpackages causes failure on re-runs

The encoder save at convert-coreml.py:458-459 correctly checks for an existing .mlpackage directory and removes it before calling save(), but the decoder save at line 487 and the joiner save at line 520 lack this protection. CoreML's save() for .mlpackage format (which creates a directory structure) will fail if the target path already exists. This means re-running the conversion with the same --output-dir will succeed for the encoder (existing package is cleaned up) but fail for the decoder or joiner. The quantize-coreml.py at lines 89-91 follows the correct pattern of rmtree before save, confirming this is an oversight.

Suggested change
dec_path = output_dir / "decoder.mlpackage"
dec_ml.short_description = f"Zipformer2 Stateless Decoder (context_size={context_size})"
dec_ml.author = AUTHOR
dec_ml.save(str(dec_path))
dec_path = output_dir / "decoder.mlpackage"
dec_ml.short_description = f"Zipformer2 Stateless Decoder (context_size={context_size})"
dec_ml.author = AUTHOR
if dec_path.exists():
shutil.rmtree(dec_path)
dec_ml.save(str(dec_path))
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +517 to +520
join_path = output_dir / "joiner.mlpackage"
join_ml.short_description = "Zipformer2 Joiner (encoder_out + decoder_out -> logits)"
join_ml.author = AUTHOR
join_ml.save(str(join_path))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing shutil.rmtree before saving joiner mlpackage causes failure on re-runs

Same issue as the decoder save — the joiner save at line 520 is missing the if exists: shutil.rmtree() guard that the encoder save has at convert-coreml.py:458-459. This will cause the conversion to fail when the output directory already contains joiner.mlpackage from a previous run.

Suggested change
join_path = output_dir / "joiner.mlpackage"
join_ml.short_description = "Zipformer2 Joiner (encoder_out + decoder_out -> logits)"
join_ml.author = AUTHOR
join_ml.save(str(join_path))
join_path = output_dir / "joiner.mlpackage"
join_ml.short_description = "Zipformer2 Joiner (encoder_out + decoder_out -> logits)"
join_ml.author = AUTHOR
if join_path.exists():
shutil.rmtree(join_path)
join_ml.save(str(join_path))
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

# Patch Conv2dSubsampling to avoid aten::Int ops that coremltools rejects
# ---------------------------------------------------------------------------

def _freeze_rel_pos_encoding(module: nn.Module) -> None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _freeze_rel_pos_encoding return type annotated as None but returns a tuple

The function _freeze_rel_pos_encoding at line 145 is annotated -> None but actually returns (captured, hooks) at line 178. The caller at convert-coreml.py:426 correctly unpacks the return value as captured, hooks = _freeze_rel_pos_encoding(encoder). While this works at runtime (Python doesn't enforce return type annotations), it's incorrect and would be flagged by any type checker as trying to unpack None.

Suggested change
def _freeze_rel_pos_encoding(module: nn.Module) -> None:
def _freeze_rel_pos_encoding(module: nn.Module) -> Tuple[dict, list]:
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

JarbasAl and others added 5 commits March 27, 2026 18:32
Add FusedPreprocessorForExport that bakes kaldi fbank mel extraction
into the CoreML encoder. When --fuse-mel is passed, the output is
Preprocessor.mlpackage taking raw audio (1, 239120) like Parakeet,
instead of encoder.mlpackage taking mel frames (1, 1495, 80).

New file: fused_fbank.py — traceable kaldi-compatible fbank extractor
using index-gather framing (no as_strided/unfold which coremltools
rejects). Includes preemphasis (0.97), DC offset removal, povey
window, and HTK mel filterbank via torchaudio.

Both export modes preserved: --fuse-mel for FluidAudio integration,
default mel-frames mode for standalone encoder validation.
Fused preprocessor (audio → encoder features) is now the default.
Use --no-fuse-mel for standalone encoder with mel frame input.
Three fixes to fused_fbank.py:
- Use high_freq=0.0 (Nyquist=8000) for mel filterbank, matching
  torchaudio.kaldi.fbank default. Was using 7600 (kaldi CLI default).
- Fix padding offset: left_pad = win//2 - hop//2 = 120, not win//2 = 200
- Apply preemphasis per-frame after extraction (kaldi order), not on
  the full waveform before framing
- Use reflection padding (torch.flip + cat) instead of zero padding
- Use machine epsilon for log floor instead of energy_floor=1.0

Verified with debug-fbank.py: all 10 processing steps pass with
cosine=1.000000 and max_diff=0.000000 against torchaudio reference.

New file: debug-fbank.py — step-by-step comparison script for
validating fbank parity at each processing stage.
- model_avg has substantially different weights (cosine=0.82 vs ONNX)
  causing ~26% WER instead of ~9%
- Add compare-pipeline.py for stage-by-stage PyTorch vs CoreML validation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant