|
| 1 | +# Parakeet‑TDT v2 (0.6B) — CoreML Export, Parity, and Quantization |
| 2 | + |
| 3 | +Tools to export NVIDIA Parakeet‑TDT v2 (0.6B) RNNT ASR to CoreML, validate numerical parity with the NeMo reference, measure latency, and explore quantization trade‑offs. All CoreML components use a fixed 15‑second audio window for export and validation. |
| 4 | + |
| 5 | +## Environment |
| 6 | + |
| 7 | +1. Create or reuse the local environment with `uv venv`. |
| 8 | +2. Activate the repo `.venv` and install deps via `uv pip sync`. |
| 9 | +3. Run everything through `uv run` to keep resolutions reproducible. |
| 10 | + |
| 11 | +## Test Environment |
| 12 | + |
| 13 | +All tests and measurements referenced here were run on an Apple M4 Pro with 48 GB of RAM. |
| 14 | + |
| 15 | +## Export CoreML packages |
| 16 | + |
| 17 | +Exports preprocessor, encoder, decoder, joint, and two fused variants (mel+encoder, joint+decision). Shapes and I/O match the fixed 15‑second window contract. |
| 18 | + |
| 19 | +``` |
| 20 | +uv run python convert-parakeet.py convert \ |
| 21 | + --nemo-path /path/to/parakeet-tdt-0.6b-v2.nemo \ |
| 22 | + --output-dir parakeet_coreml |
| 23 | +``` |
| 24 | + |
| 25 | +Notes |
| 26 | +- Minimum deployment target: iOS 17. Export uses CPU_ONLY by default; runtime compute units can be set when loading the model (Python or Swift). |
| 27 | +- Audio is 16 kHz, single‑channel. The 15 s window is enforced during export and validation. |
| 28 | + |
| 29 | +## Validate parity and speed (Torch vs CoreML) |
| 30 | + |
| 31 | +Runs Torch and CoreML side‑by‑side on the same 15 s input, records diffs and latency, and saves plots under `plots/compare-components/`. The tool updates `parakeet_coreml/metadata.json` with all measurements. |
| 32 | + |
| 33 | +``` |
| 34 | +uv run python compare-components.py compare \ |
| 35 | + --output-dir parakeet_coreml \ |
| 36 | + --model-id nvidia/parakeet-tdt-0.6b-v2 \ |
| 37 | + --runs 10 --warmup 3 |
| 38 | +``` |
| 39 | + |
| 40 | +Output comparison: |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | + |
| 46 | +Latency: |
| 47 | + |
| 48 | + |
| 49 | + |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +Quants: |
| 55 | + |
| 56 | + |
| 57 | + |
| 58 | +### Key results (quality first) |
| 59 | + |
| 60 | +Numerical parity is strong across components on the fixed window: |
| 61 | +- Preprocessor mel: match=true; max_abs≈0.484, max_rel≈2.00 (near‑zero bins inflate relative error). |
| 62 | +- Encoder: match=true; max_abs≈0.0054, strong agreement over time (see plot). |
| 63 | +- Decoder h/c state: match=true; value deltas within tolerance. |
| 64 | +- Joint logits: match=true; max_abs≈0.099, distributions align (see top‑k plot). |
| 65 | +- Joint+Decision: Fused CoreML head exactly matches decisions computed on CoreML logits (token_id/prob/duration). PyTorch logits produce slightly different argmax paths (expected from small logit differences). |
| 66 | + |
| 67 | +### Speed (latency and RTF) |
| 68 | + |
| 69 | +Component latency on a 15 s clip, Torch CPU vs CoreML (CPU+NE) from `parakeet_coreml/metadata.json`: |
| 70 | +- Encoder: Torch 1030.48 ms → CoreML 25.44 ms (≈40.5× faster, RTF 0.00170) |
| 71 | +- Preprocessor: 1.99 ms → 1.19 ms (≈1.68×) |
| 72 | +- Joint: 28.34 ms → 22.66 ms (≈1.25×) |
| 73 | +- Decoder (U=1): 7.51 ms → 4.32 ms (≈1.73×) |
| 74 | + |
| 75 | +Fused paths: |
| 76 | +- Mel+Encoder (Torch separate vs CoreML fused): 1032.48 ms → 27.10 ms (≈38.1× faster) |
| 77 | +- Joint+Decision (CoreML joint + CPU post vs fused CoreML head): 50.05 ms → 64.09 ms (fused is slower here; prefer CoreML joint + lightweight CPU decision on host). |
| 78 | + |
| 79 | +Plots |
| 80 | +- Latency bars and speedups: `plots/compare-components/latency_summary.png`, `plots/compare-components/latency_speedup.png` |
| 81 | +- Fused vs separate: `plots/compare-components/latency_fused_vs_separate.png`, `plots/compare-components/latency_fused_speedup.png` |
| 82 | +- Quality visuals: mel composite (`mel_composite.png`), encoder L2 over time (`encoder_time_l2.png`), decoder step L2 (`decoder_steps_l2.png`), joint top‑k/time L2 (`joint_top50.png`, `joint_time_l2.png`), joint‑decision agreement (`joint_decision_token_agree.png`, `joint_decision_prob_u0.png`). |
| 83 | + |
| 84 | +## Quantization (size • quality • speed) |
| 85 | + |
| 86 | +`uv run python quantize_coreml.py` evaluates several variants and writes a roll‑up to `parakeet_coreml_quantized/quantization_summary.json`. Plots are mirrored to `plots/quantize/<compute_units>/` (we include `plots/quantize/all/`). Quality here is reported as 1 − normalized L2 error (1.0 = identical). For JointDecision we report token‑id match rate, duration match, and token‑prob MAE. |
| 87 | + |
| 88 | +Quick highlights (ComputeUnits=ALL): |
| 89 | +- int8 linear (per‑channel): ~2.0× smaller across components with minimal quality loss |
| 90 | + - MelEncoder quality≈0.963; latency≈31.13 ms (baseline≈29.34 ms) |
| 91 | + - JointDecision acc≈0.995; latency≈1.96 ms (baseline≈2.15 ms) |
| 92 | +- int8 linear (per‑tensor symmetric): large encoder quality drop (≈0.50) — not recommended |
| 93 | + |
| 94 | +Quantization plots (ALL) |
| 95 | +- Fused: `plots/quantize/all/fused_quality.png`, `fused_latency.png`, `fused_compression.png`, `fused_size.png` |
| 96 | +- Component breakdown: `plots/quantize/all/all_components_quality.png`, `all_components_latency.png`, `all_components_compression.png`, `all_components_size.png`, `all_components_compile.png` |
| 97 | + |
| 98 | +## Reproduce the figures |
| 99 | + |
| 100 | +1) Export baseline CoreML packages |
| 101 | +``` |
| 102 | +uv run python convert-parakeet.py convert --model-id nvidia/parakeet-tdt-0.6b-v2 --output-dir parakeet_coreml |
| 103 | +``` |
| 104 | + |
| 105 | +2) Compare Torch vs CoreML and generate parity/latency plots |
| 106 | +``` |
| 107 | +uv run python compare-components.py compare --output-dir parakeet_coreml --runs 10 --warmup 3 |
| 108 | +``` |
| 109 | + |
| 110 | +3) Run quantization sweeps (mirrors plots into `plots/quantize/<compute_units>/`) |
| 111 | +``` |
| 112 | +uv run python quantize_coreml.py \ |
| 113 | + --input-dir parakeet_coreml \ |
| 114 | + --output-root parakeet_coreml_quantized \ |
| 115 | + --compute-units ALL --runs 10 |
| 116 | +``` |
| 117 | + |
| 118 | +Examples |
| 119 | +- Encoder 6‑bit palette only: |
| 120 | + `uv run python quantize_coreml.py -c encoder-palettize` |
| 121 | +- MelEncoder 6‑bit palette only: |
| 122 | + `uv run python quantize_coreml.py -c mel-palettize` |
| 123 | + (By default, the script derives the component whitelist from the selected |
| 124 | + variants. Use `-m/--component` to explicitly restrict or `-m all` to force all.) |
| 125 | + |
| 126 | +## Notes & limits |
| 127 | + |
| 128 | +- Fixed 15‑second window shapes are required for all CoreML exports and validations. |
| 129 | +- Latency measurements are host‑side CoreML predictions (CPU+NE or ALL); on‑device results can differ by chip/OS. |
| 130 | +- For streaming decode, the exported decoder uses U=1 inputs with explicit LSTM state I/O. |
| 131 | +- Minimum deployment target is iOS 17; models are saved as MLProgram and eligible for ANE when loaded with `ComputeUnits=ALL`. |
| 132 | + |
| 133 | +## Acknowledgements |
| 134 | + |
| 135 | +- Parakeet‑TDT v2 model from NVIDIA NeMo (`nvidia/parakeet-tdt-0.6b-v2`). |
| 136 | +- This directory provides export/validation utilities and plots to help the community reproduce quality and performance on Apple devices. |
0 commit comments