Skip to content

Commit 6bf7833

Browse files
authored
/nvidia/parakeet-tdt-0.6b-v2 (#4)
1 parent f69655f commit 6bf7833

45 files changed

Lines changed: 9057 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
mlpackages/
2+
parakeet_coreml/
3+
parakeet_coreml_quantized/
4+
compiled/
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Parakeet‑TDT v2 (0.6B) — CoreML Export, Parity, and Quantization
2+
3+
Tools to export NVIDIA Parakeet‑TDT v2 (0.6B) RNNT ASR to CoreML, validate numerical parity with the NeMo reference, measure latency, and explore quantization trade‑offs. All CoreML components use a fixed 15‑second audio window for export and validation.
4+
5+
## Environment
6+
7+
1. Create or reuse the local environment with `uv venv`.
8+
2. Activate the repo `.venv` and install deps via `uv pip sync`.
9+
3. Run everything through `uv run` to keep resolutions reproducible.
10+
11+
## Test Environment
12+
13+
All tests and measurements referenced here were run on an Apple M4 Pro with 48 GB of RAM.
14+
15+
## Export CoreML packages
16+
17+
Exports preprocessor, encoder, decoder, joint, and two fused variants (mel+encoder, joint+decision). Shapes and I/O match the fixed 15‑second window contract.
18+
19+
```
20+
uv run python convert-parakeet.py convert \
21+
--nemo-path /path/to/parakeet-tdt-0.6b-v2.nemo \
22+
--output-dir parakeet_coreml
23+
```
24+
25+
Notes
26+
- Minimum deployment target: iOS 17. Export uses CPU_ONLY by default; runtime compute units can be set when loading the model (Python or Swift).
27+
- Audio is 16 kHz, single‑channel. The 15 s window is enforced during export and validation.
28+
29+
## Validate parity and speed (Torch vs CoreML)
30+
31+
Runs Torch and CoreML side‑by‑side on the same 15 s input, records diffs and latency, and saves plots under `plots/compare-components/`. The tool updates `parakeet_coreml/metadata.json` with all measurements.
32+
33+
```
34+
uv run python compare-components.py compare \
35+
--output-dir parakeet_coreml \
36+
--model-id nvidia/parakeet-tdt-0.6b-v2 \
37+
--runs 10 --warmup 3
38+
```
39+
40+
Output comparison:
41+
42+
![./plots/compare-components/mel_encoder_time_l2.png](./plots/compare-components/mel_encoder_time_l2.png)
43+
![./plots/compare-components/joint_decision_prob_u0.png](./plots/compare-components/joint_decision_prob_u0.png)
44+
![./plots/compare-components/decoder_steps_l2.png](./plots/compare-components/decoder_steps_l2.png)
45+
46+
Latency:
47+
48+
![./plots/quantize/all/all_components_compile.png](./plots/quantize/all/all_components_compile.png)
49+
![./plots/quantize/all/all_components_compression.png](./plots/quantize/all/all_components_compression.png)
50+
![./plots/quantize/all/all_components_quality.png](./plots/quantize/all/all_components_quality.png)
51+
![./plots/quantize/all/all_components_latency.png](./plots/quantize/all/all_components_latency.png)
52+
53+
54+
Quants:
55+
56+
57+
58+
### Key results (quality first)
59+
60+
Numerical parity is strong across components on the fixed window:
61+
- Preprocessor mel: match=true; max_abs≈0.484, max_rel≈2.00 (near‑zero bins inflate relative error).
62+
- Encoder: match=true; max_abs≈0.0054, strong agreement over time (see plot).
63+
- Decoder h/c state: match=true; value deltas within tolerance.
64+
- Joint logits: match=true; max_abs≈0.099, distributions align (see top‑k plot).
65+
- Joint+Decision: Fused CoreML head exactly matches decisions computed on CoreML logits (token_id/prob/duration). PyTorch logits produce slightly different argmax paths (expected from small logit differences).
66+
67+
### Speed (latency and RTF)
68+
69+
Component latency on a 15 s clip, Torch CPU vs CoreML (CPU+NE) from `parakeet_coreml/metadata.json`:
70+
- Encoder: Torch 1030.48 ms → CoreML 25.44 ms (≈40.5× faster, RTF 0.00170)
71+
- Preprocessor: 1.99 ms → 1.19 ms (≈1.68×)
72+
- Joint: 28.34 ms → 22.66 ms (≈1.25×)
73+
- Decoder (U=1): 7.51 ms → 4.32 ms (≈1.73×)
74+
75+
Fused paths:
76+
- Mel+Encoder (Torch separate vs CoreML fused): 1032.48 ms → 27.10 ms (≈38.1× faster)
77+
- Joint+Decision (CoreML joint + CPU post vs fused CoreML head): 50.05 ms → 64.09 ms (fused is slower here; prefer CoreML joint + lightweight CPU decision on host).
78+
79+
Plots
80+
- Latency bars and speedups: `plots/compare-components/latency_summary.png`, `plots/compare-components/latency_speedup.png`
81+
- Fused vs separate: `plots/compare-components/latency_fused_vs_separate.png`, `plots/compare-components/latency_fused_speedup.png`
82+
- Quality visuals: mel composite (`mel_composite.png`), encoder L2 over time (`encoder_time_l2.png`), decoder step L2 (`decoder_steps_l2.png`), joint top‑k/time L2 (`joint_top50.png`, `joint_time_l2.png`), joint‑decision agreement (`joint_decision_token_agree.png`, `joint_decision_prob_u0.png`).
83+
84+
## Quantization (size • quality • speed)
85+
86+
`uv run python quantize_coreml.py` evaluates several variants and writes a roll‑up to `parakeet_coreml_quantized/quantization_summary.json`. Plots are mirrored to `plots/quantize/<compute_units>/` (we include `plots/quantize/all/`). Quality here is reported as 1 − normalized L2 error (1.0 = identical). For JointDecision we report token‑id match rate, duration match, and token‑prob MAE.
87+
88+
Quick highlights (ComputeUnits=ALL):
89+
- int8 linear (per‑channel): ~2.0× smaller across components with minimal quality loss
90+
- MelEncoder quality≈0.963; latency≈31.13 ms (baseline≈29.34 ms)
91+
- JointDecision acc≈0.995; latency≈1.96 ms (baseline≈2.15 ms)
92+
- int8 linear (per‑tensor symmetric): large encoder quality drop (≈0.50) — not recommended
93+
94+
Quantization plots (ALL)
95+
- Fused: `plots/quantize/all/fused_quality.png`, `fused_latency.png`, `fused_compression.png`, `fused_size.png`
96+
- Component breakdown: `plots/quantize/all/all_components_quality.png`, `all_components_latency.png`, `all_components_compression.png`, `all_components_size.png`, `all_components_compile.png`
97+
98+
## Reproduce the figures
99+
100+
1) Export baseline CoreML packages
101+
```
102+
uv run python convert-parakeet.py convert --model-id nvidia/parakeet-tdt-0.6b-v2 --output-dir parakeet_coreml
103+
```
104+
105+
2) Compare Torch vs CoreML and generate parity/latency plots
106+
```
107+
uv run python compare-components.py compare --output-dir parakeet_coreml --runs 10 --warmup 3
108+
```
109+
110+
3) Run quantization sweeps (mirrors plots into `plots/quantize/<compute_units>/`)
111+
```
112+
uv run python quantize_coreml.py \
113+
--input-dir parakeet_coreml \
114+
--output-root parakeet_coreml_quantized \
115+
--compute-units ALL --runs 10
116+
```
117+
118+
Examples
119+
- Encoder 6‑bit palette only:
120+
`uv run python quantize_coreml.py -c encoder-palettize`
121+
- MelEncoder 6‑bit palette only:
122+
`uv run python quantize_coreml.py -c mel-palettize`
123+
(By default, the script derives the component whitelist from the selected
124+
variants. Use `-m/--component` to explicitly restrict or `-m all` to force all.)
125+
126+
## Notes & limits
127+
128+
- Fixed 15‑second window shapes are required for all CoreML exports and validations.
129+
- Latency measurements are host‑side CoreML predictions (CPU+NE or ALL); on‑device results can differ by chip/OS.
130+
- For streaming decode, the exported decoder uses U=1 inputs with explicit LSTM state I/O.
131+
- Minimum deployment target is iOS 17; models are saved as MLProgram and eligible for ANE when loaded with `ComputeUnits=ALL`.
132+
133+
## Acknowledgements
134+
135+
- Parakeet‑TDT v2 model from NVIDIA NeMo (`nvidia/parakeet-tdt-0.6b-v2`).
136+
- This directory provides export/validation utilities and plots to help the community reproduce quality and performance on Apple devices.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Agent Notes
2+
3+
- Preferred Python workflow uses `uv` (https://github.com/astral-sh/uv).
4+
- Create and manage environments with `uv venv`.
5+
- Install dependencies with `uv pip install` or `uv pip sync` as needed.
6+
- When working in this repo, activate the local `.venv` and run tooling through `uv run` to keep resolutions reproducible.
7+
- Keep CoreML conversions constrained to the fixed 15-second audio window when exporting or validating Parakeet components.
10.1 MB
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)