You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/ASR/CustomVocabulary.md
+75-8Lines changed: 75 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,6 +17,58 @@ The paper introduces a dynamic programming algorithm for CTC-based keyword spott
17
17
18
18
## Architecture Overview
19
19
20
+
FluidAudio supports two approaches for CTC-based custom vocabulary boosting:
21
+
22
+
### Approach 1: Standalone CTC Head (Beta, Recommended for TDT-CTC-110M)
23
+
24
+
```
25
+
┌─────────────────────────────────────────┐
26
+
│ Audio Input │
27
+
│ (16kHz, mono) │
28
+
└─────────────────┬───────────────────────┘
29
+
│
30
+
▼
31
+
┌─────────────────┐
32
+
│ TDT-CTC-110M │
33
+
│ Preprocessor │
34
+
│ (fused encoder) │
35
+
└────────┬────────┘
36
+
│
37
+
encoder output [1, 512, T]
38
+
│
39
+
┌──────────────┴──────────────┐
40
+
│ │
41
+
▼ ▼
42
+
┌─────────────────┐ ┌─────────────────┐
43
+
│ TDT Decoder │ │ CTC Head │
44
+
│ + Joint Network│ │ (1MB, beta) │
45
+
└────────┬────────┘ └────────┬────────┘
46
+
│ │
47
+
▼ ctc_logits [1, T, 1025]
48
+
┌─────────────────┐ │
49
+
│ Raw Transcript│ ▼
50
+
│ "in video corp"│ ┌─────────────────┐
51
+
└────────┬────────┘ Custom │ Keyword Spotter │
52
+
│ Vocabulary►│ (DP Algorithm) │
53
+
│ └────────┬────────┘
54
+
└──────────────┬──────────────┘
55
+
▼
56
+
┌─────────────────┐
57
+
│ Vocabulary │
58
+
│ Rescorer │
59
+
└────────┬────────┘
60
+
│
61
+
▼
62
+
┌─────────────────┐
63
+
│ Final Transcript│
64
+
│ "NVIDIA Corp" │
65
+
└─────────────────┘
66
+
```
67
+
68
+
The standalone CTC head is a single linear projection (512 → 1025) extracted from the hybrid TDT-CTC-110M model. It reuses the TDT encoder output, requiring only ~1MB of additional model weight and no second encoder pass.
69
+
70
+
### Approach 2: Separate CTC Encoder (Original)
71
+
20
72
```
21
73
┌─────────────────────────────────────────┐
22
74
│ Audio Input │
@@ -58,24 +110,37 @@ The paper introduces a dynamic programming algorithm for CTC-based keyword spott
58
110
└─────────────────┘
59
111
```
60
112
61
-
## Dual Encoder Alignment
113
+
### Approach Comparison
114
+
115
+
|| Standalone CTC Head (beta) | Separate CTC Encoder |
116
+
|---|---|---|
117
+
|**Additional model size**| 1 MB | 97.5 MB |
118
+
|**Second encoder pass**| No | Yes |
119
+
|**RTFx (earnings benchmark)**| 70.29x | 25.98x |
120
+
|**Dict Recall**| 99.4% | 99.4% |
121
+
|**TDT model requirement**| TDT-CTC-110M only | Any TDT model |
122
+
|**Status**| Beta | Stable |
123
+
124
+
The standalone CTC head is available only with the TDT-CTC-110M model because both the TDT and CTC heads share the same encoder in the hybrid architecture. For Parakeet TDT v2/v3 (0.6B), the separate CTC encoder approach is required.
125
+
126
+
## Encoder Alignment
127
+
128
+
### Separate CTC Encoder (Approach 2)
62
129
63
130
The system uses two separate neural network encoders that process the same audio:
64
131
65
-
###1. TDT Encoder (Primary Transcription)
132
+
####TDT Encoder (Primary Transcription)
66
133
-**Model**: Parakeet TDT 0.6B (600M parameters)
67
134
-**Architecture**: Token Duration Transducer with FastConformer
68
135
-**Output**: High-quality transcription with word timestamps
69
136
-**Frame Rate**: ~40ms per frame
70
137
71
-
###2. CTC Encoder (Keyword Spotting)
138
+
####CTC Encoder (Keyword Spotting)
72
139
-**Model**: Parakeet CTC 110M (110M parameters)
73
140
-**Architecture**: FastConformer with CTC head
74
141
-**Output**: Per-frame log-probabilities over 1024 tokens
75
142
-**Frame Rate**: ~40ms per frame (aligned with TDT)
76
143
77
-
### Frame Alignment
78
-
79
144
Both encoders use the same audio preprocessing (mel spectrogram with identical parameters), producing frames at the same rate. This enables direct timestamp comparison between:
| TDT + CTC head |~67 MB | With vocabulary boosting (standalone head, beta) |
99
165
100
166
*Measured on iPhone 17 Pro. Memory settles after initial model loading.*
101
167
102
-
The additional ~64 MB overhead comes from the CTC encoder (Parakeet 110M) being loaded alongside the primary TDT encoder. For memory-constrained scenarios, consider:
168
+
The standalone CTC head adds negligible memory (~1MB) since it reuses the existing encoder output. The separate CTC encoder adds ~64MB overhead. For memory-constrained scenarios, consider:
169
+
- Using the standalone CTC head with TDT-CTC-110M (beta)
103
170
- Loading the CTC encoder on-demand rather than at startup
104
171
- Unloading the CTC encoder after transcription completes
105
172
- Using vocabulary boosting only for files where domain terms are expected
Copy file name to clipboardExpand all lines: Documentation/ASR/TDT-CTC-110M.md
+69Lines changed: 69 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -465,9 +465,78 @@ Tested on iPhone (iOS 17+):
465
465
- Highest accuracy required
466
466
- Extra model size acceptable
467
467
468
+
## Standalone CTC Head for Custom Vocabulary (Beta)
469
+
470
+
The TDT-CTC-110M hybrid model shares one FastConformer encoder between its TDT and CTC decoder heads. FluidAudio exploits this by exporting the CTC decoder head as a standalone 1MB CoreML model (`CtcHead.mlmodelc`) that runs on the existing TDT encoder output, enabling custom vocabulary keyword spotting without a second encoder pass.
471
+
472
+
### How It Works
473
+
474
+
```
475
+
TDT Preprocessor (fused encoder)
476
+
│
477
+
▼
478
+
encoder output [1, 512, T]
479
+
│
480
+
┌────┴────┐
481
+
││
482
+
▼▼
483
+
TDT Decoder CtcHead (1MB, beta)
484
+
││
485
+
▼▼
486
+
transcript ctc_logits [1, T, 1025]
487
+
│
488
+
▼
489
+
Keyword Spotter / VocabularyRescorer
490
+
```
491
+
492
+
The CTC head is a single linear projection (512→1025) that maps the 512-dimensional encoder features to log-probabilities over 1024 BPE tokens +1 blank token.
493
+
494
+
### Performance
495
+
496
+
Benchmarked on 772 earnings call files (Earnings22-KWS):
The standalone CTC head achieves identical keyword detection quality at 2.7x the speed, using 97x less model weight.
504
+
505
+
### Loading
506
+
507
+
The CTC head model auto-downloads from [FluidInference/parakeet-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-ctc-110m-coreml) when loading the TDT-CTC-110M model. It also supports manual placement in the TDT model directory.
508
+
509
+
Two loading paths are supported:
510
+
1. **Local (v1):** Place `CtcHead.mlmodelc` in the TDT model directory (`parakeet-tdt-ctc-110m/`)
511
+
2. **Auto-download (v2):** Automatically downloaded from the `parakeet-ctc-110m-coreml` HuggingFace repo
512
+
513
+
```swift
514
+
// CTC head loads automatically with TDT-CTC-110M models
515
+
let models =tryawait AsrModels.downloadAndLoad(version: .tdtCtc110m)
516
+
// models.ctcHead is non-nil when CtcHead.mlmodelc is available
517
+
```
518
+
519
+
### Conversion
520
+
521
+
The CTC head is exported using the conversion script in the mobius repo:
522
+
523
+
```bash
524
+
cd mobius/models/stt/parakeet-tdt-ctc-110m/coreml/
525
+
uv run python export-ctc-head.py--output-dir ./ctc-head-build
See [mobius PR #36](https://github.com/FluidInference/mobius/pull/36) for the conversion script.
530
+
531
+
### Status
532
+
533
+
This feature is**beta**. The CTC head produces identical keyword detection results to the separate CTC encoder, but the auto-download pathway and integration are new. See [#435](https://github.com/FluidInference/FluidAudio/issues/435) and [PR #450](https://github.com/FluidInference/FluidAudio/pull/450) for details.
Copy file name to clipboardExpand all lines: Documentation/ASR/benchmarks100.md
+12Lines changed: 12 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,3 +41,15 @@ Benchmark comparison between `main` and PR #440 (`standardize-asr-directory-stru
41
41
## Verdict
42
42
43
43
**No regressions.** WER is identical across all 6 benchmarks. RTFx differences are within normal system noise (M2 thermals, background processes). The directory restructuring is a pure file move with no behavioral changes.
44
+
45
+
## Issue #435: Standalone CTC Head for Custom Vocabulary (Beta)
46
+
47
+
Benchmark comparing separate CTC encoder vs standalone CTC head extracted from the TDT-CTC-110M hybrid model.
48
+
See [#435](https://github.com/FluidInference/FluidAudio/issues/435) and [PR #450](https://github.com/FluidInference/FluidAudio/pull/450).
49
+
50
+
| Metric | Separate CTC (v2 TDT) | Separate CTC (110m TDT) | Standalone CTC Head (110m TDT) |
0 commit comments