Skip to content

Commit 9516d95

Browse files
authored
Add standalone CTC head for custom vocabulary (#435) (#450)
## Summary - Export the CTC decoder head (512→1025 linear projection) as a standalone 1MB CoreML model, replacing the need for the full 97.5MB CTC encoder for custom vocabulary keyword spotting - Load optional `CtcHead.mlmodelc` from model directory and run it on existing TDT encoder output - Add `spotKeywordsFromLogProbs()` and `applyLogSoftmax()` APIs for pre-computed CTC log-probabilities ## Benchmark (772 earnings call files) | Approach | Model Size | Dict Recall | RTFx | |----------|-----------|-------------|------| | Separate CTC encoder | 97.5 MB | 99.4% | 25.98x | | **Standalone CTC head** | **1 MB** | **99.4%** | **70.29x** | ## Test plan - [x] `swift build -c release` passes - [x] 10-file quick test: Dict Recall 100%, RTFx 67.36x - [x] Full 772-file benchmark: Dict Recall 99.4%, RTFx 70.29x - [ ] Conversion script: [mobius PR #36](FluidInference/mobius#36) - [ ] HF model upload: `CtcHead.mlmodelc` to `parakeet-tdt-ctc-110m` repo <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/450" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
1 parent 7feaec8 commit 9516d95

9 files changed

Lines changed: 508 additions & 40 deletions

File tree

Documentation/ASR/CustomVocabulary.md

Lines changed: 75 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,58 @@ The paper introduces a dynamic programming algorithm for CTC-based keyword spott
1717

1818
## Architecture Overview
1919

20+
FluidAudio supports two approaches for CTC-based custom vocabulary boosting:
21+
22+
### Approach 1: Standalone CTC Head (Beta, Recommended for TDT-CTC-110M)
23+
24+
```
25+
┌─────────────────────────────────────────┐
26+
│ Audio Input │
27+
│ (16kHz, mono) │
28+
└─────────────────┬───────────────────────┘
29+
30+
31+
┌─────────────────┐
32+
│ TDT-CTC-110M │
33+
│ Preprocessor │
34+
│ (fused encoder) │
35+
└────────┬────────┘
36+
37+
encoder output [1, 512, T]
38+
39+
┌──────────────┴──────────────┐
40+
│ │
41+
▼ ▼
42+
┌─────────────────┐ ┌─────────────────┐
43+
│ TDT Decoder │ │ CTC Head │
44+
│ + Joint Network│ │ (1MB, beta) │
45+
└────────┬────────┘ └────────┬────────┘
46+
│ │
47+
▼ ctc_logits [1, T, 1025]
48+
┌─────────────────┐ │
49+
│ Raw Transcript│ ▼
50+
│ "in video corp"│ ┌─────────────────┐
51+
└────────┬────────┘ Custom │ Keyword Spotter │
52+
│ Vocabulary►│ (DP Algorithm) │
53+
│ └────────┬────────┘
54+
└──────────────┬──────────────┘
55+
56+
┌─────────────────┐
57+
│ Vocabulary │
58+
│ Rescorer │
59+
└────────┬────────┘
60+
61+
62+
┌─────────────────┐
63+
│ Final Transcript│
64+
│ "NVIDIA Corp" │
65+
└─────────────────┘
66+
```
67+
68+
The standalone CTC head is a single linear projection (512 → 1025) extracted from the hybrid TDT-CTC-110M model. It reuses the TDT encoder output, requiring only ~1MB of additional model weight and no second encoder pass.
69+
70+
### Approach 2: Separate CTC Encoder (Original)
71+
2072
```
2173
┌─────────────────────────────────────────┐
2274
│ Audio Input │
@@ -58,24 +110,37 @@ The paper introduces a dynamic programming algorithm for CTC-based keyword spott
58110
└─────────────────┘
59111
```
60112

61-
## Dual Encoder Alignment
113+
### Approach Comparison
114+
115+
| | Standalone CTC Head (beta) | Separate CTC Encoder |
116+
|---|---|---|
117+
| **Additional model size** | 1 MB | 97.5 MB |
118+
| **Second encoder pass** | No | Yes |
119+
| **RTFx (earnings benchmark)** | 70.29x | 25.98x |
120+
| **Dict Recall** | 99.4% | 99.4% |
121+
| **TDT model requirement** | TDT-CTC-110M only | Any TDT model |
122+
| **Status** | Beta | Stable |
123+
124+
The standalone CTC head is available only with the TDT-CTC-110M model because both the TDT and CTC heads share the same encoder in the hybrid architecture. For Parakeet TDT v2/v3 (0.6B), the separate CTC encoder approach is required.
125+
126+
## Encoder Alignment
127+
128+
### Separate CTC Encoder (Approach 2)
62129

63130
The system uses two separate neural network encoders that process the same audio:
64131

65-
### 1. TDT Encoder (Primary Transcription)
132+
#### TDT Encoder (Primary Transcription)
66133
- **Model**: Parakeet TDT 0.6B (600M parameters)
67134
- **Architecture**: Token Duration Transducer with FastConformer
68135
- **Output**: High-quality transcription with word timestamps
69136
- **Frame Rate**: ~40ms per frame
70137

71-
### 2. CTC Encoder (Keyword Spotting)
138+
#### CTC Encoder (Keyword Spotting)
72139
- **Model**: Parakeet CTC 110M (110M parameters)
73140
- **Architecture**: FastConformer with CTC head
74141
- **Output**: Per-frame log-probabilities over 1024 tokens
75142
- **Frame Rate**: ~40ms per frame (aligned with TDT)
76143

77-
### Frame Alignment
78-
79144
Both encoders use the same audio preprocessing (mel spectrogram with identical parameters), producing frames at the same rate. This enables direct timestamp comparison between:
80145
- TDT decoder word timestamps
81146
- CTC keyword detection timestamps
@@ -88,18 +153,20 @@ CTC Frames: [0] [1] [2] ... [374] (375 frames @ 40ms)
88153
Aligned timestamps
89154
```
90155

91-
### Memory Usage
156+
#### Memory Usage
92157

93158
Running two encoders in parallel increases peak memory consumption:
94159

95160
| Configuration | Peak RAM | Notes |
96161
|---------------|----------|-------|
97162
| TDT encoder only | ~66 MB | Standard transcription |
98-
| TDT + CTC encoders | ~130 MB | With vocabulary boosting |
163+
| TDT + CTC encoders | ~130 MB | With vocabulary boosting (separate encoder) |
164+
| TDT + CTC head | ~67 MB | With vocabulary boosting (standalone head, beta) |
99165

100166
*Measured on iPhone 17 Pro. Memory settles after initial model loading.*
101167

102-
The additional ~64 MB overhead comes from the CTC encoder (Parakeet 110M) being loaded alongside the primary TDT encoder. For memory-constrained scenarios, consider:
168+
The standalone CTC head adds negligible memory (~1MB) since it reuses the existing encoder output. The separate CTC encoder adds ~64MB overhead. For memory-constrained scenarios, consider:
169+
- Using the standalone CTC head with TDT-CTC-110M (beta)
103170
- Loading the CTC encoder on-demand rather than at startup
104171
- Unloading the CTC encoder after transcription completes
105172
- Using vocabulary boosting only for files where domain terms are expected

Documentation/ASR/TDT-CTC-110M.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -465,9 +465,78 @@ Tested on iPhone (iOS 17+):
465465
- Highest accuracy required
466466
- Extra model size acceptable
467467

468+
## Standalone CTC Head for Custom Vocabulary (Beta)
469+
470+
The TDT-CTC-110M hybrid model shares one FastConformer encoder between its TDT and CTC decoder heads. FluidAudio exploits this by exporting the CTC decoder head as a standalone 1MB CoreML model (`CtcHead.mlmodelc`) that runs on the existing TDT encoder output, enabling custom vocabulary keyword spotting without a second encoder pass.
471+
472+
### How It Works
473+
474+
```
475+
TDT Preprocessor (fused encoder)
476+
477+
478+
encoder output [1, 512, T]
479+
480+
┌────┴────┐
481+
482+
483+
TDT Decoder CtcHead (1MB, beta)
484+
485+
486+
transcript ctc_logits [1, T, 1025]
487+
488+
489+
Keyword Spotter / VocabularyRescorer
490+
```
491+
492+
The CTC head is a single linear projection (512 1025) that maps the 512-dimensional encoder features to log-probabilities over 1024 BPE tokens + 1 blank token.
493+
494+
### Performance
495+
496+
Benchmarked on 772 earnings call files (Earnings22-KWS):
497+
498+
| Approach | Model Size | Dict Recall | RTFx |
499+
|----------|-----------|-------------|------|
500+
| Separate CTC encoder | 97.5 MB | 99.4% | 25.98x |
501+
| **Standalone CTC head** | **1 MB** | **99.4%** | **70.29x** |
502+
503+
The standalone CTC head achieves identical keyword detection quality at 2.7x the speed, using 97x less model weight.
504+
505+
### Loading
506+
507+
The CTC head model auto-downloads from [FluidInference/parakeet-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-ctc-110m-coreml) when loading the TDT-CTC-110M model. It also supports manual placement in the TDT model directory.
508+
509+
Two loading paths are supported:
510+
1. **Local (v1):** Place `CtcHead.mlmodelc` in the TDT model directory (`parakeet-tdt-ctc-110m/`)
511+
2. **Auto-download (v2):** Automatically downloaded from the `parakeet-ctc-110m-coreml` HuggingFace repo
512+
513+
```swift
514+
// CTC head loads automatically with TDT-CTC-110M models
515+
let models = try await AsrModels.downloadAndLoad(version: .tdtCtc110m)
516+
// models.ctcHead is non-nil when CtcHead.mlmodelc is available
517+
```
518+
519+
### Conversion
520+
521+
The CTC head is exported using the conversion script in the mobius repo:
522+
523+
```bash
524+
cd mobius/models/stt/parakeet-tdt-ctc-110m/coreml/
525+
uv run python export-ctc-head.py --output-dir ./ctc-head-build
526+
xcrun coremlcompiler compile ctc-head-build/CtcHead.mlpackage ctc-head-build/
527+
```
528+
529+
See [mobius PR #36](https://github.com/FluidInference/mobius/pull/36) for the conversion script.
530+
531+
### Status
532+
533+
This feature is **beta**. The CTC head produces identical keyword detection results to the separate CTC encoder, but the auto-download pathway and integration are new. See [#435](https://github.com/FluidInference/FluidAudio/issues/435) and [PR #450](https://github.com/FluidInference/FluidAudio/pull/450) for details.
534+
468535
## Resources
469536

470537
- **Model:** [FluidInference/parakeet-tdt-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-tdt-ctc-110m-coreml)
538+
- **CTC Head model:** [FluidInference/parakeet-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-ctc-110m-coreml) (includes CtcHead.mlmodelc)
471539
- **Benchmark results:** See `benchmarks.md`
472540
- **PR:** [#433 - Add TDT-CTC-110M support](https://github.com/FluidInference/FluidAudio/pull/433)
541+
- **CTC Head PR:** [#450 - Add standalone CTC head for custom vocabulary](https://github.com/FluidInference/FluidAudio/pull/450)
473542
- **Original NVIDIA model:** [nvidia/parakeet-tdt-1.1b](https://huggingface.co/nvidia/parakeet-tdt-1.1b)

Documentation/ASR/benchmarks100.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,3 +41,15 @@ Benchmark comparison between `main` and PR #440 (`standardize-asr-directory-stru
4141
## Verdict
4242

4343
**No regressions.** WER is identical across all 6 benchmarks. RTFx differences are within normal system noise (M2 thermals, background processes). The directory restructuring is a pure file move with no behavioral changes.
44+
45+
## Issue #435: Standalone CTC Head for Custom Vocabulary (Beta)
46+
47+
Benchmark comparing separate CTC encoder vs standalone CTC head extracted from the TDT-CTC-110M hybrid model.
48+
See [#435](https://github.com/FluidInference/FluidAudio/issues/435) and [PR #450](https://github.com/FluidInference/FluidAudio/pull/450).
49+
50+
| Metric | Separate CTC (v2 TDT) | Separate CTC (110m TDT) | Standalone CTC Head (110m TDT) |
51+
|---|---|---|---|
52+
| Dict Recall | 99.3% | 99.4% | 99.4% |
53+
| RTFx | 43.94x | 25.98x | 70.29x |
54+
| Additional model size | 97.5 MB | 97.5 MB | 1 MB |
55+

Sources/FluidAudio/ASR/Parakeet/AsrManager.swift

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,37 @@ public actor AsrManager {
5353
internal var vocabSizeConfig: ContextBiasingConstants.VocabSizeConfig?
5454
internal var vocabBoostingEnabled: Bool { customVocabulary != nil && vocabularyRescorer != nil }
5555

56+
// Cached CTC logits from fused Preprocessor (unified custom vocabulary)
57+
internal var cachedCtcLogits: MLMultiArray?
58+
internal var cachedCtcFrameDuration: Double?
59+
internal var cachedCtcValidFrames: Int?
60+
61+
/// Whether the Preprocessor outputs CTC logits (unified custom vocabulary model).
62+
public var hasCachedCtcLogits: Bool { cachedCtcLogits != nil }
63+
64+
/// Get cached CTC raw logits as [[Float]] for external use (e.g. benchmarks).
65+
/// These are raw logits — callers must apply `CtcKeywordSpotter.applyLogSoftmax()`
66+
/// to convert to log-probabilities before use in keyword detection.
67+
/// Returns nil if the CTC head model is not available or audio was multi-chunk.
68+
public func getCachedCtcRawLogits() -> (rawLogits: [[Float]], frameDuration: Double)? {
69+
guard let logits = cachedCtcLogits, let duration = cachedCtcFrameDuration else { return nil }
70+
let shape = logits.shape
71+
guard shape.count == 3 else { return nil }
72+
let numFrames = min(shape[1].intValue, cachedCtcValidFrames ?? shape[1].intValue)
73+
let vocabSize = shape[2].intValue
74+
var result: [[Float]] = []
75+
result.reserveCapacity(numFrames)
76+
for t in 0..<numFrames {
77+
var frame: [Float] = []
78+
frame.reserveCapacity(vocabSize)
79+
for v in 0..<vocabSize {
80+
frame.append(logits[[0, t, v] as [NSNumber]].floatValue)
81+
}
82+
result.append(frame)
83+
}
84+
return (rawLogits: result, frameDuration: duration)
85+
}
86+
5687
// Cached prediction options for reuse
5788
internal lazy var predictionOptions: MLPredictionOptions = {
5889
AsrModels.optimizedPredictionOptions()
@@ -308,6 +339,9 @@ public actor AsrManager {
308339
let layers = asrModels?.version.decoderLayers ?? 2
309340
microphoneDecoderState = TdtDecoderState.make(decoderLayers: layers)
310341
systemDecoderState = TdtDecoderState.make(decoderLayers: layers)
342+
cachedCtcLogits = nil
343+
cachedCtcFrameDuration = nil
344+
cachedCtcValidFrames = nil
311345
Task { await sharedMLArrayCache.clear() }
312346
}
313347

@@ -322,7 +356,10 @@ public actor AsrManager {
322356
// Reset decoder states using fresh allocations for deterministic behavior
323357
microphoneDecoderState = TdtDecoderState.make(decoderLayers: layers)
324358
systemDecoderState = TdtDecoderState.make(decoderLayers: layers)
325-
// Release vocabulary boosting resources
359+
// Release vocabulary boosting resources and cached CTC data
360+
cachedCtcLogits = nil
361+
cachedCtcFrameDuration = nil
362+
cachedCtcValidFrames = nil
326363
disableVocabularyBoosting()
327364
Task { await sharedMLArrayCache.clear() }
328365
logger.info("AsrManager resources cleaned up")

Sources/FluidAudio/ASR/Parakeet/AsrModels.swift

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ public struct AsrModels: Sendable {
6060
public let preprocessor: MLModel
6161
public let decoder: MLModel
6262
public let joint: MLModel
63+
/// Optional CTC decoder head for custom vocabulary (encoder features → CTC logits)
64+
public let ctcHead: MLModel?
6365
public let configuration: MLModelConfiguration
6466
public let vocabulary: [Int: String]
6567
public let version: AsrModelVersion
@@ -71,6 +73,7 @@ public struct AsrModels: Sendable {
7173
preprocessor: MLModel,
7274
decoder: MLModel,
7375
joint: MLModel,
76+
ctcHead: MLModel? = nil,
7477
configuration: MLModelConfiguration,
7578
vocabulary: [Int: String],
7679
version: AsrModelVersion
@@ -79,6 +82,7 @@ public struct AsrModels: Sendable {
7982
self.preprocessor = preprocessor
8083
self.decoder = decoder
8184
self.joint = joint
85+
self.ctcHead = ctcHead
8286
self.configuration = configuration
8387
self.vocabulary = vocabulary
8488
self.version = version
@@ -207,11 +211,52 @@ extension AsrModels {
207211
throw AsrModelsError.loadingFailed("Failed to load decoder or joint model")
208212
}
209213

214+
// [Beta] Optionally load CTC head model for custom vocabulary.
215+
// Supports two paths:
216+
// v1: CtcHead.mlmodelc placed manually in the TDT model directory
217+
// v2: Auto-download from FluidInference/parakeet-ctc-110m-coreml HF repo
218+
var ctcHeadModel: MLModel?
219+
if version == .tdtCtc110m {
220+
// v1: Check local TDT model directory first
221+
let repoDir = repoPath(from: directory, version: version)
222+
let ctcHeadPath = repoDir.appendingPathComponent(Names.ctcHeadFile)
223+
if FileManager.default.fileExists(atPath: ctcHeadPath.path) {
224+
let ctcConfig = MLModelConfiguration()
225+
ctcConfig.computeUnits = config.computeUnits
226+
ctcHeadModel = try? MLModel(contentsOf: ctcHeadPath, configuration: ctcConfig)
227+
if ctcHeadModel != nil {
228+
logger.info("[Beta] Loaded CTC head model from local directory")
229+
} else {
230+
logger.warning("CTC head model found but failed to load: \(ctcHeadPath.path)")
231+
}
232+
}
233+
234+
// v2: Fall back to downloading from parakeet-ctc-110m HF repo
235+
if ctcHeadModel == nil {
236+
do {
237+
let ctcModels = try await DownloadUtils.loadModels(
238+
.parakeetCtc110m,
239+
modelNames: [Names.ctcHeadFile],
240+
directory: parentDirectory,
241+
computeUnits: config.computeUnits,
242+
progressHandler: progressHandler
243+
)
244+
ctcHeadModel = ctcModels[Names.ctcHeadFile]
245+
if ctcHeadModel != nil {
246+
logger.info("[Beta] Loaded CTC head model from HF repo")
247+
}
248+
} catch {
249+
logger.warning("CTC head model not available: \(error.localizedDescription)")
250+
}
251+
}
252+
}
253+
210254
let asrModels = AsrModels(
211255
encoder: encoderModel,
212256
preprocessor: preprocessorModel,
213257
decoder: decoderModel,
214258
joint: jointModel,
259+
ctcHead: ctcHeadModel,
215260
configuration: config,
216261
vocabulary: try loadVocabulary(from: directory, version: version),
217262
version: version

0 commit comments

Comments
 (0)