Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 75 additions & 8 deletions Documentation/ASR/CustomVocabulary.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,58 @@ The paper introduces a dynamic programming algorithm for CTC-based keyword spott

## Architecture Overview

FluidAudio supports two approaches for CTC-based custom vocabulary boosting:

### Approach 1: Standalone CTC Head (Beta, Recommended for TDT-CTC-110M)

```
┌─────────────────────────────────────────┐
│ Audio Input │
│ (16kHz, mono) │
└─────────────────┬───────────────────────┘
┌─────────────────┐
│ TDT-CTC-110M │
│ Preprocessor │
│ (fused encoder) │
└────────┬────────┘
encoder output [1, 512, T]
┌──────────────┴──────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ TDT Decoder │ │ CTC Head │
│ + Joint Network│ │ (1MB, beta) │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ctc_logits [1, T, 1025]
┌─────────────────┐ │
│ Raw Transcript│ ▼
│ "in video corp"│ ┌─────────────────┐
└────────┬────────┘ Custom │ Keyword Spotter │
│ Vocabulary►│ (DP Algorithm) │
│ └────────┬────────┘
└──────────────┬──────────────┘
┌─────────────────┐
│ Vocabulary │
│ Rescorer │
└────────┬────────┘
┌─────────────────┐
│ Final Transcript│
│ "NVIDIA Corp" │
└─────────────────┘
```

The standalone CTC head is a single linear projection (512 → 1025) extracted from the hybrid TDT-CTC-110M model. It reuses the TDT encoder output, requiring only ~1MB of additional model weight and no second encoder pass.

### Approach 2: Separate CTC Encoder (Original)

```
┌─────────────────────────────────────────┐
│ Audio Input │
Expand Down Expand Up @@ -58,24 +110,37 @@ The paper introduces a dynamic programming algorithm for CTC-based keyword spott
└─────────────────┘
```

## Dual Encoder Alignment
### Approach Comparison

| | Standalone CTC Head (beta) | Separate CTC Encoder |
|---|---|---|
| **Additional model size** | 1 MB | 97.5 MB |
| **Second encoder pass** | No | Yes |
| **RTFx (earnings benchmark)** | 70.29x | 25.98x |
| **Dict Recall** | 99.4% | 99.4% |
| **TDT model requirement** | TDT-CTC-110M only | Any TDT model |
| **Status** | Beta | Stable |

The standalone CTC head is available only with the TDT-CTC-110M model because both the TDT and CTC heads share the same encoder in the hybrid architecture. For Parakeet TDT v2/v3 (0.6B), the separate CTC encoder approach is required.

## Encoder Alignment

### Separate CTC Encoder (Approach 2)

The system uses two separate neural network encoders that process the same audio:

### 1. TDT Encoder (Primary Transcription)
#### TDT Encoder (Primary Transcription)
- **Model**: Parakeet TDT 0.6B (600M parameters)
- **Architecture**: Token Duration Transducer with FastConformer
- **Output**: High-quality transcription with word timestamps
- **Frame Rate**: ~40ms per frame

### 2. CTC Encoder (Keyword Spotting)
#### CTC Encoder (Keyword Spotting)
- **Model**: Parakeet CTC 110M (110M parameters)
- **Architecture**: FastConformer with CTC head
- **Output**: Per-frame log-probabilities over 1024 tokens
- **Frame Rate**: ~40ms per frame (aligned with TDT)

### Frame Alignment

Both encoders use the same audio preprocessing (mel spectrogram with identical parameters), producing frames at the same rate. This enables direct timestamp comparison between:
- TDT decoder word timestamps
- CTC keyword detection timestamps
Expand All @@ -88,18 +153,20 @@ CTC Frames: [0] [1] [2] ... [374] (375 frames @ 40ms)
Aligned timestamps
```

### Memory Usage
#### Memory Usage

Running two encoders in parallel increases peak memory consumption:

| Configuration | Peak RAM | Notes |
|---------------|----------|-------|
| TDT encoder only | ~66 MB | Standard transcription |
| TDT + CTC encoders | ~130 MB | With vocabulary boosting |
| TDT + CTC encoders | ~130 MB | With vocabulary boosting (separate encoder) |
| TDT + CTC head | ~67 MB | With vocabulary boosting (standalone head, beta) |

*Measured on iPhone 17 Pro. Memory settles after initial model loading.*

The additional ~64 MB overhead comes from the CTC encoder (Parakeet 110M) being loaded alongside the primary TDT encoder. For memory-constrained scenarios, consider:
The standalone CTC head adds negligible memory (~1MB) since it reuses the existing encoder output. The separate CTC encoder adds ~64MB overhead. For memory-constrained scenarios, consider:
- Using the standalone CTC head with TDT-CTC-110M (beta)
- Loading the CTC encoder on-demand rather than at startup
- Unloading the CTC encoder after transcription completes
- Using vocabulary boosting only for files where domain terms are expected
Expand Down
69 changes: 69 additions & 0 deletions Documentation/ASR/TDT-CTC-110M.md
Original file line number Diff line number Diff line change
Expand Up @@ -465,9 +465,78 @@ Tested on iPhone (iOS 17+):
- Highest accuracy required
- Extra model size acceptable

## Standalone CTC Head for Custom Vocabulary (Beta)

The TDT-CTC-110M hybrid model shares one FastConformer encoder between its TDT and CTC decoder heads. FluidAudio exploits this by exporting the CTC decoder head as a standalone 1MB CoreML model (`CtcHead.mlmodelc`) that runs on the existing TDT encoder output, enabling custom vocabulary keyword spotting without a second encoder pass.

### How It Works

```
TDT Preprocessor (fused encoder)
encoder output [1, 512, T]
┌────┴────┐
│ │
▼ ▼
TDT Decoder CtcHead (1MB, beta)
│ │
▼ ▼
transcript ctc_logits [1, T, 1025]
Keyword Spotter / VocabularyRescorer
```

The CTC head is a single linear projection (512 → 1025) that maps the 512-dimensional encoder features to log-probabilities over 1024 BPE tokens + 1 blank token.

### Performance

Benchmarked on 772 earnings call files (Earnings22-KWS):

| Approach | Model Size | Dict Recall | RTFx |
|----------|-----------|-------------|------|
| Separate CTC encoder | 97.5 MB | 99.4% | 25.98x |
| **Standalone CTC head** | **1 MB** | **99.4%** | **70.29x** |

The standalone CTC head achieves identical keyword detection quality at 2.7x the speed, using 97x less model weight.

### Loading

The CTC head model auto-downloads from [FluidInference/parakeet-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-ctc-110m-coreml) when loading the TDT-CTC-110M model. It also supports manual placement in the TDT model directory.

Two loading paths are supported:
1. **Local (v1):** Place `CtcHead.mlmodelc` in the TDT model directory (`parakeet-tdt-ctc-110m/`)
2. **Auto-download (v2):** Automatically downloaded from the `parakeet-ctc-110m-coreml` HuggingFace repo

```swift
// CTC head loads automatically with TDT-CTC-110M models
let models = try await AsrModels.downloadAndLoad(version: .tdtCtc110m)
// models.ctcHead is non-nil when CtcHead.mlmodelc is available
```

### Conversion

The CTC head is exported using the conversion script in the mobius repo:

```bash
cd mobius/models/stt/parakeet-tdt-ctc-110m/coreml/
uv run python export-ctc-head.py --output-dir ./ctc-head-build
xcrun coremlcompiler compile ctc-head-build/CtcHead.mlpackage ctc-head-build/
```

See [mobius PR #36](https://github.com/FluidInference/mobius/pull/36) for the conversion script.

### Status

This feature is **beta**. The CTC head produces identical keyword detection results to the separate CTC encoder, but the auto-download pathway and integration are new. See [#435](https://github.com/FluidInference/FluidAudio/issues/435) and [PR #450](https://github.com/FluidInference/FluidAudio/pull/450) for details.

## Resources

- **Model:** [FluidInference/parakeet-tdt-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-tdt-ctc-110m-coreml)
- **CTC Head model:** [FluidInference/parakeet-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-ctc-110m-coreml) (includes CtcHead.mlmodelc)
- **Benchmark results:** See `benchmarks.md`
- **PR:** [#433 - Add TDT-CTC-110M support](https://github.com/FluidInference/FluidAudio/pull/433)
- **CTC Head PR:** [#450 - Add standalone CTC head for custom vocabulary](https://github.com/FluidInference/FluidAudio/pull/450)
- **Original NVIDIA model:** [nvidia/parakeet-tdt-1.1b](https://huggingface.co/nvidia/parakeet-tdt-1.1b)
12 changes: 12 additions & 0 deletions Documentation/ASR/benchmarks100.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,15 @@ Benchmark comparison between `main` and PR #440 (`standardize-asr-directory-stru
## Verdict

**No regressions.** WER is identical across all 6 benchmarks. RTFx differences are within normal system noise (M2 thermals, background processes). The directory restructuring is a pure file move with no behavioral changes.

## Issue #435: Standalone CTC Head for Custom Vocabulary (Beta)

Benchmark comparing separate CTC encoder vs standalone CTC head extracted from the TDT-CTC-110M hybrid model.
See [#435](https://github.com/FluidInference/FluidAudio/issues/435) and [PR #450](https://github.com/FluidInference/FluidAudio/pull/450).

| Metric | Separate CTC (v2 TDT) | Separate CTC (110m TDT) | Standalone CTC Head (110m TDT) |
|---|---|---|---|
| Dict Recall | 99.3% | 99.4% | 99.4% |
| RTFx | 43.94x | 25.98x | 70.29x |
| Additional model size | 97.5 MB | 97.5 MB | 1 MB |

39 changes: 38 additions & 1 deletion Sources/FluidAudio/ASR/Parakeet/AsrManager.swift
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,37 @@ public actor AsrManager {
internal var vocabSizeConfig: ContextBiasingConstants.VocabSizeConfig?
internal var vocabBoostingEnabled: Bool { customVocabulary != nil && vocabularyRescorer != nil }

// Cached CTC logits from fused Preprocessor (unified custom vocabulary)
internal var cachedCtcLogits: MLMultiArray?
internal var cachedCtcFrameDuration: Double?
internal var cachedCtcValidFrames: Int?

/// Whether the Preprocessor outputs CTC logits (unified custom vocabulary model).
public var hasCachedCtcLogits: Bool { cachedCtcLogits != nil }

/// Get cached CTC raw logits as [[Float]] for external use (e.g. benchmarks).
/// These are raw logits — callers must apply `CtcKeywordSpotter.applyLogSoftmax()`
/// to convert to log-probabilities before use in keyword detection.
/// Returns nil if the CTC head model is not available or audio was multi-chunk.
public func getCachedCtcRawLogits() -> (rawLogits: [[Float]], frameDuration: Double)? {
guard let logits = cachedCtcLogits, let duration = cachedCtcFrameDuration else { return nil }
let shape = logits.shape
guard shape.count == 3 else { return nil }
let numFrames = min(shape[1].intValue, cachedCtcValidFrames ?? shape[1].intValue)
let vocabSize = shape[2].intValue
var result: [[Float]] = []
result.reserveCapacity(numFrames)
for t in 0..<numFrames {
var frame: [Float] = []
frame.reserveCapacity(vocabSize)
for v in 0..<vocabSize {
frame.append(logits[[0, t, v] as [NSNumber]].floatValue)
}
result.append(frame)
}
return (rawLogits: result, frameDuration: duration)
}
Comment on lines +56 to +85
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 No unit tests added for new CTC head functionality (AGENTS.md violation)

AGENTS.md states: "Add unit tests when writing new code." This PR adds significant new functionality — CTC head model loading (AsrModels.swift:219-252), CTC logit caching (AsrManager.swift:57-84), applyLogSoftmax() static method (CtcKeywordSpotter.swift:268-306), spotKeywordsFromLogProbs() (CtcKeywordSpotter.swift:191-254), convertCtcLogitsToArray() (AsrTranscription.swift:654-686), and the cached-logits integration in applyVocabularyRescoring — but no test files were added or modified in the PR.

Prompt for agents
Add unit tests for the new CTC head functionality. At minimum, create tests in Tests/FluidAudioTests/ for:
1. CtcKeywordSpotter.applyLogSoftmax() - verify it produces valid log-probabilities (sum to ~1 after exp), applies temperature scaling correctly, and applies blank bias to the correct index.
2. CtcKeywordSpotter.spotKeywordsFromLogProbs() - verify it produces the same detections as spotKeywordsWithLogProbs when given the same logProbs.
3. AsrManager cached CTC logit lifecycle - verify cachedCtcLogits is nil after resetState(), nil after cleanup(), and that getCachedCtcRawLogits() returns nil when no CTC head is loaded.
4. convertCtcLogitsToArray() - verify correct conversion from MLMultiArray shape [1, T, V] to [[Float]] with proper log-softmax application.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


// Cached prediction options for reuse
internal lazy var predictionOptions: MLPredictionOptions = {
AsrModels.optimizedPredictionOptions()
Expand Down Expand Up @@ -308,6 +339,9 @@ public actor AsrManager {
let layers = asrModels?.version.decoderLayers ?? 2
microphoneDecoderState = TdtDecoderState.make(decoderLayers: layers)
systemDecoderState = TdtDecoderState.make(decoderLayers: layers)
cachedCtcLogits = nil
cachedCtcFrameDuration = nil
cachedCtcValidFrames = nil
Task { await sharedMLArrayCache.clear() }
}

Expand All @@ -322,7 +356,10 @@ public actor AsrManager {
// Reset decoder states using fresh allocations for deterministic behavior
microphoneDecoderState = TdtDecoderState.make(decoderLayers: layers)
systemDecoderState = TdtDecoderState.make(decoderLayers: layers)
// Release vocabulary boosting resources
// Release vocabulary boosting resources and cached CTC data
cachedCtcLogits = nil
cachedCtcFrameDuration = nil
cachedCtcValidFrames = nil
disableVocabularyBoosting()
Task { await sharedMLArrayCache.clear() }
logger.info("AsrManager resources cleaned up")
Expand Down
45 changes: 45 additions & 0 deletions Sources/FluidAudio/ASR/Parakeet/AsrModels.swift
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ public struct AsrModels: Sendable {
public let preprocessor: MLModel
public let decoder: MLModel
public let joint: MLModel
/// Optional CTC decoder head for custom vocabulary (encoder features → CTC logits)
public let ctcHead: MLModel?
public let configuration: MLModelConfiguration
public let vocabulary: [Int: String]
public let version: AsrModelVersion
Expand All @@ -71,6 +73,7 @@ public struct AsrModels: Sendable {
preprocessor: MLModel,
decoder: MLModel,
joint: MLModel,
ctcHead: MLModel? = nil,
configuration: MLModelConfiguration,
vocabulary: [Int: String],
version: AsrModelVersion
Expand All @@ -79,6 +82,7 @@ public struct AsrModels: Sendable {
self.preprocessor = preprocessor
self.decoder = decoder
self.joint = joint
self.ctcHead = ctcHead
self.configuration = configuration
self.vocabulary = vocabulary
self.version = version
Expand Down Expand Up @@ -207,11 +211,52 @@ extension AsrModels {
throw AsrModelsError.loadingFailed("Failed to load decoder or joint model")
}

// [Beta] Optionally load CTC head model for custom vocabulary.
// Supports two paths:
// v1: CtcHead.mlmodelc placed manually in the TDT model directory
// v2: Auto-download from FluidInference/parakeet-ctc-110m-coreml HF repo
var ctcHeadModel: MLModel?
if version == .tdtCtc110m {
// v1: Check local TDT model directory first
let repoDir = repoPath(from: directory, version: version)
let ctcHeadPath = repoDir.appendingPathComponent(Names.ctcHeadFile)
if FileManager.default.fileExists(atPath: ctcHeadPath.path) {
let ctcConfig = MLModelConfiguration()
ctcConfig.computeUnits = config.computeUnits
ctcHeadModel = try? MLModel(contentsOf: ctcHeadPath, configuration: ctcConfig)
if ctcHeadModel != nil {
logger.info("[Beta] Loaded CTC head model from local directory")
} else {
logger.warning("CTC head model found but failed to load: \(ctcHeadPath.path)")
}
}

// v2: Fall back to downloading from parakeet-ctc-110m HF repo
if ctcHeadModel == nil {
do {
let ctcModels = try await DownloadUtils.loadModels(
.parakeetCtc110m,
modelNames: [Names.ctcHeadFile],
directory: parentDirectory,
computeUnits: config.computeUnits,
progressHandler: progressHandler
)
ctcHeadModel = ctcModels[Names.ctcHeadFile]
if ctcHeadModel != nil {
logger.info("[Beta] Loaded CTC head model from HF repo")
}
} catch {
logger.warning("CTC head model not available: \(error.localizedDescription)")
}
}
}
Comment on lines +219 to +252
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Nested if statements in CTC head loading code violate AGENTS.md control flow rule

AGENTS.md states: "Nested if statements should be absolutely avoided. Use guard statements and inverted conditions to exit early." The new CTC head loading block has 3 levels of nesting: if version == .tdtCtc110mif FileManager.default.fileExistsif ctcHeadModel != nil. This could be restructured by extracting a helper method or using guard-based early exits.

Prompt for agents
In Sources/FluidAudio/ASR/Parakeet/AsrModels.swift lines 219-252, extract the CTC head loading logic into a separate private static method like `loadCtcHead(from directory: URL, parentDirectory: URL, config: MLModelConfiguration, progressHandler: ...)` that uses guard statements and early returns instead of nested ifs. The outer call site would become: `let ctcHeadModel = version == .tdtCtc110m ? try? await loadCtcHead(...) : nil`. Inside the helper, use guard for the file existence check and return early on failure, avoiding the 3-level nesting.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


let asrModels = AsrModels(
encoder: encoderModel,
preprocessor: preprocessorModel,
decoder: decoderModel,
joint: jointModel,
ctcHead: ctcHeadModel,
configuration: config,
vocabulary: try loadVocabulary(from: directory, version: version),
version: version
Expand Down
Loading
Loading