FluidInference · Alex-Wengg · Mar 28, 2026 · Mar 28, 2026 · Mar 28, 2026 · Mar 28, 2026
diff --git a/Documentation/ASR/CustomVocabulary.md b/Documentation/ASR/CustomVocabulary.md
@@ -17,6 +17,58 @@ The paper introduces a dynamic programming algorithm for CTC-based keyword spott
 
 ## Architecture Overview
 
+FluidAudio supports two approaches for CTC-based custom vocabulary boosting:
+
+### Approach 1: Standalone CTC Head (Beta, Recommended for TDT-CTC-110M)
+
+```
+                  ┌─────────────────────────────────────────┐
+                  │            Audio Input                  │
+                  │           (16kHz, mono)                 │
+                  └─────────────────┬───────────────────────┘
+                                    │
+                                    ▼
+                          ┌─────────────────┐
+                          │  TDT-CTC-110M   │
+                          │  Preprocessor   │
+                          │ (fused encoder) │
+                          └────────┬────────┘
+                                   │
+                          encoder output [1, 512, T]
+                                   │
+                    ┌──────────────┴──────────────┐
+                    │                             │
+                    ▼                             ▼
+          ┌─────────────────┐           ┌─────────────────┐
+          │   TDT Decoder   │           │    CTC Head     │
+          │  + Joint Network│           │ (1MB, beta)     │
+          └────────┬────────┘           └────────┬────────┘
+                   │                             │
+                   ▼                    ctc_logits [1, T, 1025]
+          ┌─────────────────┐                    │
+          │   Raw Transcript│                    ▼
+          │  "in video corp"│           ┌─────────────────┐
+          └────────┬────────┘  Custom   │ Keyword Spotter │
+                   │         Vocabulary►│   (DP Algorithm) │
+                   │                    └────────┬────────┘
+                   └──────────────┬──────────────┘
+                                  ▼
+                        ┌─────────────────┐
+                        │   Vocabulary    │
+                        │    Rescorer     │
+                        └────────┬────────┘
+                                 │
+                                 ▼
+                        ┌─────────────────┐
+                        │ Final Transcript│
+                        │   "NVIDIA Corp" │
+                        └─────────────────┘
+```
+
+The standalone CTC head is a single linear projection (512 → 1025) extracted from the hybrid TDT-CTC-110M model. It reuses the TDT encoder output, requiring only ~1MB of additional model weight and no second encoder pass.
+
+### Approach 2: Separate CTC Encoder (Original)
+
 ```
                   ┌─────────────────────────────────────────┐
                   │            Audio Input                  │
@@ -58,24 +110,37 @@ The paper introduces a dynamic programming algorithm for CTC-based keyword spott
                             └─────────────────┘
 ```
 
-## Dual Encoder Alignment
+### Approach Comparison
+
+| | Standalone CTC Head (beta) | Separate CTC Encoder |
+|---|---|---|
+| **Additional model size** | 1 MB | 97.5 MB |
+| **Second encoder pass** | No | Yes |
+| **RTFx (earnings benchmark)** | 70.29x | 25.98x |
+| **Dict Recall** | 99.4% | 99.4% |
+| **TDT model requirement** | TDT-CTC-110M only | Any TDT model |
+| **Status** | Beta | Stable |
+
+The standalone CTC head is available only with the TDT-CTC-110M model because both the TDT and CTC heads share the same encoder in the hybrid architecture. For Parakeet TDT v2/v3 (0.6B), the separate CTC encoder approach is required.
+
+## Encoder Alignment
+
+### Separate CTC Encoder (Approach 2)
 
 The system uses two separate neural network encoders that process the same audio:
 
-### 1. TDT Encoder (Primary Transcription)
+#### TDT Encoder (Primary Transcription)
 - **Model**: Parakeet TDT 0.6B (600M parameters)
 - **Architecture**: Token Duration Transducer with FastConformer
 - **Output**: High-quality transcription with word timestamps
 - **Frame Rate**: ~40ms per frame
 
-### 2. CTC Encoder (Keyword Spotting)
+#### CTC Encoder (Keyword Spotting)
 - **Model**: Parakeet CTC 110M (110M parameters)
 - **Architecture**: FastConformer with CTC head
 - **Output**: Per-frame log-probabilities over 1024 tokens
 - **Frame Rate**: ~40ms per frame (aligned with TDT)
 
-### Frame Alignment
-
 Both encoders use the same audio preprocessing (mel spectrogram with identical parameters), producing frames at the same rate. This enables direct timestamp comparison between:
 - TDT decoder word timestamps
 - CTC keyword detection timestamps
@@ -88,18 +153,20 @@ CTC Frames: [0] [1] [2] ... [374] (375 frames @ 40ms)
             Aligned timestamps
 ```
 
-### Memory Usage
+#### Memory Usage
 
 Running two encoders in parallel increases peak memory consumption:
 
 | Configuration | Peak RAM | Notes |
 |---------------|----------|-------|
 | TDT encoder only | ~66 MB | Standard transcription |
-| TDT + CTC encoders | ~130 MB | With vocabulary boosting |
+| TDT + CTC encoders | ~130 MB | With vocabulary boosting (separate encoder) |
+| TDT + CTC head | ~67 MB | With vocabulary boosting (standalone head, beta) |
 
 *Measured on iPhone 17 Pro. Memory settles after initial model loading.*
 
-The additional ~64 MB overhead comes from the CTC encoder (Parakeet 110M) being loaded alongside the primary TDT encoder. For memory-constrained scenarios, consider:
+The standalone CTC head adds negligible memory (~1MB) since it reuses the existing encoder output. The separate CTC encoder adds ~64MB overhead. For memory-constrained scenarios, consider:
+- Using the standalone CTC head with TDT-CTC-110M (beta)
 - Loading the CTC encoder on-demand rather than at startup
 - Unloading the CTC encoder after transcription completes
 - Using vocabulary boosting only for files where domain terms are expected

diff --git a/Documentation/ASR/TDT-CTC-110M.md b/Documentation/ASR/TDT-CTC-110M.md
@@ -465,9 +465,78 @@ Tested on iPhone (iOS 17+):
 - Highest accuracy required
 - Extra model size acceptable
 
+## Standalone CTC Head for Custom Vocabulary (Beta)
+
+The TDT-CTC-110M hybrid model shares one FastConformer encoder between its TDT and CTC decoder heads. FluidAudio exploits this by exporting the CTC decoder head as a standalone 1MB CoreML model (`CtcHead.mlmodelc`) that runs on the existing TDT encoder output, enabling custom vocabulary keyword spotting without a second encoder pass.
+
+### How It Works
+
+```
+TDT Preprocessor (fused encoder)
+        │
+        ▼
+encoder output [1, 512, T]
+        │
+   ┌────┴────┐
+   │         │
+   ▼         ▼
+TDT Decoder  CtcHead (1MB, beta)
+   │         │
+   ▼         ▼
+transcript   ctc_logits [1, T, 1025]
+                  │
+                  ▼
+         Keyword Spotter / VocabularyRescorer
+```
+
+The CTC head is a single linear projection (512 → 1025) that maps the 512-dimensional encoder features to log-probabilities over 1024 BPE tokens + 1 blank token.
+
+### Performance
+
+Benchmarked on 772 earnings call files (Earnings22-KWS):
+
+| Approach | Model Size | Dict Recall | RTFx |
+|----------|-----------|-------------|------|
+| Separate CTC encoder | 97.5 MB | 99.4% | 25.98x |
+| **Standalone CTC head** | **1 MB** | **99.4%** | **70.29x** |
+
+The standalone CTC head achieves identical keyword detection quality at 2.7x the speed, using 97x less model weight.
+
+### Loading
+
+The CTC head model auto-downloads from [FluidInference/parakeet-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-ctc-110m-coreml) when loading the TDT-CTC-110M model. It also supports manual placement in the TDT model directory.
+
+Two loading paths are supported:
+1. **Local (v1):** Place `CtcHead.mlmodelc` in the TDT model directory (`parakeet-tdt-ctc-110m/`)
+2. **Auto-download (v2):** Automatically downloaded from the `parakeet-ctc-110m-coreml` HuggingFace repo
+
+```swift
+// CTC head loads automatically with TDT-CTC-110M models
+let models = try await AsrModels.downloadAndLoad(version: .tdtCtc110m)
+// models.ctcHead is non-nil when CtcHead.mlmodelc is available
+```
+
+### Conversion
+
+The CTC head is exported using the conversion script in the mobius repo:
+
+```bash
+cd mobius/models/stt/parakeet-tdt-ctc-110m/coreml/
+uv run python export-ctc-head.py --output-dir ./ctc-head-build
+xcrun coremlcompiler compile ctc-head-build/CtcHead.mlpackage ctc-head-build/
+```
+
+See [mobius PR #36](https://github.com/FluidInference/mobius/pull/36) for the conversion script.
+
+### Status
+
+This feature is **beta**. The CTC head produces identical keyword detection results to the separate CTC encoder, but the auto-download pathway and integration are new. See [#435](https://github.com/FluidInference/FluidAudio/issues/435) and [PR #450](https://github.com/FluidInference/FluidAudio/pull/450) for details.
+
 ## Resources
 
 - **Model:** [FluidInference/parakeet-tdt-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-tdt-ctc-110m-coreml)
+- **CTC Head model:** [FluidInference/parakeet-ctc-110m-coreml](https://huggingface.co/FluidInference/parakeet-ctc-110m-coreml) (includes CtcHead.mlmodelc)
 - **Benchmark results:** See `benchmarks.md`
 - **PR:** [#433 - Add TDT-CTC-110M support](https://github.com/FluidInference/FluidAudio/pull/433)
+- **CTC Head PR:** [#450 - Add standalone CTC head for custom vocabulary](https://github.com/FluidInference/FluidAudio/pull/450)
 - **Original NVIDIA model:** [nvidia/parakeet-tdt-1.1b](https://huggingface.co/nvidia/parakeet-tdt-1.1b)
diff --git a/Documentation/ASR/benchmarks100.md b/Documentation/ASR/benchmarks100.md
@@ -41,3 +41,15 @@ Benchmark comparison between `main` and PR #440 (`standardize-asr-directory-stru
 ## Verdict
 
 **No regressions.** WER is identical across all 6 benchmarks. RTFx differences are within normal system noise (M2 thermals, background processes). The directory restructuring is a pure file move with no behavioral changes.
+
+## Issue #435: Standalone CTC Head for Custom Vocabulary (Beta)
+
+Benchmark comparing separate CTC encoder vs standalone CTC head extracted from the TDT-CTC-110M hybrid model.
+See [#435](https://github.com/FluidInference/FluidAudio/issues/435) and [PR #450](https://github.com/FluidInference/FluidAudio/pull/450).
+
+| Metric | Separate CTC (v2 TDT) | Separate CTC (110m TDT) | Standalone CTC Head (110m TDT) |
+|---|---|---|---|
+| Dict Recall | 99.3% | 99.4% | 99.4% |
+| RTFx | 43.94x | 25.98x | 70.29x |
+| Additional model size | 97.5 MB | 97.5 MB | 1 MB |
+
diff --git a/Sources/FluidAudio/ASR/Parakeet/AsrManager.swift b/Sources/FluidAudio/ASR/Parakeet/AsrManager.swift
@@ -53,6 +53,37 @@ public actor AsrManager {
     internal var vocabSizeConfig: ContextBiasingConstants.VocabSizeConfig?
     internal var vocabBoostingEnabled: Bool { customVocabulary != nil && vocabularyRescorer != nil }
 
+    // Cached CTC logits from fused Preprocessor (unified custom vocabulary)
+    internal var cachedCtcLogits: MLMultiArray?
+    internal var cachedCtcFrameDuration: Double?
+    internal var cachedCtcValidFrames: Int?
+
+    /// Whether the Preprocessor outputs CTC logits (unified custom vocabulary model).
+    public var hasCachedCtcLogits: Bool { cachedCtcLogits != nil }
+
+    /// Get cached CTC raw logits as [[Float]] for external use (e.g. benchmarks).
+    /// These are raw logits — callers must apply `CtcKeywordSpotter.applyLogSoftmax()`
+    /// to convert to log-probabilities before use in keyword detection.
+    /// Returns nil if the CTC head model is not available or audio was multi-chunk.
+    public func getCachedCtcRawLogits() -> (rawLogits: [[Float]], frameDuration: Double)? {
+        guard let logits = cachedCtcLogits, let duration = cachedCtcFrameDuration else { return nil }
+        let shape = logits.shape
+        guard shape.count == 3 else { return nil }
+        let numFrames = min(shape[1].intValue, cachedCtcValidFrames ?? shape[1].intValue)
+        let vocabSize = shape[2].intValue
+        var result: [[Float]] = []
+        result.reserveCapacity(numFrames)
+        for t in 0..<numFrames {
+            var frame: [Float] = []
+            frame.reserveCapacity(vocabSize)
+            for v in 0..<vocabSize {
+                frame.append(logits[[0, t, v] as [NSNumber]].floatValue)
+            }
+            result.append(frame)
+        }
+        return (rawLogits: result, frameDuration: duration)
+    }
+
     // Cached prediction options for reuse
     internal lazy var predictionOptions: MLPredictionOptions = {
         AsrModels.optimizedPredictionOptions()
@@ -308,6 +339,9 @@ public actor AsrManager {
         let layers = asrModels?.version.decoderLayers ?? 2
         microphoneDecoderState = TdtDecoderState.make(decoderLayers: layers)
         systemDecoderState = TdtDecoderState.make(decoderLayers: layers)
+        cachedCtcLogits = nil
+        cachedCtcFrameDuration = nil
+        cachedCtcValidFrames = nil
         Task { await sharedMLArrayCache.clear() }
     }
 
@@ -322,7 +356,10 @@ public actor AsrManager {
         // Reset decoder states using fresh allocations for deterministic behavior
         microphoneDecoderState = TdtDecoderState.make(decoderLayers: layers)
         systemDecoderState = TdtDecoderState.make(decoderLayers: layers)
-        // Release vocabulary boosting resources
+        // Release vocabulary boosting resources and cached CTC data
+        cachedCtcLogits = nil
+        cachedCtcFrameDuration = nil
+        cachedCtcValidFrames = nil
         disableVocabularyBoosting()
         Task { await sharedMLArrayCache.clear() }
         logger.info("AsrManager resources cleaned up")

diff --git a/Sources/FluidAudio/ASR/Parakeet/AsrModels.swift b/Sources/FluidAudio/ASR/Parakeet/AsrModels.swift
@@ -60,6 +60,8 @@ public struct AsrModels: Sendable {
     public let preprocessor: MLModel
     public let decoder: MLModel
     public let joint: MLModel
+    /// Optional CTC decoder head for custom vocabulary (encoder features → CTC logits)
+    public let ctcHead: MLModel?
     public let configuration: MLModelConfiguration
     public let vocabulary: [Int: String]
     public let version: AsrModelVersion
@@ -71,6 +73,7 @@ public struct AsrModels: Sendable {
         preprocessor: MLModel,
         decoder: MLModel,
         joint: MLModel,
+        ctcHead: MLModel? = nil,
         configuration: MLModelConfiguration,
         vocabulary: [Int: String],
         version: AsrModelVersion
@@ -79,6 +82,7 @@ public struct AsrModels: Sendable {
         self.preprocessor = preprocessor
         self.decoder = decoder
         self.joint = joint
+        self.ctcHead = ctcHead
         self.configuration = configuration
         self.vocabulary = vocabulary
         self.version = version
@@ -207,11 +211,52 @@ extension AsrModels {
             throw AsrModelsError.loadingFailed("Failed to load decoder or joint model")
         }
 
+        // [Beta] Optionally load CTC head model for custom vocabulary.
+        // Supports two paths:
+        //   v1: CtcHead.mlmodelc placed manually in the TDT model directory
+        //   v2: Auto-download from FluidInference/parakeet-ctc-110m-coreml HF repo
+        var ctcHeadModel: MLModel?
+        if version == .tdtCtc110m {
+            // v1: Check local TDT model directory first
+            let repoDir = repoPath(from: directory, version: version)
+            let ctcHeadPath = repoDir.appendingPathComponent(Names.ctcHeadFile)
+            if FileManager.default.fileExists(atPath: ctcHeadPath.path) {
+                let ctcConfig = MLModelConfiguration()
+                ctcConfig.computeUnits = config.computeUnits
+                ctcHeadModel = try? MLModel(contentsOf: ctcHeadPath, configuration: ctcConfig)
+                if ctcHeadModel != nil {
+                    logger.info("[Beta] Loaded CTC head model from local directory")
+                } else {
+                    logger.warning("CTC head model found but failed to load: \(ctcHeadPath.path)")
+                }
+            }
+
+            // v2: Fall back to downloading from parakeet-ctc-110m HF repo
+            if ctcHeadModel == nil {
+                do {
+                    let ctcModels = try await DownloadUtils.loadModels(
+                        .parakeetCtc110m,
+                        modelNames: [Names.ctcHeadFile],
+                        directory: parentDirectory,
+                        computeUnits: config.computeUnits,
+                        progressHandler: progressHandler
+                    )
+                    ctcHeadModel = ctcModels[Names.ctcHeadFile]
+                    if ctcHeadModel != nil {
+                        logger.info("[Beta] Loaded CTC head model from HF repo")
+                    }
+                } catch {
+                    logger.warning("CTC head model not available: \(error.localizedDescription)")
+                }
+            }
+        }
+
         let asrModels = AsrModels(
             encoder: encoderModel,
             preprocessor: preprocessorModel,
             decoder: decoderModel,
             joint: jointModel,
+            ctcHead: ctcHeadModel,
             configuration: config,
             vocabulary: try loadVocabulary(from: directory, version: version),
             version: version