Skip to content

ASR tech debt cleanup: remove dead code, fix bugs, add benchmark script#460

Open
Alex-Wengg wants to merge 12 commits intomainfrom
refactor/457-asr-manager-vocab-cleanup
Open

ASR tech debt cleanup: remove dead code, fix bugs, add benchmark script#460
Alex-Wengg wants to merge 12 commits intomainfrom
refactor/457-asr-manager-vocab-cleanup

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Mar 28, 2026

Summary

Systematic cleanup of the ASR module addressing tech debt items from #457. Net reduction of ~430 lines while fixing real bugs and improving maintainability.

Bug fixes

  • enableFP16 silently ignoredoptimizedConfiguration(enableFP16:) delegated to a shared factory that hardcoded allowLowPrecisionAccumulationOnGPU = true, ignoring the caller's parameter
  • MLArrayCache.returnArray only reset float32 data — cached arrays of other types (float16, int32) retained stale data from previous use
  • CTC model auto-detection brokenRepo.parakeetCtc110m.folderName returned "parakeet-ctc-110m" instead of "parakeet-ctc-110m-coreml" because the folderName switch fell through to a default case that stripped the -coreml suffix. Same for parakeetCtc06b.
  • Duplicate tokens at chunk merge boundarymergeByMidpoint used <=/>= so tokens exactly at the cutoff appeared in both left and right chunks

Dead code removal

  • Deleted ANEOptimizer indirection layer (166 lines) — was a pass-through wrapping MLModel with no optimization
  • Deleted PerformanceMonitor actor and AggregatedMetrics — never instantiated, component times hardcoded to 0
  • Deleted getFloat16Array from MLArrayCache — never called
  • Deleted sliceEncoderOutput from AsrTranscription — never called (30 lines)
  • Deleted loadWithANEOptimization from AsrModels — never called
  • Removed unused tokenTimings parameter chain through processTranscriptionResult
  • Removed unused import OSLog / import CoreML across 5 files
  • Removed nonisolated(unsafe) from SlidingWindowAsrManager (types already Sendable)

Duplication elimination

  • Extracted clearCachedCtcData() helper (replaced 3× triple-nil assignments)
  • Extracted decoderState(for:) / setDecoderState(_:for:) (replaced 4× switch blocks)
  • Extracted frameAlignedAudio() (replaced 2× duplicated frame-alignment blocks)
  • Added ASRConstants.secondsPerEncoderFrame (replaced 5× magic 0.08)
  • Replaced hardcoded 16_000 with config.sampleRate / ASRConstants.sampleRate
  • Extracted MLModelConfigurationUtils.defaultConfiguration() (replaced 5× copy-pasted config methods)
  • Extracted MLModelConfigurationUtils.defaultModelsDirectory() (replaced 3× copy-pasted directory methods)
  • Consolidated duplicate vocabularyFile / vocabularyFileArray constants

File organization

  • Moved PerformanceMetrics.swift, ProgressEmitter.swift, MLArrayCache.swift from ASR/Parakeet/ to Shared/ (used by multiple modules)
  • Renamed StreamingAudioSourceFactoryAudioSourceFactory, StreamingAudioSampleSourceAudioSampleSource (types used by both ASR and Diarizer)
  • Renamed files to match type names: SortformerDiarizerPipeline.swiftSortformerDiarizer.swift, LSEENDDiarizerAPI.swiftLSEENDDiarizer.swift, NemotronPipeline.swiftNemotronStreamingAsrManager+Pipeline.swift
  • Replaced force unwraps in RnntDecoder.swift with guard let + descriptive errors
  • Removed stale TODO about decoder state in AsrManager

Benchmark script

  • Added Scripts/run_parakeet_benchmarks.sh — runs all 6 benchmarks (v3, v2, TDT-CTC-110M, CTC earnings, EOU 320ms, Nemotron 1120ms) with WER comparison against benchmarks100.md baselines and regression detection
  • Referenced from Documentation/ASR/benchmarks100.md

Verified — no regressions

Model                       Baseline    Current      Delta
Parakeet TDT v3 (0.6B)          2.6%      2.64%     +0.04%
Parakeet TDT v2 (0.6B)          3.8%      3.79%     -0.01%
CTC-TDT 110M                    3.6%      3.56%     -0.04%
CTC Earnings                  16.54%     16.51%     -0.03%
EOU 320ms (120M)               7.11%      7.11%     +0.00%
Nemotron 1120ms (0.6B)         1.99%      1.99%     +0.00%

Test plan

  • swift build passes
  • swift test passes (all existing tests, updated for removed dead code)
  • All 6 ASR benchmarks match baselines (100 files each)
  • swift format lint passes

…mingChunk

AsrManager is the low-level inference engine and shouldn't own vocabulary
boosting state or carry streaming semantics. This removes the duplicated
vocabulary config/rescoring from AsrManager, leaving SlidingWindowAsrManager
as the single owner. CTC head inference and cached logit storage remain in
AsrManager since they need encoder_output MLMultiArray access.

Closes #457
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 28, 2026

Claude finished @Alex-Wengg's task —— View job


PR Review: AsrManager vocabulary boosting refactor ✅

I've completed a comprehensive review of this PR focusing on code quality, architecture, and potential issues.

Review Tasks:

  • Create pending review
  • Get PR diff to understand changes
  • Analyze code changes for quality and architecture
  • Add inline comments for specific issues
  • Submit comprehensive review

Review Summary

This is a well-executed refactoring that significantly improves the codebase architecture:

✅ Key Strengths

  • Clean separation of concerns: AsrManager → low-level inference, SlidingWindowAsrManager → vocabulary boosting
  • Eliminates code duplication: 203 lines removed, single source of truth for vocabulary features
  • Semantically correct rename: transcribeStreamingChunktranscribeChunk removes misleading streaming semantics
  • Performance preservation: Smart retention of getCachedCtcRawLogits() for benchmarks
  • Zero API breakage: Public interface unchanged for end users

📊 Code Quality Rating: ★★★★★

  • Architecture: Excellent separation of responsibilities
  • API Design: Clean, maintains compatibility
  • Memory Management: Proper cleanup, no leaks
  • Maintainability: Single ownership model is much cleaner

🧪 Testing Status

  • ✅ Build, tests, formatting all pass
  • Recommendation: Complete ASR + CTC benchmark validation before merge

Overall: APPROVE - This refactoring improves maintainability while preserving functionality and performance.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-executed refactoring that improves the codebase architecture. Here's my comprehensive review:

Excellent Changes

1. Clear Separation of Concerns

  • AsrManager: Now focused purely on low-level inference (appropriate)
  • SlidingWindowAsrManager: Single owner of vocabulary boosting (appropriate)
  • This follows the Single Responsibility Principle and eliminates code duplication

2. Method Rename is Semantically Correct

  • transcribeStreamingChunktranscribeChunk removes misleading "streaming" semantics
  • AsrManager is the stateless inference engine, not a streaming component
  • The method preserves decoder state between calls, which is what matters

3. Smart Preservation of Performance Features

  • Correctly keeps getCachedCtcRawLogits() in AsrManager
  • CTC logit caching requires direct access to encoder_output MLMultiArray
  • This ensures benchmarks can still access raw logits efficiently

🔍 Code Quality Analysis

Architecture Quality: ★★★★★

  • Clean elimination of 59 lines from AsrManager (vocabulary state + methods)
  • 143 lines removed from AsrTranscription.swift (duplicate rescoring logic)
  • Single method call change in SlidingWindowAsrManager maintains functionality

API Design: ★★★★★

  • Public API remains unchanged for end users
  • Internal refactoring doesn't break existing consumers
  • Method renaming improves API clarity without functional changes

Memory Management: ★★★★★

  • Proper cleanup in resetState() and cleanup()
  • Removes vocabulary boosting cleanup call since state is no longer held

🧪 Testing & Validation

The test plan looks comprehensive:

  • ✅ Build compiles cleanly
  • ✅ All tests pass
  • ✅ No swift-format violations
  • ✅ No dangling references to old methods
  • ⏳ Pending: ASR benchmark WER validation
  • ⏳ Pending: CTC earnings benchmark validation

📝 Minor Observations

Variable Declaration Style

  • Good change from var result = ... to let result = ... since vocabulary rescoring is removed
  • Shows the code is now more functional (immutable results)

Documentation Impact

  • The removed methods had good documentation - ensure equivalent docs exist in SlidingWindowAsrManager

🚀 Recommendation: APPROVE

This refactoring:

  1. Improves maintainability by eliminating duplicate code paths
  2. Clarifies responsibilities between inference engine vs. high-level features
  3. Maintains performance by keeping critical optimizations in place
  4. Preserves functionality while improving code organization

The only remaining items are benchmark validations, which are appropriate to verify no regressions were introduced.

Nit: Consider running the benchmarks mentioned in the test plan to complete the validation before merging.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (183.8 KB)

Runtime: 0m31s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Performance Metrics

Metric CI Value Expected on Apple Silicon
Median RTFx 0.05x ~2.5x
Overall RTFx 0.05x ~2.5x

Runtime: 4m15s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.2% - -
Speaker Error 8.8% - -
RTFx 11.6x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 2m 23s • 2026-03-29T01:38:01.340Z

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.89x
test-other 1.19% 0.00% 3.78x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 4.99x
test-other 1.00% 0.00% 3.17x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.67x Streaming real-time factor
Avg Chunk Time 1.312s Average time to process each chunk
Max Chunk Time 1.495s Maximum chunk processing time
First Token 1.587s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.53x Streaming real-time factor
Avg Chunk Time 1.721s Average time to process each chunk
Max Chunk Time 2.345s Maximum chunk processing time
First Token 1.655s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 6m4s • 03/28/2026, 09:36 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 16.14x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 12.934 19.9 Fetching diarization models
Model Compile 5.543 8.5 CoreML compilation
Audio Load 0.171 0.3 Loading audio file
Segmentation 19.494 30.0 Detecting speech regions
Embedding 32.491 50.0 Extracting speaker voices
Clustering 12.996 20.0 Grouping same speakers
Total 65.032 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 65.0s diarization time • Test runtime: 3m 14s • 03/28/2026, 09:42 PM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 723.6x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 695.9x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 6.79x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 73.0s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.073s Average chunk processing time
Max Chunk Time 0.146s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m43s • 03/28/2026, 09:34 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 14.5% <20% Diarization Error Rate (lower is better)
RTFx 3.86x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 14.305 5.3 Fetching diarization models
Model Compile 6.131 2.3 CoreML compilation
Audio Load 0.106 0.0 Loading audio file
Segmentation 28.320 10.4 VAD + speech detection
Embedding 270.530 99.6 Speaker embedding extraction
Clustering (VBx) 0.797 0.3 Hungarian algorithm + VBx clustering
Total 271.523 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 14.5% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 299.6s processing • Test runtime: 5m 0s • 03/28/2026, 09:46 PM EST

- Remove deprecated calculateStartFrameOffset and its tests
- Add explicit parakeetTdtCtc110m case to Repo.folderName
- Extract duplicated defaultConfiguration() and defaultModelsDirectory()
  into shared MLModelConfigurationUtils, replacing 5+3 copy-pasted methods
- Rename StreamingAudioSourceFactory/SampleSource/Error to drop misleading
  "Streaming" prefix (types are used by both ASR and Diarizer)
- Rename files to match their type names (SortformerDiarizer, LSEENDDiarizer,
  NemotronStreamingAsrManager+Pipeline)
- Remove stale TODO and duplicate vocabularyFileArray constant
- Remove outdated nonisolated(unsafe) from SlidingWindowAsrManager
- Replace force unwraps in RnntDecoder with guard let + throw
- Delete dead loadModel() and getDefaultModelsDirectory() (legacy Models/Parakeet path)
- Remove dangling doc comment from deleted property
- Rename transcribeStreaming → transcribeDiskBacked (avoids confusion with
  the real streaming API in SlidingWindowAsrManager)
- Convert getDecoderLayers() to decoderLayerCount computed property
- Move AudioSource enum from AsrManager.swift to Shared/AudioSource.swift
- Mark pure utility methods as nonisolated: normalizedTimingToken,
  calculateConfidence, sliceEncoderOutput, removeDuplicateTokenSequence
ANEOptimizer was a thin wrapper over ANEMemoryUtils in the wrong location
(ASR/Parakeet/ instead of Shared/). All callers now use ANEMemoryUtils
directly.

- Replace ANEOptimizer.createANEAlignedArray → ANEMemoryUtils.createAlignedArray
- Replace ANEOptimizer.prefetchToNeuralEngine(x) → x.prefetchToNeuralEngine()
  (MLMultiArray extension already in ANEMemoryOptimizer)
- Move convertToFloat16 to ANEMemoryUtils (throws proper ANEMemoryError)
- Move ZeroCopyFeatureProvider to Shared/ZeroCopyFeatureProvider.swift
- Inline optimalComputeUnits (always returned .cpuAndNeuralEngine),
  delete ModelType enum
- Simplify AsrModels.optimizedConfiguration to use shared utility
- Delete ANEOptimizer.swift
devin-ai-integration[bot]

This comment was marked as resolved.

…te duplication

- Remove unused import OSLog and dead sliceEncoderOutput method
- Add clearCachedCtcData() helper to eliminate repeated nil assignments
- Add decoderState(for:)/setDecoderState(_:for:) to eliminate switch duplication
- Extract frameAlignedAudio() helper for duplicated frame-alignment logic
- Add ASRConstants.secondsPerEncoderFrame to replace magic number 0.08
- Replace hardcoded 16_000 with config.sampleRate
- Remove unused duration parameter from calculateConfidence
- Simplify processTranscriptionResult by removing dead tokenTimings parameter
- Replace convertTokensWithExistingTimings with simpler convertTokensToText
…ad code

- Apply enableFP16 to allowLowPrecisionAccumulationOnGPU in optimizedConfiguration
  (fixes review feedback on PR #460)
- Remove dead loadWithANEOptimization method (no callers)
- Remove unused import OSLog
- Delete PerformanceMonitor actor (never instantiated, component times
  hardcoded to 0)
- Delete AggregatedMetrics struct (only used by dead monitor)
- Remove unused imports (os, MachTaskSelfWrapper)
- Move ASRPerformanceMetrics to Shared/ (not Parakeet-specific)
- Remove dead PerformanceMonitor tests, keep ASRPerformanceMetrics tests
- Remove redundant currentStream() wrapper (callers use ensureSession)
- Fix finishSession: return early when inactive instead of creating
  an orphan stream
- Remove auto-create in resetAndPrepareNextSession (renamed to reset);
  next ensureSession() creates on demand
- Remove onTermination closure with unnecessary weak self
- Move from ASR/Parakeet/ to Shared/ (generic async stream utility)
…hared

- Remove unused `import os` and logger
- Delete dead `getFloat16Array` method and its 2 tests
- Fix `returnArray` to reset data for all types, not just float32
- Remove debug logging from hot path
- Move from ASR/Parakeet/ to Shared/ (used by ASR and shared cache)
…ff bug

- Remove unused `import CoreML` and `import OSLog`
- Replace hardcoded `sampleRate = 16000` with `ASRConstants.sampleRate`
- Replace manual frameDuration calculation with `ASRConstants.secondsPerEncoderFrame`
- Fix duplicate token at cutoff boundary in mergeByMidpoint (`<=` to `<`)
Track the benchmark orchestration script that runs all 4 Parakeet
model benchmarks (v3, v2, TDT-CTC-110M, CTC earnings) with asset
verification and sleep prevention. Link it from the benchmark results
doc for reproducibility.

Whitelist the script in .gitignore (scripts/ was ignored).
- Add explicit folderName cases for parakeetCtc110m and parakeetCtc06b
  in ModelNames.swift. The default case strips "-coreml" which broke
  auto-detection of the CTC model directory.
- Add EOU 320ms and Nemotron 1120ms streaming benchmarks to the script
- Add WER comparison table against benchmarks100.md baselines
- Fix CTC earnings to use v2 TDT (matching baseline config)
- Fix WER extraction for fields stored as percentages vs decimals

Verified: all 6 benchmarks match baselines (v3 2.6%, v2 3.8%,
TDT-CTC 3.6%, earnings 16.5%, EOU 7.11%, Nemotron 1.99%).
@Alex-Wengg Alex-Wengg changed the title Move vocabulary boosting out of AsrManager and rename transcribeStreamingChunk ASR tech debt cleanup: remove dead code, fix bugs, add benchmark script Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant