Add RTFx tracking to qwen3-asr-benchmark and validate RTFx in all benchmarks by Alex-Wengg · Pull Request #454 · FluidInference/FluidAudio

Alex-Wengg · 2026-03-28T18:38:00Z

Summary

Add RTFx metric extraction to qwen3-asr-benchmark.yml
Add RTFx validation to ALL benchmark workflows to fail if RTFx is 0
Display medianRTFx and overallRTFx in qwen3 PR comments
Align with other benchmark workflows (asr, diarizer, parakeet-eou, sortformer, vad)

Changes

RTFx Tracking (qwen3-asr-benchmark.yml)

Extract and display performance metrics that were already being computed by the underlying qwen3-benchmark command but not shown in CI:

medianRTFx - Median real-time factor across test files
overallRTFx - Overall real-time factor (total audio / total inference time)

RTFx Validation (all benchmark workflows)

Add validation to fail workflows with exit 1 if RTFx is 0 or N/A, indicating silent benchmark failure:

qwen3-asr-benchmark.yml: Validate medianRTFx and overallRTFx
asr-benchmark.yml: Validate all 6 RTFx metrics (v2/v3, clean/other, streaming)
diarizer-benchmark.yml: Validate RTFx
parakeet-eou-benchmark.yml: Validate RTFx
sortformer-benchmark.yml: Validate RTFx
vad-benchmark.yml: Validate MUSAN and VOiCES RTFx

Why Fail on RTFx = 0?

If RTFx is 0 after benchmarking, it means:

Benchmark didn't run properly
Audio duration was 0
Processing failed silently
Metric extraction failed

Better to fail fast with clear error messages than report misleading zero metrics.

Context

All other benchmark workflows track RTFx. The qwen3-asr-benchmark was the only benchmark workflow that wasn't displaying these metrics despite the CLI command outputting them. Additionally, none of the workflows were validating RTFx values - they would silently report 0x if benchmarks failed.

🤖 Generated with Claude Code

Remove redundant special cases in folderName property. Now only keeps special cases for nested directory structures (EOU and Nemotron variants) and uses a simple default rule: strip "-coreml" suffix from the name. This eliminates the inconsistency raised in #442 by applying a consistent pattern across all models. Before: - Had 10+ special cases explicitly returning shortened names - parakeetTdtCtc110m was inconsistent with other Parakeet models After: - Only 5 special cases for nested directories (parakeet-eou-streaming/*, nemotron-streaming/*) - Default strips -coreml suffix for all other models - All Parakeet models now follow the same pattern Fixes #442

Add back special cases for kokoro and sortformer to preserve existing folder names and avoid forcing users to re-download models. Still removes redundant special cases (lseend, pocketTts, multilingualG2p, parakeetTdtCtc110m) that can safely use the default -coreml stripping logic. Result: 7 special cases total (kokoro, sortformer, + 5 nested directories) vs 11 special cases before. Still achieves consistency for Parakeet models without breaking existing cached model locations.

Extract and display medianRTFx and overallRTFx metrics in the PR comment. Previously the workflow was running benchmarks but not displaying the performance metrics. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

claude · 2026-03-28T18:38:17Z

Claude finished @Alex-Wengg's task —— View job

PR Review: Add RTFx tracking to qwen3-asr-benchmark workflow

Todo List:

Create pending GitHub review
Analyze PR diff and changed files
Add inline review comments for specific feedback
Submit non-blocking review with overall summary

Review Complete ✅

I've completed a thorough review of this PR. The changes successfully add RTFx tracking to the qwen3-asr-benchmark workflow, bringing it in line with other benchmark workflows.

Key findings:

✅ Solid implementation with proper error handling and null checks
✅ Good consistency with existing benchmark workflow patterns
✅ Clean code simplification in ModelNames.swift reduces duplication
📝 Minor suggestion to extract jq parsing logic to reduce duplication (non-blocking)

Recommendation: Ready to merge. This achieves the stated goal of adding RTFx metrics display and includes a nice code cleanup bonus.

claude

Review Summary

This PR successfully adds RTFx tracking to the qwen3-asr-benchmark workflow, aligning it with other benchmark workflows. The changes are well-implemented overall.

✅ Strengths

Consistency: Aligns qwen3 benchmark with other workflows (asr, diarizer, etc.)
Robust error handling: Proper null checks and fallbacks to "N/A"
Good formatting: Uses printf "%.2f" for consistent numeric display
Clear presentation: Well-structured table format in PR comments

🔍 Areas for improvement

Workflow file (.github/workflows/qwen3-asr-benchmark.yml):

Nit: The jq parsing logic is duplicated for both metrics. Consider extracting to a helper function to improve maintainability.

ModelNames.swift changes:

Good simplification: The switch to a default case with .replacingOccurrences(of: "-coreml", with: "") reduces code duplication and is more maintainable.
Consistent: The test update properly reflects the new simplified logic.

🎯 Overall Assessment

The changes fulfill the stated goal of adding RTFx metrics to match other benchmark workflows. The implementation is solid with proper error handling and consistent formatting. The ModelNames.swift simplification is a nice bonus that improves code maintainability.

Recommendation: Approve - this is ready to merge. The nit about extracting the jq logic is minor and can be addressed in a future refactor if desired.

devin-ai-integration

Devin Review found 2 potential issues.

View 3 additional findings in Devin Review.

devin-ai-integration · 2026-03-28T18:41:52Z

.github/workflows/qwen3-asr-benchmark.yml

🟡 CI cache path references old folder name qwen3-asr-0.6b-coreml that no longer matches code

The CI workflow caches ~/Library/Application Support/FluidAudio/Models/qwen3-asr-0.6b-coreml (line 30), but after the folderName default-case change in Sources/FluidAudio/ModelNames.swift:133, Repo.qwen3AsrInt8.folderName now returns qwen3-asr-0.6b/int8 instead of qwen3-asr-0.6b-coreml/int8. The models will be downloaded to a path under qwen3-asr-0.6b/ which is not covered by the CI cache configuration, so the cache will never hit and models will be re-downloaded on every CI run.

(Refers to line 30)

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-28T18:41:53Z

Sources/FluidAudio/ModelNames.swift

-            return "parakeet-tdt-ctc-110m"
        default:
-            return name
+            return name.replacingOccurrences(of: "-coreml", with: "")


🔴 Changing folderName default from name to name.replacingOccurrences(of: "-coreml", with: "") silently renames cache directories for 8 repos

The refactoring intended to simplify the 4 explicitly removed cases (.lseend, .pocketTts, .multilingualG2p, .parakeetTdtCtc110m) into the default, but the default itself was changed from return name to return name.replacingOccurrences(of: "-coreml", with: ""). This changes the folderName for every repo that was previously falling through to the old default:

.vad: "silero-vad-coreml" → "silero-vad"

.parakeet: "parakeet-tdt-0.6b-v3-coreml" → "parakeet-tdt-0.6b-v3"

.parakeetV2: "parakeet-tdt-0.6b-v2-coreml" → "parakeet-tdt-0.6b-v2"

.parakeetCtc110m/.parakeetCtc06b: similarly stripped

.diarizer: "speaker-diarization-coreml" → "speaker-diarization"

.qwen3Asr/.qwen3AsrInt8: "qwen3-asr-0.6b-coreml/..." → "qwen3-asr-0.6b/..."

folderName is used pervasively to construct local model cache paths (DownloadUtils.swift:135, DownloadUtils.swift:190, AsrModels.swift:501, DiarizerModels.swift:106, etc.). This means (1) all existing cached models at old paths become orphaned and trigger unnecessary re-downloads, and (2) CI workflow cache paths still reference the old names (e.g. asr-benchmark.yml:28-29 caches parakeet-tdt-0.6b-v3-coreml but code now expects parakeet-tdt-0.6b-v3), rendering CI caches completely useless.

Was this helpful? React with 👍 or 👎 to provide feedback.

Add validation to all benchmark workflows to fail with exit 1 if RTFx metrics are 0 or N/A, indicating a silent benchmark failure. Changes: - qwen3-asr-benchmark.yml: Validate medianRTFx and overallRTFx - asr-benchmark.yml: Validate all 6 RTFx metrics (v2/v3, clean/other, streaming) - diarizer-benchmark.yml: Validate RTFx - parakeet-eou-benchmark.yml: Validate RTFx - sortformer-benchmark.yml: Validate RTFx - vad-benchmark.yml: Validate MUSAN and VOiCES RTFx If RTFx is 0, it means either: 1. Benchmark didn't run properly 2. Audio duration was 0 3. Processing failed silently Better to fail fast than report misleading metrics. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-03-28T18:42:44Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	NaN%	<30%	⚠️	Diarization Error Rate (lower is better)
JER	NaN%	<25%	⚠️	Jaccard Error Rate
RTFx	NaNx	>1.0x	⚠️	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	NaN	NaN	Fetching diarization models
Model Compile	NaN	NaN	CoreML compilation
Audio Load	NaN	NaN	Loading audio file
Segmentation	NaN	NaN	Detecting speech regions
Embedding	NaN	NaN	Extracting speaker voices
Clustering	NaN	NaN	Grouping same speakers
Total	NaN	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	NaN%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs diarization time • Test runtime: N/A • 03/28/2026, 03:39 PM EST}

github-actions · 2026-03-28T18:42:44Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	0.0%	<35%	✅
Miss Rate	0.0%	-	-
False Alarm	0.0%	-	-
Speaker Error	0.0%	-	-
RTFx	0.0x	>1.0x	⚠️
Speakers	0/0	-	-

_{Sortformer High-Latency • ES2004a • Runtime: N/A • 2026-03-28T18:58:14.301Z}

github-actions · 2026-03-28T18:56:49Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	713.2x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	724.6x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-03-28T19:10:03Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	NaN%	<20%	⚠️	Diarization Error Rate (lower is better)
RTFx	NaNx	>1.0x	⚠️	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	NaN	NaN	Fetching diarization models
Model Compile	NaN	NaN	CoreML compilation
Audio Load	NaN	NaN	Loading audio file
Segmentation	NaN	NaN	VAD + speech detection
Embedding	NaN	NaN	Speaker embedding extraction
Clustering (VBx)	NaN	NaN	Hungarian algorithm + VBx clustering
Total	NaN	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	NaN%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: 0m 33s • 03/28/2026, 03:10 PM EST}

devin-ai-integration

Devin Review found 3 new potential issues.

View 7 additional findings in Devin Review.

devin-ai-integration · 2026-03-28T19:34:11Z

.github/workflows/asr-benchmark.yml

+            [ ! -z "$OTHER_V2_RTFX_FAILED" ] && echo "  - test-other (v2) RTFx is 0"
+            [ ! -z "$STREAMING_RTFX_FAILED" ] && echo "  - streaming RTFx is 0"
+            [ ! -z "$STREAMING_V2_RTFX_FAILED" ] && echo "  - streaming (v2) RTFx is 0"
+            exit 1


🔴 exit 1 prevents PR comment from being posted because Comment PR step lacks always()

The old code at this location had a deliberate comment: # Don't exit with error to allow PR comment to be posted. This PR replaces that with exit 1. The subsequent "Comment PR" step at asr-benchmark.yml:254 uses if: github.event_name == 'pull_request' which, per GitHub Actions docs, implicitly becomes if: success() && github.event_name == 'pull_request'. When exit 1 fires, success() evaluates to false and the PR comment step is skipped entirely. Benchmark results will not be visible on the PR. The other workflows that already had exit 1 and work correctly (diarizer, sortformer) use if: always() on their comment steps.

Prompt for agents

In .github/workflows/asr-benchmark.yml, the `exit 1` at line 247 causes the Comment PR step to be skipped. Fix by changing the Comment PR step's condition at line 254 from: if: github.event_name == 'pull_request' to: if: always() && github.event_name == 'pull_request' This matches the pattern used in other benchmarks (diarizer-benchmark.yml, vad-benchmark.yml) that already handle step failures correctly.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-28T19:34:12Z

.github/workflows/parakeet-eou-benchmark.yml

+          if [ "$RTFx" = "0.00" ] || [ "$RTFx" = "N/A" ]; then
+            echo "❌ CRITICAL: RTFx is 0 or N/A - benchmark failed"
+            echo "RTFx value: $RTFx"
+            exit 1


🔴 exit 1 prevents PR comment because Comment PR step lacks always() in parakeet-eou-benchmark

Same issue as in asr-benchmark.yml: the new exit 1 at line 111 will cause the "Comment PR" step at parakeet-eou-benchmark.yml:115 to be skipped. That step uses if: github.event_name == 'pull_request' without always(), so when the benchmark step fails, GitHub Actions' implicit success() AND prevents the comment from being posted.

Prompt for agents

In .github/workflows/parakeet-eou-benchmark.yml, the exit 1 at line 111 causes the Comment PR step to be skipped. Fix by changing the Comment PR step's condition at line 115 from: if: github.event_name == 'pull_request' to: if: always() && github.event_name == 'pull_request' This matches the pattern used in diarizer-benchmark.yml and vad-benchmark.yml.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-28T19:34:13Z

.github/workflows/qwen3-asr-benchmark.yml

+              exit 1
+            fi
+          else
+            echo "❌ CRITICAL: Results file not found - benchmark failed"
+            echo "MEDIAN_RTFx=N/A" >> $GITHUB_OUTPUT
+            echo "OVERALL_RTFx=N/A" >> $GITHUB_OUTPUT
+            exit 1


🔴 exit 1 prevents PR comment because Comment PR step lacks always() in qwen3-asr-benchmark

Same issue: the new exit 1 at lines 88 and 94 in the smoketest step will cause the "Comment PR" step at qwen3-asr-benchmark.yml:101 to be skipped. That step uses if: github.event_name == 'pull_request' without always(), so the implicit success() check fails and no PR comment is posted when the RTFx validation fails.

Prompt for agents

In .github/workflows/qwen3-asr-benchmark.yml, the exit 1 at lines 88 and 94 causes the Comment PR step to be skipped. Fix by changing the Comment PR step's condition at line 101 from: if: github.event_name == 'pull_request' to: if: always() && github.event_name == 'pull_request' This matches the pattern used in diarizer-benchmark.yml and vad-benchmark.yml.

Was this helpful? React with 👍 or 👎 to provide feedback.

Alex-Wengg · 2026-03-28T19:38:05Z

Closing in favor of #458 which fixes the issues identified by Devin:

Removed ModelNames.swift changes that broke cache paths
Added if: always() to Comment PR steps so comments post even when validation fails
Clean branch from main with no unrelated commits

## Summary - Add RTFx metric extraction to qwen3-asr-benchmark.yml - Add RTFx validation to ALL 6 benchmark workflows to fail if RTFx is 0 - Fix PR comment posting with `if: always()` so comments post even when validation fails ## Changes ### 1. RTFx Tracking (qwen3-asr-benchmark.yml) Extract and display performance metrics: - `medianRTFx` - Median real-time factor across test files - `overallRTFx` - Overall real-time factor (total audio / total inference time) ### 2. RTFx Validation (all 6 benchmark workflows) Add validation to fail workflows with `exit 1` if RTFx is 0 or N/A, indicating silent benchmark failure: - **qwen3-asr-benchmark.yml**: Validate medianRTFx and overallRTFx - **asr-benchmark.yml**: Validate all 6 RTFx metrics (v2/v3 × clean/other/streaming) - **diarizer-benchmark.yml**: Validate RTFx - **parakeet-eou-benchmark.yml**: Validate RTFx - **sortformer-benchmark.yml**: Validate RTFx - **vad-benchmark.yml**: Validate MUSAN and VOiCES RTFx ### 3. Fix PR Comment Posting - Add `if: always()` to Comment PR steps in workflows that didn't have it - Without this, PR comments don't post when validation fails - Users need to see what went wrong even if the workflow fails ## Why Fail on RTFx = 0? If RTFx is 0 after benchmarking, it means: 1. Benchmark didn't run properly 2. Audio duration was 0 3. Processing failed silently 4. Metric extraction failed Better to fail fast with clear error messages than report misleading zero metrics. ## Fixes from Previous PR #454 This PR fixes the issues identified by Devin in #454: - ✅ No ModelNames.swift changes (avoiding cache path breakage) - ✅ Added `if: always()` to Comment PR steps - ✅ Clean branch from main (no unrelated commits) Closes #454 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/458" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  ---------

Alex-Wengg and others added 4 commits March 28, 2026 14:19

Update kokoro folderName test expectation to match simplified logic

c3aba6a

Add RTFx tracking to qwen3-asr-benchmark workflow

31066de

Extract and display medianRTFx and overallRTFx metrics in the PR comment. Previously the workflow was running benchmarks but not displaying the performance metrics. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

claude bot reviewed Mar 28, 2026

View reviewed changes

Alex-Wengg marked this pull request as draft March 28, 2026 18:40

devin-ai-integration bot reviewed Mar 28, 2026

View reviewed changes

Alex-Wengg changed the title ~~Add RTFx tracking to qwen3-asr-benchmark workflow~~ Add RTFx tracking to qwen3-asr-benchmark and validate RTFx in all benchmarks Mar 28, 2026

Alex-Wengg marked this pull request as ready for review March 28, 2026 19:28

devin-ai-integration bot reviewed Mar 28, 2026

View reviewed changes

Alex-Wengg mentioned this pull request Mar 28, 2026

Add RTFx tracking and validation to all benchmark workflows #458

Merged

Alex-Wengg closed this Mar 28, 2026

Conversation

Alex-Wengg commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

RTFx Tracking (qwen3-asr-benchmark.yml)

RTFx Validation (all benchmark workflows)

Why Fail on RTFx = 0?

Context

Uh oh!

claude bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add RTFx tracking to qwen3-asr-benchmark workflow

Todo List:

Review Complete ✅

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Review Summary

✅ Strengths

🔍 Areas for improvement

🎯 Overall Assessment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Mar 28, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Alex-Wengg commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alex-Wengg commented Mar 28, 2026 •

edited

Loading

claude bot commented Mar 28, 2026 •

edited

Loading

devin-ai-integration bot Mar 28, 2026 •

edited

Loading

github-actions bot commented Mar 28, 2026 •

edited

Loading

github-actions bot commented Mar 28, 2026 •

edited

Loading

github-actions bot commented Mar 28, 2026 •

edited

Loading