Skip to content

Add RTFx tracking to qwen3-asr-benchmark and validate RTFx in all benchmarks#454

Closed
Alex-Wengg wants to merge 5 commits intomainfrom
feature/add-rtfx-tracking-to-qwen3-benchmark
Closed

Add RTFx tracking to qwen3-asr-benchmark and validate RTFx in all benchmarks#454
Alex-Wengg wants to merge 5 commits intomainfrom
feature/add-rtfx-tracking-to-qwen3-benchmark

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Mar 28, 2026

Summary

  • Add RTFx metric extraction to qwen3-asr-benchmark.yml
  • Add RTFx validation to ALL benchmark workflows to fail if RTFx is 0
  • Display medianRTFx and overallRTFx in qwen3 PR comments
  • Align with other benchmark workflows (asr, diarizer, parakeet-eou, sortformer, vad)

Changes

RTFx Tracking (qwen3-asr-benchmark.yml)

Extract and display performance metrics that were already being computed by the underlying qwen3-benchmark command but not shown in CI:

  • medianRTFx - Median real-time factor across test files
  • overallRTFx - Overall real-time factor (total audio / total inference time)

RTFx Validation (all benchmark workflows)

Add validation to fail workflows with exit 1 if RTFx is 0 or N/A, indicating silent benchmark failure:

  • qwen3-asr-benchmark.yml: Validate medianRTFx and overallRTFx
  • asr-benchmark.yml: Validate all 6 RTFx metrics (v2/v3, clean/other, streaming)
  • diarizer-benchmark.yml: Validate RTFx
  • parakeet-eou-benchmark.yml: Validate RTFx
  • sortformer-benchmark.yml: Validate RTFx
  • vad-benchmark.yml: Validate MUSAN and VOiCES RTFx

Why Fail on RTFx = 0?

If RTFx is 0 after benchmarking, it means:

  1. Benchmark didn't run properly
  2. Audio duration was 0
  3. Processing failed silently
  4. Metric extraction failed

Better to fail fast with clear error messages than report misleading zero metrics.

Context

All other benchmark workflows track RTFx. The qwen3-asr-benchmark was the only benchmark workflow that wasn't displaying these metrics despite the CLI command outputting them. Additionally, none of the workflows were validating RTFx values - they would silently report 0x if benchmarks failed.

🤖 Generated with Claude Code

Alex-Wengg and others added 4 commits March 28, 2026 14:19
Remove redundant special cases in folderName property. Now only keeps special
cases for nested directory structures (EOU and Nemotron variants) and uses
a simple default rule: strip "-coreml" suffix from the name.

This eliminates the inconsistency raised in #442 by applying a consistent
pattern across all models.

Before:
- Had 10+ special cases explicitly returning shortened names
- parakeetTdtCtc110m was inconsistent with other Parakeet models

After:
- Only 5 special cases for nested directories (parakeet-eou-streaming/*, nemotron-streaming/*)
- Default strips -coreml suffix for all other models
- All Parakeet models now follow the same pattern

Fixes #442
Add back special cases for kokoro and sortformer to preserve existing
folder names and avoid forcing users to re-download models. Still removes
redundant special cases (lseend, pocketTts, multilingualG2p, parakeetTdtCtc110m)
that can safely use the default -coreml stripping logic.

Result: 7 special cases total (kokoro, sortformer, + 5 nested directories)
vs 11 special cases before. Still achieves consistency for Parakeet models
without breaking existing cached model locations.
Extract and display medianRTFx and overallRTFx metrics in the PR comment.
Previously the workflow was running benchmarks but not displaying the
performance metrics.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 28, 2026

Claude finished @Alex-Wengg's task —— View job


PR Review: Add RTFx tracking to qwen3-asr-benchmark workflow

Todo List:

  • Create pending GitHub review
  • Analyze PR diff and changed files
  • Add inline review comments for specific feedback
  • Submit non-blocking review with overall summary

Review Complete ✅

I've completed a thorough review of this PR. The changes successfully add RTFx tracking to the qwen3-asr-benchmark workflow, bringing it in line with other benchmark workflows.

Key findings:

  • Solid implementation with proper error handling and null checks
  • Good consistency with existing benchmark workflow patterns
  • Clean code simplification in ModelNames.swift reduces duplication
  • 📝 Minor suggestion to extract jq parsing logic to reduce duplication (non-blocking)

Recommendation: Ready to merge. This achieves the stated goal of adding RTFx metrics display and includes a nice code cleanup bonus.


Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR successfully adds RTFx tracking to the qwen3-asr-benchmark workflow, aligning it with other benchmark workflows. The changes are well-implemented overall.

✅ Strengths

  • Consistency: Aligns qwen3 benchmark with other workflows (asr, diarizer, etc.)
  • Robust error handling: Proper null checks and fallbacks to "N/A"
  • Good formatting: Uses printf "%.2f" for consistent numeric display
  • Clear presentation: Well-structured table format in PR comments

🔍 Areas for improvement

Workflow file (.github/workflows/qwen3-asr-benchmark.yml):

  • Nit: The jq parsing logic is duplicated for both metrics. Consider extracting to a helper function to improve maintainability.

ModelNames.swift changes:

  • Good simplification: The switch to a default case with .replacingOccurrences(of: "-coreml", with: "") reduces code duplication and is more maintainable.
  • Consistent: The test update properly reflects the new simplified logic.

🎯 Overall Assessment

The changes fulfill the stated goal of adding RTFx metrics to match other benchmark workflows. The implementation is solid with proper error handling and consistent formatting. The ModelNames.swift simplification is a nice bonus that improves code maintainability.

Recommendation: Approve - this is ready to merge. The nit about extracting the jq logic is minor and can be addressed in a future refactor if desired.

@Alex-Wengg Alex-Wengg marked this pull request as draft March 28, 2026 18:40
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 CI cache path references old folder name qwen3-asr-0.6b-coreml that no longer matches code

The CI workflow caches ~/Library/Application Support/FluidAudio/Models/qwen3-asr-0.6b-coreml (line 30), but after the folderName default-case change in Sources/FluidAudio/ModelNames.swift:133, Repo.qwen3AsrInt8.folderName now returns qwen3-asr-0.6b/int8 instead of qwen3-asr-0.6b-coreml/int8. The models will be downloaded to a path under qwen3-asr-0.6b/ which is not covered by the CI cache configuration, so the cache will never hit and models will be re-downloaded on every CI run.

(Refers to line 30)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

return "parakeet-tdt-ctc-110m"
default:
return name
return name.replacingOccurrences(of: "-coreml", with: "")
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Changing folderName default from name to name.replacingOccurrences(of: "-coreml", with: "") silently renames cache directories for 8 repos

The refactoring intended to simplify the 4 explicitly removed cases (.lseend, .pocketTts, .multilingualG2p, .parakeetTdtCtc110m) into the default, but the default itself was changed from return name to return name.replacingOccurrences(of: "-coreml", with: ""). This changes the folderName for every repo that was previously falling through to the old default:

  • .vad: "silero-vad-coreml""silero-vad"
  • .parakeet: "parakeet-tdt-0.6b-v3-coreml""parakeet-tdt-0.6b-v3"
  • .parakeetV2: "parakeet-tdt-0.6b-v2-coreml""parakeet-tdt-0.6b-v2"
  • .parakeetCtc110m/.parakeetCtc06b: similarly stripped
  • .diarizer: "speaker-diarization-coreml""speaker-diarization"
  • .qwen3Asr/.qwen3AsrInt8: "qwen3-asr-0.6b-coreml/...""qwen3-asr-0.6b/..."

folderName is used pervasively to construct local model cache paths (DownloadUtils.swift:135, DownloadUtils.swift:190, AsrModels.swift:501, DiarizerModels.swift:106, etc.). This means (1) all existing cached models at old paths become orphaned and trigger unnecessary re-downloads, and (2) CI workflow cache paths still reference the old names (e.g. asr-benchmark.yml:28-29 caches parakeet-tdt-0.6b-v3-coreml but code now expects parakeet-tdt-0.6b-v3), rendering CI caches completely useless.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Add validation to all benchmark workflows to fail with exit 1 if RTFx
metrics are 0 or N/A, indicating a silent benchmark failure.

Changes:
- qwen3-asr-benchmark.yml: Validate medianRTFx and overallRTFx
- asr-benchmark.yml: Validate all 6 RTFx metrics (v2/v3, clean/other, streaming)
- diarizer-benchmark.yml: Validate RTFx
- parakeet-eou-benchmark.yml: Validate RTFx
- sortformer-benchmark.yml: Validate RTFx
- vad-benchmark.yml: Validate MUSAN and VOiCES RTFx

If RTFx is 0, it means either:
1. Benchmark didn't run properly
2. Audio duration was 0
3. Processing failed silently

Better to fail fast than report misleading metrics.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER NaN% <30% ⚠️ Diarization Error Rate (lower is better)
JER NaN% <25% ⚠️ Jaccard Error Rate
RTFx NaNx >1.0x ⚠️ Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download NaN NaN Fetching diarization models
Model Compile NaN NaN CoreML compilation
Audio Load NaN NaN Loading audio file
Segmentation NaN NaN Detecting speech regions
Embedding NaN NaN Extracting speaker voices
Clustering NaN NaN Grouping same speakers
Total NaN 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio NaN% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs diarization time • Test runtime: N/A • 03/28/2026, 03:39 PM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 0.0% <35%
Miss Rate 0.0% - -
False Alarm 0.0% - -
Speaker Error 0.0% - -
RTFx 0.0x >1.0x ⚠️
Speakers 0/0 - -

Sortformer High-Latency • ES2004a • Runtime: N/A • 2026-03-28T18:58:14.301Z

@Alex-Wengg Alex-Wengg changed the title Add RTFx tracking to qwen3-asr-benchmark workflow Add RTFx tracking to qwen3-asr-benchmark and validate RTFx in all benchmarks Mar 28, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 713.2x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 724.6x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER NaN% <20% ⚠️ Diarization Error Rate (lower is better)
RTFx NaNx >1.0x ⚠️ Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download NaN NaN Fetching diarization models
Model Compile NaN NaN CoreML compilation
Audio Load NaN NaN Loading audio file
Segmentation NaN NaN VAD + speech detection
Embedding NaN NaN Speaker embedding extraction
Clustering (VBx) NaN NaN Hungarian algorithm + VBx clustering
Total NaN 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) NaN% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: 0m 33s • 03/28/2026, 03:10 PM EST

@Alex-Wengg Alex-Wengg marked this pull request as ready for review March 28, 2026 19:28
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

[ ! -z "$OTHER_V2_RTFX_FAILED" ] && echo " - test-other (v2) RTFx is 0"
[ ! -z "$STREAMING_RTFX_FAILED" ] && echo " - streaming RTFx is 0"
[ ! -z "$STREAMING_V2_RTFX_FAILED" ] && echo " - streaming (v2) RTFx is 0"
exit 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 exit 1 prevents PR comment from being posted because Comment PR step lacks always()

The old code at this location had a deliberate comment: # Don't exit with error to allow PR comment to be posted. This PR replaces that with exit 1. The subsequent "Comment PR" step at asr-benchmark.yml:254 uses if: github.event_name == 'pull_request' which, per GitHub Actions docs, implicitly becomes if: success() && github.event_name == 'pull_request'. When exit 1 fires, success() evaluates to false and the PR comment step is skipped entirely. Benchmark results will not be visible on the PR. The other workflows that already had exit 1 and work correctly (diarizer, sortformer) use if: always() on their comment steps.

Prompt for agents
In .github/workflows/asr-benchmark.yml, the `exit 1` at line 247 causes the Comment PR step to be skipped. Fix by changing the Comment PR step's condition at line 254 from:
  if: github.event_name == 'pull_request'
to:
  if: always() && github.event_name == 'pull_request'

This matches the pattern used in other benchmarks (diarizer-benchmark.yml, vad-benchmark.yml) that already handle step failures correctly.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

if [ "$RTFx" = "0.00" ] || [ "$RTFx" = "N/A" ]; then
echo "❌ CRITICAL: RTFx is 0 or N/A - benchmark failed"
echo "RTFx value: $RTFx"
exit 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 exit 1 prevents PR comment because Comment PR step lacks always() in parakeet-eou-benchmark

Same issue as in asr-benchmark.yml: the new exit 1 at line 111 will cause the "Comment PR" step at parakeet-eou-benchmark.yml:115 to be skipped. That step uses if: github.event_name == 'pull_request' without always(), so when the benchmark step fails, GitHub Actions' implicit success() AND prevents the comment from being posted.

Prompt for agents
In .github/workflows/parakeet-eou-benchmark.yml, the exit 1 at line 111 causes the Comment PR step to be skipped. Fix by changing the Comment PR step's condition at line 115 from:
  if: github.event_name == 'pull_request'
to:
  if: always() && github.event_name == 'pull_request'

This matches the pattern used in diarizer-benchmark.yml and vad-benchmark.yml.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +88 to +94
exit 1
fi
else
echo "❌ CRITICAL: Results file not found - benchmark failed"
echo "MEDIAN_RTFx=N/A" >> $GITHUB_OUTPUT
echo "OVERALL_RTFx=N/A" >> $GITHUB_OUTPUT
exit 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 exit 1 prevents PR comment because Comment PR step lacks always() in qwen3-asr-benchmark

Same issue: the new exit 1 at lines 88 and 94 in the smoketest step will cause the "Comment PR" step at qwen3-asr-benchmark.yml:101 to be skipped. That step uses if: github.event_name == 'pull_request' without always(), so the implicit success() check fails and no PR comment is posted when the RTFx validation fails.

Prompt for agents
In .github/workflows/qwen3-asr-benchmark.yml, the exit 1 at lines 88 and 94 causes the Comment PR step to be skipped. Fix by changing the Comment PR step's condition at line 101 from:
  if: github.event_name == 'pull_request'
to:
  if: always() && github.event_name == 'pull_request'

This matches the pattern used in diarizer-benchmark.yml and vad-benchmark.yml.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@Alex-Wengg
Copy link
Copy Markdown
Member Author

Closing in favor of #458 which fixes the issues identified by Devin:

  • Removed ModelNames.swift changes that broke cache paths
  • Added if: always() to Comment PR steps so comments post even when validation fails
  • Clean branch from main with no unrelated commits

@Alex-Wengg Alex-Wengg closed this Mar 28, 2026
Alex-Wengg added a commit that referenced this pull request Mar 28, 2026
## Summary
- Add RTFx metric extraction to qwen3-asr-benchmark.yml
- Add RTFx validation to ALL 6 benchmark workflows to fail if RTFx is 0
- Fix PR comment posting with `if: always()` so comments post even when
validation fails

## Changes

### 1. RTFx Tracking (qwen3-asr-benchmark.yml)
Extract and display performance metrics:
- `medianRTFx` - Median real-time factor across test files
- `overallRTFx` - Overall real-time factor (total audio / total
inference time)

### 2. RTFx Validation (all 6 benchmark workflows)
Add validation to fail workflows with `exit 1` if RTFx is 0 or N/A,
indicating silent benchmark failure:
- **qwen3-asr-benchmark.yml**: Validate medianRTFx and overallRTFx
- **asr-benchmark.yml**: Validate all 6 RTFx metrics (v2/v3 ×
clean/other/streaming)
- **diarizer-benchmark.yml**: Validate RTFx
- **parakeet-eou-benchmark.yml**: Validate RTFx
- **sortformer-benchmark.yml**: Validate RTFx
- **vad-benchmark.yml**: Validate MUSAN and VOiCES RTFx

### 3. Fix PR Comment Posting
- Add `if: always()` to Comment PR steps in workflows that didn't have
it
- Without this, PR comments don't post when validation fails
- Users need to see what went wrong even if the workflow fails

## Why Fail on RTFx = 0?

If RTFx is 0 after benchmarking, it means:
1. Benchmark didn't run properly
2. Audio duration was 0
3. Processing failed silently
4. Metric extraction failed

Better to fail fast with clear error messages than report misleading
zero metrics.

## Fixes from Previous PR #454

This PR fixes the issues identified by Devin in #454:
- ✅ No ModelNames.swift changes (avoiding cache path breakage)
- ✅ Added `if: always()` to Comment PR steps
- ✅ Clean branch from main (no unrelated commits)

Closes #454

🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/458"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant