Add RTFx tracking and validation to all benchmark workflows#458
Add RTFx tracking and validation to all benchmark workflows#458Alex-Wengg merged 2 commits intomainfrom
Conversation
Changes: 1. Add RTFx metric extraction to qwen3-asr-benchmark.yml - Extract medianRTFx and overallRTFx from benchmark results - Display metrics in PR comments 2. Add RTFx validation to all 6 benchmark workflows - Fail with exit 1 if RTFx is 0 or N/A - Prevents silent benchmark failures from being reported as success 3. Fix PR comment posting with if: always() - Add "if: always()" to Comment PR steps - Ensures PR comments post even when validation fails - Allows users to see what went wrong Workflows updated: - qwen3-asr-benchmark.yml: Validate 2 RTFx metrics + add tracking - asr-benchmark.yml: Validate 6 RTFx metrics (v2/v3 × clean/other/streaming) - diarizer-benchmark.yml: Validate 1 RTFx metric - parakeet-eou-benchmark.yml: Validate 1 RTFx metric - sortformer-benchmark.yml: Validate 1 RTFx metric - vad-benchmark.yml: Validate 2 RTFx metrics (MUSAN + VOiCES) If RTFx is 0, it means: - Benchmark didn't run properly - Audio duration was 0 - Processing failed silently - Metric extraction failed Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Claude finished @Alex-Wengg's task —— View job PR Review: RTFx Tracking and Validation ✅Successfully reviewed the RTFx tracking and validation changes across all benchmark workflows. Todo List
Review SummaryOverall Assessment: Good to merge with minor suggestions for improvement Key Strengths:
Minor Issues Identified:
The core functionality is solid and effectively addresses the silent failure issue. The PR successfully fixes the problems from #454 and implements the RTFx validation as specified. |
There was a problem hiding this comment.
Code Review Summary
This PR successfully adds RTFx tracking and validation across all 6 benchmark workflows. The changes are well-implemented with appropriate failure handling and PR commenting fixes.
Key Strengths:
- ✅ Comprehensive coverage: All benchmark workflows now validate RTFx metrics
- ✅ Proper error handling: Uses
exit 1to fail workflows when RTFx is invalid - ✅ Fixed PR commenting: Added
if: always()to ensure comments post even on failure - ✅ Clear validation logic: RTFx = 0 or N/A correctly indicates benchmark failure
- ✅ Good performance reporting: qwen3 workflow now shows median/overall RTFx metrics
Minor Issues to Consider:
1. Inconsistent validation patterns (.github/workflows/asr-benchmark.yml:202):
- Some workflows check
[ -z "$RTF" ]for empty variables, others don't - Consider standardizing to:
if [ "$RTFx" = "0.00" ] || [ "$RTFx" = "N/A" ] || [ -z "$RTFx" ]; then
2. String comparison robustness (multiple workflows):
- Hardcoded
"0.00"might miss edge cases like"0"or"0.0" - Consider using numeric comparison:
if (( $(echo "$RTFx < 0.01" | bc -l) ))
3. Error message consistency:
- Some workflows use
❌ CRITICAL:prefix, others use⚠️ - Consider standardizing critical validation failures to use
❌ CRITICAL:
Architecture Assessment:
The overall approach is sound - failing fast on RTFx = 0 prevents misleading zero metrics from being reported. The validation logic correctly identifies when benchmarks haven't run properly.
Recommendation: Approve with minor suggestions for consistency improvements. The core functionality is correct and addresses the silent failure issue effectively.
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.4s diarization time • Test runtime: 1m 41s • 03/28/2026, 04:15 PM EST |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m55s • 03/28/2026, 04:10 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 3m3s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Fixes issues identified in review: 1. Move EXECUTION_TIME calculation before RTFx validation (qwen3) - Previously exit 1 prevented EXECUTION_TIME from being set - Now PR comments show proper runtime even when validation fails 2. Standardize error messages to "❌ CRITICAL:" across all workflows - Changed "⚠️ " to "❌ CRITICAL:" for consistency - All validation failures now use the same format 3. Add more zero format checks (0, 0.0, 0.00) - Catches edge cases like "0" or "0.0" in addition to "0.00" - More robust string comparison for RTFx validation Workflows updated: - qwen3-asr-benchmark.yml: Move EXECUTION_TIME before validation - asr-benchmark.yml: Standardize error messages, add zero variants - parakeet-eou-benchmark.yml: Add zero variants and empty check Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 343.2s processing • Test runtime: 5m 43s • 03/28/2026, 04:15 PM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 2m 42s • 2026-03-28T20:05:27.244Z |
PocketTTS Smoke Test ✅
Runtime: 0m34s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon. |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 6m21s • 03/28/2026, 04:21 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Summary
if: always()so comments post even when validation failsChanges
1. RTFx Tracking (qwen3-asr-benchmark.yml)
Extract and display performance metrics:
medianRTFx- Median real-time factor across test filesoverallRTFx- Overall real-time factor (total audio / total inference time)2. RTFx Validation (all 6 benchmark workflows)
Add validation to fail workflows with
exit 1if RTFx is 0 or N/A, indicating silent benchmark failure:3. Fix PR Comment Posting
if: always()to Comment PR steps in workflows that didn't have itWhy Fail on RTFx = 0?
If RTFx is 0 after benchmarking, it means:
Better to fail fast with clear error messages than report misleading zero metrics.
Fixes from Previous PR #454
This PR fixes the issues identified by Devin in #454:
if: always()to Comment PR stepsCloses #454
🤖 Generated with Claude Code