Skip to content

Your faithful, impartial partner for audio evaluation — know yourself, know your rivals. 真实评测,知己知彼。

License

Notifications You must be signed in to change notification settings

OpenBMB/UltraEval-Audio

Repository files navigation

assets/logo.png

A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

中文 | English | 💬Discord | UltraEval-Audio Paper

v1.1.0 Highlights

  • Popular model replication: Added replication support for popular models, including replication result showcases and one-click replication commands (see replication/).
  • Isolated Runtime: Introduced an isolated inference mechanism. Model-specific dependencies are installed/managed automatically; inference runs in the isolated environment and communicates with the main evaluation process via IPC, eliminating dependency conflicts.
  • Specialized model evaluation support: Added specialized audio models for TTS, ASR, and Audio Codec, further expanding evaluation coverage.

Overview

🚀Exceptional Experience with UltraEval-Audio🚀

UltraEval-Audio — The world's first open-source framework supporting both speech understanding and speech generation evaluation, specifically designed for large audio models. It aggregates 34 authoritative benchmarks, covering four major domains: speech, sound, medicine, and music, supporting 10 languages and 12 task categories. With UltraEval-Audio, you will experience unprecedented convenience and efficiency:

  • Direct Replication of Popular Models 🔬: Provides detailed replication documentation and commands, ensuring you can easily reproduce evaluation results of open-source models with complete transparency and reproducibility.
  • One-Click Benchmark Management 📥: Say goodbye to tedious manual downloading and data processing. UltraEval-Audio automates it all, letting you easily acquire well-known benchmark datasets (e.g., Librispeech, TED-LIUM, Seed-TTS-Eval).
  • Built-in Evaluation Tools ⚙️: No need to hunt for evaluation tools. UltraEval-Audio binds datasets with commonly used official evaluation methods (e.g., WER, WER-ZH, BLEU, G-Eval) to ensure alignment between datasets and metrics.
  • Powerful and Flexible 🛠️: Supports preview testing, random sampling, error retries, and resume-from-breakpoint, ensuring a flexible and controllable evaluation process while boosting efficiency and accuracy.
  • Seamless Integration of Custom Datasets 💼: Supports not only public benchmarks but also powerful custom dataset integration, allowing rapid application in various engineering scenarios.
  • Easy Integration with Existing Systems 🔗: With excellent extensibility and standardized design, UltraEval-Audio seamlessly connects with your existing evaluation pipelines, simplifying project management and unifying output results.

UEA_Architecture

Changelog🔥

  • [2025/12/31]
    • release v1.1.0 🎉🎉🎉
  • [2025/12/04]
  • [2025/12/02]
  • [2025/10/30]
    • Support VoxCPM TTS model: --model voxcpm-tts --model voxcpm-vc
    • Use uv to accelerate model dependency installation 🚀
  • [2025/10/17]
  • [2025/05/22]
  • [2025/05/12]
    • Support Qwen2.5-Omni qwen2.5-omni-audio, qwen2.5-omni-speech, Kimi-Audio-7B-Instruct kimiaudio, kimiaudio-speech models, and update Audio Understanding Leaderboard
  • [2025/05/8]
    • Faster resume evaluation, -r/--resume parameter, automatically searches for the latest breakpoint result if no file is specified
    • Support evaluation starting from inference file, --infer-file parameter, allows direct evaluation from inference file without regeneration
  • [2025/03/23]
  • [2025/03/04]
    • Support [resume evaluation](docs/Procedures for Restarting an Incomplete Evaluation.md), command line parameter --resume $checkpoint_res_file
    • glm-4-voice service deployment, supports UltraEval-Audio evaluation, see details at GLM-4-Voice
    • Parallel evaluation support, command line parameter --workers $num_workers
  • [2025/01/13] release v1.0.0

Leaderboard

Audio Understanding Leaderboard

Audio Understanding Audio Foundation Models: Speech + Text → Text

WER/CER ($\downarrow$) for ASR, BLEU ($\uparrow$) for AST, and ACC ($\uparrow$) for EMO. Best results are in bold.

Scoring:

  • Avg. Score ($\uparrow$): mean of all available normalized metric scores. For WER/CER-based metrics we use ((100-\text{WER/CER})); for other metrics (e.g., BLEU/Acc.) we keep the original value.
Model ASR
Librispeech
dev-clean|dev-other
test-clean|test-other
ASR
TED-LIUM
ASR
CV-15
en|zh
ASR
Aishell-1
ASR
FLEURS
ASR
Wenet
-test-net
AST
covost2-en2zh
AST
covost2-zh2en
EMO
MELD
Avg. Score
($\uparrow$)
GPT-4o-Realtime 2.30|5.60
2.60|5.50
4.80 27.44|37.44 7.30 5.40 28.90 37.10 15.70 33.20 73.75
Qwen3-Omni-30B-A3B-Instruct 1.25|2.27
1.36|2.57
2.82 6.00|4.32 0.87 2.61 4.82 46.58 29.40 56.81 84.92
Qwen2.5-Omni 2.10|4.20
2.40|4.20
4.70 8.70|5.20 1.10 4.60 6.00 42.50 11.50 53.60 81.88
MiniCPM-o 2.6 1.60|3.40
1.70|4.40
3.00 10.30|9.60 1.60 4.40 6.90 48.20 27.20 52.40 83.15
Kimi-Audio-7B-Instruct 1.18|2.34
1.28|2.44
2.96 7.09|5.72 0.60 2.53 5.55 36.61 18.30 59.23 83.27
Gemini-1.5-Flash 5.90|7.20
21.90|16.30
6.90 208.00|84.37 9.00 85.90 279.90 33.40 8.20 45.20 27.80
Gemini-1.5-Pro 2.60|4.40
2.90|4.90
3.00 8.36|13.26 4.50 5.90 14.30 47.30 22.60 48.40 81.09
Gemini-2.5-Flash 3.73|6.71
3.28|12.03
3.53 46.76|36.15 6.40 6.45 126.07 3.67 10.61 51.53 62.67
Gemini-2.5-Pro 5.30|4.51
2.84|6.74
2.52 9.42|11.04 3.36 4.25 16.83 41.75 27.84 46.59 80.72
Qwen2-Audio-7B 1.57|3.50
1.60|3.88
3.43 8.67|7.03 1.52 5.89 8.09 45.30 24.84 42.87 82.14
Qwen2-Audio-7B-Instruct 2.90|5.50
3.10|5.70
5.90 10.68|8.39 2.60 6.90 10.30 39.50 22.90 17.40 78.29
MiDaShengLM-7B 2.20|4.75
2.21|5.16
146.53 13.66|29.13 1.23 3.28 16.56 38.52 22.68 53.96 68.50

Audio Generation Leaderboard

Audio Generation Audio Foundation Models: Speech → Speech Table: Audio generation performance ($\uparrow$). Acoustic metrics (UTMOS | DNSMOS P.835 | DNSMOS P.808, scores range from 0 to 5) are evaluated on the generated audio responses from the speech tasks. Best results are in bold.

Note: The average score is computed as the average of 6 scores: five speech-task scores and normalized acoustic scores. For acoustic scores (UTMOS | DNSMOS P.835 | DNSMOS P.808), each value (0--5) is multiplied by 20 to map to 0--100, then averaged to obtain the normalized acoustic score.

Models Speech
WebQuestions
Speech
TriviaQA
Speech
AlpacaEval
Speech
CMMLU
Speech
HSK
Acoustics Avg. Score
($\uparrow$)
GPT-4o-Realtime 51.60 69.70 74.00 70.05 98.69 4.29|3.44|4.26 74.00
Qwen3-Omni-30B-A3B-Instruct 51.50 55.27 67.97 47.83 40.27 4.44|3.45|4.12 57.15
Qwen2.5-Omni 38.89 39.94 54.00 73.72 95.65 4.23|3.48|4.27 63.68
MiniCPM-o 2.6 40.00 40.20 51.00 51.37 80.68 4.12|3.39|4.02 56.69
Kimi-Audio-7B-Instruct 33.69 38.20 34.40 71.25 97.42 2.94|3.22|3.62 56.69
GLM-4-Voice 32.00 36.40 51.00 52.61 71.06 4.21|3.46|4.07 53.56

Audio Codec Leaderboard

Audio Codec: Speech → Speech. Table: Audio Codec Performance: ASR-WER ($\downarrow$), ASR-CER ($\downarrow$), SIM ($\uparrow$), and Acoustics (UTMOS|DNSMOS P.835|DNSMOS P.808, $\uparrow$). Note: The hyphen (-) indicates that UTMOS is not applicable to Chinese speech (AISHELL-1). Best results are in bold.

Note: For acoustic scores we use UTMOS, DNSMOS P.835, and DNSMOS P.808 metrics. To calculate the average score, for ASR-WER and ASR-CER, we calculate (100-\text{val}). For acoustic scores, each available value (ranges from 0 to 5) is normalized by (20\times\mathrm{val}) (mapping to 0--100), and the acoustic score is their average (the hyphen `-' is ignored). The final score is the average of 9 metric scores.

Models Librispeech-dev-clean
ASR-WER
Librispeech-dev-clean
SIM
Librispeech-dev-clean
Acoustics
Librispeech-test-clean
ASR-WER
Librispeech-test-clean
SIM
Librispeech-test-clean
Acoustics
AISHELL-1
ASR-CER
AISHELL-1
SIM
AISHELL-1
Acoustics
Avg. Score
($\uparrow$)
Encodec-24k 4.56 59.40 1.58|3.12|2.36 4.32 59.40 1.57|3.12|2.36 13.95 47.48 -|2.93|2.03 65.24
Encodec-48k 3.85 65.53 1.52|2.88|2.42 3.80 66.00 1.48|2.87|2.40 6.85 68.78 -|2.79|2.21 69.59
ChatTTS-DVAE 7.49 34.83 1.30|2.66|2.11 6.75 36.21 1.29|2.64|2.12 32.36 32.36 -|2.24|1.57 52.86
Mimi (32bit) 2.04 92.18 3.83|2.87|2.44 1.96 92.68 3.84|2.92|2.49 2.82 84.80 -|2.43|1.89 80.96
Mimi (8bit) 2.76 72.15 3.52|2.78|2.37 2.83 73.13 3.53|2.83|2.43 6.82 60.63 -|2.42|2.04 72.72
Mimi-streaming (8bit) 6.76 54.02 1.65|2.78|2.37 6.19 54.32 1.63|2.83|2.43 19.62 40.67 -|2.42|2.04 61.37
WavTokenizer-large-75 4.31 69.97 4.01|3.64|3.26 4.05 68.15 4.00|3.63|3.27 8.97 64.27 -|3.11|2.85 76.67
WavTokenizer-large-40 8.13 60.26 3.78|3.70|3.13 7.73 56.63 3.77|3.70|3.16 25.52 49.21 -|3.13|2.50 69.18
Spark 2.39 79.94 4.18|3.85|3.24 2.53 79.53 4.18|3.83|3.24 3.66 74.76 -|3.63|2.85 82.29

Quick Start

Environment Preparation

git clone https://github.com/OpenBMB/UltraEval-Audio.git
cd UltraEval-Audio
conda create -n env python=3.10 -y
conda activate env
pip install -e .

or use uv for faster installation:

uv venv env --python 3.10
source env/bin/activate
uv pip install -e .

Run Examples

# For some regions, you may need to set: export HF_ENDPOINT=https://hf-mirror.com
# Test MiniCPM-o 2.6 speech understanding capability
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset sample --prompt mini-cpm-omni-asr-zh --model MiniCPMo2_6-audio

# Test MiniCPM-o 2.6 speech generation capability
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset llama-questions-s2t --model MiniCPMo2_6-speech

# Test GPT-4o-Realtime speech understanding capability
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset sample --model gpt4o_audio

# Test GPT-4o-Realtime speech generation capability
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset llama-questions-s2t --model gpt4o_speech

# Test gemini-1.5-pro speech understanding capability
export GOOGLE_API_KEY=$your-key
python audio_evals/main.py --dataset sample --model gemini-pro


# Test qwen2-audio-offline speech understanding capability
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset sample --model qwen2-audio-chat

If you encounter errors or cannot reproduce Mini-CPM-o 2.6 results, please check FAQ.

Res

Evaluation complete, results are as follows:

- res
    |-- $model-name
        |-- $dataset
            |-- $time.jsonl
            |-- $time-overview.jsonl

Usage

Evaluation command:

python audio_evals/main.py --dataset <dataset_name> --model <model_name>

Dataset Selection

<dataset_name> specifies the dataset to evaluate. Supported datasets can be viewed via python cli/list_availabel.py

Construct your own dataset: docs/how add a dataset.md

Model Selection

model_name specifies the model to evaluate. Supported models can be viewed via python cli/list_availabel.py

Evaluate your own model docs/how eval your model.md

Contact Us

If you have any suggestions or questions, please file an issue or join our Discord group: https://discord.com/invite/Qrsbft4e