AudioBench: A Universal Benchmark for Audio Large Language Models
TL;DR Summary
AudioBench is introduced as a universal benchmark for Audio Large Language Models, covering 8 tasks and 26 datasets, including 7 new ones. It evaluates speech and audio scene understanding, addressing gaps in existing benchmarks for instruction-following capabilities. Five models
Abstract
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic). Despite recent advancements, there lacks a comprehensive benchmark for AudioLLMs on instruction following capabilities conditioned on audio signals. AudioBench addresses this gap by setting up datasets as well as desired evaluation metrics. Besides, we also evaluated the capabilities of five popular models and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-sourced evaluation toolkit, data, and leaderboard will offer a robust testbed for future model developments.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
AudioBench: A Universal Benchmark for Audio Large Language Models
1.2. Authors
Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen, all affiliated with the Institute for Infocomm Research (IR)$, A*STAR, Singapore, and some also with the Centre for Frontier AI Research (CFAR), A*STAR. The primary contact is wang_bin@i2r.a-star.edu.sg.
1.3. Journal/Conference
The paper is published at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) in 2025. NAACL is a highly reputable and influential conference in the field of Natural Language Processing (NLP) and computational linguistics. Publication at NAACL signifies that the work has undergone rigorous peer review and is recognized by experts in the field.
1.4. Publication Year
2025
1.5. Abstract
This paper introduces AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It features 8 distinct tasks and 26 datasets, including 7 newly proposed ones. The evaluation focuses on three key areas: speech understanding, audio scene understanding, and voice understanding (paralinguistic). The authors highlight a gap in existing benchmarks regarding the instruction-following capabilities of AudioLLMs conditioned on audio signals. AudioBench addresses this by providing appropriate datasets and evaluation metrics. The study also evaluates five popular models, revealing that no single model consistently excels across all tasks. The paper outlines future research directions for AudioLLMs and provides an open-sourced evaluation toolkit, data, and a leaderboard to serve as a robust testbed for future model development.
1.6. Original Source Link
The paper is published at: https://aclanthology.org/2025.naacl-long.218/ The PDF link is: https://aclanthology.org/2025.naacl-long.218.pdf It is officially published at the NAACL 2025 conference.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the lack of a comprehensive and standardized benchmark for evaluating Audio Large Language Models (AudioLLMs). While Large Language Models (LLMs) and multimodal LLMs (vision-enhanced, video-enhanced) have seen a proliferation of benchmarks, AudioLLMs lag significantly. Existing evaluations for audio-language models are fragmented, using different datasets and limited task scopes, making systematic comparison and understanding of their capabilities intractable.
This problem is important because AudioLLMs are designed to interpret diverse audio inputs (speech, environmental sounds, paralinguistic features) and respond flexibly to user queries, necessitating a broad evaluation across various use cases. The existing evaluation regimes predominantly rely on conventional metrics and datasets that are not well-suited for assessing the instruction-following and open-ended generation capabilities expected of these advanced models.
The paper's entry point is to create a "universal benchmark" that covers the breadth of AudioLLMs' potential applications, going beyond traditional speech tasks to include audio scene understanding and paralinguistic feature interpretation. It also focuses on addressing the challenge of evaluating open-ended, instruction-following responses, which is a hallmark of modern LLMs.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Introduction of AudioBench: Proposes
AudioBench, the first comprehensive evaluation benchmark specifically for general instruction-followingAudioLLMs. This benchmark covers 8 distinct tasks and 26 datasets, significantly broadening the scope of evaluation for audio-language models. -
Novel Datasets: Introduces 7 newly adapted or collected datasets (e.g.,
CN-College-Listen,DREAM-TTS,Public-SG-SpeechQAfor Speech Question Answering;OpenHermes-Audio,ALPACA-Audiofor Speech Instruction;WavCaps-QA,AudioCaps-QAfor Audio Question Answering) to fill gaps in existing resources and better assess instruction-following capabilities. -
Comprehensive Evaluation Scope: Defines three core aspects for evaluation:
speech understanding,audio scene understanding, andvoice understanding(paralinguistic features like emotion, accent, and gender), providing a holistic view ofAudioLLMcapabilities. -
Robustness Queries: Integrates multiple prompt templates and varying input lengths to evaluate model robustness and generalizability, addressing the issue that models might overfit to seen instructions.
-
Model-as-Judge Validation: Investigates and validates the effectiveness of
Model-as-Judgeevaluation for open-ended generation, showing thatLLaMA-3-70B-Instructhas a high correlation withGPT-4in judgment tasks, offering an accessible and transparent alternative. -
Extensive Model Evaluation: Evaluates five popular
AudioLLMs(SALMONN,Qwen-Audio-Chat,WavLLM,Qwen2-Audio-Instruct) and one cascade model (), revealing that no single model consistently excels across all tasks and highlighting areas for future improvement, especially in long-form audio processing and non-verbal audio understanding for end-to-end models. -
Open-sourced Resources: Anticipates offering an open-sourced evaluation toolkit, data, and a leaderboard to foster future research and development in
AudioLLMs.These findings collectively address the existing evaluation gap for
AudioLLMs, provide a standardized framework, and highlight current strengths and weaknesses of state-of-the-art models, thereby guiding future research directions.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following foundational concepts:
-
Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and respond to human language. They can perform various tasks like translation, summarization, and question answering. Examples include
GPT-3,LLaMA, etc.LLMsare the backbone formultimodal LLMs. -
Multimodal Large Language Models (Multimodal LLMs): These are extensions of
LLMsthat can process and understand information from multiple modalities, such as text, images, video, and audio. For example, avision-LLMcan understand images and respond to text queries about them. This paper focuses onAudioLLMs, which aremultimodal LLMsspecialized in audio. -
Audio Large Language Models (AudioLLMs): These are a specific type of
multimodal LLMthat can interpret audio content (speech, environmental sounds, paralinguistic features) and generate text responses based on user instructions. They aim to combine the understanding capabilities ofLLMswith the ability to process diverse audio inputs. -
Instruction Following: A key capability of modern
LLMswhere the model can understand and execute commands or instructions given in natural language, even if it hasn't been explicitly trained on that exact instruction. ForAudioLLMs, this means understanding a query about an audio input and generating an appropriate text response. -
Benchmarks: In machine learning, a benchmark is a standardized set of tasks, datasets, and evaluation metrics used to compare the performance of different models objectively. Benchmarks are crucial for tracking progress and identifying limitations in a field.
-
Automatic Speech Recognition (ASR): The process of converting spoken language into written text. This is a fundamental task in speech processing and a component of many
AudioLLMs. -
Speech Question Answering (SQA): A task where a model answers questions based on the content of a spoken audio input. This requires both
ASRcapabilities and natural language understanding. -
Audio Scene Understanding: The ability of a model to identify and interpret non-speech sounds in an audio input, such as environmental sounds (e.g., car horns, birds chirping, music).
-
Voice Understanding (Paralinguistic Features): The ability to extract information from speech beyond its semantic content, such as emotional state, accent, gender, or speaking style. These are often referred to as
paralinguisticcues. -
Word Error Rate (WER): A common metric for evaluating
ASRsystems. It measures the number of errors (substitutions, deletions, insertions) required to transform the recognized text into the reference text, divided by the total number of words in the reference text. A lowerWERindicates better performance. The formula forWERis: $ \mathrm{WER} = \frac{S + D + I}{N} $ Where:- is the number of substitutions (words replaced).
- is the number of deletions (words removed).
- is the number of insertions (words added).
- is the total number of words in the reference (ground truth) transcription.
-
METEOR Score (Metric for Evaluation of Translation with Explicit Ordering): A metric used for evaluating machine translation and text generation tasks, including audio captioning. It measures the alignment between a machine-generated text and one or more reference texts. It considers exact word matches, stemmed word matches, and synonym matches, as well as maintaining word order (ngram-based penalties). A higher
METEORscore indicates better quality. TheMETEORscore is calculated based on a weighted harmonic mean of precision () and recall (), with a penalty for fragmentation (): $ \mathrm{METEOR} = \mathrm{HM}(\alpha P, (1-\alpha)R) \times (1 - F_p) $ Where:- is the harmonic mean.
- is precision, calculated as the number of matched unigrams divided by the number of unigrams in the candidate (generated) text.
- is recall, calculated as the number of matched unigrams divided by the number of unigrams in the reference text.
- is a parameter that weights precision and recall (typically 0.9 for
METEOR). - is a fragmentation penalty, which accounts for the chunking of matched words and discourages fragmented alignments.
-
Model-as-Judge: An evaluation paradigm where another
LLM(often a more powerful one likeGPT-4) is used to score the quality of responses generated by a target model, especially in open-ended generation tasks where traditional metrics are difficult to apply.
3.2. Previous Works
The paper frames its contribution within the broader context of multimodal LLM benchmarks. Here's a summary of previous works and key background information:
-
General LLM Benchmarks:
Hendrycks et al., 2021: IntroducedMeasuring Massive Multitask Language Understanding (MMLU), a benchmark for evaluatingLLMsacross various subjects and tasks. This provided a holistic evaluation fortext-based LLMs.Cobbe et al., 2021a,Wang et al., 2024,Rein et al., 2023: Other examples of benchmarks fortext-based LLMs, covering aspects like reasoning, subject knowledge, safety, and multilingual capabilities. These benchmarks primarily focus on text input and output.
-
Vision-enhanced Multimodal LLM Benchmarks:
Marino et al., 2019: IntroducedOK-VQA(Outside Knowledge Visual Question Answering), a benchmark requiring external knowledge to answer questions about images.Yue et al., 2023: ProposedMMMU, a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI, covering various academic and professional domains with visual input.Padlewski et al., 2024: IntroducedVibe-Eval, a hard evaluation suite for measuring progress ofmultimodal LLMs.- Other works like
Yu et al., 2023andLiu et al., 2023(e.g.,MMBench) focus on perception tests forVision-LLMs.
-
Video-enhanced Multimodal LLM Benchmarks:
Xiao et al., 2021: Focused on video question answering, such asNext-QA.Li et al., 2023,Ning et al., 2023,Liu et al., 2024,Fu et al., 2024: These benchmarks assessvideo understanding, which inherently includes both visual and audio elements. However, the paper notes that these benchmarks predominantly rely on visual inputs, with audio often serving as a supplementary feature, not the primary modality for understanding.
-
Advancements in AudioLLMs (Models, not Benchmarks): The paper mentions several
multimodal foundation modelsthat enhance speech and audio understanding, typically throughcascaded methods(separate components for audio processing and language processing) orintegrated multitask optimization(joint training across various audio-text tasks).- Cascaded Models:
Whisper(Huang and Tsai, 2023) forASRfollowed byLLaMAfor reasoning is a typical example. - Integrated Models:
AudioGPT(Huang et al., 2023): One of the earlier models aiming for diverse audio tasks.SpeechGPT(Zhang et al., 2023a): Focuses on cross-modal conversational abilities.SALMONN(Tang et al., 2024): A model aiming for generic hearing abilities.Qwen-Audio(Chu et al., 2023) andQwen2-Audio(Chu et al., 2024): Models for universal audio understanding.AudioPALM(Rubenstein et al., 2023): A large language model that can speak and listen.WavLLM(Hu et al., 2024a): Towards robust and adaptive speech large language model.
- Cascaded Models:
3.3. Technological Evolution
The technological evolution leading to AudioBench can be summarized as:
- Rise of
Text-based LLMs: Initial foundation models focused solely on text, demonstrating strong capabilities in language understanding and generation. Benchmarks likeMMLUhelped quantify their progress. - Expansion to
Vision-LLMs: Researchers extendedLLMsto incorporate visual information, leading tovision-language modelscapable of image understanding and visual question answering. Benchmarks likeMMMUemerged to evaluate these. - Inclusion of Video-LLMs: Further expansion incorporated video, which includes both visual and audio streams. While these benchmarks touched upon audio, the audio component was often secondary.
- Emergence of dedicated
AudioLLMs: Recognizing the distinct challenges and opportunities of audio, specializedAudioLLMsbegan to appear, integrating speech, environmental sounds, and paralinguistic features. However, a unified benchmark for these models was missing, with models often evaluated on disparate datasets (e.g.,Qwen-Audio-Chaton 12 datasets,SALMONNon 15, with only two in common). AudioBench's Role:AudioBenchsteps in at this point to standardize the evaluation ofAudioLLMsby providing a comprehensive, multi-task, and instruction-following oriented benchmark, including new datasets and robust evaluation methodologies likeModel-as-Judge. It aims to address the shortcomings of previous audio evaluations (e.g.,SUPERB,Dynamic-SUPERB) which were either not focused onAudioLLMsor lacked comprehensive coverage.
3.4. Differentiation Analysis
Compared to previous benchmarks and evaluation practices, AudioBench introduces several core differences and innovations:
- Comprehensive Scope for Audio: Unlike
video-LLMbenchmarks where audio is supplementary,AudioBenchspecifically focuses on the primary role of audio understanding across three critical aspects:speech understanding,audio scene understanding, andvoice understanding(paralinguistic features). This holistic coverage is novel for a dedicatedAudioLLMbenchmark. - Instruction-Following Focus: The benchmark is explicitly designed for "general instruction-following Audio Large Language Models." This differentiates it from traditional audio benchmarks that often have constrained output spaces (e.g., classification tasks for emotion recognition).
AudioBenchemphasizes evaluatingAudioLLMsin open-ended generation scenarios, mimicking real-world user interactions. - Novel and Curated Datasets:
AudioBenchincludes 7 newly proposed datasets (e.g.,CN-College-Listen,DREAM-TTS,Public-SG-SpeechQA,OpenHermes-Audio,ALPACA-Audio,WavCaps-QA,AudioCaps-QA) specifically tailored to address gaps in existing resources for evaluating instruction following and reasoning in audio contexts. This goes beyond simply aggregating existing datasets. - Robustness Evaluation: The benchmark incorporates
multiple instruction templatesandvarying input lengths(from seconds to minutes, including long-form audio) to rigorously test the models' robustness and generalizability to diverse prompts and audio durations. This addresses the observed issue of models being sensitive to instruction variations. - Validated
Model-as-JudgeApproach: For open-ended generation,AudioBenchvalidates and adopts aModel-as-Judgeapproach, providing an affordable and transparent alternative (LLaMA-3-70B-Instruct) toGPT-4based on strong correlation studies. This is crucial for evaluating unconstrained outputs where traditional metrics fall short. - Contrast with
SUPERBandDynamic-SUPERB: The paper explicitly differentiates fromSUPERB(Yang et al., 2024b), which is designed for evaluating self-supervised speech encoders with a supervised fine-tuning step, andDynamic-SUPERB(Huang et al., 2024a), which, despite allowing zero-shot instruction following, is a crowdsourced collection lacking a specific focus onAudioLLMsas argued by Yang et al. (2024a).AudioBenchis specifically designed forAudioLLMsand their unique challenges. - Comparison with
AIR-Bench:AudioBenchacknowledgesAIR-Bench(Yang et al., 2024a) as a concurrent work but highlights key differences:AudioBenchhas broader dataset coverage (6 new datasets), includes multipleASR,SQA, andSIdatasets not inAIR-Bench, explicitly handlesprompt variantsandmodel robustness, and provides a holistic study on evaluation metrics.
4. Methodology
4.1. Principles
The core principle behind AudioBench is to comprehensively evaluate Audio Large Language Models (AudioLLMs) by assessing their ability to interpret diverse audio content and flexibly respond to user queries. This involves moving beyond traditional, constrained evaluation metrics and tasks to embrace the instruction-following and open-ended generation capabilities expected of modern LLMs. The benchmark is structured around three key aspects of audio understanding: speech understanding, audio scene understanding, and voice understanding (paralinguistic features). For each aspect, it combines existing datasets with newly curated ones and employs evaluation metrics suitable for both constrained and open-ended responses, including a validated Model-as-Judge approach. A critical aspect is also to evaluate model robustness to varying instructions and audio lengths.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology of AudioBench can be broken down into several layers: defining evaluation scope, dataset curation, evaluation setup, and metric selection.
4.2.1. Evaluation Scope Definition
The benchmark targets three main aspects of audio understanding, crucial for AudioLLMs to handle diverse inputs and user queries:
- Speech Understanding: Focuses on interpreting the semantic content of spoken language.
- Audio Scene Understanding: Concentrates on understanding non-human audio events and environmental sounds.
- Voice Understanding (Paralinguistic): Aims at recognizing human-related information beyond speech content, such as emotion, accent, and gender.
4.2.2. Dataset Curation
AudioBench comprises 8 distinct tasks and 26 datasets, with 7 datasets newly proposed or adapted to fill existing gaps. The total test suite includes over 400 hours of audio across samples. The tasks and datasets are organized as follows:
-
Speech Understanding Tasks:
-
Automatic Speech Recognition (ASR):
-
Purpose: To convert spoken content into text. It measures accuracy of speech-to-text conversion.
-
Datasets: 9 datasets are included, 3 of which feature
long-form audio(e.g.,Tedlium3-Longform,Earning-21,Earning-22). -
Long-form Audio Handling: For
AudioLLMsthat struggle with long audio files (exceeding 10 minutes), the long audio is segmented into smaller chunks, and then reassembled for assessment. -
Example (Table 4 from the original paper): The following are the results from Table 4 of the original paper:
Dataset Context Instruction (Example) Answer LibriSpeech-Clean (Audio: "No, I wasn't thinking of that.") Turn the speech input into a text transcription. No, I wasn't thinking of that. LibriSpeech-Other (Audio: "The history of the house is plain now.") Decode the audio and give me the written transcription. The history of the house is plain now. CommonVoice (Audio: "This is Jon Davis.") Process the audio speech and provide the text output. This is Jon Davis. PeoplesSpeech (Audio: "that's where you have a lot of windowsin ...") Convert the audio speech into a text transcript. that's where you have a lot of windows in … GigaSpeech (Audio: "I believe that fast fashion has made itpossible for everyone to have access to aesthet-ically thoughtful clothing.") Transform the speech into a text document. I believe that fast fashion has made it possiblefor everyone to have access to aestheticallythoughtful clothing. Earning-21 (Audio: "Good morning, everyone, and welcometo the NextEra Energy Inc. and NextEra En-ergy Partners...") Convert the audio speech into a text transcript. Good morning, everyone, and welcome to theNextEra Energy Inc. and NextEra Energy Part-ners... Earning-22 (Audio: "Good day, ladies and gentlemen, andwelcome to the CD Projekt Group financialresults for H1 2021 conference call. Today'scall...") Transform the speech into a text document. Good day, ladies and gentlemen, and welcometo the CD Projekt Group financial results forH1 2021 conference call. Today's cal.. Tedlium3 (Audio: "One day, Los Angeles Times columnistSteve Lopez was walking along the streets.…") Turn the speech input into a text transcription. One day, Los Angeles Times columnist SteveLopez was walking along the strets…. Tedlium3-Longform (Audio: "I'd like to share with you a discoveryI made a few months ago while writing anarticle for Italian Wired. I always keep mythesaurus handy whenever..") Process the audio speech and provide the text output. I'd like to share with you a discovery I madea few months ago while writing an article forItalian Wired. I always keep my thesaurushandy whenever...
-
-
Speech Question Answering (SQA):
-
Purpose: To assess models' ability to respond to questions based on speech-related audio, including both monologue and dialogue understanding.
-
New Datasets:
CN-College-Listen,DREAM-TTS, andPublic-SG-SpeechQA.CN-College-Listen: Collected from English listening comprehension sections of China's national college entrance exams. 271 manual questions combined with 2000 questions from Hu et al. (2024a). Questions are presented in a free-textQAformat, with correct multiple-choice options serving as reference answers.DREAM-TTS: Built upon the text-based dialogue comprehension datasetDREAM(Sun et al., 2019). ASOTA Text-to-Speech (TTS)engine (Casanova et al., 2024) is used to convert text dialogues into spoken format, preserving gender consistency.Public-SG-SpeechQA: Four public speaking videos from Singaporean politicians with clean transcriptions. Transcriptions were manually segmented, andLLMsgenerated five questions per segment. Questions and reference answers were manually reviewed, resulting in 688 speechQApairs.
-
Example (Table 5 from the original paper): The following are the results from Table 5 of the original paper:
Dataset Context Instruction Answer CN-College-Exam-Listening (Audio: "F: Excuse me, this is the address, where do I find it? M:All right, you need a street map, here is one, and I will show youwhere it is.") Question: What does the woman want to do? Choices: (A) Find aplace. (B) Buy a map. (C) Get an address. (A) Find a place. SLUE-P2-SQA5 (Audio: "1 Climate Stephens City is located in the humid subtropicalclimate zone ( Köppen climate classification : Cfa ) ….") Which regions have temperate climates? mid-atlantic DREAM-TTS (Audio: "F: The movie next Tuesday has been canceled due to a lackof interest. M: what do you mean? F: Well by last night only afew tickets been sold.") Question: What can we conclude about the movie? Choices: (A)They want to buy the tickets for the movie. (B) The tickets for themovie were sold. (C) The movie will not be shown. (C) The movie will not beshown. Public-SG-SpeechQA (Audio: "Today, speaking to a roomful of economists, I am inclinedto confine myself to talk about markets and price, dollars ..") How can economics help solve complex healthcare challenges, asmentioned by the speaker? Economics can help solvecomplex healthcare ….
-
-
Speech Instruction (SI):
-
Purpose: To evaluate the model's ability to directly follow instructions provided through speech input. The question is audio, and the response is text.
-
New Datasets:
OpenHermes-AudioandALPACA-Audio.- Data Generation: Audio instructions are synthesized (Casanova et al., 2024) from existing instruction-following datasets
ALPACA(Taori et al., 2023) andOpenHermes(Teknium, 2023). - Human Filtering: Humans select instances to ensure accurate speech synthesis and suitability as spoken instructions. Only of samples are retained after filtering.
- Data Generation: Audio instructions are synthesized (Casanova et al., 2024) from existing instruction-following datasets
-
Example (Table 8 from the original paper): The following are the results from Table 8 of the original paper:
Dataset Context Instruction (Sample) Answer OpenHermes-Audio (Audio: "Pretend to be Ghost, expressingfrustration to Soap that they're no closerto finding the elusive enemy leader afterweeks of searching.") Please follow the instruction in the speech It feels like we've been chasing ghosts, Soap.After all these weeks, we're still no closer tofinding the bastard. It's bloody infuriating. ALPACA-Audio (Audio: "identify the type of the sentence:she can play guitar.") Please follow the instruction in the speech The type of sentence is a declarative sentence.
-
-
-
Audio Scene Understanding Tasks:
-
Audio Question Answering (AQA):
-
Purpose: Focuses on understanding environmental contexts and following instructions related to non-speech audio.
-
New Datasets:
WavCaps-QAandAudioCaps-QA.- Data Generation:
Clotho-AQAdataset is refined by retaining high-confidence samples. ForWavCaps-QAandAudioCaps-QA,LLaMA-3-8B-Instructionis used to generate questions and answers from existing captions, followed by human annotation and revision for validation.
- Data Generation:
-
Example (Table 6 from the original paper): The following are the results from Table 6 of the original paper:
Dataset Context Instruction Answer Clotho-AQA (Audio: Wave sound) Are there waves? yes WavCaps-QA (Audio: Electronic Music playing) What type of sound is being played? The sound being played is music. AudioCaps-QA (Audio: Mechanical vibration sound) What type of object or equipment islikely to produce a constant rattlingnoise and sharp vibrations? A loose or worn-out bolt or screw on amachine or equipment is likely to pro-duce a constant rattling noise and sharpvibrations.
-
-
Audio Captioning (AC):
-
Purpose: To generate textual descriptions (captions) for an audio clip.
-
Datasets:
WavCapsandAudioCaps. -
Example (Table 9 from the original paper): The following are the results from Table 9 of the original paper:
Dataset Context Instruction (Sample) Answer WavCaps (Audio: Electronic Music playing) Detail the ambient sounds included in the audio. A sound is playing. AudioCaps (Audio: Mechanical vibration sound) Describe the environmental sounds in the audio. Constant rattling noise andsharp vibrations
-
-
-
Voice Understanding Tasks:
-
Emotion Recognition (ER):
-
Purpose: To interpret emotions conveyed through human speech or non-speech content.
-
Datasets:
IEMOCAP-Emotion,MELD-Sentiment,MELD-Emotion. -
Example (Table 7 from the original paper): The following are the results from Table 7 of the original paper:
Dataset Context Instruction (Sample) Answer IEMOCAP-Emotion (Audio: "Thank you.") Can you interpret the emotions in the speaker'sspee (frustration, anger, excited, neutral, happi-ness, surprise, sad)? From the speaker's speech, it seemsthey are in a sad state. MELD-Sentiment (Audio: "Yeah, I'm not in that.") What sentiment signals can you hear in the speaker's-speech (neutral, positive, negative)? From the speaker's speech, it seemsthey are in a neutral sentiment state. MELD-Emotion (Audio: "Yeah, I'm not in that.") How does the speaker's speech reflect their emotionalstate (neutral, joy, disgust, sadness, surprise, anger,fear)? Based on the speaker's speech patterns,it seems like they are feeling neutral.
-
-
Accent Recognition (AR):
-
Purpose: To predict a speaker's likely origin based on their accent.
-
Dataset:
VoxCeleb1-Accent(metadata fromVoxCeleb1). -
Example (Table 10 from the original paper): The following are the results from Table 10 of the original paper:
Dataset Context Instruction (Sample) Answer VoxCeleb-Accent (Audio: "because I was trying to get away fromher, but where did you go? what did you do?") Based on their accent, where is the speakermost likely from? From the audio, I guessthe speaker is from USA.
-
-
Gender Recognition (GR):
-
Purpose: To recognize gender based on vocal characteristics.
-
Datasets:
VoxCeleb1-Gender,IEMOCAP-Gender. -
Example (Table 11 from the original paper): The following are the results from Table 11 of the original paper:
Dataset Context Instruction (Sample) Answer VoxCeleb-Gender (Audio: "because I was trying to get away fromher, but where did you go? what did you do?") Can you determine the gender of the speaker from the audio (Maleor Female)? The speaker is a female. IEMOCAP-Gender (Audio: "For God's sake, Augie. It's- Grow up,we're not ging to see then.") From the audio, can you determine the speaker's gender (Male orFemale)? Based on the auditory cues, itsounds like the speaker is a male.
-
-
The overall structure of AudioBench datasets is visualized in Figure 4 from the original paper:
该图像是一个示意图,展示了 AudioBench 数据集的结构,用于评估音频大型语言模型(AudioLLMs)。图中包含了多个任务和数据集的分类,如自动语音识别、情感识别、语音指令等,清晰地表现了数据集的组织方式。
4.2.3. Evaluation Setup
- Prompt Templates: To assess model compatibility with different instructions and robustness,
AudioBenchincorporates a variety of prompt templates. This is crucial because some models might perform well on seen instructions but fail to generalize to unseen ones. For tasks without inherently diverse prompts (likeASR,ER,AR,GR,AC), at least 20 diverse prompt templates are used. - Input Length Variation: The benchmark varies input audio lengths from seconds to minutes to better assess model performance on longer audio sequences, an area where current
AudioLLMsoften struggle.
4.2.4. Metric Selection
Given the open-ended generation nature of AudioLLMs, traditional metrics are often insufficient.
- Model-as-Judge (M.J.): This approach is employed for most tasks to evaluate the quality of open-ended generated responses. The
M.J.scores are rescaled to a 100-point scale for easier comparison. - Word Error Rate (WER): Used as the primary metric for
ASRtasks. - METEOR Score: Utilized as an additional measure for
Audio Captioning(AC) tasks, alongsideM.J..
4.2.5. Model-as-Judge Validation
To ensure the reliability and accessibility of the Model-as-Judge approach:
-
Candidate Models: Three open-source models (
Llama-3-8B-Instruct,Llama-3-70B-Instruct,Prometheus-2) are compared againstGPT-4(a closed-source, high-performing judge model). -
Validation Process: Model outputs (from
SALMONNmodel) for three diverse datasets (CN-College-Listen,Clotho-AQA,VoxCeleb1-Accent) are fed into the candidate judge models andGPT-4. -
Correlation Study:
Spearman's rank correlationis calculated between the scores from the open-source judges andGPT-4. The formula forSpearman's rank correlation coefficient() is: $ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $ Where:- is the difference between the ranks of corresponding observations (scores) for the -th pair.
- is the number of observations (pairs of scores).
Spearman's rank correlationassesses how well the relationship between two variables can be described using a monotonic function. A value close to 1 indicates a strong positive monotonic relationship, and a value close to -1 indicates a strong negative monotonic relationship.
-
Finding:
LLaMA-3-70B-Instructshows the highest correlation (exceeding 0.85 across all three datasets) withGPT-4, indicating a "very strong correlation." Consequently,LLaMA-3-70B-Instructis chosen as the default judging model forAudioBenchexperiments due to its robustness and accessibility.The template used for the
Model-as-Judge(Llama-3-70B-Instruct, Llama-3-8B-Instruct, and GPT4) is shown in Table 3 from the original paper. Prometheus2 uses a slightly different template. The following are the results from Table 3 of the original paper:
| Judgement Model Llama-3-70B-Instruct | Template [Reference Answer] {reference} |
| & Llama-3-8B-Instruct | |
| & GPT4 | [Model Answer] {prediction} |
| [Question] | |
| {question} [Task] Rate the model's answer based on its alignment with the reference answer, focusing | |
| on accuracy and relevance to the reference provided. Please be critical on the details. | |
| Criteria: Assess if the model's response mirors the reference in terms of content, accuracy, and relevance. | |
| ScoreO: The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference. | |
| Score1: The answer shows minimal alignment, often misunderstanding or providing | |
| irrelevant details unrelated to the reference. Score2: The answer recognizes the topic but diverges significantly from the reference | |
| in accuracy or relevance. Score3: The answer aligns with the reference generally but lacks detail or precise | |
| accuracy in some aspects. Score4: The answer is mostly accurate and relevant, closely following the reference | |
| but could be clearer or more detailed. | |
| Score5: The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail. | |
| Prometheus2 | Your response should be formatted as follows: Explanation: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's |
| answer is [YYY]. I think ..") Rating: (int)* "criteria": "Does the model provide accurate, relevant, and contextually appropriate responses to user inquiries?" | |
| "scorel _description": "The model frequently fails to understand or address the core of the user's inquiries, providing inaccurate, irrelevant, or inappropriate responses." | |
| "score2_description": "The model occasionally recognizes the topic of inquiry but often provides responses that are not sufficiently accurate, detailed, or contextually relevant." | |
| "score3_description": "The model usually understands the question and attempts to provide a relevant answer, yet the responses may sometimes lack detail, accuracy, or context." "score4_description": "The model consistently understands and appropriately ad- | |
| clarity." thoroughly address the user's needs." | dresses the questions, providing accurate and relevant responses. However, there may still be minor inaccuracies or instances where additional context could enhance "score5_description": "The model excels in understanding user inquiries and con- sistently delivers accurate, detailed, and contextually appropriate responses that |
5. Experimental Setup
5.1. Datasets
AudioBench uses a comprehensive set of 26 datasets across 8 tasks, totaling over 400 hours of audio and samples. Seven of these datasets are newly proposed or adapted for this benchmark. The datasets are categorized into Speech Understanding, Audio Scene Understanding, and Voice Understanding.
Speech Understanding Datasets:
- ASR (Automatic Speech Recognition):
LibriSpeech-Clean(Panayotov et al., 2015): A large corpus of read English speech, derived from audiobooks, containing clean audio.LibriSpeech-Other(Panayotov et al., 2015): Similar toLibriSpeech-Cleanbut with more challenging, "other" speech conditions (e.g., lower quality).CommonVoice-15(Ardila et al., 2020): A massively-multilingual speech corpus from Mozilla,version 15specifically.PeoplesSpeech(Galvez et al., 2021): A large-scale, diverse English speech recognition dataset for commercial usage.GigaSpeech(Chen et al., 2021): An evolving, multi-domainASRcorpus with 10,000 hours of transcribed audio.Tedlium3(Rousseau et al., 2012): A corpus derived from TED talks.Tedlium3-Longform: A long-form version ofTedlium3.Earning-21(Del Rio et al., 2021): A benchmark forASRin the wild, likely from earnings calls.Earning-22(Del Rio et al., 2022): Another earnings call benchmark, focusing on accents.
- SQA (Speech Question Answering):
CN-College-Listen: Newly curated. Questions from English listening comprehension sections of China's national college entrance examinations, compiled with additional questions from Hu et al. (2024a). Example: (Audio of a dialogue) "Question: What does the woman want to do?" Answer: "(A) Find a place."SLUE-P2-SQA5(Shon et al., 2022): From theSLUE Phase-2benchmark suite for spoken language understanding tasks.DREAM-TTS: Newly curated. Built upon the text-based dialogue comprehension datasetDREAM(Sun et al., 2019), converted to speech usingTTS. Example: (Audio of a dialogue) "Question: What can we conclude about the movie?" Answer: "(C) The movie will not be shown."Public-SG-SpeechQA: Newly curated. Public speaking videos from Singaporean politicians, manually segmented withLLM-generated and human-reviewedQApairs. Example: (Audio of speech) "How can economics help solve complex healthcare challenges, as mentioned by the speaker?" Answer: "Economics can help solve complex healthcare..."
- SI (Speech Instruction):
OpenHermes-Audio: Newly curated. Synthesized audio instructions from theOpenHermes(Teknium, 2023) instruction-following dataset, human-filtered. Example: (Audio: "Pretend to be Ghost, expressing frustration to Soap that they're no closer to finding the elusive enemy leader after weeks of searching.") Expected answer: "It feels like we've been chasing ghosts, Soap. After all these weeks, we're still no closer to finding the bastard. It's bloody infuriating."ALPACA-Audio: Newly curated. Synthesized audio instructions from theALPACA(Taori et al., 2023) instruction-following dataset, human-filtered. Example: (Audio: "identify the type of the sentence: she can play guitar.") Expected answer: "The type of sentence is a declarative sentence."
Audio Scene Understanding Datasets:
- AQA (Audio Question Answering):
Clotho-AQA(Lipping et al., 2022): A crowdsourced dataset for audio question answering.WavCaps-QA: Newly curated. Questions and answers generated fromWavCaps(Mei et al., 2023) captions usingLLMs, human-verified. Example: (Audio: "Electronic Music playing") "What type of sound is being played?" Answer: "The sound being played is music."AudioCaps-QA: Newly curated. Questions and answers generated fromAudioCaps(Kim et al., 2019) captions usingLLMs, human-verified. Example: (Audio: "Mechanical vibration sound") "What type of object or equipment is likely to produce a constant rattling noise and sharp vibrations?" Answer: "A loose or worn-out bolt or screw on a machine or equipment is likely to produce a constant rattling noise and sharp vibrations."
- AC (Audio Captioning):
WavCaps(Mei et al., 2023): AChatGPT-assisted weakly-labeled audio captioning dataset.AudioCaps(Kim et al., 2019): A dataset for generating captions for audios in the wild.
Voice Understanding Datasets:
- ER (Emotion Recognition):
IEMOCAP-Emotion(Busso et al., 2008): TheInteractive Emotional Dyadic Motion Capture database. Example: (Audio: "Thank you.") "Can you interpret the emotions in the speaker's speech (frustration, anger, excited, neutral, happiness, surprise, sad)?" Answer: "From the speaker's speech, it seems they are in a sad state."MELD-Sentiment(Poria et al., 2019): A multimodal multi-party dataset for emotion recognition in conversations.MELD-Emotion(Poria et al., 2019): Similar toMELD-Sentiment, focusing on specific emotions.
- AR (Accent Recognition):
VoxCeleb1-Accent(Nagrani et al., 2017): Metadata from theVoxCeleb1dataset used to predict speaker's likely origin based on accent. Example: (Audio of speech) "Based on their accent, where is the speaker most likely from?" Answer: "From the audio, I guess the speaker is from USA."
- GR (Gender Recognition):
-
VoxCeleb1-Gender(Nagrani et al., 2017): Metadata fromVoxCeleb1used to determine speaker gender. -
IEMOCAP-Gender(Busso et al., 2008): From theIEMOCAPdataset, used to determine speaker gender. Example: (Audio of speech) "From the audio, can you determine the speaker's gender (Male or Female)?" Answer: "Based on the auditory cues, it sounds like the speaker is a male."The following table (Table 1 from the original paper) summarizes the statistics of the
AudioBenchdataset: The following are the results from Table 1 of the original paper:
-
| Category | Dataset Name | #Samples | Hours | Avg.L/Min.L/Max.L(s) | Metrics |
| Speech Understanding | |||||
| ASR | LibriSpeech-Clean LibriSpeech-Other CommonVoice-15 PeoplesSpeech GigaSpeech Tedlium3 Tedlium3-Longform Earning-21 Earning-22 |
2.6k | 5.40 | 7.43 / 1.28 / 34.96 | WER(↓) |
| 2.9k | 5.34 | 6.55 / 1.47 / 34.51 5.93 / 1.34 / 105.67 6.54 / 1.00 / 99.91 6.77 / 1.00 / 22.03 8.24 / 1.07 / 32.55 0.8k / 0.3k / 1.6k 3.2k / 1k / 5.7k 3.4k / 0.87k / 7.4k |
|||
| 16k | 26.95 | ||||
| 32k | 59.20 | ||||
| 18k | 35.09 | ||||
| 1.1k 1144 |
2.61 2.61 39.26 |
||||
| 125 | 119.88 | ||||
| SQA | CN-College-Listen SLUE-P2-SQA5 DREAM-TTS Public-SG-SpeechQA |
2.2k | 13.3 | 21.09 / 5.76 / 137.82 39.85 / 13.00 / 40.0 34.14 / 3.14 / 261.85 39.86 / 15.78 / 95.71 |
M.J.(↑) |
| 408 1.9k 688 |
4.5 18.1 7.6 |
||||
| SI | OpenHermes-Audio ALPACA-Audio |
100 100 |
0.16 0.12 |
5.95 / 2.04 / 15.77 4.32 / 1.80 / 8.85 |
M.J.(↑) |
| Audio Scene Understanding | |||||
| AQA | Clotho-AQAWavCaps-QAAudioCaps-QA | 2.2k 304 313 |
14.1 0.87 0.86 |
22.59 / 15.03 / 29.97 10.28 / 1.0 / 30.62 9.86 / 3.27 / 10.00 |
M.J.(↑) |
| AC | WavCaps AudioCaps |
1.7k 4.4k |
4.9 12.1 |
10.22 / 1.00 / 30.97 9.86 / 1.74 / 10.0 |
M.J.(↑) & METEOR(↑) |
| Voice Understanding | |||||
| ER | IEMOCAP-Emotion MELD-Sentiment MELD-Emotion |
1k 2.6k 2.6k |
1.3 2.4 2.4 |
4.51 / 0.75 / 24.12 3.35 / 0.13 / 304.9 3.35 / 0.13 / 304.9 |
M.J.(↑) |
| AR | VoxCeleb1-Accent | 4.8k | 11.2 | 8.27 / 3.96 / 69.04 | M.J.(↑) |
| GR | VoxCeleb1-Gender IEMOCAP-Gender |
4.8k 1k |
11.2 1.26 |
8.27 / 3.96 / 69.04 4.55 / 0.69 / 26.77 |
M.J.(↑) |
5.2. Evaluation Metrics
The choice of evaluation metrics depends on the task, particularly distinguishing between tasks with constrained outputs and those with open-ended generation.
-
Word Error Rate (WER) for ASR Tasks:
- Conceptual Definition:
WERis a standard metric forAutomatic Speech Recognitionthat quantifies the difference between a hypothesized (model-generated) transcription and a reference (ground truth) transcription. It is based on theLevenshtein distance, which measures the minimum number of single-word edits (insertions, deletions, or substitutions) required to change one sequence of words into another. A lowerWERindicates betterASRperformance. - Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $
- Symbol Explanation:
- : The number of
substitutions, where a word in the reference is replaced by a different word in the hypothesis. - : The number of
deletions, where a word in the reference is missing from the hypothesis. - : The number of
insertions, where a word in the hypothesis is not present in the reference. - : The total number of words in the
reference(ground truth) transcription.
- : The number of
- Conceptual Definition:
-
METEOR Score for Audio Captioning (AC) Tasks:
- Conceptual Definition:
METEOR(Metric for Evaluation of Translation with Explicit Ordering) is a metric used for evaluating the quality of text generation, such as machine translation or image/audio captioning. UnlikeBLEUwhich primarily focuses on precision,METEORconsiders both precision and recall, as well as alignment features like exact word matches, stemmed matches, and synonym matches. It also includes a penalty for fragmentation to account for the fluency and grammatical correctness of the generated text, aiming for a higher correlation with human judgments. A higherMETEORscore indicates better quality captions. - Mathematical Formula: $ \mathrm{METEOR} = \mathrm{HM}(\alpha P, (1-\alpha)R) \times (1 - F_p) $
- Symbol Explanation:
- : The
harmonic meanfunction. The harmonic mean is used here to combine precision and recall, giving more weight to lower values. - :
Precision, calculated as the number of matched unigrams between the candidate and reference texts, divided by the number of unigrams in thecandidate(model-generated) text. - :
Recall, calculated as the number of matched unigrams between the candidate and reference texts, divided by the number of unigrams in thereference(ground truth) text. - : A
weighting parameter(typically set to 0.9 inMETEOR) that determines the relative importance of precision and recall in the harmonic mean calculation. - : The
fragmentation penalty, which penalizes fragmented matches between the candidate and reference texts, encouraging more fluent and contiguous generated text. It is calculated as .
- : The
- Conceptual Definition:
-
Model-as-Judge (M.J.) for Open-ended Generation Tasks:
- Conceptual Definition: For tasks involving open-ended generation (e.g.,
SQA,SI,AQA,ER,AR,GR, andACin addition toMETEOR), traditional word-overlap metrics are often inadequate.Model-as-Judgeleverages a powerfulLLM(in this case,LLaMA-3-70B-Instructafter validation againstGPT-4) to evaluate the quality of a model's generated response against a reference answer and the original question. The judge model scores the response based on criteria such as accuracy, relevance, completeness, and adherence to instructions. - Mathematical Formula: There is no single mathematical formula for
Model-as-Judgeas it relies on the internal reasoning and scoring mechanism of the judgingLLM. The output is typically a numerical score (e.g., 0-5 scale, then rescaled to 100 points) and often an accompanying explanation. - Symbol Explanation: Not applicable as it's a qualitative assessment performed by another
LLM, rather than a fixed formula. The scores are then rescaled as follows: $ \mathrm{Rescaled,Score} = \frac{\mathrm{Raw,M.J.,Score}}{\mathrm{Max,M.J.,Score}} \times 100 $ Where:- : The score assigned by the
Model-as-Judge(e.g., 0-5 or 0-1). - : The maximum possible score from the
Model-as-Judge(e.g., 5 or 1). The (↑) symbol next toM.J.in Table 1 indicates that a higher score is better.
- : The score assigned by the
- Conceptual Definition: For tasks involving open-ended generation (e.g.,
5.3. Baselines
The paper evaluated five models, comprising four AudioLLMs and one cascade model, to provide a comprehensive review of various solutions:
- SALMONN (Tang et al., 2024): An
AudioLLMdesigned for generic hearing abilities. - Qwen-Audio-Chat (Chu et al., 2023): An
AudioLLMfrom Alibaba, aiming for universal audio understanding. - WavLLM (Hu et al., 2024a): An
AudioLLMfocused on robustness and adaptivity in speech. - Qwen2-Audio-Instruct (Chu et al., 2024): An updated version of
Qwen-Audio, further developed for instruction following. - Whisper+Llama3 (Cascade Model): This model operates in a pipeline:
- Step 1 (ASR): Transcriptions are extracted from the audio using
Whisper-large-v3(Huang and Tsai, 2023), a highly capableASRmodel. - Step 2 (Reasoning): These transcriptions, along with user queries, are then fed into the
Llama-3-8B-Instructmodel to generate responses. This cascade model serves as a strong baseline, particularly for speech-intensive tasks, as it leverages the robustASRcapabilities ofWhisperand the powerful reasoning abilities ofLlama3. However, it inherently cannot comprehend rich non-speech audio content, relying solely on verbal context.
- Step 1 (ASR): Transcriptions are extracted from the audio using
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results, detailed in Table 2, show that no single model consistently outperforms others across all 26 datasets and 8 tasks. This highlights the diverse challenges within AudioLLMs and opportunities for future advancements.
The following are the results from Table: Main results of four AudioLLMs and one cascade model. The word-error-ate WER) for ASR tasks is the lower the better(↓) of the original paper:
| Dataset Name | AudioLLMs | Whisper+Llama3 | ||||||
| SALMONN | Qwen-Audio-Chat | WavLLM | Qwen2-Audio-Instruct | |||||
| Speech Understanding | ||||||||
| LibriSpeech-Clean(↓) | 55.58 | 2.25 | 2.10 | 3.20 | 1.83 | |||
| LibriSpeech-Other(↓) | 41.80 | 4.16 | 4.80 | 6.07 | 3.71 | |||
| CommonVoice-15(↓) | 33.75 | 11.65 | 14.53 | 11.44 | 9.89 | |||
| GigaSpeech(↓) | 34.33 | 30.72 | 37.92 | 22.32 | 14.54 | |||
| 14.22 | 13.32 | 15.49 | 11.89 | 9.51 | ||||
| Tedlium3(↓) | 8.56 | 4.00 | 6.62 | 6.39 | 3.81 | |||
| Tedlium3-Longform(↓) Earning-21(↓) Earning-22(↓) |
18.39 | 45.29 | 45.37 | 95.35 | ||||
| 26.87 | 38.46 | 64.47 | 98.65 | 11.77 | ||||
| 36.38 | 51.18 | 66.72 | 98.84 | 15.61 | ||||
| CN-College-Listen | 50.51 | 60.85 | 65.43 | 74.50 | 85.25 | |||
| SLUE-P2-SQA5 | 78.24 | 76.12 | 83.92 | 80.05 | 82.99 | |||
| DREAM-TTS | 55.93 | 57.76 | 64.56 | 66.70 | 86.09 | |||
| Public- | Public-SG-SpeechQA | 56.77 | 57.47 | 58.55 | 58.31 | 64.94 | ||
| Public-SG-SpeechQA OpenHermes-Audio ALPACA-Audio |
||||||||
| 19.20 12.40 |
11.00 9.60 |
22.40 21.60 |
44.80 52.60 |
63.0 70.8 |
||||
| Audio Scene Understanding | ||||||||
| Clotho-AQA | 51.18 | 58.20 38.68 |
43.01 | 50.92 44.47 |
29.47 17.38 |
|||
| WavCaps-QA | 46.25 | 26.25 | ||||||
| AudioCaps-QA | 47.03 | 47.99 | 29.84 | 45.75 | 16.71 | |||
| WavCaps(M.J.) AudioCaps(M.J.) |
21.16 | 29.25 | 6.40 | 33.78 | 3.45 | |||
| 34.37 | 47.99 | 4.17 | 40.78 | 2.47 | ||||
| WavCaps(METEOR) AudioCaps(METEOR) |
17.72 21.20 |
24.02 27.70 |
9.78 6.70 |
21.34 19.89 |
13.89 7.95 |
|||
| Voice Understanding | ||||||||
| IEMOCAP-Emotion | 21.56 | 27.34 | 45.91 | 49.30 40.54 |
34.43 | |||
| MELD-Emotion | 33.06 | 50.57 | 41.07 | 33.36 | ||||
| MELD-Sentiment | 41.87 | 43.87 | 50.08 | 53.49 | 43.87 | |||
| VoxCeleb1-Accent | 28.06 | 45.70 | 37.65 | 29.19 | 39.33 | |||
| VoxCeleb1-Gender | 88.90 | 70.56 | 70.51 | 99.12 | 53.41 | |||
| IEMOCAP-Gender | 51.60 | 51.13 | 45.29 | 49.30 | 51.50 | |||
Here's an analysis of the results:
-
ASR Performance:
Qwen-Audio-ChatandWavLLMgenerally show robustASRcapabilities on standard datasets likeLibriSpeech-CleanandLibriSpeech-Other, withWERs of 2.25/2.10 and 4.16/4.80, respectively.- The cascade model demonstrates superior
ASRperformance on most short-form datasets, notably achieving the lowestWERs onLibriSpeech-Clean(1.83),LibriSpeech-Other(3.71), andTedlium3(3.81). This is expected, asWhisperis a dedicated and highly optimizedASRmodel. SALMONNshows significantly higherWERs on manyASRtasks (e.g., 55.58 onLibriSpeech-Clean), indicating sensitivity to varying instructions or limitations in itsASRcomponent compared to other models.- All
AudioLLMsstruggle significantly withlong-form ASRtasks (Tedlium3-Longform,Earning-21,Earning-22), exhibiting very highWERs (e.g.,Qwen2-Audio-Instructwith 95.35 onTedlium3-Longform, 98.65 onEarning-21). This suggests a clear limitation in handling extended audio contexts, potentially due to training data limitations or model architecture. also struggles with long-form audio, though generally better than the end-to-endAudioLLMs.
-
Speech Question Answering (SQA) and Speech Instruction (SI):
- The cascade model consistently exhibits superior performance in speech-intensive tasks like
SQAandSI(e.g., 85.25 onCN-College-Listen, 86.09 onDREAM-TTS, 63.0 onOpenHermes-Audio). This confirms that combining a strongASRfront-end (Whisper) with a powerfulLLM(Llama3) is highly effective when the information is primarily verbal. AudioLLMsperform significantly lower inSItasks compared toSQAtasks. For example,Qwen2-Audio-Instructscores 74.50 onCN-College-Listen(SQA) but only 44.80 onOpenHermes-Audio(SI). This indicates that directly following instructions delivered via speech (SI) is more challenging than answering questions about speech content (SQA), possibly due to complexities in extracting instructional intent or integrating paralinguistic cues for instruction following.- The "modality fusion process" in end-to-end
AudioLLMsmight distort speech content, leading to their underperformance compared to the pipeline, which relies on clean transcriptions.
- The cascade model consistently exhibits superior performance in speech-intensive tasks like
-
Audio Scene Understanding (AQA & AC):
AudioLLMsgenerally outperform the cascade model inAudio Scene Understandingtasks. For example,Qwen-Audio-Chatscores 58.20 onClotho-AQAcompared to 's 29.47. This is logical, as the cascade model cannot "hear" non-verbal audio and relies entirely on transcriptions, which are absent or irrelevant for these tasks.- However, the
AudioLLMs' performance in these tasks is still far from perfect, withM.J.scores often in the 40s-50s forAQAand 20s-40s forAC, highlighting room for improvement in environmental sound comprehension and captioning. WavLLMshows notably poor performance inAudio Captioning(e.g.,WavCaps(M.J.)6.40,AudioCaps(M.J.)4.17), indicating its lack of exposure or capability in non-spoken scenarios.
-
Voice Understanding (ER, AR, GR):
- Similar to
Audio Scene Understanding,AudioLLMsgenerally outperform the cascade model in tasks involvingparalinguistic featureslike emotion, accent, and gender recognition. struggles here because these features are not explicitly captured inASRtranscripts (e.g., scores 34.43 onIEMOCAP-EmotionwhileQwen2-Audio-Instructscores 49.30). - An exception is observed in
sentiment and emotion recognitionwhere some emotions can be directly inferred from speech semantics, possibly boosting the cascade model's performance onMELD-Sentiment. Qwen2-Audio-Instructshows strong performance inGender Recognition(99.12 onVoxCeleb1-Gender), indicating a specific strength in this area.
- Similar to
6.2. Robustness Queries
The study specifically investigated model robustness to diverse instruction prompts, a crucial factor for real-world applicability.
-
Experimental Setup: Two
AudioLLMs(SALMONNandQwen-Audio) were tested on threeASRdatasets (LibriSpeech-Clean,CommonVoice,Tedlium3) using three different prompt templates:- "Transcribe the given audio."
- "Read and give me the transcription."
- "Decode the audio and give me the written transcriptions."
-
Results (Figure 2 from the original paper):
该图像是一个条形图,展示了在 LibriSpeech-Clean、CommonVoice 和 Tedium3 数据集上,SALMONN 和 Qwen-Audio 模型在三个不同提示(Prompt-1、Prompt-2 和 Prompt-3)下的表现。每个数据集的条形高度代表模型在特定提示下的性能得分。SALMONNshowed significant variability:- On
LibriSpeechandTedlium3,Prompt 3("Decode the audio and give me the written transcriptions") causedSALMONNto performphoneme recognitioninstead ofASR, leading to a drastic increase inWER(worse performance). - On
CommonVoice,SALMONNtended to performspeech translationfor a substantial number of samples, negatively impacting itsWER. - This suggests
SALMONNmight be "overly tuned to speech features (tokens) and not sufficiently responsive to the prompts," indicating a lack of robust instruction following.
- On
Qwen-Audiodemonstrated stable performance across all three prompt templates, indicating better robustness to varying instructions.
-
Implication: This analysis underscores the importance of evaluating models with diverse prompts.
AudioBenchaddresses this by incorporating at least 20 diverse prompt templates for tasks without inherent prompt variety.
6.3. Model-as-Judge Comparison
The paper validated the choice of LLaMA-3-70B-Instruct as the open-source Model-as-Judge for open-ended generation tasks.
-
Methodology:
- Outputs from the
SALMONNmodel onCN-College-Listen(Speech Understanding),Clotho-AQA(Audio Scene Understanding), andVoxCeleb1-Accent(Voice Understanding) were used. - These outputs, along with golden answers and questions, were scored by
GPT-4(as the gold standard judge) and three open-source candidates:Llama-3-8B-Instruct,Llama-3-70B-Instruct, andPrometheus-2. Spearman's rank correlationwas calculated to assess the agreement between the open-source judges andGPT-4.
- Outputs from the
-
Results (Figure 3 from the original paper):
该图像是一个热力图,展示了三个数据集(CN-College-Listen、Clohto-AQA、VoxCeleb1-Accent)中多个模型之间的相似度。每个矩阵的值表示模型性能的相关性,颜色深浅反映了相关程度。LLaMA-3-70B-Instructexhibited the highest correlation withGPT-4-as-a-judge(specifically,gp4-turbo-0409), with correlation scores exceeding 0.85 across all three diverse datasets. This indicates a "very strong correlation."Llama-3-8B-Instructalso showed strong correlations, though slightly lower than its 70B counterpart.Prometheus-2, despite being fine-tuned for evaluation, showed lower correlation scores compared to theLLaMA-3models. This suggests thatPrometheus-2's fine-tuning may not fully compensate for the limitations of its base model (Mistral).
-
Conclusion:
LLaMA-3-70B-Instructwas chosen as theModel-as-JudgeforAudioBenchdue to its high correlation withGPT-4, robust generalizability, transparency, and ease of adaptability, providing an accessible and reliable solution for evaluating open-ended responses.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduces AudioBench, the first comprehensive evaluation benchmark for Audio Large Language Models (AudioLLMs). It addresses a critical gap in the field by providing a standardized and holistic evaluation framework for models capable of instruction-following based on audio signals. AudioBench encompasses 8 distinct tasks and 26 datasets, including 7 newly proposed ones, covering three crucial aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic features). The benchmark employs diverse prompt templates and input lengths to assess model robustness and validates LLaMA-3-70B-Instruct as a reliable and accessible Model-as-Judge for open-ended generation tasks. Through evaluation of five popular models, the study reveals that no single model currently excels across all criteria, indicating significant opportunities for future development. The authors anticipate that their open-sourced toolkit, data, and leaderboard will serve as a robust testbed to stimulate further progress in multimodal foundation models.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
-
English-only Datasets: The current
AudioBenchis exclusively in English. Future work will expand to includemultilingual capabilitiesandcode-switchingfor comprehensive speech understanding and generation across diverse linguistic and cultural contexts. -
Challenges in Evaluating Free-style Generation: Evaluating open-ended, free-style generation remains a challenge. While
Model-as-Judgeis adopted, the issue of accurately grading such outputs, especially in zero-shot scenarios, is not fully resolved. Further development of robust and suitable evaluation metrics, particularly for audio inputs, is crucial. -
Efficiency vs. Accuracy Focus: The benchmark primarily focuses on
accuracy, notefficiency.AudioLLMsoften have large model sizes, leading to longer inference times. Future evaluations should considerinference speedanddeployment environmentfor a comprehensive assessment during the deployment process.In addition to addressing these limitations, the authors outline a broader research outlook for
AudioLLMs: -
Long Audio Processing and Understanding: Extending
AudioLLMsto process and understandlong audio(beyond minutes) is crucial for applications like meeting summarization and sequential event understanding. Embedding speech content as tokens could facilitate this, leveraging advancements in long-sequence processing in text-basedLLMs. -
Multi-round Query Handling: Enhancing the ability of open-source models to manage
multi-round querieswould enable more dynamic interactions, where each query could involve different modalities (e.g., images or audio). -
Multilingual Capabilities, Code-Switching, and Dialects: Expanding linguistic capabilities to handle multiple languages,
code-switching, and various dialects is essential for global applicability. -
Speech Generation: Developing more sophisticated
speech generationcapabilities withinAudioLLMswould lead to more natural and engaging human-computer interactions, including the ability to mimic human-like intonations and rhythms.
7.3. Personal Insights & Critique
This paper makes a highly valuable contribution to the rapidly evolving field of multimodal LLMs. The establishment of AudioBench fills a critical gap, providing a much-needed standardized framework for evaluating AudioLLMs that aims to match the rigor seen in text-based and vision-based LLM benchmarks.
One of the paper's strengths is its comprehensive approach, covering not just ASR but also audio scene understanding and paralinguistic features. This holistic view acknowledges the complexity of real-world audio, which goes far beyond just transcribing speech. The inclusion of newly curated datasets, especially for SQA and SI, is particularly innovative, as these tasks directly assess the instruction-following capabilities central to LLM design. The validation of LLaMA-3-70B-Instruct as a Model-as-Judge is also a practical contribution, offering an accessible and transparent evaluation method for open-ended generation, which is a persistent challenge in NLP.
However, some potential issues or areas for improvement could be considered:
-
Subjectivity of Model-as-Judge: While
LLaMA-3-70B-Instructshows high correlation withGPT-4,Model-as-Judgestill carries an inherent subjectivity, being dependent on the judge model's biases and capabilities. Further research into hybrid metrics combining automated linguistic features withModel-as-Judgescores could offer more robust evaluations. -
Generalizability of "Long-form Audio" Definition: The paper notes that long-form audio may exceed 10 minutes. While this is a practical definition, the challenges of long-form understanding can vary greatly (e.g., a 10-minute speech vs. a 2-hour meeting recording). A more nuanced categorization of long-form audio could reveal specific bottlenecks (e.g., memory limitations, long-range dependencies).
-
Dataset Size for New Tasks: While 7 new datasets are introduced, some, like
OpenHermes-AudioandALPACA-Audio, have only 100 samples each after human filtering. While this is a starting point, it might be a relatively small sample size to robustly evaluate instruction-following capabilities, especially forLLMsthat are notoriously data-hungry. Scaling these datasets up could provide more reliable insights. -
Beyond English Bias: While acknowledged as a limitation and future work, the current English-only focus leaves a significant portion of global linguistic diversity unaddressed. The challenges of
multilingual AudioLLMsare substantial, including handlingcode-switching, diverse accents, and low-resource languages.AudioBenchprovides a strong foundation, but its "universality" will be truly tested with expansion into multilingual contexts. -
Efficiency Metrics: The focus on accuracy over efficiency is understandable for a first-of-its-kind benchmark. However, for practical deployment,
inference speed,computational cost, andmodel sizeare crucial. Future iterations ofAudioBenchcould integrate these metrics to guide the development of practical and efficientAudioLLMs.Despite these points,
AudioBenchoffers a much-needed, rigorous, and forward-looking framework. Its methodologies and datasets are highly transferable, potentially inspiring similar benchmarks for othermultimodal LLMs(e.g.,olfactory-LLMsortactile-LLMsif such modalities become prominent). The emphasis on open-sourcing the toolkit and data is commendable and vital for fostering collaborative research. The findings already provide clear directions forAudioLLMdevelopment, especially concerning long-form audio processing and enhancing robustness to diverse instructions. This paper sets a new standard for evaluating the "ears" of intelligent agents.
Similar papers
Recommended via semantic vector search.