Paper status: completed

AudioBench: A Universal Benchmark for Audio Large Language Models

Published:04/01/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
0 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

AudioBench is introduced as a universal benchmark for Audio Large Language Models, covering 8 tasks and 26 datasets, including 7 new ones. It evaluates speech and audio scene understanding, addressing gaps in existing benchmarks for instruction-following capabilities. Five models

Abstract

We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic). Despite recent advancements, there lacks a comprehensive benchmark for AudioLLMs on instruction following capabilities conditioned on audio signals. AudioBench addresses this gap by setting up datasets as well as desired evaluation metrics. Besides, we also evaluated the capabilities of five popular models and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-sourced evaluation toolkit, data, and leaderboard will offer a robust testbed for future model developments.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

AudioBench: A Universal Benchmark for Audio Large Language Models

1.2. Authors

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen, all affiliated with the Institute for Infocomm Research (I2^2R)$, A*STAR, Singapore, and some also with the Centre for Frontier AI Research (CFAR), A*STAR. The primary contact is wang_bin@i2r.a-star.edu.sg.

1.3. Journal/Conference

The paper is published at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) in 2025. NAACL is a highly reputable and influential conference in the field of Natural Language Processing (NLP) and computational linguistics. Publication at NAACL signifies that the work has undergone rigorous peer review and is recognized by experts in the field.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It features 8 distinct tasks and 26 datasets, including 7 newly proposed ones. The evaluation focuses on three key areas: speech understanding, audio scene understanding, and voice understanding (paralinguistic). The authors highlight a gap in existing benchmarks regarding the instruction-following capabilities of AudioLLMs conditioned on audio signals. AudioBench addresses this by providing appropriate datasets and evaluation metrics. The study also evaluates five popular models, revealing that no single model consistently excels across all tasks. The paper outlines future research directions for AudioLLMs and provides an open-sourced evaluation toolkit, data, and a leaderboard to serve as a robust testbed for future model development.

The paper is published at: https://aclanthology.org/2025.naacl-long.218/ The PDF link is: https://aclanthology.org/2025.naacl-long.218.pdf It is officially published at the NAACL 2025 conference.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the lack of a comprehensive and standardized benchmark for evaluating Audio Large Language Models (AudioLLMs). While Large Language Models (LLMs) and multimodal LLMs (vision-enhanced, video-enhanced) have seen a proliferation of benchmarks, AudioLLMs lag significantly. Existing evaluations for audio-language models are fragmented, using different datasets and limited task scopes, making systematic comparison and understanding of their capabilities intractable.

This problem is important because AudioLLMs are designed to interpret diverse audio inputs (speech, environmental sounds, paralinguistic features) and respond flexibly to user queries, necessitating a broad evaluation across various use cases. The existing evaluation regimes predominantly rely on conventional metrics and datasets that are not well-suited for assessing the instruction-following and open-ended generation capabilities expected of these advanced models.

The paper's entry point is to create a "universal benchmark" that covers the breadth of AudioLLMs' potential applications, going beyond traditional speech tasks to include audio scene understanding and paralinguistic feature interpretation. It also focuses on addressing the challenge of evaluating open-ended, instruction-following responses, which is a hallmark of modern LLMs.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Introduction of AudioBench: Proposes AudioBench, the first comprehensive evaluation benchmark specifically for general instruction-following AudioLLMs. This benchmark covers 8 distinct tasks and 26 datasets, significantly broadening the scope of evaluation for audio-language models.

  • Novel Datasets: Introduces 7 newly adapted or collected datasets (e.g., CN-College-Listen, DREAM-TTS, Public-SG-SpeechQA for Speech Question Answering; OpenHermes-Audio, ALPACA-Audio for Speech Instruction; WavCaps-QA, AudioCaps-QA for Audio Question Answering) to fill gaps in existing resources and better assess instruction-following capabilities.

  • Comprehensive Evaluation Scope: Defines three core aspects for evaluation: speech understanding, audio scene understanding, and voice understanding (paralinguistic features like emotion, accent, and gender), providing a holistic view of AudioLLM capabilities.

  • Robustness Queries: Integrates multiple prompt templates and varying input lengths to evaluate model robustness and generalizability, addressing the issue that models might overfit to seen instructions.

  • Model-as-Judge Validation: Investigates and validates the effectiveness of Model-as-Judge evaluation for open-ended generation, showing that LLaMA-3-70B-Instruct has a high correlation with GPT-4 in judgment tasks, offering an accessible and transparent alternative.

  • Extensive Model Evaluation: Evaluates five popular AudioLLMs (SALMONN, Qwen-Audio-Chat, WavLLM, Qwen2-Audio-Instruct) and one cascade model (Whisper+Llama3Whisper+Llama3), revealing that no single model consistently excels across all tasks and highlighting areas for future improvement, especially in long-form audio processing and non-verbal audio understanding for end-to-end models.

  • Open-sourced Resources: Anticipates offering an open-sourced evaluation toolkit, data, and a leaderboard to foster future research and development in AudioLLMs.

    These findings collectively address the existing evaluation gap for AudioLLMs, provide a standardized framework, and highlight current strengths and weaknesses of state-of-the-art models, thereby guiding future research directions.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following foundational concepts:

  • Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and respond to human language. They can perform various tasks like translation, summarization, and question answering. Examples include GPT-3, LLaMA, etc. LLMs are the backbone for multimodal LLMs.

  • Multimodal Large Language Models (Multimodal LLMs): These are extensions of LLMs that can process and understand information from multiple modalities, such as text, images, video, and audio. For example, a vision-LLM can understand images and respond to text queries about them. This paper focuses on AudioLLMs, which are multimodal LLMs specialized in audio.

  • Audio Large Language Models (AudioLLMs): These are a specific type of multimodal LLM that can interpret audio content (speech, environmental sounds, paralinguistic features) and generate text responses based on user instructions. They aim to combine the understanding capabilities of LLMs with the ability to process diverse audio inputs.

  • Instruction Following: A key capability of modern LLMs where the model can understand and execute commands or instructions given in natural language, even if it hasn't been explicitly trained on that exact instruction. For AudioLLMs, this means understanding a query about an audio input and generating an appropriate text response.

  • Benchmarks: In machine learning, a benchmark is a standardized set of tasks, datasets, and evaluation metrics used to compare the performance of different models objectively. Benchmarks are crucial for tracking progress and identifying limitations in a field.

  • Automatic Speech Recognition (ASR): The process of converting spoken language into written text. This is a fundamental task in speech processing and a component of many AudioLLMs.

  • Speech Question Answering (SQA): A task where a model answers questions based on the content of a spoken audio input. This requires both ASR capabilities and natural language understanding.

  • Audio Scene Understanding: The ability of a model to identify and interpret non-speech sounds in an audio input, such as environmental sounds (e.g., car horns, birds chirping, music).

  • Voice Understanding (Paralinguistic Features): The ability to extract information from speech beyond its semantic content, such as emotional state, accent, gender, or speaking style. These are often referred to as paralinguistic cues.

  • Word Error Rate (WER): A common metric for evaluating ASR systems. It measures the number of errors (substitutions, deletions, insertions) required to transform the recognized text into the reference text, divided by the total number of words in the reference text. A lower WER indicates better performance. The formula for WER is: $ \mathrm{WER} = \frac{S + D + I}{N} $ Where:

    • SS is the number of substitutions (words replaced).
    • DD is the number of deletions (words removed).
    • II is the number of insertions (words added).
    • NN is the total number of words in the reference (ground truth) transcription.
  • METEOR Score (Metric for Evaluation of Translation with Explicit Ordering): A metric used for evaluating machine translation and text generation tasks, including audio captioning. It measures the alignment between a machine-generated text and one or more reference texts. It considers exact word matches, stemmed word matches, and synonym matches, as well as maintaining word order (ngram-based penalties). A higher METEOR score indicates better quality. The METEOR score is calculated based on a weighted harmonic mean of precision (PP) and recall (RR), with a penalty for fragmentation (FpF_p): $ \mathrm{METEOR} = \mathrm{HM}(\alpha P, (1-\alpha)R) \times (1 - F_p) $ Where:

    • HM\mathrm{HM} is the harmonic mean.
    • PP is precision, calculated as the number of matched unigrams divided by the number of unigrams in the candidate (generated) text.
    • RR is recall, calculated as the number of matched unigrams divided by the number of unigrams in the reference text.
    • α\alpha is a parameter that weights precision and recall (typically 0.9 for METEOR).
    • FpF_p is a fragmentation penalty, which accounts for the chunking of matched words and discourages fragmented alignments.
  • Model-as-Judge: An evaluation paradigm where another LLM (often a more powerful one like GPT-4) is used to score the quality of responses generated by a target model, especially in open-ended generation tasks where traditional metrics are difficult to apply.

3.2. Previous Works

The paper frames its contribution within the broader context of multimodal LLM benchmarks. Here's a summary of previous works and key background information:

  • General LLM Benchmarks:

    • Hendrycks et al., 2021: Introduced Measuring Massive Multitask Language Understanding (MMLU), a benchmark for evaluating LLMs across various subjects and tasks. This provided a holistic evaluation for text-based LLMs.
    • Cobbe et al., 2021a, Wang et al., 2024, Rein et al., 2023: Other examples of benchmarks for text-based LLMs, covering aspects like reasoning, subject knowledge, safety, and multilingual capabilities. These benchmarks primarily focus on text input and output.
  • Vision-enhanced Multimodal LLM Benchmarks:

    • Marino et al., 2019: Introduced OK-VQA (Outside Knowledge Visual Question Answering), a benchmark requiring external knowledge to answer questions about images.
    • Yue et al., 2023: Proposed MMMU, a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI, covering various academic and professional domains with visual input.
    • Padlewski et al., 2024: Introduced Vibe-Eval, a hard evaluation suite for measuring progress of multimodal LLMs.
    • Other works like Yu et al., 2023 and Liu et al., 2023 (e.g., MMBench) focus on perception tests for Vision-LLMs.
  • Video-enhanced Multimodal LLM Benchmarks:

    • Xiao et al., 2021: Focused on video question answering, such as Next-QA.
    • Li et al., 2023, Ning et al., 2023, Liu et al., 2024, Fu et al., 2024: These benchmarks assess video understanding, which inherently includes both visual and audio elements. However, the paper notes that these benchmarks predominantly rely on visual inputs, with audio often serving as a supplementary feature, not the primary modality for understanding.
  • Advancements in AudioLLMs (Models, not Benchmarks): The paper mentions several multimodal foundation models that enhance speech and audio understanding, typically through cascaded methods (separate components for audio processing and language processing) or integrated multitask optimization (joint training across various audio-text tasks).

    • Cascaded Models: Whisper (Huang and Tsai, 2023) for ASR followed by LLaMA for reasoning is a typical example.
    • Integrated Models:
      • AudioGPT (Huang et al., 2023): One of the earlier models aiming for diverse audio tasks.
      • SpeechGPT (Zhang et al., 2023a): Focuses on cross-modal conversational abilities.
      • SALMONN (Tang et al., 2024): A model aiming for generic hearing abilities.
      • Qwen-Audio (Chu et al., 2023) and Qwen2-Audio (Chu et al., 2024): Models for universal audio understanding.
      • AudioPALM (Rubenstein et al., 2023): A large language model that can speak and listen.
      • WavLLM (Hu et al., 2024a): Towards robust and adaptive speech large language model.

3.3. Technological Evolution

The technological evolution leading to AudioBench can be summarized as:

  1. Rise of Text-based LLMs: Initial foundation models focused solely on text, demonstrating strong capabilities in language understanding and generation. Benchmarks like MMLU helped quantify their progress.
  2. Expansion to Vision-LLMs: Researchers extended LLMs to incorporate visual information, leading to vision-language models capable of image understanding and visual question answering. Benchmarks like MMMU emerged to evaluate these.
  3. Inclusion of Video-LLMs: Further expansion incorporated video, which includes both visual and audio streams. While these benchmarks touched upon audio, the audio component was often secondary.
  4. Emergence of dedicated AudioLLMs: Recognizing the distinct challenges and opportunities of audio, specialized AudioLLMs began to appear, integrating speech, environmental sounds, and paralinguistic features. However, a unified benchmark for these models was missing, with models often evaluated on disparate datasets (e.g., Qwen-Audio-Chat on 12 datasets, SALMONN on 15, with only two in common).
  5. AudioBench's Role: AudioBench steps in at this point to standardize the evaluation of AudioLLMs by providing a comprehensive, multi-task, and instruction-following oriented benchmark, including new datasets and robust evaluation methodologies like Model-as-Judge. It aims to address the shortcomings of previous audio evaluations (e.g., SUPERB, Dynamic-SUPERB) which were either not focused on AudioLLMs or lacked comprehensive coverage.

3.4. Differentiation Analysis

Compared to previous benchmarks and evaluation practices, AudioBench introduces several core differences and innovations:

  • Comprehensive Scope for Audio: Unlike video-LLM benchmarks where audio is supplementary, AudioBench specifically focuses on the primary role of audio understanding across three critical aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic features). This holistic coverage is novel for a dedicated AudioLLM benchmark.
  • Instruction-Following Focus: The benchmark is explicitly designed for "general instruction-following Audio Large Language Models." This differentiates it from traditional audio benchmarks that often have constrained output spaces (e.g., classification tasks for emotion recognition). AudioBench emphasizes evaluating AudioLLMs in open-ended generation scenarios, mimicking real-world user interactions.
  • Novel and Curated Datasets: AudioBench includes 7 newly proposed datasets (e.g., CN-College-Listen, DREAM-TTS, Public-SG-SpeechQA, OpenHermes-Audio, ALPACA-Audio, WavCaps-QA, AudioCaps-QA) specifically tailored to address gaps in existing resources for evaluating instruction following and reasoning in audio contexts. This goes beyond simply aggregating existing datasets.
  • Robustness Evaluation: The benchmark incorporates multiple instruction templates and varying input lengths (from seconds to minutes, including long-form audio) to rigorously test the models' robustness and generalizability to diverse prompts and audio durations. This addresses the observed issue of models being sensitive to instruction variations.
  • Validated Model-as-Judge Approach: For open-ended generation, AudioBench validates and adopts a Model-as-Judge approach, providing an affordable and transparent alternative (LLaMA-3-70B-Instruct) to GPT-4 based on strong correlation studies. This is crucial for evaluating unconstrained outputs where traditional metrics fall short.
  • Contrast with SUPERB and Dynamic-SUPERB: The paper explicitly differentiates from SUPERB (Yang et al., 2024b), which is designed for evaluating self-supervised speech encoders with a supervised fine-tuning step, and Dynamic-SUPERB (Huang et al., 2024a), which, despite allowing zero-shot instruction following, is a crowdsourced collection lacking a specific focus on AudioLLMs as argued by Yang et al. (2024a). AudioBench is specifically designed for AudioLLMs and their unique challenges.
  • Comparison with AIR-Bench: AudioBench acknowledges AIR-Bench (Yang et al., 2024a) as a concurrent work but highlights key differences: AudioBench has broader dataset coverage (6 new datasets), includes multiple ASR, SQA, and SI datasets not in AIR-Bench, explicitly handles prompt variants and model robustness, and provides a holistic study on evaluation metrics.

4. Methodology

4.1. Principles

The core principle behind AudioBench is to comprehensively evaluate Audio Large Language Models (AudioLLMs) by assessing their ability to interpret diverse audio content and flexibly respond to user queries. This involves moving beyond traditional, constrained evaluation metrics and tasks to embrace the instruction-following and open-ended generation capabilities expected of modern LLMs. The benchmark is structured around three key aspects of audio understanding: speech understanding, audio scene understanding, and voice understanding (paralinguistic features). For each aspect, it combines existing datasets with newly curated ones and employs evaluation metrics suitable for both constrained and open-ended responses, including a validated Model-as-Judge approach. A critical aspect is also to evaluate model robustness to varying instructions and audio lengths.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology of AudioBench can be broken down into several layers: defining evaluation scope, dataset curation, evaluation setup, and metric selection.

4.2.1. Evaluation Scope Definition

The benchmark targets three main aspects of audio understanding, crucial for AudioLLMs to handle diverse inputs and user queries:

  1. Speech Understanding: Focuses on interpreting the semantic content of spoken language.
  2. Audio Scene Understanding: Concentrates on understanding non-human audio events and environmental sounds.
  3. Voice Understanding (Paralinguistic): Aims at recognizing human-related information beyond speech content, such as emotion, accent, and gender.

4.2.2. Dataset Curation

AudioBench comprises 8 distinct tasks and 26 datasets, with 7 datasets newly proposed or adapted to fill existing gaps. The total test suite includes over 400 hours of audio across 100k+100\mathrm{k}+ samples. The tasks and datasets are organized as follows:

  • Speech Understanding Tasks:

    • Automatic Speech Recognition (ASR):

      • Purpose: To convert spoken content into text. It measures accuracy of speech-to-text conversion.

      • Datasets: 9 datasets are included, 3 of which feature long-form audio (e.g., Tedlium3-Longform, Earning-21, Earning-22).

      • Long-form Audio Handling: For AudioLLMs that struggle with long audio files (exceeding 10 minutes), the long audio is segmented into smaller chunks, and then reassembled for assessment.

      • Example (Table 4 from the original paper): The following are the results from Table 4 of the original paper:

        Dataset Context Instruction (Example) Answer
        LibriSpeech-Clean (Audio: "No, I wasn't thinking of that.") Turn the speech input into a text transcription. No, I wasn't thinking of that.
        LibriSpeech-Other (Audio: "The history of the house is plain now.") Decode the audio and give me the written transcription. The history of the house is plain now.
        CommonVoice (Audio: "This is Jon Davis.") Process the audio speech and provide the text output. This is Jon Davis.
        PeoplesSpeech (Audio: "that's where you have a lot of windowsin ...") Convert the audio speech into a text transcript. that's where you have a lot of windows in …
        GigaSpeech (Audio: "I believe that fast fashion has made itpossible for everyone to have access to aesthet-ically thoughtful clothing.") Transform the speech into a text document. I believe that fast fashion has made it possiblefor everyone to have access to aestheticallythoughtful clothing.
        Earning-21 (Audio: "Good morning, everyone, and welcometo the NextEra Energy Inc. and NextEra En-ergy Partners...") Convert the audio speech into a text transcript. Good morning, everyone, and welcome to theNextEra Energy Inc. and NextEra Energy Part-ners...
        Earning-22 (Audio: "Good day, ladies and gentlemen, andwelcome to the CD Projekt Group financialresults for H1 2021 conference call. Today'scall...") Transform the speech into a text document. Good day, ladies and gentlemen, and welcometo the CD Projekt Group financial results forH1 2021 conference call. Today's cal..
        Tedlium3 (Audio: "One day, Los Angeles Times columnistSteve Lopez was walking along the streets.…") Turn the speech input into a text transcription. One day, Los Angeles Times columnist SteveLopez was walking along the strets….
        Tedlium3-Longform (Audio: "I'd like to share with you a discoveryI made a few months ago while writing anarticle for Italian Wired. I always keep mythesaurus handy whenever..") Process the audio speech and provide the text output. I'd like to share with you a discovery I madea few months ago while writing an article forItalian Wired. I always keep my thesaurushandy whenever...
    • Speech Question Answering (SQA):

      • Purpose: To assess models' ability to respond to questions based on speech-related audio, including both monologue and dialogue understanding.

      • New Datasets: CN-College-Listen, DREAM-TTS, and Public-SG-SpeechQA.

        • CN-College-Listen: Collected from English listening comprehension sections of China's national college entrance exams. 271 manual questions combined with 2000 questions from Hu et al. (2024a). Questions are presented in a free-text QA format, with correct multiple-choice options serving as reference answers.
        • DREAM-TTS: Built upon the text-based dialogue comprehension dataset DREAM (Sun et al., 2019). A SOTA Text-to-Speech (TTS) engine (Casanova et al., 2024) is used to convert text dialogues into spoken format, preserving gender consistency.
        • Public-SG-SpeechQA: Four public speaking videos from Singaporean politicians with clean transcriptions. Transcriptions were manually segmented, and LLMs generated five questions per segment. Questions and reference answers were manually reviewed, resulting in 688 speech QA pairs.
      • Example (Table 5 from the original paper): The following are the results from Table 5 of the original paper:

        Dataset Context Instruction Answer
        CN-College-Exam-Listening (Audio: "F: Excuse me, this is the address, where do I find it? M:All right, you need a street map, here is one, and I will show youwhere it is.") Question: What does the woman want to do? Choices: (A) Find aplace. (B) Buy a map. (C) Get an address. (A) Find a place.
        SLUE-P2-SQA5 (Audio: "1 Climate Stephens City is located in the humid subtropicalclimate zone ( Köppen climate classification : Cfa ) ….") Which regions have temperate climates? mid-atlantic
        DREAM-TTS (Audio: "F: The movie next Tuesday has been canceled due to a lackof interest. M: what do you mean? F: Well by last night only afew tickets been sold.") Question: What can we conclude about the movie? Choices: (A)They want to buy the tickets for the movie. (B) The tickets for themovie were sold. (C) The movie will not be shown. (C) The movie will not beshown.
        Public-SG-SpeechQA (Audio: "Today, speaking to a roomful of economists, I am inclinedto confine myself to talk about markets and price, dollars ..") How can economics help solve complex healthcare challenges, asmentioned by the speaker? Economics can help solvecomplex healthcare ….
    • Speech Instruction (SI):

      • Purpose: To evaluate the model's ability to directly follow instructions provided through speech input. The question is audio, and the response is text.

      • New Datasets: OpenHermes-Audio and ALPACA-Audio.

        • Data Generation: Audio instructions are synthesized (Casanova et al., 2024) from existing instruction-following datasets ALPACA (Taori et al., 2023) and OpenHermes (Teknium, 2023).
        • Human Filtering: Humans select instances to ensure accurate speech synthesis and suitability as spoken instructions. Only 10.5%10.5\% of samples are retained after filtering.
      • Example (Table 8 from the original paper): The following are the results from Table 8 of the original paper:

        Dataset Context Instruction (Sample) Answer
        OpenHermes-Audio (Audio: "Pretend to be Ghost, expressingfrustration to Soap that they're no closerto finding the elusive enemy leader afterweeks of searching.") Please follow the instruction in the speech It feels like we've been chasing ghosts, Soap.After all these weeks, we're still no closer tofinding the bastard. It's bloody infuriating.
        ALPACA-Audio (Audio: "identify the type of the sentence:she can play guitar.") Please follow the instruction in the speech The type of sentence is a declarative sentence.
  • Audio Scene Understanding Tasks:

    • Audio Question Answering (AQA):

      • Purpose: Focuses on understanding environmental contexts and following instructions related to non-speech audio.

      • New Datasets: WavCaps-QA and AudioCaps-QA.

        • Data Generation: Clotho-AQA dataset is refined by retaining high-confidence samples. For WavCaps-QA and AudioCaps-QA, LLaMA-3-8B-Instruction is used to generate questions and answers from existing captions, followed by human annotation and revision for validation.
      • Example (Table 6 from the original paper): The following are the results from Table 6 of the original paper:

        Dataset Context Instruction Answer
        Clotho-AQA (Audio: Wave sound) Are there waves? yes
        WavCaps-QA (Audio: Electronic Music playing) What type of sound is being played? The sound being played is music.
        AudioCaps-QA (Audio: Mechanical vibration sound) What type of object or equipment islikely to produce a constant rattlingnoise and sharp vibrations? A loose or worn-out bolt or screw on amachine or equipment is likely to pro-duce a constant rattling noise and sharpvibrations.
    • Audio Captioning (AC):

      • Purpose: To generate textual descriptions (captions) for an audio clip.

      • Datasets: WavCaps and AudioCaps.

      • Example (Table 9 from the original paper): The following are the results from Table 9 of the original paper:

        Dataset Context Instruction (Sample) Answer
        WavCaps (Audio: Electronic Music playing) Detail the ambient sounds included in the audio. A sound is playing.
        AudioCaps (Audio: Mechanical vibration sound) Describe the environmental sounds in the audio. Constant rattling noise andsharp vibrations
  • Voice Understanding Tasks:

    • Emotion Recognition (ER):

      • Purpose: To interpret emotions conveyed through human speech or non-speech content.

      • Datasets: IEMOCAP-Emotion, MELD-Sentiment, MELD-Emotion.

      • Example (Table 7 from the original paper): The following are the results from Table 7 of the original paper:

        Dataset Context Instruction (Sample) Answer
        IEMOCAP-Emotion (Audio: "Thank you.") Can you interpret the emotions in the speaker'sspee (frustration, anger, excited, neutral, happi-ness, surprise, sad)? From the speaker's speech, it seemsthey are in a sad state.
        MELD-Sentiment (Audio: "Yeah, I'm not in that.") What sentiment signals can you hear in the speaker's-speech (neutral, positive, negative)? From the speaker's speech, it seemsthey are in a neutral sentiment state.
        MELD-Emotion (Audio: "Yeah, I'm not in that.") How does the speaker's speech reflect their emotionalstate (neutral, joy, disgust, sadness, surprise, anger,fear)? Based on the speaker's speech patterns,it seems like they are feeling neutral.
    • Accent Recognition (AR):

      • Purpose: To predict a speaker's likely origin based on their accent.

      • Dataset: VoxCeleb1-Accent (metadata from VoxCeleb1).

      • Example (Table 10 from the original paper): The following are the results from Table 10 of the original paper:

        Dataset Context Instruction (Sample) Answer
        VoxCeleb-Accent (Audio: "because I was trying to get away fromher, but where did you go? what did you do?") Based on their accent, where is the speakermost likely from? From the audio, I guessthe speaker is from USA.
    • Gender Recognition (GR):

      • Purpose: To recognize gender based on vocal characteristics.

      • Datasets: VoxCeleb1-Gender, IEMOCAP-Gender.

      • Example (Table 11 from the original paper): The following are the results from Table 11 of the original paper:

        Dataset Context Instruction (Sample) Answer
        VoxCeleb-Gender (Audio: "because I was trying to get away fromher, but where did you go? what did you do?") Can you determine the gender of the speaker from the audio (Maleor Female)? The speaker is a female.
        IEMOCAP-Gender (Audio: "For God's sake, Augie. It's- Grow up,we're not ging to see then.") From the audio, can you determine the speaker's gender (Male orFemale)? Based on the auditory cues, itsounds like the speaker is a male.

The overall structure of AudioBench datasets is visualized in Figure 4 from the original paper:

Figure 4: Structure of AudioBench datasets. 该图像是一个示意图,展示了 AudioBench 数据集的结构,用于评估音频大型语言模型(AudioLLMs)。图中包含了多个任务和数据集的分类,如自动语音识别、情感识别、语音指令等,清晰地表现了数据集的组织方式。

4.2.3. Evaluation Setup

  • Prompt Templates: To assess model compatibility with different instructions and robustness, AudioBench incorporates a variety of prompt templates. This is crucial because some models might perform well on seen instructions but fail to generalize to unseen ones. For tasks without inherently diverse prompts (like ASR, ER, AR, GR, AC), at least 20 diverse prompt templates are used.
  • Input Length Variation: The benchmark varies input audio lengths from seconds to minutes to better assess model performance on longer audio sequences, an area where current AudioLLMs often struggle.

4.2.4. Metric Selection

Given the open-ended generation nature of AudioLLMs, traditional metrics are often insufficient.

  • Model-as-Judge (M.J.): This approach is employed for most tasks to evaluate the quality of open-ended generated responses. The M.J. scores are rescaled to a 100-point scale for easier comparison.
  • Word Error Rate (WER): Used as the primary metric for ASR tasks.
  • METEOR Score: Utilized as an additional measure for Audio Captioning (AC) tasks, alongside M.J..

4.2.5. Model-as-Judge Validation

To ensure the reliability and accessibility of the Model-as-Judge approach:

  1. Candidate Models: Three open-source models (Llama-3-8B-Instruct, Llama-3-70B-Instruct, Prometheus-2) are compared against GPT-4 (a closed-source, high-performing judge model).

  2. Validation Process: Model outputs (from SALMONN model) for three diverse datasets (CN-College-Listen, Clotho-AQA, VoxCeleb1-Accent) are fed into the candidate judge models and GPT-4.

  3. Correlation Study: Spearman's rank correlation is calculated between the scores from the open-source judges and GPT-4. The formula for Spearman's rank correlation coefficient (ρ\rho) is: $ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $ Where:

    • did_i is the difference between the ranks of corresponding observations (scores) for the ii-th pair.
    • nn is the number of observations (pairs of scores). Spearman's rank correlation assesses how well the relationship between two variables can be described using a monotonic function. A value close to 1 indicates a strong positive monotonic relationship, and a value close to -1 indicates a strong negative monotonic relationship.
  4. Finding: LLaMA-3-70B-Instruct shows the highest correlation (exceeding 0.85 across all three datasets) with GPT-4, indicating a "very strong correlation." Consequently, LLaMA-3-70B-Instruct is chosen as the default judging model for AudioBench experiments due to its robustness and accessibility.

    The template used for the Model-as-Judge (Llama-3-70B-Instruct, Llama-3-8B-Instruct, and GPT4) is shown in Table 3 from the original paper. Prometheus2 uses a slightly different template. The following are the results from Table 3 of the original paper:

Judgement Model Llama-3-70B-Instruct Template [Reference Answer] {reference}
& Llama-3-8B-Instruct
& GPT4 [Model Answer] {prediction}
[Question]
{question} [Task] Rate the model's answer based on its alignment with the reference answer, focusing
on accuracy and relevance to the reference provided. Please be critical on the details.
Criteria: Assess if the model's response mirors the reference in terms of content, accuracy, and relevance.
ScoreO: The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference.
Score1: The answer shows minimal alignment, often misunderstanding or providing
irrelevant details unrelated to the reference. Score2: The answer recognizes the topic but diverges significantly from the reference
in accuracy or relevance. Score3: The answer aligns with the reference generally but lacks detail or precise
accuracy in some aspects. Score4: The answer is mostly accurate and relevant, closely following the reference
but could be clearer or more detailed.
Score5: The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail.
Prometheus2 Your response should be formatted as follows: Explanation: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's
answer is [YYY]. I think ..") Rating: (int)* "criteria": "Does the model provide accurate, relevant, and contextually appropriate responses to user inquiries?"
"scorel _description": "The model frequently fails to understand or address the core of the user's inquiries, providing inaccurate, irrelevant, or inappropriate responses."
"score2_description": "The model occasionally recognizes the topic of inquiry but often provides responses that are not sufficiently accurate, detailed, or contextually relevant."
"score3_description": "The model usually understands the question and attempts to provide a relevant answer, yet the responses may sometimes lack detail, accuracy, or context." "score4_description": "The model consistently understands and appropriately ad-
clarity." thoroughly address the user's needs." dresses the questions, providing accurate and relevant responses. However, there may still be minor inaccuracies or instances where additional context could enhance "score5_description": "The model excels in understanding user inquiries and con- sistently delivers accurate, detailed, and contextually appropriate responses that

5. Experimental Setup

5.1. Datasets

AudioBench uses a comprehensive set of 26 datasets across 8 tasks, totaling over 400 hours of audio and 100k+100\mathrm{k}+ samples. Seven of these datasets are newly proposed or adapted for this benchmark. The datasets are categorized into Speech Understanding, Audio Scene Understanding, and Voice Understanding.

Speech Understanding Datasets:

  • ASR (Automatic Speech Recognition):
    • LibriSpeech-Clean (Panayotov et al., 2015): A large corpus of read English speech, derived from audiobooks, containing clean audio.
    • LibriSpeech-Other (Panayotov et al., 2015): Similar to LibriSpeech-Clean but with more challenging, "other" speech conditions (e.g., lower quality).
    • CommonVoice-15 (Ardila et al., 2020): A massively-multilingual speech corpus from Mozilla, version 15 specifically.
    • PeoplesSpeech (Galvez et al., 2021): A large-scale, diverse English speech recognition dataset for commercial usage.
    • GigaSpeech (Chen et al., 2021): An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio.
    • Tedlium3 (Rousseau et al., 2012): A corpus derived from TED talks.
    • Tedlium3-Longform: A long-form version of Tedlium3.
    • Earning-21 (Del Rio et al., 2021): A benchmark for ASR in the wild, likely from earnings calls.
    • Earning-22 (Del Rio et al., 2022): Another earnings call benchmark, focusing on accents.
  • SQA (Speech Question Answering):
    • CN-College-Listen: Newly curated. Questions from English listening comprehension sections of China's national college entrance examinations, compiled with additional questions from Hu et al. (2024a). Example: (Audio of a dialogue) "Question: What does the woman want to do?" Answer: "(A) Find a place."
    • SLUE-P2-SQA5 (Shon et al., 2022): From the SLUE Phase-2 benchmark suite for spoken language understanding tasks.
    • DREAM-TTS: Newly curated. Built upon the text-based dialogue comprehension dataset DREAM (Sun et al., 2019), converted to speech using TTS. Example: (Audio of a dialogue) "Question: What can we conclude about the movie?" Answer: "(C) The movie will not be shown."
    • Public-SG-SpeechQA: Newly curated. Public speaking videos from Singaporean politicians, manually segmented with LLM-generated and human-reviewed QA pairs. Example: (Audio of speech) "How can economics help solve complex healthcare challenges, as mentioned by the speaker?" Answer: "Economics can help solve complex healthcare..."
  • SI (Speech Instruction):
    • OpenHermes-Audio: Newly curated. Synthesized audio instructions from the OpenHermes (Teknium, 2023) instruction-following dataset, human-filtered. Example: (Audio: "Pretend to be Ghost, expressing frustration to Soap that they're no closer to finding the elusive enemy leader after weeks of searching.") Expected answer: "It feels like we've been chasing ghosts, Soap. After all these weeks, we're still no closer to finding the bastard. It's bloody infuriating."
    • ALPACA-Audio: Newly curated. Synthesized audio instructions from the ALPACA (Taori et al., 2023) instruction-following dataset, human-filtered. Example: (Audio: "identify the type of the sentence: she can play guitar.") Expected answer: "The type of sentence is a declarative sentence."

Audio Scene Understanding Datasets:

  • AQA (Audio Question Answering):
    • Clotho-AQA (Lipping et al., 2022): A crowdsourced dataset for audio question answering.
    • WavCaps-QA: Newly curated. Questions and answers generated from WavCaps (Mei et al., 2023) captions using LLMs, human-verified. Example: (Audio: "Electronic Music playing") "What type of sound is being played?" Answer: "The sound being played is music."
    • AudioCaps-QA: Newly curated. Questions and answers generated from AudioCaps (Kim et al., 2019) captions using LLMs, human-verified. Example: (Audio: "Mechanical vibration sound") "What type of object or equipment is likely to produce a constant rattling noise and sharp vibrations?" Answer: "A loose or worn-out bolt or screw on a machine or equipment is likely to produce a constant rattling noise and sharp vibrations."
  • AC (Audio Captioning):
    • WavCaps (Mei et al., 2023): A ChatGPT-assisted weakly-labeled audio captioning dataset.
    • AudioCaps (Kim et al., 2019): A dataset for generating captions for audios in the wild.

Voice Understanding Datasets:

  • ER (Emotion Recognition):
    • IEMOCAP-Emotion (Busso et al., 2008): The Interactive Emotional Dyadic Motion Capture database. Example: (Audio: "Thank you.") "Can you interpret the emotions in the speaker's speech (frustration, anger, excited, neutral, happiness, surprise, sad)?" Answer: "From the speaker's speech, it seems they are in a sad state."
    • MELD-Sentiment (Poria et al., 2019): A multimodal multi-party dataset for emotion recognition in conversations.
    • MELD-Emotion (Poria et al., 2019): Similar to MELD-Sentiment, focusing on specific emotions.
  • AR (Accent Recognition):
    • VoxCeleb1-Accent (Nagrani et al., 2017): Metadata from the VoxCeleb1 dataset used to predict speaker's likely origin based on accent. Example: (Audio of speech) "Based on their accent, where is the speaker most likely from?" Answer: "From the audio, I guess the speaker is from USA."
  • GR (Gender Recognition):
    • VoxCeleb1-Gender (Nagrani et al., 2017): Metadata from VoxCeleb1 used to determine speaker gender.

    • IEMOCAP-Gender (Busso et al., 2008): From the IEMOCAP dataset, used to determine speaker gender. Example: (Audio of speech) "From the audio, can you determine the speaker's gender (Male or Female)?" Answer: "Based on the auditory cues, it sounds like the speaker is a male."

      The following table (Table 1 from the original paper) summarizes the statistics of the AudioBench dataset: The following are the results from Table 1 of the original paper:

Category Dataset Name #Samples Hours Avg.L/Min.L/Max.L(s) Metrics
Speech Understanding
ASR LibriSpeech-Clean
LibriSpeech-Other
CommonVoice-15
PeoplesSpeech
GigaSpeech
Tedlium3
Tedlium3-Longform
Earning-21
Earning-22
2.6k 5.40 7.43 / 1.28 / 34.96 WER(↓)
2.9k 5.34 6.55 / 1.47 / 34.51
5.93 / 1.34 / 105.67
6.54 / 1.00 / 99.91
6.77 / 1.00 / 22.03
8.24 / 1.07 / 32.55
0.8k / 0.3k / 1.6k
3.2k / 1k / 5.7k
3.4k / 0.87k / 7.4k
16k 26.95
32k 59.20
18k 35.09
1.1k
1144
2.61
2.61
39.26
125 119.88
SQA CN-College-Listen
SLUE-P2-SQA5
DREAM-TTS
Public-SG-SpeechQA
2.2k 13.3 21.09 / 5.76 / 137.82
39.85 / 13.00 / 40.0
34.14 / 3.14 / 261.85
39.86 / 15.78 / 95.71
M.J.(↑)
408
1.9k
688
4.5
18.1
7.6
SI OpenHermes-Audio
ALPACA-Audio
100
100
0.16
0.12
5.95 / 2.04 / 15.77
4.32 / 1.80 / 8.85
M.J.(↑)
Audio Scene Understanding
AQA Clotho-AQAWavCaps-QAAudioCaps-QA 2.2k
304
313
14.1
0.87
0.86
22.59 / 15.03 / 29.97
10.28 / 1.0 / 30.62
9.86 / 3.27 / 10.00
M.J.(↑)
AC WavCaps
AudioCaps
1.7k
4.4k
4.9
12.1
10.22 / 1.00 / 30.97
9.86 / 1.74 / 10.0
M.J.(↑) &
METEOR(↑)
Voice Understanding
ER IEMOCAP-Emotion
MELD-Sentiment
MELD-Emotion
1k
2.6k
2.6k
1.3
2.4
2.4
4.51 / 0.75 / 24.12
3.35 / 0.13 / 304.9
3.35 / 0.13 / 304.9
M.J.(↑)
AR VoxCeleb1-Accent 4.8k 11.2 8.27 / 3.96 / 69.04 M.J.(↑)
GR VoxCeleb1-Gender
IEMOCAP-Gender
4.8k
1k
11.2
1.26
8.27 / 3.96 / 69.04
4.55 / 0.69 / 26.77
M.J.(↑)

5.2. Evaluation Metrics

The choice of evaluation metrics depends on the task, particularly distinguishing between tasks with constrained outputs and those with open-ended generation.

  • Word Error Rate (WER) for ASR Tasks:

    1. Conceptual Definition: WER is a standard metric for Automatic Speech Recognition that quantifies the difference between a hypothesized (model-generated) transcription and a reference (ground truth) transcription. It is based on the Levenshtein distance, which measures the minimum number of single-word edits (insertions, deletions, or substitutions) required to change one sequence of words into another. A lower WER indicates better ASR performance.
    2. Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $
    3. Symbol Explanation:
      • SS: The number of substitutions, where a word in the reference is replaced by a different word in the hypothesis.
      • DD: The number of deletions, where a word in the reference is missing from the hypothesis.
      • II: The number of insertions, where a word in the hypothesis is not present in the reference.
      • NN: The total number of words in the reference (ground truth) transcription.
  • METEOR Score for Audio Captioning (AC) Tasks:

    1. Conceptual Definition: METEOR (Metric for Evaluation of Translation with Explicit Ordering) is a metric used for evaluating the quality of text generation, such as machine translation or image/audio captioning. Unlike BLEU which primarily focuses on precision, METEOR considers both precision and recall, as well as alignment features like exact word matches, stemmed matches, and synonym matches. It also includes a penalty for fragmentation to account for the fluency and grammatical correctness of the generated text, aiming for a higher correlation with human judgments. A higher METEOR score indicates better quality captions.
    2. Mathematical Formula: $ \mathrm{METEOR} = \mathrm{HM}(\alpha P, (1-\alpha)R) \times (1 - F_p) $
    3. Symbol Explanation:
      • HM\mathrm{HM}: The harmonic mean function. The harmonic mean is used here to combine precision and recall, giving more weight to lower values.
      • PP: Precision, calculated as the number of matched unigrams between the candidate and reference texts, divided by the number of unigrams in the candidate (model-generated) text.
      • RR: Recall, calculated as the number of matched unigrams between the candidate and reference texts, divided by the number of unigrams in the reference (ground truth) text.
      • α\alpha: A weighting parameter (typically set to 0.9 in METEOR) that determines the relative importance of precision and recall in the harmonic mean calculation.
      • FpF_p: The fragmentation penalty, which penalizes fragmented matches between the candidate and reference texts, encouraging more fluent and contiguous generated text. It is calculated as 0.5×(number of chunks/number of unigram matches)0.5 \times (\text{number of chunks} / \text{number of unigram matches}).
  • Model-as-Judge (M.J.) for Open-ended Generation Tasks:

    1. Conceptual Definition: For tasks involving open-ended generation (e.g., SQA, SI, AQA, ER, AR, GR, and AC in addition to METEOR), traditional word-overlap metrics are often inadequate. Model-as-Judge leverages a powerful LLM (in this case, LLaMA-3-70B-Instruct after validation against GPT-4) to evaluate the quality of a model's generated response against a reference answer and the original question. The judge model scores the response based on criteria such as accuracy, relevance, completeness, and adherence to instructions.
    2. Mathematical Formula: There is no single mathematical formula for Model-as-Judge as it relies on the internal reasoning and scoring mechanism of the judging LLM. The output is typically a numerical score (e.g., 0-5 scale, then rescaled to 100 points) and often an accompanying explanation.
    3. Symbol Explanation: Not applicable as it's a qualitative assessment performed by another LLM, rather than a fixed formula. The scores are then rescaled as follows: $ \mathrm{Rescaled,Score} = \frac{\mathrm{Raw,M.J.,Score}}{\mathrm{Max,M.J.,Score}} \times 100 $ Where:
      • RawM.J.Score\mathrm{Raw\,M.J.\,Score}: The score assigned by the Model-as-Judge (e.g., 0-5 or 0-1).
      • MaxM.J.Score\mathrm{Max\,M.J.\,Score}: The maximum possible score from the Model-as-Judge (e.g., 5 or 1). The (↑) symbol next to M.J. in Table 1 indicates that a higher score is better.

5.3. Baselines

The paper evaluated five models, comprising four AudioLLMs and one cascade model, to provide a comprehensive review of various solutions:

  1. SALMONN (Tang et al., 2024): An AudioLLM designed for generic hearing abilities.
  2. Qwen-Audio-Chat (Chu et al., 2023): An AudioLLM from Alibaba, aiming for universal audio understanding.
  3. WavLLM (Hu et al., 2024a): An AudioLLM focused on robustness and adaptivity in speech.
  4. Qwen2-Audio-Instruct (Chu et al., 2024): An updated version of Qwen-Audio, further developed for instruction following.
  5. Whisper+Llama3 (Cascade Model): This model operates in a pipeline:
    • Step 1 (ASR): Transcriptions are extracted from the audio using Whisper-large-v3 (Huang and Tsai, 2023), a highly capable ASR model.
    • Step 2 (Reasoning): These transcriptions, along with user queries, are then fed into the Llama-3-8B-Instruct model to generate responses. This cascade model serves as a strong baseline, particularly for speech-intensive tasks, as it leverages the robust ASR capabilities of Whisper and the powerful reasoning abilities of Llama3. However, it inherently cannot comprehend rich non-speech audio content, relying solely on verbal context.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results, detailed in Table 2, show that no single model consistently outperforms others across all 26 datasets and 8 tasks. This highlights the diverse challenges within AudioLLMs and opportunities for future advancements.

The following are the results from Table: Main results of four AudioLLMs and one cascade model. The word-error-ate WER) for ASR tasks is the lower the better(↓) of the original paper:

Dataset Name AudioLLMs Whisper+Llama3
SALMONN Qwen-Audio-Chat WavLLM Qwen2-Audio-Instruct
Speech Understanding
LibriSpeech-Clean(↓) 55.58 2.25 2.10 3.20 1.83
LibriSpeech-Other(↓) 41.80 4.16 4.80 6.07 3.71
CommonVoice-15(↓) 33.75 11.65 14.53 11.44 9.89
GigaSpeech(↓) 34.33 30.72 37.92 22.32 14.54
14.22 13.32 15.49 11.89 9.51
Tedlium3(↓) 8.56 4.00 6.62 6.39 3.81
Tedlium3-Longform(↓)
Earning-21(↓)
Earning-22(↓)
18.39 45.29 45.37 95.35
26.87 38.46 64.47 98.65 11.77
36.38 51.18 66.72 98.84 15.61
CN-College-Listen 50.51 60.85 65.43 74.50 85.25
SLUE-P2-SQA5 78.24 76.12 83.92 80.05 82.99
DREAM-TTS 55.93 57.76 64.56 66.70 86.09
Public- Public-SG-SpeechQA 56.77 57.47 58.55 58.31 64.94
Public-SG-SpeechQA
OpenHermes-Audio
ALPACA-Audio
19.20
12.40
11.00
9.60
22.40
21.60
44.80
52.60
63.0
70.8
Audio Scene Understanding
Clotho-AQA 51.18 58.20
38.68
43.01 50.92
44.47
29.47
17.38
WavCaps-QA 46.25 26.25
AudioCaps-QA 47.03 47.99 29.84 45.75 16.71
WavCaps(M.J.)
AudioCaps(M.J.)
21.16 29.25 6.40 33.78 3.45
34.37 47.99 4.17 40.78 2.47
WavCaps(METEOR)
AudioCaps(METEOR)
17.72
21.20
24.02
27.70
9.78
6.70
21.34
19.89
13.89
7.95
Voice Understanding
IEMOCAP-Emotion 21.56 27.34 45.91 49.30
40.54
34.43
MELD-Emotion 33.06 50.57 41.07 33.36
MELD-Sentiment 41.87 43.87 50.08 53.49 43.87
VoxCeleb1-Accent 28.06 45.70 37.65 29.19 39.33
VoxCeleb1-Gender 88.90 70.56 70.51 99.12 53.41
IEMOCAP-Gender 51.60 51.13 45.29 49.30 51.50

Here's an analysis of the results:

  • ASR Performance:

    • Qwen-Audio-Chat and WavLLM generally show robust ASR capabilities on standard datasets like LibriSpeech-Clean and LibriSpeech-Other, with WERs of 2.25/2.10 and 4.16/4.80, respectively.
    • The cascade model Whisper+Llama3Whisper+Llama3 demonstrates superior ASR performance on most short-form datasets, notably achieving the lowest WERs on LibriSpeech-Clean (1.83), LibriSpeech-Other (3.71), and Tedlium3 (3.81). This is expected, as Whisper is a dedicated and highly optimized ASR model.
    • SALMONN shows significantly higher WERs on many ASR tasks (e.g., 55.58 on LibriSpeech-Clean), indicating sensitivity to varying instructions or limitations in its ASR component compared to other models.
    • All AudioLLMs struggle significantly with long-form ASR tasks (Tedlium3-Longform, Earning-21, Earning-22), exhibiting very high WERs (e.g., Qwen2-Audio-Instruct with 95.35 on Tedlium3-Longform, 98.65 on Earning-21). This suggests a clear limitation in handling extended audio contexts, potentially due to training data limitations or model architecture. Whisper+Llama3Whisper+Llama3 also struggles with long-form audio, though generally better than the end-to-end AudioLLMs.
  • Speech Question Answering (SQA) and Speech Instruction (SI):

    • The cascade model Whisper+Llama3Whisper+Llama3 consistently exhibits superior performance in speech-intensive tasks like SQA and SI (e.g., 85.25 on CN-College-Listen, 86.09 on DREAM-TTS, 63.0 on OpenHermes-Audio). This confirms that combining a strong ASR front-end (Whisper) with a powerful LLM (Llama3) is highly effective when the information is primarily verbal.
    • AudioLLMs perform significantly lower in SI tasks compared to SQA tasks. For example, Qwen2-Audio-Instruct scores 74.50 on CN-College-Listen (SQA) but only 44.80 on OpenHermes-Audio (SI). This indicates that directly following instructions delivered via speech (SI) is more challenging than answering questions about speech content (SQA), possibly due to complexities in extracting instructional intent or integrating paralinguistic cues for instruction following.
    • The "modality fusion process" in end-to-end AudioLLMs might distort speech content, leading to their underperformance compared to the Whisper+Llama3Whisper+Llama3 pipeline, which relies on clean transcriptions.
  • Audio Scene Understanding (AQA & AC):

    • AudioLLMs generally outperform the cascade model in Audio Scene Understanding tasks. For example, Qwen-Audio-Chat scores 58.20 on Clotho-AQA compared to Whisper+Llama3Whisper+Llama3's 29.47. This is logical, as the cascade model cannot "hear" non-verbal audio and relies entirely on transcriptions, which are absent or irrelevant for these tasks.
    • However, the AudioLLMs' performance in these tasks is still far from perfect, with M.J. scores often in the 40s-50s for AQA and 20s-40s for AC, highlighting room for improvement in environmental sound comprehension and captioning.
    • WavLLM shows notably poor performance in Audio Captioning (e.g., WavCaps(M.J.) 6.40, AudioCaps(M.J.) 4.17), indicating its lack of exposure or capability in non-spoken scenarios.
  • Voice Understanding (ER, AR, GR):

    • Similar to Audio Scene Understanding, AudioLLMs generally outperform the cascade model in tasks involving paralinguistic features like emotion, accent, and gender recognition. Whisper+Llama3Whisper+Llama3 struggles here because these features are not explicitly captured in ASR transcripts (e.g., Whisper+Llama3Whisper+Llama3 scores 34.43 on IEMOCAP-Emotion while Qwen2-Audio-Instruct scores 49.30).
    • An exception is observed in sentiment and emotion recognition where some emotions can be directly inferred from speech semantics, possibly boosting the cascade model's performance on MELD-Sentiment.
    • Qwen2-Audio-Instruct shows strong performance in Gender Recognition (99.12 on VoxCeleb1-Gender), indicating a specific strength in this area.

6.2. Robustness Queries

The study specifically investigated model robustness to diverse instruction prompts, a crucial factor for real-world applicability.

  • Experimental Setup: Two AudioLLMs (SALMONN and Qwen-Audio) were tested on three ASR datasets (LibriSpeech-Clean, CommonVoice, Tedlium3) using three different prompt templates:

    1. "Transcribe the given audio."
    2. "Read and give me the transcription."
    3. "Decode the audio and give me the written transcriptions."
  • Results (Figure 2 from the original paper):

    该图像是一个条形图,展示了在 LibriSpeech-Clean、CommonVoice 和 Tedium3 数据集上,SALMONN 和 Qwen-Audio 模型在三个不同提示(Prompt-1、Prompt-2 和 Prompt-3)下的表现。每个数据集的条形高度代表模型在特定提示下的性能得分。 该图像是一个条形图,展示了在 LibriSpeech-Clean、CommonVoice 和 Tedium3 数据集上,SALMONN 和 Qwen-Audio 模型在三个不同提示(Prompt-1、Prompt-2 和 Prompt-3)下的表现。每个数据集的条形高度代表模型在特定提示下的性能得分。

    • SALMONN showed significant variability:
      • On LibriSpeech and Tedlium3, Prompt 3 ("Decode the audio and give me the written transcriptions") caused SALMONN to perform phoneme recognition instead of ASR, leading to a drastic increase in WER (worse performance).
      • On CommonVoice, SALMONN tended to perform speech translation for a substantial number of samples, negatively impacting its WER.
      • This suggests SALMONN might be "overly tuned to speech features (tokens) and not sufficiently responsive to the prompts," indicating a lack of robust instruction following.
    • Qwen-Audio demonstrated stable performance across all three prompt templates, indicating better robustness to varying instructions.
  • Implication: This analysis underscores the importance of evaluating models with diverse prompts. AudioBench addresses this by incorporating at least 20 diverse prompt templates for tasks without inherent prompt variety.

6.3. Model-as-Judge Comparison

The paper validated the choice of LLaMA-3-70B-Instruct as the open-source Model-as-Judge for open-ended generation tasks.

  • Methodology:

    • Outputs from the SALMONN model on CN-College-Listen (Speech Understanding), Clotho-AQA (Audio Scene Understanding), and VoxCeleb1-Accent (Voice Understanding) were used.
    • These outputs, along with golden answers and questions, were scored by GPT-4 (as the gold standard judge) and three open-source candidates: Llama-3-8B-Instruct, Llama-3-70B-Instruct, and Prometheus-2.
    • Spearman's rank correlation was calculated to assess the agreement between the open-source judges and GPT-4.
  • Results (Figure 3 from the original paper):

    该图像是一个热力图,展示了三个数据集(CN-College-Listen、Clohto-AQA、VoxCeleb1-Accent)中多个模型之间的相似度。每个矩阵的值表示模型性能的相关性,颜色深浅反映了相关程度。 该图像是一个热力图,展示了三个数据集(CN-College-Listen、Clohto-AQA、VoxCeleb1-Accent)中多个模型之间的相似度。每个矩阵的值表示模型性能的相关性,颜色深浅反映了相关程度。

    • LLaMA-3-70B-Instruct exhibited the highest correlation with GPT-4-as-a-judge (specifically, gp4-turbo-0409), with correlation scores exceeding 0.85 across all three diverse datasets. This indicates a "very strong correlation."
    • Llama-3-8B-Instruct also showed strong correlations, though slightly lower than its 70B counterpart.
    • Prometheus-2, despite being fine-tuned for evaluation, showed lower correlation scores compared to the LLaMA-3 models. This suggests that Prometheus-2's fine-tuning may not fully compensate for the limitations of its base model (Mistral).
  • Conclusion: LLaMA-3-70B-Instruct was chosen as the Model-as-Judge for AudioBench due to its high correlation with GPT-4, robust generalizability, transparency, and ease of adaptability, providing an accessible and reliable solution for evaluating open-ended responses.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces AudioBench, the first comprehensive evaluation benchmark for Audio Large Language Models (AudioLLMs). It addresses a critical gap in the field by providing a standardized and holistic evaluation framework for models capable of instruction-following based on audio signals. AudioBench encompasses 8 distinct tasks and 26 datasets, including 7 newly proposed ones, covering three crucial aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic features). The benchmark employs diverse prompt templates and input lengths to assess model robustness and validates LLaMA-3-70B-Instruct as a reliable and accessible Model-as-Judge for open-ended generation tasks. Through evaluation of five popular models, the study reveals that no single model currently excels across all criteria, indicating significant opportunities for future development. The authors anticipate that their open-sourced toolkit, data, and leaderboard will serve as a robust testbed to stimulate further progress in multimodal foundation models.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

  • English-only Datasets: The current AudioBench is exclusively in English. Future work will expand to include multilingual capabilities and code-switching for comprehensive speech understanding and generation across diverse linguistic and cultural contexts.

  • Challenges in Evaluating Free-style Generation: Evaluating open-ended, free-style generation remains a challenge. While Model-as-Judge is adopted, the issue of accurately grading such outputs, especially in zero-shot scenarios, is not fully resolved. Further development of robust and suitable evaluation metrics, particularly for audio inputs, is crucial.

  • Efficiency vs. Accuracy Focus: The benchmark primarily focuses on accuracy, not efficiency. AudioLLMs often have large model sizes, leading to longer inference times. Future evaluations should consider inference speed and deployment environment for a comprehensive assessment during the deployment process.

    In addition to addressing these limitations, the authors outline a broader research outlook for AudioLLMs:

  • Long Audio Processing and Understanding: Extending AudioLLMs to process and understand long audio (beyond minutes) is crucial for applications like meeting summarization and sequential event understanding. Embedding speech content as tokens could facilitate this, leveraging advancements in long-sequence processing in text-based LLMs.

  • Multi-round Query Handling: Enhancing the ability of open-source models to manage multi-round queries would enable more dynamic interactions, where each query could involve different modalities (e.g., images or audio).

  • Multilingual Capabilities, Code-Switching, and Dialects: Expanding linguistic capabilities to handle multiple languages, code-switching, and various dialects is essential for global applicability.

  • Speech Generation: Developing more sophisticated speech generation capabilities within AudioLLMs would lead to more natural and engaging human-computer interactions, including the ability to mimic human-like intonations and rhythms.

7.3. Personal Insights & Critique

This paper makes a highly valuable contribution to the rapidly evolving field of multimodal LLMs. The establishment of AudioBench fills a critical gap, providing a much-needed standardized framework for evaluating AudioLLMs that aims to match the rigor seen in text-based and vision-based LLM benchmarks.

One of the paper's strengths is its comprehensive approach, covering not just ASR but also audio scene understanding and paralinguistic features. This holistic view acknowledges the complexity of real-world audio, which goes far beyond just transcribing speech. The inclusion of newly curated datasets, especially for SQA and SI, is particularly innovative, as these tasks directly assess the instruction-following capabilities central to LLM design. The validation of LLaMA-3-70B-Instruct as a Model-as-Judge is also a practical contribution, offering an accessible and transparent evaluation method for open-ended generation, which is a persistent challenge in NLP.

However, some potential issues or areas for improvement could be considered:

  • Subjectivity of Model-as-Judge: While LLaMA-3-70B-Instruct shows high correlation with GPT-4, Model-as-Judge still carries an inherent subjectivity, being dependent on the judge model's biases and capabilities. Further research into hybrid metrics combining automated linguistic features with Model-as-Judge scores could offer more robust evaluations.

  • Generalizability of "Long-form Audio" Definition: The paper notes that long-form audio may exceed 10 minutes. While this is a practical definition, the challenges of long-form understanding can vary greatly (e.g., a 10-minute speech vs. a 2-hour meeting recording). A more nuanced categorization of long-form audio could reveal specific bottlenecks (e.g., memory limitations, long-range dependencies).

  • Dataset Size for New Tasks: While 7 new datasets are introduced, some, like OpenHermes-Audio and ALPACA-Audio, have only 100 samples each after human filtering. While this is a starting point, it might be a relatively small sample size to robustly evaluate instruction-following capabilities, especially for LLMs that are notoriously data-hungry. Scaling these datasets up could provide more reliable insights.

  • Beyond English Bias: While acknowledged as a limitation and future work, the current English-only focus leaves a significant portion of global linguistic diversity unaddressed. The challenges of multilingual AudioLLMs are substantial, including handling code-switching, diverse accents, and low-resource languages. AudioBench provides a strong foundation, but its "universality" will be truly tested with expansion into multilingual contexts.

  • Efficiency Metrics: The focus on accuracy over efficiency is understandable for a first-of-its-kind benchmark. However, for practical deployment, inference speed, computational cost, and model size are crucial. Future iterations of AudioBench could integrate these metrics to guide the development of practical and efficient AudioLLMs.

    Despite these points, AudioBench offers a much-needed, rigorous, and forward-looking framework. Its methodologies and datasets are highly transferable, potentially inspiring similar benchmarks for other multimodal LLMs (e.g., olfactory-LLMs or tactile-LLMs if such modalities become prominent). The emphasis on open-sourcing the toolkit and data is commendable and vital for fostering collaborative research. The findings already provide clear directions for AudioLLM development, especially concerning long-form audio processing and enhancing robustness to diverse instructions. This paper sets a new standard for evaluating the "ears" of intelligent agents.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.