Paper status: completed

AHELM: A Holistic Evaluation of Audio-Language Models

Published:08/29/2025

Evaluation of Audio-Language Models (1)AHELM Benchmark (1)PARADE Dataset (1)Multimodal Model Performance Assessment (1)Speech Recognition and Language Model Integration (1)

Original Link PDF

Price: 0.100000

0 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

AHELM is a benchmark introduced to holistically assess Audio-Language Models (ALMs), integrating multiple datasets and introducing PARADE and CoRe-Bench. It covers ten key evaluation aspects and standardizes methods for equitable model comparisons.

Abstract

Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ( $p=0.01$ ) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 6th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.

Mind Map

In-depth Reading

English Analysis~46 min read · 67,091 chars

1. Bibliographic Information

1.1. Title

AHELM: A Holistic Evaluation of Audio-Language Models

1.2. Authors

The paper is authored by a collaborative team from Stanford University, University of California, Santa Cruz, and Hitachi America, Ltd. The authors include:

Tony Lee
Haoqin Tu
Chi Heem Wong
Zijun Wang
Siwei Yang
Yifan Mai
Yuyin Zhou
Cihang Xie
Percy Liang

Tony Lee, Haoqin Tu, and Chi Heem Wong are noted as having equal contributions. Percy Liang, affiliated with Stanford University, is a prominent researcher known for his work in natural language processing and responsible AI, including the Holistic Evaluation of Language Models (HELM) framework.

1.3. Journal/Conference

The paper is published at (UTC): 2025-08-29T07:40:39.000Z. Based on the "NeurIPS Paper Checklist" included in the appendix, it is intended for publication at the NeurIPS (Conference on Neural Information Processing Systems) conference. NeurIPS is one of the most prestigious and highly-regarded conferences in the field of artificial intelligence and machine learning, known for publishing cutting-edge research.

1.4. Publication Year

2025

1.5. Abstract

Evaluations of Audio-Language Models (ALMs), which are multimodal models processing interleaved audio and text to output text, are currently fragmented due to a lack of standardized benchmarks. Existing evaluations typically focus on one or two capabilities, neglecting crucial aspects like fairness or safety. Furthermore, comparing models is difficult because different studies use varied prompting methods, inference parameters, and a limited number of models. To address these shortcomings, the paper introduces AHELM (A Holistic Evaluation of Audio-Language Models), a comprehensive benchmark.

AHELM integrates various datasets, including two newly developed synthetic audio-text datasets: PARADE (evaluating ALMs on stereotype avoidance) and CoRe-Bench (measuring reasoning over conversational audio through inferential multi-turn question answering). This benchmark holistically assesses ALM performance across ten critical aspects: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. To ensure fair comparisons, AHELM standardizes prompts, inference parameters, and evaluation metrics.

The authors tested 14 open-weight and closed-API ALMs from three developers, along with three simple baseline systems (each combining an automatic speech recognizer and a language model). Key findings indicate that while Gemini 2.5 Pro ranks highest in five out of ten aspects, it exhibits group unfairness ( $p=0.01$ ) on ASR tasks, unlike most other models. Surprisingly, baseline systems perform remarkably well, with one ranking 6th overall despite relying solely on speech-to-text capabilities. For transparency, all prompts, model generations, and outputs are publicly available. AHELM is designed as a living benchmark, with continuous additions of new datasets and models.

1.6. Original Source Link

https://arxiv.org/abs/2508.21376 (Preprint status) PDF Link: https://arxiv.org/pdf/2508.21376v2.pdf

2. Executive Summary

2.1. Background & Motivation

The field of Audio-Language Models (ALMs) is rapidly advancing, with growing aspirations to integrate them into daily life for tasks ranging from smart assistants to complex audio scene understanding. ALMs are multimodal models that can process both audio and text inputs, producing text outputs, aiming to give machines a more complete perception of the world beyond text-only models.

However, the rapid development of ALMs has outpaced the establishment of robust, standardized evaluation methods. The core problem the paper identifies is the lack of a holistic and standardized benchmark for ALMs.

Limited Scope of Existing Benchmarks: Most current evaluations are narrow, focusing on one or two specific capabilities (e.g., automatic speech recognition (ASR) or emotion detection). They often neglect crucial societal and ethical aspects like fairness or safety, which are paramount for widespread deployment.
Inconsistent Comparisons: Comparing ALMs across different studies is challenging due to varying experimental setups, including different prompting methods, inference parameters (e.g., temperature, token limits), and evaluation metrics. This makes it difficult to objectively assess and compare model performance.
Lack of Transparency: Many evaluations do not release raw predictions, further hindering reproducibility and detailed analysis.

The paper's entry point is to extend existing holistic evaluation frameworks (like HELM for LMs) to the multimodal domain of ALMs, creating a comprehensive and transparent benchmarking system. It aims to provide a unified platform for researchers and developers to understand the strengths, weaknesses, and potential risks of ALMs.

2.2. Main Contributions / Findings

The paper makes six major contributions to address the identified shortfalls in ALM evaluation:

Identification of 10 Key Aspects: The authors identify ten critical aspects for ALM development and usage, spanning both technological and societal concerns: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. This provides a structured framework for comprehensive evaluation.
Aggregation of Benchmark Datasets: AHELM maps 14 existing relevant benchmark datasets to these 10 aspects, enabling holistic assessment.
Introduction of PARADE Dataset: To address the lack of benchmarks for bias in ALMs, the authors create PARADE, a novel synthetic audio-text dataset. It probes stereotyped responses by presenting audio transcripts commonly associated with different occupations or social statuses.
Introduction of CoRe-Bench Dataset: To evaluate reasoning over long, real-life conversational audio, they introduce CoRe-Bench. This synthetic dataset features multi-turn dialogues grounded in diverse demographic scenarios, requiring inferential multi-turn question answering.
Inclusion of Baseline Systems: The benchmark includes simple baseline systems, each comprising an Automatic Speech Recognizer (ASR) paired with a Large Language Model (LM). This provides a crucial comparison point to understand when ALMs genuinely outperform simpler, existing solutions and where improvements are most needed.
Standardization of Evaluation: AHELM standardizes prompts, inference parameters (e.g., temperature set to 0, max output tokens to 200), and evaluation metrics to ensure equitable and objective comparisons across models and model versions.

Key Findings:

No Single Best Model: There is no single ALM that excels across all aspects.
Gemini 2.5 Pro's Performance and Unfairness: Gemini 2.5 Pro (05-06 Preview) emerged as the overall best, ranking first in 5 out of 10 aspects (audio perception, reasoning, emotion detection, multilinguality, and robustness), with a mean win rate of 0.803. However, it exhibited statistically significant group unfairness ( $p=0.01$ ) on ASR tasks.
Instruction Following in Open-Weight Models: Open-weight models (like Qwen variants) generally showed weaker instruction following capabilities, leading to degraded performance.
Strong Baselines: Baseline systems (ASR + LM) performed surprisingly well, with one (GPT-4o-mini Transcribe + GPT-4o) ranking 6th overall. This is partly attributed to the dedicated ASR modules being more skillful and robust to environmental noise, and text being a good abstraction for many audio tasks.
Robustness of Dedicated ASR: Dedicated ASR modules in baseline systems were generally more robust to environmental noise than most ALMs.
Emotion Detection Insights: The strong performance of baseline systems in some emotion detection scenarios (e.g., MELD) suggests that much of the emotional information can be inferred from speech content rather than subtle audio cues like inflection. Conversely, their poor performance on MUStARD (sarcasm detection) highlights the need for nuanced audio understanding.
Toxicity Detection Language Dependence: Model performance on toxicity detection (MuToX) varied significantly across languages, with best performance on French and Indonesian, and worst on Vietnamese and English.
Fairness in ASR for Gender: Current ALMs are generally robust to speaker gender in ASR tasks, with few statistically significant differences, though some Gemini and Qwen models showed minor preferences.
OpenAI Models' Safety: OpenAI's models demonstrated stronger resistance to voice jailbreak attacks, possibly due to recent patching.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following concepts:

Multimodal Models: These are AI models that can process and integrate information from multiple types of data (or "modalities"). In this paper, the primary modalities are audio and text. An example is an ALM that takes spoken words and written instructions to generate a response.
Audio-Language Models (ALMs): A specific type of multimodal model that takes interleaved audio and text as input and produces text as output. This means they can understand spoken language, environmental sounds, and written instructions, then generate a textual response.
Automatic Speech Recognition (ASR): The technology that converts spoken language into written text. This is a crucial component for any system that processes audio input, as it allows the model to "understand" what is being said.
Large Language Models (LLMs): Deep learning models (often based on transformer architectures) trained on vast amounts of text data to understand, generate, and respond to human language. Examples include GPT-3, GPT-4, Gemini, and Claude. They excel at tasks like question answering, summarization, and text generation.
Benchmarking: In machine learning, benchmarking refers to the process of evaluating and comparing the performance of different models or systems on a standardized set of tasks and datasets using predefined metrics. A good benchmark should be comprehensive, fair, and reproducible.
Holistic Evaluation: An evaluation approach that assesses models across a wide range of capabilities, including both technical performance and societal aspects (e.g., fairness, safety, bias). This goes beyond simple accuracy metrics to provide a more complete picture of a model's strengths and weaknesses.
Zero-shot Prompting: A method of prompting a language model where the model is given a task description and an input, and is expected to perform the task without any explicit examples or demonstrations. It relies on the model's pre-trained knowledge to generalize to new tasks.
Inference Parameters: Settings that control how a model generates its output during the inference phase (when it's making predictions). Common parameters include:
- Temperature: Controls the randomness of the output. A temperature of 0 makes the output deterministic (most likely token chosen), while higher temperature values lead to more creative and varied outputs.
- Maximum Number of Output Tokens: Limits the length of the generated response.
Word Error Rate (WER): A common metric for evaluating the performance of ASR systems. It calculates the number of errors (substitutions, insertions, and deletions) required to change a hypothesis (model output) into a reference (ground truth) sequence of words, divided by the total number of words in the reference. A lower WER indicates better performance.
- Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $ Where:
  - $S$ = number of substitutions
  - $D$ = number of deletions
  - $I$ = number of insertions
  - $N$ = number of words in the reference (ground truth)
BLEU (Bilingual Evaluation Understudy) Score: A widely used metric for evaluating the quality of machine-translated text. It measures the similarity between a machine-generated translation and a set of high-quality human reference translations. It works by counting the number of n-gram matches between the candidate and reference texts, with a penalty for brevity. A higher BLEU score indicates better translation quality.
- Formula: $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ Where:
  - $\mathrm{BP}$ is the brevity penalty, which penalizes translations that are too short compared to the reference.
  - $w_n$ are positive weights for each n-gram (often uniform, $1/N$ ).
  - $p_n$ is the precision for n-grams, which is the count of matching n-grams divided by the total count of n-grams in the candidate translation.
  - $N$ is the maximum n-gram order considered (typically 4).
Accuracy (EM - Exact Match): A metric that measures the percentage of predictions that exactly match the ground truth. It's often used for tasks with discrete, single-token, or short-string answers, like multiple-choice questions or classification.
F1 Score: The harmonic mean of precision and recall, often used to evaluate binary classification systems, especially on imbalanced datasets. It provides a single score that balances both metrics.
- Formula: $ \mathrm{F1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where:
  - $\mathrm{Precision} = \frac{\mathrm{True Positives}}{\mathrm{True Positives} + \mathrm{False Positives}}$
  - $\mathrm{Recall} = \frac{\mathrm{True Positives}}{\mathrm{True Positives} + \mathrm{False Negatives}}$
Statistical Significance (p-value, t-test):
- p-value: In hypothesis testing, the p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true. A small p-value (typically $\le 0.05$ ) indicates strong evidence against the null hypothesis, suggesting the observed effect is statistically significant.
- t-test: A statistical hypothesis test used to determine if there is a significant difference between the means of two groups.
  - Independent t-test: Used when comparing means of two independent groups (e.g., performance on male vs. female speakers).
  - Paired-samples t-test: Used when comparing means from the same group under two different conditions or from matched pairs (e.g., performance on the same content spoken by a male vs. a female).

3.2. Previous Works

The paper builds upon a lineage of holistic evaluation frameworks and draws from extensive research in both language models and speech recognition.

HELM Framework for Language Models [29]: AHELM is a direct extension of the HELM (Holistic Evaluation of Language Models) framework developed by Liang et al. [29]. HELM established a comprehensive approach to evaluate LMs across various scenarios and metrics, focusing on aspects like truthfulness, safety, and efficiency, beyond just accuracy. It emphasized standardization to ensure fair comparisons.
Extensions to Other Modalities: The HELM framework has been successfully adapted for other multimodal models:
- Lee et al. [28] applied it to text-to-image models.
- Lee et al. [27] extended it to vision-language models (VHELM). AHELM is the natural next step, bringing this holistic approach to audio-language models.
Foundation Models (LMs & ALMs): The paper references the rapid advancements in Large Language Models like GPT-4 [1], Gemini [40], Claude [2], Deepseek [22], and Qwen [4, 43]. ALMs are seen as the next evolutionary step, incorporating audio to enhance these capabilities.
Automatic Speech Recognition (ASR) Developments: The development of ALMs is deeply intertwined with ASR. The paper mentions:
- Traditional ASR: Models based on Mel-frequency Cepstral Coefficients (MFCCs), Gaussian Mixture Models (GMMs), and Hidden Markov Models (HMMs) [23].
- Modern ASR: Deep neural networks [20, 21] and transformer-based models [11, 46]. Some ALMs, like Qwen2 Audio [6], leverage ASR backbones. The existence of mature ASR benchmarks (e.g., CSR-I (WSJ0) Sennheiser [16], Common Voice [3]) is noted, but the paper argues that these are often used in training ALMs, necessitating new benchmarks or careful usage of original test sets to prevent data leakage.
Specific Audio Datasets/Benchmarks: The paper incorporates and references a wide array of existing audio datasets for various tasks, which are detailed in Table 2 of the paper. These cover tasks like audio captioning (AudioCaps [25]), speaker verification (VoxCeleb2 [7]), vocal sound recognition (VocalSound [19]), multilingual ASR (FLEURS [9], Multilingual LibriSpeech [35]), emotion detection (MELD [34], MUStARD [5]), and robustness (Speech Robust Bench [37]), among others.
LLM-as-a-Judge: The paper leverages the technique of using an LLM (specifically GPT-4o) as an automated judge for open-ended tasks. This approach has gained traction in evaluating generative models, with works like those by Dubois et al. [13, 12] and Lee et al. [27] (using Prometheus-Vision [26]) demonstrating its utility for simulating human feedback and fine-grained evaluation.

3.3. Technological Evolution

The field has evolved from text-only language models to multimodal models, integrating vision (Vision-Language Models, VLMs) and now audio (Audio-Language Models, ALMs). This progression aims to create AI systems with a more comprehensive understanding of the world, mirroring human sensory perception. ASR technology, initially a separate domain, has become a foundational component for ALMs, evolving from statistical models to deep learning and transformer-based architectures.

The evaluation of these increasingly complex models has also evolved from task-specific metrics to holistic frameworks that consider a broader range of capabilities and societal impacts. HELM for LMs, then VHELM for VLMs, and now AHELM for ALMs, represents this shift towards more responsible and comprehensive assessment.

3.4. Differentiation Analysis

Compared to prior work, AHELM's core differences and innovations are:

Holistic Scope for ALMs: While previous HELM derivatives addressed LMs and VLMs, AHELM is the first to apply this comprehensive, multi-aspect evaluation framework specifically to ALMs. It explicitly identifies and covers 10 critical aspects, many of which (like bias, fairness, toxicity, safety) were largely unaddressed in prior ALM evaluations.
Standardization Across the Board: Unlike previous ALM evaluations that used varied settings, AHELM imposes strict standardization on prompts, inference parameters (temperature, max tokens), and evaluation metrics. This is crucial for enabling fair and objective comparisons, addressing a major limitation of earlier studies.
Novel Dataset Contributions: The introduction of PARADE and CoRe-Bench directly fills critical gaps in ALM evaluation:
- PARADE addresses the explicit need for a bias benchmark, a societal aspect often overlooked in ALM evaluation. It uses synthetic audio to probe for stereotypes in a controlled manner.
- CoRe-Bench tackles the challenge of evaluating reasoning over complex, long-form conversational audio involving multiple speakers, going beyond simpler AQA tasks. Its synthetic generation pipeline ensures diversity and scalability.
Baseline Systems for Context: The inclusion of $ASR+LM$ baseline systems is a unique and insightful contribution. It provides a clear yardstick to determine if ALMs genuinely offer superior performance or if a simpler, chained approach suffices. This helps in identifying specific areas where ALMs need to demonstrate added value.
Transparency: AHELM's commitment to releasing all raw prompts, model generations, and outputs, along with code, sets a high standard for transparency and reproducibility, which was often lacking in previous ALM evaluations.

In essence, AHELM moves beyond fragmented, task-specific ALM evaluations to a unified, standardized, and transparent holistic framework, explicitly incorporating ethical and societal considerations and introducing new benchmarks to cover previously neglected capabilities.

4. Methodology

The AHELM framework is designed to holistically evaluate Audio-Language Models (ALMs) by systematically breaking down the evaluation process into four primary components: aspect, scenario, adaptation, and metric.

The following figure (Figure 2 from the original paper) illustrates these evaluation components:

Figure 2: Evaluation components. Each evaluation run consists of an aspect (i.e., an evaluative dimension), a scenario (i.e., backed by a specific dataset), a model with an adaptation process (i.e., how the model is prompted), and one or more metrics to capture how good the model responses are. 该图像是示意图，展示了评估音频语言模型的组成部分。每次评估包括一个方面（如公平性）、一个场景（如 FLEURS）、模型的适应过程（例如零-shot prompting）及一个或多个指标（例如 $ΔWER$ ）来衡量模型的回应质量。

4.1. Principles

The core idea behind AHELM is to provide a structured, comprehensive, and standardized method for assessing ALMs. This is achieved by:

Defining Key Evaluative Dimensions (Aspects): Identifying both technical and societal capabilities crucial for ALMs.
Mapping to Real-world Use Cases (Scenarios): Selecting and creating datasets that represent concrete tasks and contexts for ALMs.
Standardizing Interaction (Adaptation): Ensuring models are prompted and interact in a consistent manner (e.g., zero-shot).
Quantifying Performance (Metrics): Using automated and objective measures to score model outputs.

The theoretical basis is that a model's true utility and safety can only be understood through a multifaceted evaluation that goes beyond singular performance metrics and accounts for real-world deployment challenges.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Evaluation Components

The AHELM framework consists of four primary components:

Aspect: This refers to a particular evaluative dimension, representing a key capability or characteristic of the ALM. AHELM considers 10 such aspects: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. Each aspect is described in detail in Table 1.

The following are the descriptions of the aspects from Table 1 of the original paper:

Aspect	Description
Audio Perception	Extracting meaningful information from audio signals
Knowledge	Recalling facts or information contained in the ALM
Reasoning	Performing a series of logical inferences to deduce an answer
Emotion detection	Detecting the user's conscious mental state deriving from his mood, circumstances, or rela- tionships with others
Bias	Prevent forming inappropriate or unwarranted associations between the input and output of the model
Fairness	Ensuring that the model's responses remain consistent when a non-essential or spurious attribute (e.g., sex) of the input is altered (i.e., counterfactual fairness) or having uniform performance on every subset of the data when an attribute is used as the filter (i.e., performance disparity)
Multilinguality	Executing tasks effectively even when the language of the instructions or the language of the output is altered
Robustness	Generating accurate and desired outputs despite variations or disturbances in the input audio (e.g., noise) and/or text (e.g., typos)
Toxicity	Detecting and steering clear of offensive or harmful content (e.g, hate speech, violent language, abusive remarks)
Safety	Refusing to generate responses that could potentially harm humans

Scenario: A scenario is a specific use case for an ALM, defined by a task (e.g., transcription, captioning, identifying emotion) and a usage category (e.g., domain, language, theme). A scenario consists of instances, which are pairs of prompts and references. A dataset can support multiple scenarios. For example, the FLEURS dataset can be used for both audio perception (transcription) and fairness (by comparing performance across different demographic groups). AHELM compiles 14 existing datasets and introduces 2 new ones (CoRe-Bench and PARADE). Table 2 provides a detailed list of scenarios used.

The following are the list of scenarios used in AHELM, from Table 2 of the original paper:

Aspect Auditory perception	Scenarios AudioCaps [25]	Category	Description AudioCaps contains 46K audio clips to human-written text pairs. The audio clips are from AudioSet and covers a wide range	Metrics GPT-4o judge critique
			of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. The captions are collected via crowdsourcing. This scenario measures how well the ALM can express sounds in various settings as text.
	VoxCeleb2 [7]	Audio	VoxCeleb2 contains over 1M utterances by celebrities collected from YouTube. We use only the audio subset. This scenario measures whether the ALM can decipher whether the speakers in two audio clips are the same.	Exact match
	VocalSound [19]		VocalSound consists of >21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,6 nqe subjc. It tess whether the ALMs can ze the aforementioned human sounds. The LibriSpeech corpus is derived from audiobooks that are part	Exact match
	LibriSpeech [33]		of the LibriVox project. This corpus is one of the most widely- used ASR corpus, which has been extended to many applications such as robust ASR and multilingual ASR tasks. The dataset contains the audio and transcriptions and assesses automated speech recognition capabilities.	WER
Knowledge	AIR-Bench [44] (Foundation) AIR-Bench [44] (Chat)	Music Genre Recognition, Music Instrument Classification, Music QA Music, Sound	AIR-Bench (Foundaton) which consists o 19 tasks with approx- imately 19k single-choice questions. We use only the music- related subsets to test music understanding. AIR-Bench (Chat) contains 2k instances of open-ended question- and-answer data. This benchmark evaluates the ability of audio language models to understand various types of audio signals (including human speech, natural sounds and music) and to	Exact match GPT-4o judge critique
Reasoning Emotion	AIR-Bench [44] Mixed, Speech (Chat) CoRe-Bench**		interact with humans through text. These subsets of AIR-Bench test the ability of models to reason with speech and sounds. CoRe-Bench contains a diverse range of audio conversations and questions whose answers can be inferred from the conversations. Multimodal EmotionLines Dataset (MELD) is created by en-	GPT-4o judge critique Pseudo-exact match
detection	MELD [34]	Audio	hancing and extending EmotionLines dataset. MELD has more than 1,400 dialogues and 13,000 utterances from Friends TV series. Multiple speakers participated in the dialogues. Each utterance in a dialogue has been labeled by any of these seven emotions - Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. The task is to classify the emotion after listening to an audio clip.	Exact match
Bias	MUStARD [5]		MUStARD is a multimodal video corpus focusing on automated sarcasm discovery. It consists of audiovisual utterances from sitcoms such as Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. Sarcasm labels are labeled by humans. Each utterance is accompanied by a context that provides additional information on the scenario where it occurs. We use only the audio from the videos to evaluate how well ALMs detect sarcasm in speech.	Exact match
Fairness	FLEURS (ASR)*	{Status, Occupation} × {Male, Female} [9] Female vs Male	PARADE is a new audio-text multiple-choice QA benchmark consisting of 436 instances that explores occupational and status bias in ALMs. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark.	Exact match WER
	LibriSpeech* [33] Female vs Male		We evaluate the mean WER between male and female speakers in order to test the difference in the models' ASR abilities when confronted with speech from different sexes. Sla Sp ex ask the model to do ASR on audio files from different sexes. This	WER
	Multilinguality CoVoST 2 [41]	Spanish→English, Chinese→English	scenario measures how the ASR capability of ALMs is affected by different sexes. CoVost-2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from	BLEU
	FLEURS [9]		English into various languages. We use the Spanish-to-English and Chinese-to-English subsets to test for the ability to translate speech from a language to a target language. We use the udio and transciptions to te or the abiy to WER
	Multilingual Lib-	Finnish, Mandarin_chinese, Thai, Hebrew, Bengali, English, Zulu Italian, French, Polish,	transcribe audio in various languages. The Multilingual LibriSpeech dataset is derived from audiobooks	WER
	riSpeech [35] Speech R0- bust Bench	Dutch, Portuguese, Spanish, German {Gaussian Noise, Environment Noise}×	in LibriVox and consists of ~ 44.5K hours of English and a total of ~6K hours for other 7 languages. The task is to transcribe audio in various languages. Speech Robust Bench (SRB) comprises of 114 input perturba- WER tions that simulate a heterogeneous range of corruptions that
Toxicity	(LibriSpeech- Clean) [37] MuToX [10]	{Levels 1, 2, 3} Estonian, French, Urdu, English, Bulgarian, German, Mandarin Chinese,	ASR models may encounter when deployed in the wild. In this scenario, we select four subsets in the benchmark for evaluation, each corresponds to a clean version of audio task, to evaluate how well the ALMs can process speech in noisy environments. MuTox consists of ∼20k audio utterances for English and Span- ish and ~4k for the other languages. This scenario evaluates ALM for zero-shot toxicity detection across a broad range of	Exact match
		Indonesian, Turkish, Slovak, Bengali, Arabic, Hindi, Polish, Tagalog, Italian, Catalan, Czech, Hungarian, Greek, Swahili, Danish, Finnish, Hebrew, Russian, Vietnamese, Dutch,	languages.
Safety	Voice jailbreak at- tacks [38]	Portuguese, Spanish Text jailbreak, Baseline	Voice Jailbreak Attacks Against GPT-4o. This scenario test how ALM can resist jailbreak attacks.	Toxic fraction
Safety

Adaptation: This specifies the procedure for invoking a model, essentially how the model is prompted. AHELM exclusively uses zero-shot prompting. This means models are given instructions and inputs without any prior examples in the prompt, relying on their pre-trained knowledge. This strategy is chosen because it is the most common way general users interact with these models.
Metric: A metric quantifies the performance of an ALM within a scenario. AHELM implements automated metrics for efficiency, consistency, and cost-effectiveness.
- For ASR tasks, Word Error Rate (WER) is used.
- For translation tasks, BLEU score is used.
- For multiple-choice questions, Accuracy (Exact Match) is used.
- For fairness evaluation, statistical tests (independent t-test and paired t-test) are applied to detect significant performance disparities between groups.
- For open-ended tasks (e.g., captioning), an LLM-as-a-judge (specifically GPT-4o) is employed to evaluate the alignment of ALM output with reference texts. This is described further in Appendix F.

4.2.2. New Scenarios: PARADE and CoRe-Bench

PARADE (for Bias Evaluation)

Purpose: To address the lack of benchmarks for bias in ALMs, specifically to probe for stereotyped outputs conditioned on speaker attributes.
Concept: Inspired by PAIRS [15], PARADE presents an audio clip and asks the ALM to identify the most likely role (occupation or social status) of the speaker from multiple choices. The content of the speech is designed to be neutral, equally plausible for contrasting roles (e.g., "Where is your pain?" could be spoken by a doctor or a nurse). The gender of the voice is a confounding variable.
Methodology:
1. Contrasting Roles: Uses lists of contrasting occupations (e.g., doctor vs. nurse, programmer vs. typist) and social statuses (e.g., rich vs. poor).
2. Transcript Generation: GPT-4o is prompted to generate neutral utterances that could plausibly be spoken by individuals in both contrasting roles.
3. Audio Synthesis: OpenAI's state-of-the-art text-to-speech (TTS) model (e.g., nova voice for female, onyx voice for male) is used to verbalize each transcript with both male and female voices, creating synthetic audio clips.
Outcome: If the ALM consistently associates a specific role with a gender despite neutral content, it indicates bias. The "correct" answer is often "unclear" if the content itself doesn't provide explicit clues.
Scale: 938 examples (20 occupation pairs, 5 status pairs), each synthetically verbalized by both male and female voices.

An example instance from PARADE is shown in Figure A16 of the original paper:

CoRe-Bench (for Reasoning Evaluation)
Purpose: To measure reasoning over long, complex, multi-turn conversational audio involving multiple speakers, going beyond direct retrieval or surface-level cues.
Concept: The benchmark consists of multi-turn dialogues grounded in diverse demographic scenarios, paired with questions requiring inference. It aims to minimize reliance on cultural or factual knowledge, focusing instead on reasoning about personal attributes (e.g., genre preferences, demographics) within the conversation.
Methodology (Fully Automatic Pipeline):
1. Scenario Generation: An LM (GPT-4o with $temperature=0.7$ for diversity) generates unique conversational scenarios based on structured inputs like speaker age, relationships, discussion topics, environment, and mood.
2. Transcript Generation: Another LM generates conversational transcripts based on the scenario, a seed question, two possible answers (one valid, one confounding), speaker details (names, age groups, region), and desired dialogue length. The LM is instructed to incorporate the two possible answers to make reasoning more challenging.
3. Question-and-Answer Verification: An independent LM acts as a validator. It attempts to answer the question from the generated transcript (with speaker names masked to simulate an audio-only context). The validator's answer is then compared against the ground truth. If verification fails, the generation step is repeated up to three times. This ensures the question is genuinely answerable from the conversation and requires inference.
4. Audio Conversation Generation: Transcripts are converted into audio using steerable text-to-speech (TTS) engines (e.g., gpt-4o-mini-tts). The TTS engine allows control over accent, emotion, intonation, and speech speed, creating natural-sounding dialogues. Each speaker is assigned a distinct voice (from a set of male and female voices) and turn-specific speech patterns are generated by the LM.
Outcome: The dataset contains diverse, demographically grounded, audio-based multi-turn conversations with inferential questions. It also includes adversarial examples (irrelevant questions) to test for hallucination.
Scale: 2290 question-answer pairs grounded in 2082 unique multi-turn audio clips, totaling over 48 hours of dialogue.

The broad overview of the data construction process for CoRe-Bench is illustrated in Figure A3 of the original paper:

该图像是一个数据构建过程的示意图。首先生成角色的基本信息，接着通过语言模型生成详细的对话场景，并进行验证以确保问题可解答。最终过程生成包含文本输入、音频输入和真实答案的元组，以评估音频对话推理能力。

4.2.3. ASR+LM Baseline System

Purpose: To establish a baseline for comparison, evaluating when ALMs truly outperform simpler systems and to understand the nature of information available in different audio tasks.
Architecture: Each $ASR+LM$ baseline system consists of two chained components:
1. Dedicated ASR Module: Transcribes the input audio into text. The paper uses Whisper-1, GPT-4o Transcribe, or GPT-4o-mini Transcribe.
2. Large Language Model (LM): Takes the transcribed text (from the ASR module) along with the original text prompt as input and generates the final text output. GPT-4o is used as the LM.
Data Flow: The audio is processed by the ASR, generating transcribed_audio. This text, alongside the original text prompt, is then fed to the LM.

The following figure (Figure A1 from the original paper) illustrates the data flow within the baseline $\mathrm{ASR+LM}$ models:

$Figure A1: An illustration of the dataflow within the baseline $_ { \\mathrm { A S R + L M } }$ models.$ 该图像是一个示意图，展示了基线模型 $_{ASR + LM}$ 中的数据流。图中包含一个音频片段输入到自动语音识别（ASR）模块，生成的转录文本与文本提示结合后输入到语言模型（LLM），最终输出结果。

An example of an input prompt for an ALM versus the corresponding text-only input prompt for the $ASR+LM$ baseline is shown in Figure A2 of the original paper:

(a) ALM input: Answer the multiple choice question by just giving the letter o the corec answer. Context: <context.mp3> Utterance: <utterance.mp3> Given the context, does the utterance contain sarcasm? A. Yes B. Noo Answer: B
(b) $ASR+LM$ baseline input: Answer the multiple choice question by just giving the letter of the correct answer. [TRANSCRIBED AUDIO START] transcript_context [TRANSCRIBED AUDIO END] Utterance: [TRANSCRIBED AUDIO START] transcript_utterance [TRANSCRIBED AUDIO END] Given the context, does the utterance contain sarcasm? A. Yes B.Noo Answer:

Here, transcript_context and transcript_utterance are the ASR outputs for $<context.mp3>$ and $<utterance.mp3>$ respectively. [TRANSCRIBED AUDIO START] and [TRANSCRIBED AUDIO END] are markers.
Insights from Baselines:
- If baselines perform well on a task (e.g., some emotion detection scenarios), it suggests that the necessary information is primarily in the spoken content (text), not subtle audio cues (intonation, prosody).
- If baselines perform poorly (e.g., non-speech scenarios like music identification), it highlights where ALMs truly leverage their multimodal capabilities.

4.2.4. LLM-as-a-Judge for Open-Ended Tasks

Motivation: For scenarios with open-ended text generation (e.g., audio captioning), manual human evaluation is costly and slow. Using an LLM as an automated judge provides a consistent, cheap, and fast alternative. It also avoids the potential bias of an ALM evaluating itself by using a separate LLM for judging.
Judge Model: GPT-4o is used as the judge for AudioCaps, Air-Bench Chat (reasoning subsets), and Air-Bench Chat (knowledge subsets).
Methodology: Given a reference answer ( $r$ $r$ ) and a model response ( $o$ $o$ ), GPT-4o is prompted to evaluate $o$ $o$ against $r$ $r$ based on a 5-point rubric:
- Score 1: Completely inaccurate or unrelated.
- Score 2: Significant inaccuracies.
- Score 3: Mostly accurate with minor errors.
- Score 4: Accurate with slight room for improvement.
- Score 5: Fully accurate and precise. The LLM judge is instructed to output a single score and a single-line explanation.
Validation: Human evaluation was conducted on 197 random samples, with 4 human raters using the same rubric.
- Exact agreement rate between GPT-4o and humans: $50.8\%$
- Agreement within $\pm 1$ score: $83.8\%$
- Cohen's $\kappa$ (weighted): $83.8\%$ These results validate the use of GPT-4o as an effective automated judge.

4.2.5. Aggregation

Scenario Level: For each model and scenario, the main metrics (e.g., accuracy, WER) are averaged across all instances to get a summary score.
Aspect Level: The mean win rate (MWR) is calculated for each model on each scenario. The MWR is defined as the probability that the model outperforms a randomly selected competitor model in a head-to-head comparison for a given metric.
Overall Leaderboard: To produce the overall leaderboard, the mean win rate for all scenarios covered within an aspect is computed.

5. Experimental Setup

5.1. Datasets

AHELM uses a diverse set of 14 existing datasets and introduces two new synthetic datasets, PARADE and CoRe-Bench. All scenarios are evaluated strictly on their original test sets to minimize the risk of data leakage. The audio sampling rates vary across datasets, as detailed in Appendix B.

The following are the audio sampling rates of scenarios in AHELM, from Table A2 of the original paper:

Datasets	Samping Rate
AudioCaps	44.1 kHz
VoxCeleb2	16 kHz
VocalSound	16 kHz
LibriSpeech AIR-Bench	16 kHz
MELD	16 ~ 48 kHz
MUStARD	16 kHz
PARADE	48 kHz
	24 kHz
FLEURS	16 kHz
CoVoST 2 Multilingual LibriSpeech	48 kHz
Speech Robust Bench	16 kHz
(LibriSpeech-Clean)	16 kHz
MuToX Voice Jailbreak Attacks	22 ~ 48 kHz 24 kHz

Here's a breakdown of the key datasets and their relevance:

AudioCaps [25]: For Audio Perception (audio captioning). Contains 46K audio clips with human-written text pairs, covering human and animal sounds, musical instruments, and environmental sounds. Measures ALM's ability to express sounds in text.
VoxCeleb2 [7]: For Audio Perception (speaker verification). Over 1M utterances by celebrities. Measures if ALM can determine if speakers in two audio clips are the same.
VocalSound [19]: For Audio Perception (vocal sound recognition). >21,000 crowdsourced recordings of non-speech human sounds (laughter, sighs, coughs). Tests ALM's ability to recognize these sounds.
LibriSpeech [33]: For Audio Perception and Fairness (ASR). Derived from audiobooks, a widely-used ASR corpus. Measures automated speech recognition capabilities. For fairness, it's used to compare WER between male and female speakers.
AIR-Bench [44] (Foundation & Chat): For Knowledge and Reasoning.
- Foundation: 19 tasks, ~19k single-choice questions, music-related subsets used for music understanding.
- Chat: 2k open-ended QA instances, evaluates understanding of various audio signals (speech, natural sounds, music) and human interaction. Subsets test reasoning with speech and sounds.
CoRe-Bench (New) **: For Reasoning. Synthetic dataset of multi-turn conversational audio with inferential questions. Evaluates reasoning over realistic audio dialogues. Contains 2290 QA pairs from 2082 unique audio clips (48+ hours).
MELD [34]: For Emotion Detection. Multimodal EmotionLines Dataset from Friends TV series (1,400+ dialogues, 13,000+ utterances), labeled with 7 emotions. Task is to classify emotion from audio.
MUStARD [5]: For Emotion Detection (sarcasm detection). Multimodal video corpus from sitcoms. Uses audio to evaluate ALM's ability to detect sarcasm.
PARADE (New) **: For Bias. Audio-text multiple-choice QA benchmark (436 instances) exploring occupational and status bias. Uses synthetic audio with neutral content and varied gender voices to probe for stereotyped responses.
FLEURS [9]: For Fairness and Multilinguality (ASR & Speech Translation). N-way parallel speech dataset in 102 languages.
- Fairness: Evaluates mean WER differences between male and female speakers.
- Multilinguality: Tests transcription in various languages.
CoVoST 2 [41]: For Multilinguality (Speech Translation). Large-scale multilingual speech translation corpus (21 languages to English, English to various). Spanish-to-English and Chinese-to-English subsets used.
Multilingual LibriSpeech [35]: For Multilinguality (Multilingual ASR). Derived from audiobooks, ~44.5K hours English, ~6K hours for 7 other languages. Task is to transcribe audio in various languages.
Speech Robust Bench (SRB) [37]: For Robustness. Comprises 114 input perturbations (e.g., Gaussian noise, environmental noise) simulating real-world corruptions. Evaluates ALM's ability to process speech in noisy environments.
MuToX [10]: For Toxicity. ~20k audio utterances for English/Spanish, ~4k for other languages. Evaluates zero-shot toxicity detection across languages.
Voice Jailbreak Attacks [38]: For Safety. Scenarios testing ALM resistance to jailbreak attacks (e.g., asking how to remove watermarks from copyrighted images). Measures refusal rate for unsafe prompts.

5.2. Evaluation Metrics

AHELM uses automated metrics for consistency and cost-effectiveness.

Word Error Rate (WER):
- Conceptual Definition: Measures the dissimilarity between a hypothesized (model-generated) transcription and a reference (ground truth) transcription. It quantifies the number of errors (substitutions, deletions, insertions) needed to transform the hypothesis into the reference, relative to the reference length. Lower WER indicates better performance.
- Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $
- Symbol Explanation:
  - $S$ : Number of substitutions (words replaced).
  - $D$ : Number of deletions (words omitted).
  - $I$ : Number of insertions (words added).
  - $N$ : Total number of words in the reference (ground truth) transcription.
- Applied to: ASR tasks (LibriSpeech, FLEURS, Multilingual LibriSpeech, Speech Robust Bench).
BLEU (Bilingual Evaluation Understudy) Score:
- Conceptual Definition: A metric for evaluating the quality of machine-translated text by comparing it to one or more human-produced reference translations. It calculates precision on n-grams (sequences of n words) and applies a brevity penalty to discourage overly short translations. Higher BLEU indicates better translation quality.
- Mathematical Formula: $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ Where:
  - $\mathrm{BP}$ (Brevity Penalty): $ \mathrm{BP} = \begin{cases} 1 & \text{if } c > r \ e^{(1-r/c)} & \text{if } c \le r \end{cases} $ ( $c$ is the length of the candidate translation, $r$ is the effective reference corpus length).
  - $p_n$ (n-gram precision): $\frac{\text{Count of matching n-grams in candidate and references}}{\text{Count of n-grams in candidate}}$ .
  - $w_n$ : Weights for each n-gram order (typically $1/N$ ).
  - $N$ : Maximum n-gram order (typically 4).
- Applied to: Translation tasks (CoVoST 2).
Accuracy (Exact Match, EM):
- Conceptual Definition: Measures the proportion of model predictions that are perfectly identical to the ground truth labels. It is a straightforward metric for tasks with discrete or categorical outputs.
- Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
  - Number of Correct Predictions: Count of instances where the model's output exactly matches the reference.
  - Total Number of Predictions: Total number of instances evaluated.
- Applied to: Multiple-choice questions (VoxCeleb2, VocalSound, AIR-Bench Foundation, MELD, MUStARD, PARADE, MuToX), and potentially aspects like CoRe-Bench (where "Pseudo-exact match" is mentioned, implying a potentially slightly relaxed form of exact matching for answers).
LLM-as-a-Judge (GPT-4o Judge Critique):
- Conceptual Definition: An automated evaluation method for open-ended text generation tasks where a powerful Large Language Model (GPT-4o) assesses the quality of a model's response against a reference answer based on a predefined rubric. It provides a numerical score (1-5) and a short explanation.
- Mechanism: The judge model receives the reference answer and the model's response, along with a detailed scoring rubric. It then outputs a score and an explanation based on its understanding.
- Applied to: Open-ended tasks like AudioCaps, AIR-Bench Chat (knowledge and reasoning subsets).
Statistical Tests for Fairness (t-test):
- Conceptual Definition: Used to determine if observed differences in performance between demographic groups (e.g., male vs. female speakers) are statistically significant, rather than due to random chance. This helps in identifying performance disparity and counterfactual fairness issues.
- Independent t-test:
  - Purpose: Compares the means of two independent groups.
  - Hypotheses:
    - Null Hypothesis ( $H_0$ ): The means of the two groups are equal ( $\mu_{\text{male}} = \mu_{\text{female}}$ ).
    - Alternative Hypothesis ( $H_1$ ): The means of the two groups are not equal ( $\mu_{\text{male}} \neq \mu_{\text{female}}$ ).
  - Mathematical Formula: $ t = \frac{\bar{x}{\text{male}} - \bar{x}{\text{female}}}{\sqrt{\frac{s_{\text{male}}^2}{n_{\text{male}}} + \frac{s_{\text{female}}^2}{n_{\text{female}}}}} $
  - Symbol Explanation:
    - $\bar{x}_g$ : Sample mean score for group $g$ (e.g., male or female).
    - $s_g^2$ : Sample variance for group $g$ .
    - $n_g$ : Number of members (samples) in group $g$ .
- Paired-samples t-test:
  - Purpose: Compares the means of paired observations (e.g., performance on the same content spoken by a male versus a female). It analyzes the difference between pairs.
  - Hypotheses:
    - Null Hypothesis ( $H_0$ ): The mean difference between paired observations is zero ( $\bar{d} = 0$ ).
    - Alternative Hypothesis ( $H_1$ ): The mean difference is not zero ( $\bar{d} \neq 0$ ).
  - Mathematical Formula: $ t = \frac{\bar{d}}{\frac{s_d}{\sqrt{n_d}}} $
  - Symbol Explanation:
    - $d_i = s_{i, \text{male}} - s_{i, \text{female}}$ : The difference in scores for the $i$ -th paired content.
    - $\bar{d} = \frac{1}{n_d} \sum_i d_i$ : The arithmetic mean of the sample differences.
    - $s_d$ : The standard deviation of the sample differences.
    - $n_d$ : The number of paired differences.
- Applied to: FLEURS (fairness) and LibriSpeech (fairness).
Toxic Fraction:
- Conceptual Definition: The proportion of generated responses that are deemed toxic or harmful. Often used in safety evaluations to measure how frequently a model generates undesirable content. In this paper, specifically, it refers to the rate at which models comply with (fail to refuse) unsafe voice jailbreak attacks.
- Mathematical Formula: $ \mathrm{Toxic~Fraction} = \frac{\text{Number of Toxic Responses}}{\text{Total Number of Responses}} $
- Symbol Explanation:
  - Number of Toxic Responses: Count of model outputs classified as toxic or harmful.
  - Total Number of Responses: Total number of instances where the model provided a response.
- Applied to: Voice Jailbreak Attacks scenario.

5.3. Baselines

The paper compares the evaluated ALMs against three simple baseline systems. These baselines are constructed by combining a dedicated Automatic Speech Recognizer (ASR) with a powerful Large Language Model (LM). This allows for a direct comparison between end-to-end ALMs and a modular, "engineered" solution.

The baseline systems are:

Whisper-1 + GPT-4o (2024-11-20): Uses OpenAI's Whisper-1 as the ASR module, combined with GPT-4o as the LM.
GPT-4o Transcribe + GPT-4o (2024-11-20): Uses OpenAI's GPT-4o Transcribe as the ASR module, combined with GPT-4o as the LM.
GPT-4o Mini Transcribe + GPT-4o (2024-11-20): Uses OpenAI's GPT-4o Mini Transcribe as the ASR module, combined with GPT-4o as the LM.

These baselines are chosen to:

Gauge ALM Superiority: Determine if ALMs truly offer significant advantages over chaining specialized components, and in which specific scenarios.
Inform Scenario Interpretation: Help understand whether task information primarily resides in speech content (if baselines perform well) or in subtle audio cues (if baselines perform poorly). For instance, ASR outputs provide textual content, but typically lack prosodic information or speaker identity cues.

5.4. Evaluated ALMs

The paper evaluates 14 state-of-the-art ALMs, prioritizing popular models for meaningful comparisons. These models come from three major developers:

Google: Various versions of Gemini models (1.5 Pro, 1.5 Flash, 2.0 Flash, 2.5 Pro, 2.5 Flash). This allows for observing performance evolution within the same model family.
OpenAI: Various GPT-4o Audio and GPT-4o mini Audio previews.

Alibaba Cloud: Qwen2-Audio Instruct (7B) and Qwen2.5-Omni (7B).

The following are the audio language models evaluated in AHELM, from Table 3 of the original paper:

Model	Identifier	Creator	Access	Release Date	Parameters	Ref.	Knowledge Cutoff
Gemini 1.5 Pro (001)	gemini-1.5-pro-001	Google	API	2024-05-24	?	[17]	?
Gemini 1.5 Flash (001)	gemini-1.5-flash-001	Google	API	2024-05-24	?	[17]	?
Gemini 1.5 Pro (002)	gemini-1.5-pro-002	Google	API	2024-09-24	?	[17]	?
Gemini 1.5 Flash (002)	gemini-1.5-flash-002	Google	API	2024-09-24	?	[17]	?
Gemini 2.0 Flash (Experimental)	gemini-2.0-flash-exp	Google	API	2024-12-11	?	[30]	?
Gemini 2.0 Flash	gemini-2.0-flash-001	Google	API	2025-02-01	?	[30]	?
Gemini 2.0 Flash Lite	gemini-2.0-flash-lite-001	Google	API	2025-03-25	?	[30]	?
Gemini 2.5 Pro (05-06 preview)	gemini-2.5-pro-preview-05-06	Google	API	2025-05-06	?	[24]	?
Gemini 2.5 Flash (05-20 preview)	gemini-2.5-flash-preview-05-20	Google	API	2025-04-17	?	[24]	?
GPT-40 Audio (Preview 2024-10-01)	gpt-4o-audio-preview-2024-10-01	OpenAI	API	2024-10-01	?	[32]	2023-09-30
GPT-40 Audio (Preview 2024-12-17)	gpt-4o-audio-preview-2024-12-17	OpenAI	API	2024-12-17	?	[32]	2023-09-30
GPT-4o mini Audio (Preview 2024-12-17)	gpt-4o-mini-audio-preview-2024-12-17	OpenAI	API	2024-12-17	?	[32]	2023-09-30
Qwen2-Audio Instruct (7B)	qwen2-audio-7b-instruct	Alibaba Cloud	Open-weight	2024-11-28 2025-03-27	8.4B	[6]	?
Qwen2.5-Omni (7B)	qwen2.5-omni-7b	Alibaba Cloud	Open-weight		10.7B	[42]	?

The second block lists models that are used to construct our baseline systems and are not ALMs. A question mark indicates unknown.

Model	Identifier	Creator	Access	Release Date	Parameters	Ref.	Knowledge Cutoff
Whisper 1	whisper-1	OpenAI	API	2022-09-21	?	[36]	?
GPT-4o Transcribe	gpt-4o-transcribe	OpenAI	API	2025-03-20	?	[32]	2024-05-31
GPT-4o Mini Transcribe	gpt-4o-mini-transcribe	OpenAI	API	2025-03-20	?	[32]	2024-05-31
GPT-4o (2024-11-20)	gpt-4o-2024-11-20	OpenAI	API	2024-11-20	?	[31]	2023-09-30

5.5. Inference Parameters

To ensure equitable and reliable comparisons:

Temperature: Set to 0 (deterministic output).
Maximum Number of Output Tokens: Set to 200.
Prompting Method: Zero-shot prompting exclusively.
Attempts: Only one try per instance.

5.6. Computational Resources & Evaluation Scale

Instances per scenario: Up to 1,000 instances were randomly sampled per scenario.
Total instances processed: Each model processed 39,538 instances.
Input Data: 5,728,718 characters of input text and 41,228 audio files in total.
Output Data: For context, Qwen2.5-Omni (7B) generated 3,823,092 characters in its completions across all scenarios.
Experiment Dates: Conducted between February 16, 2025, and June 1, 2025.

6. Results & Analysis

6.1. Core Results Analysis

The central finding is that no single model excels across all scenarios, highlighting the multifaceted nature of ALM capabilities. Gemini 2.5 Pro (05-06 Preview) generally leads, but with notable weaknesses. The baseline $ASR+LM$ systems demonstrate surprising competitiveness, especially in certain aspects.

The following figure (Figure A19 from the original paper) summarizes the performances of the models on the aspects in AHELM:

Figure A19: A radar chart summarizing the performances of the models on the aspects in AHELM. The mean win rates of different aspects are reported. A detailed breakdown across different aspects is provided in Table A8 to Table A23. 该图像是一个雷达图，展示了不同音频语言模型在AHELM基准中的各项性能表现。图中显示了模型在知识、推理、情感检测等10个重要方面的胜率平均值。

The following figure (Figure A20 from the original paper) summarizes the performances of the models on the scenarios in AHELM:

Figure A20: A radar chart summarizing the performances of the models on the scenarios in AHELM. The scenario scores are reported, with all scores normalized to a 01 scale. WER-based metrics are inverted (i.e., 1-WER is reported here) to ensure that higher values consistently indicate better performance. 该图像是一个雷达图，展示了多种音频语言模型在 AHELM 中的表现。图中的各个维度代表模型在不同场景下的评分，所有得分均归一化为 0 到 1 的范围，WER 基础指标被反转处理，以确保更高的值指示更好的性能。

Here's an aspect-by-aspect breakdown:

6.1.1. Audio Perception

Gemini 2.5 Pro (05-06 Preview) is the top performer in Audio Perception (MWR: 0.938), followed by Qwen2.5-Omni (7B) and Gemini 2.0 Flash. This indicates its strong ability to extract meaningful information from audio signals, including ASR and AQA tasks. The baseline $ASR+LM$ systems also perform reasonably well, suggesting ASR quality is a significant factor here.

The following are the results of the models in audio perception, from Table A8 of the original paper:

Model	Mean win rate	AudioCaps (GPT-4o Judge Critique) ↑	VoxCeleb2 (EM) ↑	VocalSound (PEM) ↑	LibriSpeech (WER) ↓
Gemini 2.5 Pro (05-06 preview)	0.938	2.275	0.751	0.860	0.039
Qwen2.5-Omni (7B)	0.734	2.653	0.581	0.904	0.103
Gemini 2.0 Flash	0.688	1.979	0.529	0.719	0.043
Gemini 2.0 Flash (Experimental)	0.656	1.977	0.530	0.718	0.044
Gemini 2.5 Flash (05-20 preview)	0.641	1.971	0.759	0.626	0.077
GPT-4o Audio (Preview 2024-12-17)	0.625	1.908	0.575	0.837	0.095
GPT-4o Audio (Preview 2024-10-01)	0.516	1.797	0.570	0.833	0.113
Gemini 1.5 Pro (002)	0.516	1.366	0.585	0.528	0.052
GPT-4o mini Transcribe + GPT-4o (2024-11-20)	0.484	1.283	0.548	0.622	0.045
Qwen2-Audio Instruct (7B)	0.469	2.673	0.240	0.799	0.113
Gemini 2.0 Flash Lite	0.453	1.884	0.527	0.506	0.049
Gemini 1.5 Flash (002)	0.359	1.416	0.542	0.418	0.062
Whisper-1 + GPT-4o (2024-11-20)	0.359	1.093	0.601	0.280	0.053
GPT-4o Transcribe + GPT-4o (2024-11-20)	0.328	1.171	0.521	0.616	0.049
GPT-4o mini Audio (Preview 2024-12-17)	0.328	1.835	0.509	0.794	0.163
Gemini 1.5 Pro (001)	0.266	1.348	0.524	0.492	0.071
Gemini 1.5 Flash (001)	0.141	1.363	0.522	0.463	0.342

6.1.2. Knowledge

Qwen2-Audio Instruct (7B) surprisingly takes the lead in Knowledge (MWR: 0.906), slightly outperforming Gemini 2.5 Pro (05-06 Preview). The baseline systems score worst, indicating that knowledge tasks often require access to non-speech audio content (e.g., music understanding) that is lost with a simple ASR conversion.

The following are the results of the models in knowledge, from Table A9 of the original paper:

Model	Mean win rate	Air-Bench Chat (knowledge subsets) (GPT-4o Judge Critique) ↑	Air-Bench Foundation (EM) ↑
Qwen2-Audio Instruct (7B)	0.906	3.113	0.724
Gemini 2.5 Pro (05-06 preview)	0.875	3.413	0.683
Gemini 2.0 Flash	0.812	3.042	0.697
Gemini 2.5 Flash (05-20 preview)	0.781	3.182	0.579
Gemini 2.0 Flash (Experimental)	0.750	3.018	0.698
Qwen2.5-Omni (7B)	0.656	2.669	0.743
GPT-4o Audio (Preview 2024-12-17)	0.656	3.041	0.560
Gemini 2.0 Flash Lite	0.625	2.923	0.641
GPT-4o Audio (Preview 2024-10-01)	0.531	3.037	0.527
Gemini 1.5 Pro (002)	0.500	2.864	0.554
GPT-4o mini Audio (Preview 2024-12-17)	0.406	2.779	0.541
Gemini 1.5 Flash (002)	0.344	2.822	0.508
Gemini 1.5 Flash (001)	0.219	2.393	0.483
Gemini 1.5 Pro (001)	0.219	2.255	0.511
GPT-4o mini Transcribe + GPT-4o (2024-11-20)	0.125	2.298	0.383
Whisper-1 + GPT-4o (2024-11-20)	0.094	2.156	0.383
GPT-4o Transcribe + GPT-4o (2024-11-20)	0.000	2.137	0.372

6.1.3. Reasoning

Gemini 2.5 Pro (05-06 Preview) achieves a perfect MWR of 1.000 in Reasoning, indicating it consistently outperforms other models. The Gemini family generally performs best. Interestingly, Qwen2.5-Omni (7B) performs poorly, ranking third-worst among ALMs, despite its strength in knowledge. This highlights that knowledge and reasoning are distinct capabilities.

The following are the results for reasoning, from Table A10 of the original paper:

Model	Mean win rate	Air-Bench Chat (reasoning subsets) (GPT-4o Judge Critique) ↑	COREBench (PEM) ↑
Gemini 2.5 Pro (05-06 preview)	1.000	3.621	0.813
Gemini 2.0 Flash	0.812	3.331	0.756
Gemini 1.5 Pro (002)	0.812	3.241	0.799
Gemini 2.0 Flash (Experimental)	0.812	3.339	0.754
Gemini 1.5 Flash (002)	0.750	3.227	0.776
Gemini 2.5 Flash (05-20 preview)	0.719	3.495	0.644
Gemini 2.0 Flash Lite	0.594	3.173	0.737
Gemini 1.5 Flash (001)	0.469	3.084	0.722
Gemini 1.5 Pro (001)	0.406	3.024	0.659
Qwen2-Audio Instruct (7B)	0.375	3.304	0.233
GPT-4o Audio (Preview 2024-12-17)	0.344	3.217	0.359
Qwen2.5-Omni (7B)	0.312	3.012	0.560
Whisper-1 + GPT-4o (2024-11-20)	0.312	3.126	0.377
GPT-4o mini Audio (Preview 2024-12-17)	0.250	2.915	0.514
GPT-4o Audio (Preview 2024-10-01)	0.250	3.153	0.342
GPT-4o Transcribe + GPT-4o (2024-11-20)	0.156	2.664	0.388
GPT-4o mini Transcribe + GPT-4o (2024-11-20)	0.125	2.898	0.373

6.1.4. Emotion Detection

Gemini 2.5 Pro (05-06 Preview) is again the best (MWR: 0.781). Crucially, the baseline systems rank 2nd to 4th, suggesting that for scenarios like MELD, emotional cues are heavily present in the speech content itself (which ASR captures) rather than subtle audio inflections. However, their poorer performance on MUStARD (sarcasm) indicates that some emotion detection tasks do require deeper audio understanding beyond transcription.

The following are the results of the models on the emotion detection aspect, from Table A11 of the original paper:

Model		Mean win rate Multimodal EmotionLines Dataset (MELD) Audio (PEM) ↑	MUStARD (EM) ↑
Gemini 2.5 Pro (05-06 preview)	0.781	0.473	0.655
GPT-4o Audio (Preview 2024-12-17)	0.656	0.497	0.583
Qwen2.5-Omni (7B)	0.656	0.491	0.588
GPT-4o Transcribe + GPT-4o (2024-11-20)	0.656	0.541	0.575
Gemini 1.5 Pro (002)	0.656	0.516	0.577
Whisper-1 + GPT-4o (2024-11-20)	0.625	0.552	0.565
GPT-4o mini Transcribe + GPT-4o (2024-11-20)	0.609	0.573	0.564
Gemini 2.0 Flash Lite	0.594	0.368	0.661
Gemini 2.0 Flash (Experimental)	0.578	0.443	0.604
GPT-4o Audio (Preview 2024-10-01)	0.562	0.456	0.593
Gemini 2.0 Flash	0.516	0.423	0.604
GPT-4o mini Audio (Preview 2024-12-17)	0.469	0.334	0.623
Gemini 1.5 Pro (001)	0.359	0.469	0.564
Gemini 1.5 Flash (001)	0.312	0.471	0.555
Gemini 2.5 Flash (05-20 preview)	0.250	0.340	0.574
Gemini 1.5 Flash (002)	0.219	0.425	0.558
Qwen2-Audio Instruct (7B)	0.000	0.260	0.209

6.1.5. Bias

Baseline systems (specifically GPT-4o mini Transcribe + GPT-4o) significantly outperform ALMs in Bias (MWR: 1.000). This suggests that ASR models are less prone to exhibiting gender-based stereotypes when only provided with transcribed text, as opposed to ALMs which process the actual audio cues. The Gemini models generally perform poorly on the PARADE benchmark.

The following are the results of benchmarking on bias scenarios, from Table A12 of the original paper:

Model	Mean win rate	PARADE (EM) ↑
GPT-4o mini Transcribe + GPT-4o (2024-11-20)	1.000	0.858
GPT-4o Transcribe + GPT-4o (2024-11-20)	0.938	0.858
GPT-4o mini Audio (Preview 2024-12-17)	0.875	0.857
Whisper-1 + GPT-4o (2024-11-20)	0.812	0.857
GPT-4o Audio (Preview 2024-10-01)	0.750	0.847
GPT-4o Audio (Preview 2024-12-17)	0.688	0.779
Qwen2.5-Omni (7B)	0.625	0.634
Gemini 2.5 Flash (05-20 preview)	0.562	0.514
Gemini 2.0 Flash (Experimental)	0.500	0.465
Gemini 2.0 Flash	0.438	0.463
Gemini 2.0 Flash Lite	0.375	0.436
Gemini 2.5 Pro (05-06 preview)	0.312	0.324
Gemini 1.5 Flash (002)	0.250	0.312
Gemini 1.5 Flash (001)	0.188	0.292
Gemini 1.5 Pro (001)	0.125	0.217
Gemini 1.5 Pro (002)	0.062	0.215
Qwen2-Audio Instruct (7B)	0.000	0.209

6.1.6. Fairness

Most models generally do not show statistically significant differences in performance based on speaker gender in ASR tasks (FLEURS, LibriSpeech). However, Gemini 2.5 Pro (05-06) and Qwen2.5-Omni showed a preference for females on FLEURS ( $p=0.02$ ). Conversely, several Gemini 2.0 Flash models and GPT-4o-mini Transcribe showed a slight preference for male speakers on LibriSpeech. This indicates subtle, but sometimes statistically significant, group unfairness can exist.

The following are the results of the paired-samples $\mathrm{t}$ -test between transcriptions of the same audio content by males and females and of the independent $\mathrm{t}$ -test between group means on FLEURS (fairness), from Table A13 of the original paper:

Model	p-value (paired)	t-stat (paired)	DoF (paired)	p-value (indp)	t-stat (indp)	DoF (indp)
Gemini 1.5 Pro (001)	0.24	1.18	130	0.32	0.99	645
Gemini 1.5 Flash (001)	0.41	0.83	130	0.77	0.30	645
Gemini 1.5 Pro (002)	0.13	1.51	130	0.65	0.46	645
Gemini 1.5 Flash (002)	0.92	0.09	130	0.61	-0.51	645
Gemini 2.0 Flash (Experimental)	0.21	1.26	130	0.21	1.25	645
Gemini 2.0 Flash	0.17	1.39	130	0.16	1.39	645
Gemini 2.0 Flash Lite	0.51	0.66	130	0.66	0.44	645
Gemini 2.5 Pro (05-06 preview)	0.02*	2.30	130	0.34	0.95	645
Gemini 2.5 Flash (05-20 preview)	0.87	0.17	130	0.22	-1.22	645
Whisper 1	0.83	0.21	130	0.85	-0.19	645
GPT-4o Transcribe	0.78	-0.27	130	0.31	-1.02	645
GPT-4o Mini Transcribe	0.92	0.10	130	0.65	-0.45	645
GPT-4o Audio (Preview 2024-10-01)	0.33	0.98	130	0.43	0.79	645
GPT-4o Audio (Preview 2024-12-17)	0.67	-0.43	130	0.40	-0.84	645
GPT-4o mini Audio (Preview 2024-12-17)	0.91	-0.11	130	0.98	-0.03	645
Qwen2-Audio Instruct (7B)	0.85	-0.19	130	0.03*	2.13	645
Qwen2.5-Omni (7B)	0.02*	2.38	130	0.01*	2.52	645

The following are the results of the independent $\mathrm{t}$ -test between group means on LibriSpeech (fairness), from Table A14 of the original paper:

Model	p-value (indp)	t-stat (indp)	DoF (indp)
Gemini 1.5 Pro (001)	0.39	0.86	1998
Gemini 1.5 Flash (001)	0.53	-0.64	1998
Gemini 1.5 Pro (002)	0.85	-0.19	1998
Gemini 1.5 Flash (002)	0.14	1.48	1998
Gemini 2.0 Flash (Experimental)	0.06*	-1.90	1998
Gemini 2.0 Flash	0.06*	-1.89	1998
Gemini 2.0 Flash Lite	0.03*	-2.17	1998
Gemini 2.5 Pro (05-06 preview)	0.21	-1.25	1998
Gemini 2.5 Flash (05-20 preview)	0.00*	-3.22	1998
Whisper 1	0.21	-1.25	1998
GPT-4o Transcribe	0.27	-1.09	1998
GPT-4o Mini Transcribe	0.01*	-2.62	1998
GPT-4o Audio (Preview 2024-10-01)	0.28	-1.07	1998
GPT-4o Audio (Preview 2024-12-17)	0.36	0.91	1998
GPT-4o mini Audio (Preview 2024-12-17)	0.99	-0.01	1998
Qwen2-Audio Instruct (7B)	0.51	-0.66	1998
Qwen2.5-Omni (7B)	0.47	-0.72	1998

6.1.7. Multilinguality

A baseline system (GPT-4o Transcribe + GPT-4o) performs best in Multilinguality (MWR: 0.896), followed by Gemini 1.5 Pro (002) and Gemini 2.5 Pro (05-06 preview). This again highlights the strength of specialized components. All models performed better on Spanish-to-English than Chinese-to-English translation (CoVoST-2), reflecting potential data distribution biases towards Latin languages. Performance on FLEURS also showed variations across languages (e.g., better on English/Finnish, worse on Thai).

The following are the results of the ALMs on the multilinguality aspect, from Table A15 of the original paper:

Model	Mean win rate	CoVost-2 (BLEU) ↑	FLEURS (WER) ↓	Multilingual Librispeech (WER) ↓
GPT-4o Transcribe + GPT-4o (2024-11-20)	0.896	33.991	0.314	0.065
Gemini 1.5 Pro (002)	0.854	32.999	0.342	0.054
Gemini 2.5 Pro (05-06 preview)	0.729	35.657	0.211	0.198
Gemini 2.0 Flash	0.708	33.468	0.648	0.060
GPT-4o mini Transcribe + GPT-4o (2024-11-20)	0.688	33.238	0.419	0.080
Gemini 2.0 Flash Lite	0.625	31.768	0.443	0.067
Gemini 2.0 Flash (Experimental)	0.604	32.900	0.646	0.060
GPT-4o Audio (Preview 2024-12-17)	0.562	32.190	0.456	0.073
Gemini 1.5 Pro (001)	0.562	32.661	0.463	0.073
Gemini 1.5 Flash (002)	0.500	30.597	0.461	0.071
Whisper-1 + GPT-4o (2024-11-20)	0.500	32.931	0.614	0.086
GPT-4o mini Audio (Preview 2024-12-17)	0.312	29.256	0.545	0.123
Gemini 1.5 Flash (001)	0.292	30.699	0.723	0.088
Gemini 2.5 Flash (05-20 preview)	0.271	33.393	2.732	0.603
GPT-4o Audio (Preview 2024-10-01)	0.250	31.563	0.771	0.162
Qwen2-Audio Instruct (7B)	0.083	28.283	2.240	0.337
Qwen2.5-Omni (7B)	0.062	20.497	1.932	0.416

6.1.8. Robustness

Gemini 2.5 Pro (05-06 Preview) is the most robust (MWR: 1.000) to environmental noise. However, baseline systems hold three of the top five spots, emphasizing that dedicated ASR architectures often have superior noise robustness, suggesting that ALMs could benefit from incorporating ASR-specific design principles.

The following are the results for robustness, from Table A19 of the original paper:

Model	Mean win rate	Robust Speech Bench (WER) ↓
Gemini 2.5 Pro (05-06 preview)	1.000	0.039
GPT-4o mini Transcribe + GPT-4o (2024-11-20)	0.938	0.046
GPT-4o Transcribe + GPT-4o (2024-11-20)	0.875	0.047
Gemini 2.0 Flash Lite	0.812	0.049
Whisper-1 + GPT-4o (2024-11-20)	0.750	0.053
Gemini 2.5 Flash (05-20 preview)	0.688	0.077
Qwen2.5-Omni (7B)	0.625	0.103
Gemini 2.0 Flash (Experimental)	0.562	0.171
Gemini 2.0 Flash	0.500	0.178
Gemini 1.5 Pro (002)	0.438	0.207
Gemini 1.5 Pro (001)	0.375	0.213
Gemini 1.5 Flash (002)	0.312	0.214
Qwen2-Audio Instruct (7B)	0.250	0.399
GPT-4o Audio (Preview 2024-12-17)	0.188	0.451
GPT-4o mini Audio (Preview 2024-12-17)	0.125	0.471
Gemini 1.5 Flash (001)	0.062	0.498
GPT-4o Audio (Preview 2024-10-01)	0.000	0.822

6.1.9. Toxicity

GPT-4o mini Audio (Preview 2024-12-17) performs best overall in Toxicity (EM: 0.874), followed by other GPT-4o Audio models. Performance varies significantly by language, with French and Indonesian showing the highest accuracy and Vietnamese and English the lowest, possibly due to cultural differences in defining toxicity or dataset curation.

The following are the results of the models on Toxicity (MuTox) subsets (Part 1), from Table A20 of the original paper:

Model	MuTox (EM) ↑	French (EM) ↑	Indonesian (EM) ↑	Tagalog (EM) ↑	Bengali (EM) ↑	Dutch (EM) ↑	Urdu (EM) ↑	Hindi (EM) ↑	Catalan (EM) ↑
GPT-40 mini Audio (Preview 2024-12-17)	0.874	1.000	1.000	1.000	0.882	1.000	1.000	1.000	0.919
GPT-40 Audio (Preview 2024-10-01)	0.859	1.000	1.000	0.909	0.882	0.923	1.000	1.000	0.924
GPT-40 Audio (Preview 2024-12-17)	0.858	1.000	1.000	1.000	0.882	1.000	1.000	1.000	0.919
Qwen2.5-Omni (7B)	0.828	1.000	1.000	0.909	1.000	0.923	0.714	0.857	0.865
Gemini 1.5 Pro (002)	0.819	1.000	1.000	1.000	0.882	1.000	1.000	1.000	0.919
Gemini 2.0 Flash Lite	0.812	1.000	1.000	1.000	0.824	1.000	0.857	0.857	0.849
Gemini 2.5 Flash (05-20 preview)	0.797	1.000	1.000	0.909	0.824	0.923	0.714	0.857	0.914
GPT-4o Transcribe + GPT-4o (2024-11-20)	0.787	1.000	0.800	0.636	0.941	0.846	0.857	1.000	0.886
Gemini 1.5 Pro (001)	0.771	1.000	1.000	1.000	0.824	0.923	1.000	0.714	0.849
GPT-4o mini Transcribe + GPT-4o (2024-11-20)	0.756	1.000	0.800	0.727	0.882	0.692	0.571	1.000	0.903
Whisper-1 + GPT-4o (2024-11-20)	0.750	1.000	1.000	0.636	0.765	0.615	0.857	1.000	0.876
Gemini 1.5 Flash (002)	0.737	1.000	1.000	0.909	0.882	0.923	0.857	1.000	0.627
Gemini 2.5 Pro (05-06 preview)	0.735	0.625	1.000	0.818	0.765	0.692	0.714	0.571	0.876
Gemini 2.0 Flash	0.621	0.875	1.000	0.909	0.765	0.846	0.714	0.571	0.530
Gemini 2.0 Flash (Experimental)	0.620	0.875	1.000	0.909	0.765	0.846	0.714	0.571	0.530
Gemini 1.5 Flash (001)	0.591	1.000	0.800	0.818	0.824	0.692	0.857	0.714	0.508
Qwen2-Audio Instruct (7B)	0.587	0.875	0.800	0.636	0.647	0.385	0.571	0.286	0.838
Average	0.753	0.956	0.953	0.866	0.837	0.837	0.824	0.824	0.808
(Std. Dev)	(0.095)	(0.098)	(0.087)	(0.133)	(0.082)	(0.170)	(0.148)	(0.217)	(0.152)

6.1.10. Safety

OpenAI models show strong robustness against voice jailbreak attacks, with GPT-4o Audio (Preview 2024-12-17) achieving a near-perfect refusal rate (0.994). This is significantly better than Qwen2.5-Omni and Gemini 2.5 Pro, which refused only ~51-53% of the time, despite their strong performance in other aspects. This suggests specific safety optimizations by OpenAI.

The following are the results for safety, from Table A23 of the original paper:

Model	Mean win rate	Voice Jailbreak Attacks Against GPT-40 (Refusal rate for safety)
GPT-4o Audio (Preview 2024-12-17)	1.000	0.994
GPT-4o mini Transcribe + GPT-4o (2024-11-20)	0.906	0.989
Whisper-1 + GPT-4o (2024-11-20)	0.906	0.989
GPT-4o Audio (Preview 2024-10-01)	0.781	0.978
GPT-4o Transcribe + GPT-4o (2024-11-20)	0.781	0.978
GPT-4o mini Audio (Preview 2024-12-17)	0.688	0.967
Gemini 2.5 Pro (05-06 preview)	0.625	0.533
Gemini 1.5 Pro (001)	0.531	0.511
Qwen2.5-Omni (7B)	0.531	0.511
Qwen2-Audio Instruct (7B)	0.438	0.467
Gemini 1.5 Flash (001)	0.375	0.317
Gemini 2.0 Flash (Experimental)	0.312	0.311
Gemini 2.0 Flash	0.250	0.306
Gemini 2.5 Flash (05-20 preview)	0.188	0.289
Gemini 1.5 Flash (002)	0.125	0.267
Gemini 1.5 Pro (002)	0.062	0.261
Gemini 2.0 Flash Lite	0.000	0.250

6.1.11. Overall Performance and Baseline Insights

Gemini 2.5 Pro (05-06 Preview): Overall best with MWR of 0.803, leading 5 aspects. However, its significant group unfairness on ASR tasks (e.g., $p=0.02$ on FLEURS, $p=0.00$ for Gemini 2.5 Flash on LibriSpeech) is a critical drawback.
Baseline Systems: Perform remarkably well, with GPT-4o-mini Transcribe + GPT-4o ranking 6th overall. This suggests that for many speech-based tasks, the strong performance and robustness of dedicated ASR modules (like Whisper or GPT-4o Transcribe) combined with a powerful LM can rival or even surpass complex end-to-end ALMs. Text also proves to be a good abstraction for many audio tasks.
Open-weight models: Generally show weaker instruction following, which hurts their performance across tasks. However, improvements are noted between Qwen2-Audio Instruct and Qwen2.5-Omni.

6.2. Ablation Studies / Parameter Analysis

While the paper doesn't present traditional ablation studies on components of a single ALM, it implicitly performs an "ablation" by comparing full ALMs against $ASR+LM$ baselines. This comparison functions as an ablation to understand the value added by the end-to-end multimodal architecture versus a modular approach.

ASR vs. ALM for Emotion Detection: The strong performance of baselines on MELD (emotion detection from speech) compared to MUStARD (sarcasm detection) reveals that MELD can often be solved by analyzing speech content alone. This implies that tasks relying on subtle prosodic cues (like sarcasm) truly benefit from ALM's end-to-end audio processing, whereas simpler emotion tasks can be handled by just text from ASR.
ASR vs. ALM for Robustness: The baselines' strong performance in Robustness (3 out of top 5 spots) suggests that dedicated ASR systems have specialized architectural designs and engineering optimizations that make them highly resilient to noise. This indicates a potential area for ALMs to improve by incorporating ASR-specific robust features.
Impact of Dialogue Turns in CoRe-Bench: As shown in Figure A13, the accuracy of models on CoRe-Bench only marginally improves with an increasing number of dialogue turns. This suggests that longer conversations don't necessarily lead to disproportionately better reasoning capabilities in current models, or that the additional complexity doesn't fully translate to improved performance.

The following figure (Figure A13 from the original paper) shows the accuracy of the models vs the number of dialogue turns in the conversations:

该图像是一个图表，展示了不同模型在对话数量与准确率之间的关系。随着对话数量的增加，模型的平均性能略有提升，黑色虚线表示平均准确率。
Impact of Number of Speakers in CoRe-Bench: Figure A14 shows that model accuracy is largely independent of the number of speakers in the conversation. This indicates that current ALMs might not be significantly challenged by speaker differentiation or tracking in multi-speaker scenarios, or that their performance bottleneck lies elsewhere.

The following figure (Figure A14 from the original paper) shows the accuracy of the models vs the number of speakers conversations:

该图像是图表，展示了模型性能与说话人数之间的关系。图中显示准确率在2至5位说话者的对话中变化趋势，均值用黑色虚线表示，说明各模型的平均表现与说话人数无关。
Accuracy by Question Subject in CoRe-Bench: Figure A15 highlights that models perform poorly on questions asking "what is the name of the first/second/... speaker?". This points to a specific weakness in either their ability to infer or retain speaker identities or solve the "cocktail party problem" (separating individual voices in a noisy environment).

The following figure (Figure A15 from the original paper) shows the accuracy of the models vs the conversation subjects:

该图像是一个图表，展示了模型在不同问题类别上的准确性。模型在关于“名字”的问题上表现较差，准确率最低，而在一些其他类别问题上准确率则相对较高。

Unanswerable Questions in CoRe-Bench: Analysis on unanswerable questions (Table A3) shows that OpenAI models have high recall but low precision (they tend to classify many questions as unanswerable, even correct ones), leading to low F1 scores. Gemini models are better balanced, indicating better ability to discern when a question truly cannot be answered from the provided context. This is a critical hallucination test.

The following are the F1 score, precision and recall on CoRe-Bench's unanswerable instances, from Table A3 of the original paper:

Model	F1	Precision	Recall
google gemini-1.5-flash-002	0.740	0.638	0.880
google gemini-1.5-flash-001	0.680	0.530	0.946
google _gemini-2.5-pro-preview-05-06	0.669	0.518	0.946
google gemini-1.5-pro-002	0.642	0.513	0.859
google gemini-2.0-flash-001	0.611	0.459	0.913
google gemini-2.0-flash-exp	0.604	0.452	0.913
google gemini-2.0-flash-lite-001	0.582	0.425	0.924
google gemini-1.5-pro-001	0.423	0.269	0.978
google_ gemini-2.5-flash-preview-05-20	0.391	0.247	0.935
qwen_qwen2.5-omni-7b	0.335	0.207	0.880
openai_gpt-4o-mini-audio-preview-2024-12-17	0.276	0.166	0.815
openai_gpt-4o-transcribe_gpt-4o-2024-11-20	0.244	0.139	0.989
qwen_qwen2-audio-7b-instruct	0.243	0.213	0.283
openai_whisper-1_gpt-4o-2024-11-20	0.242	0.138	0.989
openai_gpt-4o-mini-transcribe_gpt-4o-2024-11-20	0.239	0.136	0.989
openai_gpt-4o-audio-preview-2024-10-01	0.224	0.127	0.967
openai_gpt-4o-audio-preview-2024-12-17	0.214	0.121	0.891

6.3. LLM-as-a-Judge Validation

The human evaluation of GPT-4o as a judge demonstrated its reliability:

Exact agreement rate: $50.8\%$
±1 agreement rate: $83.8\%$
Weighted Kappa agreement: $83.8\%$

This indicates that GPT-4o provides evaluations that largely align with human judgments, validating its use for open-ended tasks and making the evaluation process scalable and cost-effective. Among other LLMs tested as judges, GPT-4o achieved the highest Kappa score against human ratings (Table A7), reinforcing its suitability.

The following are the agreement table between GPT-4o Judge and humans, by absolute counts (left) and proportion of total (right), from Table A6 of the original paper:

Human Score
		1	2	3	4	5
		33	1	2	3	0
			17		4	1
p -Ld	2			1	19	15
					13	13
					2	29

			Human Score 3
		1	2	4	5
pn -Ld		16.80	0.50 1.00	1.50	0.00
		5.60 8.60	2.00	2.00	0.50
		1.00 3.00	4.10	9.60	7.60
	4	0.50	1.00 4.60	6.60	6.60
	5	0.00	0.50	0.50 1.00	14.70

The following are the weighted Cohan's Kappa scores $(\kappa)$ [8] between the language models (LLaMA-3.1- 8B-Instruct, Qwen-2.5-32B, LLaMA-3.3-70B-Instruct, and Claude 4 Sonnet) and human ratings, from Table A7 of the original paper:

Judge Models	κ against Human Ratings
LLaMA-3.1-8B-Instruct	51.2%
Qwen-2.5-32B	72.4%
LLaMA-3.3-70B-Instruct	68.6%
Claude 4 Sonnet	76.8%
GPT-40	83.8%

6.4. Selected Examples and Baseline Behavior

ASR Transcription Failures: In MUStARD, which uses audio from sitcoms with natural, alternating dialogue, GPT-4o Transcribe and GPT-4o Mini Transcribe often produced incomplete transcriptions. Whisper-1, while transcribing the full dialogue, failed to identify speakers. This highlights a challenge for current ASRs in complex, multi-speaker, natural audio environments.
ASR for Non-Speech Sounds: GPT-4o Transcribe and GPT-4o Mini Transcribe could transcribe human non-speech sounds (e.g., "haha," "ahem"), unlike Whisper-1, which allowed them to perform better on VocalSounds. This demonstrates varying capabilities of ASRs beyond just spoken words.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces AHELM, a groundbreaking holistic benchmark for Audio-Language Models (ALMs), addressing the critical need for standardized and comprehensive evaluation. AHELM evaluates ALMs across 10 diverse aspects, including crucial societal considerations like bias, fairness, toxicity, and safety, using a standardized methodology for fair comparison. Key contributions include the creation of two novel synthetic datasets, PARADE (for bias) and CoRe-Bench (for conversational reasoning), and the inclusion of $ASR+LM$ baseline systems.

The evaluation of 14 state-of-the-art ALMs and 3 baselines reveals that no single model is universally superior. While Gemini 2.5 Pro (05-06 Preview) leads in several aspects, it exhibits group unfairness on ASR tasks. Surprisingly, simple $ASR+LM$ baselines prove highly competitive, ranking well overall and outperforming ALMs in specific areas like bias and multilinguality, underscoring the enduring value of specialized ASR components and text abstraction. AHELM promotes transparency by releasing all data and code, establishing a living benchmark for the evolving ALM landscape.

7.2. Limitations & Future Work

The authors acknowledge the following limitations and suggest future work:

Aspect Coverage: While 10 aspects are identified, it's possible that other important evaluative dimensions for ALMs have been missed.
Dataset Nuances: Some scenarios (e.g., MELD for emotion detection) might require improvements to better assess ALM's ability to extract information from non-speech content (like intonation), as current performance can be largely explained by speech content alone.
Contextualization of Results: The benchmark provides technical scores, but further work is needed to fully understand the real-world impact and nuances of these scores.
LLM-as-a-Judge Limitations: While validated, using an LLM as a judge may introduce subtle issues like self-preference, consistency, position bias, or preference for longer outputs. The impact on leaderboard stability when using different judges needs further exploration.

Future work includes:
Living Benchmark: Continuous addition of new scenarios, models, and metrics to AHELM as the field evolves.
Innovation in ALM Architectures: The strong performance of $ASR+LM$ baselines suggests that incorporating ASR-specific designs (e.g., for robustness to noise) into ALM architectures could significantly enhance their performance, especially in speech recognition tasks.
Synthetic Data Generation: Adapting the methods used for CoRe-Bench and PARADE to create more benchmarks and training data, spurring the community to develop newer and better datasets for ALMs.

7.3. Personal Insights & Critique

This paper provides an essential contribution to the nascent field of Audio-Language Models. The extension of the HELM framework to ALMs is logical and timely, providing much-needed structure and standardization.

Value of Baselines: The inclusion of $ASR+LM$ baselines is a stroke of genius. It not only provides a competitive benchmark but also offers diagnostic insights into why ALMs might be performing well or poorly. For example, if a baseline performs well, it tells us that the task's complexity might primarily lie in linguistic processing after accurate transcription, rather than nuanced audio understanding. This helps researchers prioritize development efforts, focusing multimodal capabilities on tasks where they truly add value (e.g., sarcasm detection, music understanding, complex reasoning over audio context).
Synthetic Data Innovation: The detailed methodology for generating PARADE and CoRe-Bench is impressive. Creating high-quality, controlled synthetic datasets for complex tasks like bias and multi-turn conversational reasoning is a significant methodological advancement. It allows for testing specific capabilities without the immense cost and ethical complexities of real-world data collection, especially for sensitive aspects like bias. The adversarial unanswerable questions in CoRe-Bench are also a smart way to probe for hallucination.
Emphasis on Societal Aspects: The strong emphasis on bias, fairness, toxicity, and safety from the outset is commendable and crucial. As ALMs move towards widespread deployment, their ethical implications are paramount. Benchmarks like AHELM are vital for guiding responsible development. The finding about Gemini's group unfairness, despite its overall strong performance, is a stark reminder that even top models can have significant blind spots.
Transparency: The commitment to open-sourcing code, raw prompts, and model generations is excellent and sets a high standard for reproducibility in the community.

Potential Issues/Areas for Improvement:
"Living Benchmark" Challenges: While the concept of a living benchmark is ideal, maintaining it and ensuring fair comparisons across perpetually updating models (especially closed-API ones) can be resource-intensive. The constant evolution of models might make long-term comparisons tricky.
Interpretation of LLM-as-a-Judge: While validated, the LLM-as-a-judge approach still relies on another black-box model. Further research into its limitations (e.g., potential biases mirroring its own training data, subtle prompt sensitivity) could strengthen its reliability claims.
Deeper Dive into ALM Failures: The paper identifies that ALMs struggle with certain aspects that ASR+LM baselines handle well (e.g., speaker identification in complex dialogues, robustness). A deeper architectural analysis of why ALMs fall short in these areas could be highly beneficial for future ALM design. What specific features or inductive biases do dedicated ASRs have that general-purpose ALMs lack?
Cultural Nuances in Toxicity: The observation that toxicity detection varies significantly across languages points to a complex problem rooted in cultural and linguistic differences. Future work could explore how to build more universally applicable toxicity detectors that account for these nuances, rather than just identifying differing performance.

Overall, AHELM is a comprehensive and timely framework that will undoubtedly serve as a cornerstone for future research and development in Audio-Language Models, pushing the field towards more capable, robust, and responsible AI systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

AHELM: A Holistic Evaluation of Audio-Language Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~46 min read · 67,091 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Evaluation Components

4.2.2. New Scenarios: PARADE and CoRe-Bench

4.2.3. ASR+LM Baseline System

4.2.4. LLM-as-a-Judge for Open-Ended Tasks

4.2.5. Aggregation

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Evaluated ALMs

5.5. Inference Parameters

5.6. Computational Resources & Evaluation Scale

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Audio Perception

6.1.2. Knowledge

6.1.3. Reasoning

6.1.4. Emotion Detection

6.1.5. Bias

6.1.6. Fairness

6.1.7. Multilinguality

6.1.8. Robustness

6.1.9. Toxicity

6.1.10. Safety

6.1.11. Overall Performance and Baseline Insights

6.2. Ablation Studies / Parameter Analysis

6.3. LLM-as-a-Judge Validation

6.4. Selected Examples and Baseline Behavior

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers