Paper status: completed

MiMo-Audio: Audio Language Models are Few-Shot Learners

Audio Language Models (1)Few-Shot Learning Capabilities (1)Speech Intelligence Benchmarks (1)Audio Understanding Benchmarks (1)Task Generation and Conversion (1)

Original Link

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MiMo-Audio demonstrates strong few-shot learning abilities in audio tasks, leveraging over 100 million hours of pretraining data. It achieved state-of-the-art performance in speech intelligence and audio understanding benchmarks while effectively generalizing to new tasks.

Abstract

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio’s pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

Mind Map

In-depth Reading

English Analysis~26 min read · 35,610 chars

1. Bibliographic Information

1.1. Title

MiMo-Audio: Audio Language Models are Few-Shot Learners

The title directly states the paper's central argument: by creating a specific type of audio language model, called MiMo-Audio, the authors have achieved few-shot learning capabilities in the audio domain. This draws a parallel to the breakthrough of models like GPT-3 in natural language processing, which demonstrated that large-scale pre-training enables models to perform new tasks with only a few examples provided in the context, without needing to be retrained.

1.2. Authors

The primary author is listed as LLM-Core Xiaomi, indicating this is a large-scale research and engineering effort from a corporate entity, Xiaomi. The acknowledgments section lists a large number of individual contributors, categorized into Core Contributors, Deployment & Evaluation, and Additional Contributors, reflecting the significant resources required for a project of this magnitude.

1.3. Journal/Conference

The paper is presented as a technical report, with a link to a PDF hosted on GitHub. It is available as a preprint on arXiv. This format is common for large-scale model releases from industry labs, allowing for rapid dissemination of results before or alongside formal peer-reviewed publication. The paper cites several other works with a "2025" publication year, which are also recent preprints, highlighting the fast-paced nature of research in this area. As a preprint, it has not yet undergone formal peer review.

1.4. Publication Year

The paper references other works from 2025 and describes a model that is contemporary with late 2024/early 2025 systems. The content suggests it was released around this time.

1.5. Abstract

The abstract outlines that existing audio language models typically require task-specific fine-tuning. In contrast, this paper proposes that the paradigm of scaling next-token prediction pre-training, which was successful for text models like GPT-3, can be applied to audio. By pre-training their model, MiMo-Audio, on an unprecedented scale of over 100 million hours of audio data, the authors observed the emergence of few-shot learning capabilities across a diverse range of audio tasks. The base model, MiMo-Audio-7B-Base, achieves state-of-the-art (SOTA) performance among open-source models on speech and audio benchmarks and can generalize to new tasks like voice conversion and style transfer. A post-training, instruction-tuned version, MiMo-Audio-7B-Instruct, further pushes the state of the art on audio understanding, spoken dialogue, and instruction-following text-to-speech (TTS) benchmarks, approaching or surpassing closed-source models. The authors announce the release of their models and evaluation suite.

1.6. Original Source Link

Official Link: https://raw.githubusercontent.com/XiaomiMiMo/MiMo-Audio/refs/heads/main/MiMo-Audio-Technical-Report.pdf
Publication Status: This is a technical report and preprint. It is not yet published in a peer-reviewed journal or conference.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper addresses is the lack of generalization in audio AI models. While humans can effortlessly adapt their understanding and generation of speech to new contexts, speakers, and tasks with minimal examples, most AI models require extensive, task-specific fine-tuning. For instance, a model trained for speech recognition cannot perform voice conversion without being retrained on a new dataset for that specific task.

This limitation stands in stark contrast to the progress seen in natural language processing (NLP). The seminal GPT-3 paper demonstrated that by massively scaling a model's size and its training data using a simple "next-token prediction" objective, the model spontaneously develops in-context learning or few-shot learning abilities. It can perform tasks it was never explicitly trained for, just by being shown a few examples in its input prompt.

The authors of MiMo-Audio hypothesize that this "scaling law" paradigm is not unique to text and can be successfully applied to the audio domain. The key challenge and research gap has been that prior attempts at creating generalist audio models have not successfully unlocked this "GPT-3 moment"—the emergent, non-linear jump in generalization capabilities. The authors believe this is due to two critical factors that previous works failed to address simultaneously:

Lossless Information Flow: Using an architecture that preserves all information in the speech signal (including fine-grained acoustic details like timbre, prosody, and emotion), which is often lost in conventional audio representations.
Massive Scale: Scaling the pre-training data to a volume orders of magnitude larger than any previous open-source effort.

2.2. Main Contributions / Findings

The paper makes several key contributions to the field of audio AI:

First Empirical Proof of Emergent Few-Shot Learning in Audio via Scaling: The authors provide the first strong evidence that scaling pre-training on "lossless" audio representations to over 100 million hours unlocks emergent task generalization. They term this a "GPT-3 moment" for speech, where the model spontaneously acquires the ability to perform complex, unseen tasks like voice conversion and style transfer from a few examples.
A Comprehensive Blueprint for Generative Speech Pre-training: The paper lays out a detailed and replicable methodology for building such a model, which includes:
- A novel, high-fidelity audio tokenizer (MiMo-Audio-Tokenizer) designed to balance semantic meaning and acoustic reconstruction quality.
- A scalable model architecture (Patch Encoder and Patch Decoder) to efficiently handle the high data rate of audio tokens.
- A two-stage training strategy to progressively build understanding and generation capabilities.
- A new, holistic evaluation suite (SpeechMMLU) to systematically measure these new capabilities.
Integration of "Thinking" into Audio Models: The authors pioneer the use of chain-of-thought (CoT) principles for both audio understanding and generation. By creating a specialized instruction-tuning corpus that includes explicit reasoning steps, they enhance the model's ability to tackle complex cognitive tasks involving audio.

The primary finding is that their approach works. The base model, MiMo-Audio-7B-Base, exhibits strong few-shot abilities. The instruction-tuned model, MiMo-Audio-7B-Instruct, achieves state-of-the-art results on a wide array of public benchmarks for audio understanding, spoken dialogue, and controllable speech synthesis, outperforming all open-source competitors and challenging top closed-source models.

3.1. Foundational Concepts

To fully grasp this paper, a beginner should be familiar with the following concepts:

Large Language Models (LLMs): These are massive neural networks, typically based on the Transformer architecture, trained on vast amounts of text data. Their primary training objective is usually next-token prediction, where the model learns to predict the next word (or sub-word "token") in a sequence. Scaling laws refer to the observation that as you increase the model size, dataset size, and computation used for training, the model's performance improves in a predictable way, often leading to the emergence of new capabilities.
Few-Shot Learning (or In-Context Learning): This is the ability of a model to perform a new task by being given only a few examples (shots) within its input prompt. For example, to perform English-to-French translation, you would provide the model with a prompt like: $"sea otter -> loutre de mer\ncheese -> fromage\nwhat is the French for 'car' ->"$ . The model then completes the sequence, having learned the task from the context. This is fundamentally different from fine-tuning, which requires updating the model's weights on a large dataset of translation pairs.
Audio Tokenization: Computers process data in discrete numbers. To apply the "next-token prediction" paradigm to continuous audio waveforms, the audio must be converted into a sequence of discrete integers, or tokens. This process is called tokenization. There is a fundamental trade-off:
- Semantic Tokens: Derived from speech recognition or self-supervised models, these tokens capture the linguistic content (what is being said) but often discard paralinguistic information (who is speaking, how they are speaking, background noise). They are good for understanding, but poor for reconstructing the original audio.
- Acoustic Tokens: Generated by neural audio codecs, these tokens are designed to reconstruct the original waveform with high fidelity. They capture all acoustic details but may not be well-aligned with the semantic meaning of language, making them harder for an LLM to model. The paper's MiMo-Audio-Tokenizer aims to create a unified token that captures both aspects well.
Residual Vector Quantization (RVQ): A technique for compressing a continuous vector (like an audio representation) into a set of discrete codes. Instead of having one massive codebook, RVQ uses a sequence of smaller codebooks. The first quantizer finds the closest code for the original vector. The second quantizer then finds a code for the residual (the error between the original vector and the first code). This process repeats, allowing for a fine-grained representation to be built up layer by layer. MiMo-Audio uses 8 layers ( $R'=8$ ) of RVQ codes to represent each audio frame.
Transformer Architecture: A neural network architecture that has become dominant in NLP and is used extensively in this paper. Its key mechanism is self-attention, which allows the model to weigh the importance of different tokens in the input sequence when processing a given token. It consists of an encoder (which processes the input) and a decoder (which generates the output). This paper uses Transformer blocks in its tokenizer, patch encoder, patch decoder, and the central LLM itself.
Chain-of-Thought (CoT) Prompting: A technique to improve the reasoning ability of LLMs. Instead of asking for just the final answer, the model is prompted to generate a step-by-step reasoning process before giving the answer. For example, for a math problem, it would first write out the steps to solve it. This paper extends this idea by creating training data that includes such "thinking" steps for audio-related tasks.

3.2. Previous Works

The paper builds upon and differentiates itself from several lines of prior research:

Generative Pre-trained Models: The core inspiration is GPT-3 (Brown et al., 2020a), which proved that scaling next-token prediction leads to few-shot learning in text. MiMo-Audio aims to replicate this success for the audio domain.
Generalist Audio Models: Prior models like AudioLM (Borsos et al., 2023) and SPEAR (Défossez et al., 2024) also explored next-token prediction for audio. However, the authors argue these models did not scale their training data sufficiently and/or used lossy representations, preventing them from achieving broad, general-purpose generalization and the "emergent" abilities seen in MiMo-Audio.
Audio Tokenizers: The paper situates its MiMo-Audio-Tokenizer in contrast to previous approaches:
- Semantic-only: Models using tokens from ASR systems like GLM-4-Voice-Tokenizer (Zeng et al., 2024) lose acoustic detail.
- Acoustic-only: Neural codecs like EnCodec (Défossez et al., 2022) are hard to align with text semantics.
- Hybrid Approaches: Models like SpeechTokenizer (Zhang et al., 2023b) and Mimi (Défossez et al., 2024) tried to distill semantic information into a codec but were limited by small encoder sizes. Dual-stream models like X-Codec (Ye et al., 2025a) and XY-Tokenizer (Gong et al., 2025) use separate encoders for semantics and acoustics, but this creates a disconnect as the information comes from different representation spaces. MiMo-Audio-Tokenizer's innovation is using a single, large, unified encoder trained from scratch on a massive dataset to resolve this conflict internally.

3.3. Technological Evolution

The field of audio AI has evolved from:

Task-Specific Models: Separate models for ASR, TTS, speaker identification, etc., each trained on a specialized dataset.
Self-Supervised Representation Learning: Models like HuBERT (Hsu et al., 2021) were pre-trained on large amounts of unlabeled audio to learn general-purpose audio representations, which could then be fine-tuned for various downstream tasks, improving efficiency.
Generative Audio Language Models: More recent models like AudioLM began treating audio as a "language," using tokenization and next-token prediction. This opened the door for zero-shot or few-shot capabilities.
Massively-Scaled Generative Models: MiMo-Audio represents the next step in this evolution, arguing that the key to unlocking true generalization is not just the generative paradigm itself, but applying it at an unprecedented scale (100M+ hours of data), similar to the leap from GPT-2 to GPT-3 in the text domain.

3.4. Differentiation Analysis

Compared to related work, MiMo-Audio's core innovations are:

Unprecedented Scale: Its pre-training corpus of 100M+ hours is an order of magnitude larger than any other open-source speech model, which is the primary driver of its claimed emergent abilities.
Lossless & Unified Tokenization: The MiMo-Audio-Tokenizer is a large 1.2B parameter model itself, trained from scratch to produce tokens that are both high-fidelity for reconstruction and semantically meaningful, avoiding the compromises of previous systems.
Efficient High-Rate Architecture: The patch encoder/decoder system is specifically designed to manage the very high token rate (200 tokens/second) produced by their high-fidelity tokenizer, making it feasible to train an LLM on these dense audio representations.
Systematic Focus on Generalization: The entire project, from data curation to evaluation (with the creation of SpeechMMLU), is designed around the central goal of achieving and measuring task generalization, rather than just optimizing for specific benchmark scores.
Explicit Reasoning for Audio: It is one of the first models to explicitly integrate chain-of-thought (thinking) mechanisms into both audio understanding and generation tasks during the post-training phase.

4. Methodology

The methodology of MiMo-Audio is a comprehensive system comprising a novel audio tokenizer, a specialized model architecture, and a multi-stage training strategy.

4.1. Principles

The guiding principle is that a sufficiently powerful, lossless compression of a massive and diverse audio dataset will force a model to learn a deep, generalizable representation of the world as perceived through sound. By framing the pre-training task as next-token prediction on high-fidelity audio tokens, the model is compelled to learn everything from phonetics and language to speaker identity, emotion, prosody, and background environment in order to minimize its prediction error. This learned representation then enables generalization to new tasks.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. MiMo-Audio-Tokenizer

This component is the foundation of the entire system, responsible for converting continuous audio waveforms into a sequence of discrete tokens that the LLM can process.

Architecture: The tokenizer has four main parts, as illustrated in Figure 2.

Figure 2 Illustration of MiMo-Audio-Tokenizer framework. 该图像是MiMo-Audio-Tokenizer框架的示意图，展示了音频编码器、解码器和大型语言模型之间的关系，涉及多尺度重构损失和下采样/上采样过程。图中还包括次序预测损失和鉴别器的作用。

Audio Encoder: A 32-layer Transformer that takes a mel-spectrogram (sampled at 100Hz) and downsamples it to produce continuous representations at a 25Hz frame rate. To balance semantic and acoustic information, it combines hidden states from an intermediate layer (layer 3) with the final layer's output.
Discretization Module: A 20-layer Residual Vector Quantizer (RVQ). It takes the 25Hz continuous representations from the encoder and converts each frame into 20 discrete tokens (one for each RVQ layer). The first two layers have larger codebooks (1024 entries) for capturing coarse information, while the remaining 18 layers have smaller codebooks (128 entries) for finer details.
Audio Decoder: A Transformer with a structure mirroring the encoder, but using causal self-attention to allow for autoregressive generation. It takes the quantized representations and reconstructs them.
Vocoder: A Transformer-based vocoder inspired by Vocos that converts the reconstructed representations from the decoder back into a raw audio waveform (sampled at 24kHz).

Training: The tokenizer is trained in a two-stage process:

Stage 1: Unified Representation Learning The goal is to learn a representation that is good for both semantic understanding and acoustic reconstruction. This is achieved through multi-task training.

Audio-to-Text (A2T) Task: To instill semantic understanding, the quantized audio representation $\tilde{\mathbf{Q}}$ is fed to a small, jointly-trained LLM, which is tasked with predicting the corresponding text transcript $T$ . The loss is a standard next-token prediction loss on the text. $\mathcal { L } _ { \mathrm { A2T } } = - \sum _ { i = 1 } ^ { N } \log p ( t _ { i } | \tilde { \mathbf { Q } } , t _ { 1 } , \dots , t _ { i - 1 } )$
- $t_i$ : The $i$ -th token in the target text sequence $T$ .
- $\tilde{\mathbf{Q}}$ : The quantized audio representation from the RVQ module.
- $N$ : The length of the text sequence.
Audio Reconstruction Task: To ensure high-fidelity audio, the model is trained to reconstruct the original audio's mel-spectrogram. The loss is the L1 distance between the original and reconstructed mel-spectrograms at multiple scales. $\mathcal { L } _ { \mathrm { recon } } = \sum _ { i \in e } \Vert S _ { i } ( X ) - S _ { i } ( \hat { X } ) \Vert _ { 1 }$
- $X$ : The original audio waveform.
- $\hat{X}$ : The reconstructed audio waveform.
- $S_i(\cdot)$ : A function that computes the mel-spectrogram at scale $i$ .
- $e = \{5, 6, 7\}$ : The set of scales used.
Commitment Loss: A standard loss term from the VQ-VAE framework, $\mathcal{L}_{\mathrm{commit}}$ , which encourages the encoder's output to stay close to the chosen codebook vectors.

The total loss for stage 1 is a weighted sum of these three components: $\mathcal { L } _ { \mathrm { stage1 } } = \lambda _ { \mathrm { A2T } } \mathcal { L } _ { \mathrm { A2T } } + \lambda _ { \mathrm { r e c o n } } \mathcal { L } _ { \mathrm { r e c o n } } + \lambda _ { \mathrm { c o m m i t } } \mathcal { L } _ { \mathrm { c o m m i t } }$

$\lambda_{\mathrm{A2T}} = 10.0$ , $\lambda_{\mathrm{recon}} = 1.0$ , $\lambda_{\mathrm{commit}} = 1.0$ : Loss weights. The high weight on the A2T loss emphasizes learning strong semantic representations.

Stage 2: Adversarial Fine-tuning After Stage 1, the encoder and discretization module are frozen. This stage focuses solely on improving the reconstruction quality of the decoder and vocoder using a Generative Adversarial Network (GAN) setup.

Discriminators: Two types of discriminators are used to distinguish between real and generated audio: a Multi-Period Discriminator (MPD) and a Multi-Scale STFT discriminator (MS-STFT).
Losses: The generator (decoder + vocoder) is trained with a combination of three losses:
1. Adversarial Loss ( $\tilde{\mathcal{L}}_{\mathrm{adv}}$ ): Encourages the generator to produce audio that the discriminators classify as real.
2. Feature Matching Loss ( $\mathcal{L}_{\mathrm{fm}}$ ): Minimizes the L1 distance between the feature maps of real and generated audio in the intermediate layers of the discriminators. This stabilizes training.
3. Reconstruction Loss ( $\mathcal{L}_{\mathrm{recon}}$ ): The same mel-spectrogram reconstruction loss from Stage 1.
  
  The final generator loss is: $\mathcal { L } _ { G } = \lambda _ { \mathrm { { r e c o n } } } \mathcal { L } _ { \mathrm { { r e c o n } } } + \lambda _ { \mathrm { { a d v } } } \tilde { \mathcal { L } } _ { \mathrm { { a d v } } } + \lambda _ { \mathrm { { f m } } } \mathcal { L } _ { \mathrm { { f m } } }$
$\lambda_{\mathrm{recon}} = 1.0$ , $\lambda_{\mathrm{adv}} = 1.0$ , $\lambda_{\mathrm{fm}} = 2.0$ : Weights for the different loss components in Stage 2.

4.2.2. MiMo-Audio Main Model

This is the core audio-language model that performs understanding and generation tasks.

Architecture: The model (Figure 3) consists of three main components that process an interleaved sequence of text tokens and audio "patches".

Figure 3 Model architecture of MiMo-Audio. 该图像是MiMo-Audio的模型架构示意图，展示了Patch编码器和Patch解码器的结构及其交互流程。图中包含多个输入特征表示和通过多头语言模型处理的信息流。

To bridge the granularity mismatch between text tokens and the high-rate audio tokens (25 frames/second * 8 RVQ layers = 200 tokens/second), the audio frames are grouped into patches. Four consecutive audio frames ( $G=4$ ) are grouped into one patch, effectively downsampling the audio sequence to a 6.25Hz rate for the LLM.

Patch Encoder: This module takes an audio patch (containing 4 frames of 8 RVQ tokens each) and converts it into a single vector that has the same dimension as the LLM's text embeddings.
1. For each audio frame $i$ in the patch, the embeddings of its 8 RVQ tokens are summed up: $\mathbf { e } _ { i } = \sum _ { r = 1 } ^ { R ^ { \prime } } \mathbf { e } _ { i , r }$ , where $R'=8$ .
2. The sequence of 4 resulting vectors is processed by a 6-layer Transformer encoder.
3. The outputs are concatenated and projected to match the LLM's hidden dimension.
Large Language Model (LLM): The backbone is MiMo-7B-Base, a pre-trained 7B parameter LLM. It takes a sequence of mixed representations—either text token embeddings or audio patch embeddings from the patch encoder—and performs next-token prediction.
Patch Decoder: When the model needs to generate audio, the LLM produces a hidden state $\mathbf{h}$ that represents the next audio patch to be generated. The patch decoder takes this $\mathbf{h}$ and autoregressively generates the sequence of RVQ tokens for that patch. A key innovation here is the delay mechanism. Generating all 8 RVQ tokens for a single time step simultaneously is difficult. Instead, the generation is staggered. The token for RVQ layer $r$ is generated with a delay of $d_r$ time steps. In this paper, the delays are set to $D = [0, 1, 2, 3, 4, 5, 6, 7]$ , meaning the first RVQ token is generated at time step 1, the second at time step 2, and so on. This simplifies the prediction task at each step, as the model can condition on previously generated tokens from lower RVQ layers.

4.2.3. Training Strategy

The main MiMo-Audio model is trained in two progressive stages, starting from the weights of the pre-trained MiMo-7B-Base LLM.

Stage 1: Understanding Training

Goal: To teach the model speech understanding and align the audio modality with the LLM's existing text capabilities.
Components Trained: Only the patch encoder and the LLM are trained. The patch decoder is not used.
Data: A mix of speech-text interleaved data, ASR data, audio captioning data, and text-only data.
Loss: The loss is calculated only on text tokens. The model learns to predict text based on preceding text and audio inputs.

Stage 2: Understanding-Generation Joint Training

Goal: To endow the model with integrated speech understanding and generation capabilities.
Components Trained: All components are trained: patch encoder, LLM, and patch decoder.
Data: A richer mix of tasks is used, including speech continuation, TTS, instruction-following TTS, in addition to the data from Stage 1.
Loss: The loss is calculated on both text and audio tokens.
- The loss weight for text tokens is 100.
- The loss weights for the 8 RVQ audio tokens are progressively smaller: 12, 8, 6, 4, 2, 2, 1, 1. This prioritizes getting the first, most important RVQ layers correct.

4.2.4. Post-Training

After pre-training, the MiMo-Audio-7B-Base model is instruction-tuned to create MiMo-Audio-7B-Instruct, which is better at following user commands.

Data: A large and diverse corpus of instruction-following data is curated for:
- Audio Understanding: Question-answering over speech, sounds, and music.
- Speech Generation: Controllable TTS based on natural language instructions.
- Spoken Dialogue: Multi-turn conversational data, with spoken responses generated by an in-house MiMo-TTS system to ensure high quality and expressiveness.
- Thinking Data: To enhance reasoning, high-quality chain-of-thought data is created for both audio understanding and generation tasks.
Training: All model parameters are fine-tuned on this instruction dataset using a similar loss weighting scheme as in Stage 2 of pre-training.

5. Experimental Setup

5.1. Datasets

The paper uses a wide range of datasets to evaluate the model's capabilities at different stages.

For Tokenizer Evaluation:

Seed-TTS-Eval (Anastassiou et al., 2024): A benchmark for evaluating the quality of text-to-speech systems, used here to measure the reconstruction fidelity of the audio tokenizer. It has both Chinese (ZH) and English (EN) splits.

For Pre-training (Few-shot) Evaluation:

SpeechMMLU: A novel dataset created by the authors. They took the questions and multiple-choice options from the popular MMLU (Hendrycks et al., 2021) benchmark (which tests general knowledge and problem-solving across 57 subjects) and synthesized them into speech using a high-quality TTS system. This allows for controlled comparison of performance across text-in/text-out, speech-in/text-out, text-in/speech-out, and speech-in/speech-out modalities.
MMAU (Sakshi et al., 2024): A benchmark for Massive Multi-task Audio Understanding, covering question-answering and information extraction tasks across speech, environmental sounds, and music.
Custom Speech-to-Speech Tasks: A suite of tasks designed to test in-context generation, including voice conversion, emotion conversion, speech rate control, denoising, and speech translation (see Table 5 in the paper).

For Post-training (Instruction-tuned) Evaluation:

ASR Datasets:
- LibriSpeech (Panayotov et al., 2015): A standard benchmark for English automatic speech recognition (ASR), based on audiobooks.
- AISHELL-1 (Bu et al., 2017): A popular open-source corpus for Mandarin ASR.
TTS Datasets:
- SeedTTS (Anastassiou et al., 2024): Used to evaluate standard TTS quality.
- InstructTTSEval (Huang et al., 2025): A benchmark specifically designed to evaluate how well a model can follow complex, natural language instructions for controlling the style, emotion, and prosody of synthesized speech.
Audio Understanding & Reasoning Datasets:
- MMSU (Wang et al., 2025): A benchmark for Massive Multi-task Spoken language Understanding and reasoning.
- MMAU (Sakshi et al., 2025): The same benchmark used in pre-training evaluation.
- MMAR (Ma et al., 2025): A benchmark for Mixed-audio deep Reasoning, challenging models with inputs containing a mix of speech, audio, and music.
- MMAU-Pro (Kumar et al., 2025): A more challenging version of MMAU for holistic evaluation of audio general intelligence.
Spoken Dialogue Datasets:
- Big Bench Audio: An audio version of selected tasks from the Big-BENCH (Srivastava et al., 2022) benchmark, designed to probe the intelligence of language models in a conversational audio context.
- MultiChallenge Audio: An audio version of the MultiChallenge (Sirdeshmukh et al., 2025) benchmark, which tests multi-turn conversational reasoning. The authors converted the text-based dialogues to speech.

5.2. Evaluation Metrics

PESQ (Perceptual Evaluation of Speech Quality):
1. Conceptual Definition: An objective metric that predicts the subjective quality of speech as perceived by a human listener. It compares a degraded signal (e.g., reconstructed audio) to a clean reference signal. Scores typically range from -0.5 to 4.5, with higher being better. PESQ-NB refers to narrowband (telephone quality) and PESQ-WB to wideband (higher fidelity).
2. Mathematical Formula: The calculation is complex and involves time alignment, perceptual transformation to a psychoacoustic domain, and comparison of distortion levels. There is no simple formula; it is defined by the ITU-T P.862 standard.
STOI (Short-Time Objective Intelligibility):
1. Conceptual Definition: Measures the intelligibility of speech in noise. It calculates the correlation between the temporal envelopes of the clean and degraded speech signals in short time-frequency regions. Scores range from 0 to 1, with higher being better.
2. Mathematical Formula: $ d(x, y) = \frac{1}{M} \sum_{j,m} \frac{(x_{j,m} - \mu_{x_{j,m}})^T (y_{j,m} - \mu_{y_{j,m}})}{|x_{j,m} - \mu_{x_{j,m}}|2 |y{j,m} - \mu_{y_{j,m}}|_2} $
3. Symbol Explanation:
  - $x$ and $y$ : Clean and processed speech signals.
  - j, m: Indices for frequency band and time frame.
  - $x_{j,m}$ : A short-time temporal envelope vector of the clean speech.
  - $y_{j,m}$ : A short-time temporal envelope vector of the processed speech.
  - $\mu$ : The sample mean of the vector.
  - $M$ : The total number of frames.
SIM (Speaker Similarity):
1. Conceptual Definition: Measures how similar the speaker's voice is between two audio clips. It is typically calculated as the cosine similarity between speaker embeddings extracted by a pre-trained speaker verification model. Scores range from -1 to 1, with 1 indicating identical speakers.
2. Mathematical Formula: $ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $
3. Symbol Explanation:
  - A, B: Speaker embedding vectors for the two audio clips.
  - $n$ : The dimension of the embedding vectors.
WER (Word Error Rate):
1. Conceptual Definition: The standard metric for evaluating Automatic Speech Recognition (ASR). It measures the number of errors (substitutions, deletions, and insertions) in the transcribed text compared to a ground-truth reference. Lower is better.
2. Mathematical Formula: $ \text{WER} = \frac{S + D + I}{N} $
3. Symbol Explanation:
  - $S$ : Number of substitutions (words replaced incorrectly).
  - $D$ : Number of deletions (words missed).
  - $I$ : Number of insertions (words added that weren't in the reference).
  - $N$ : Number of words in the reference transcript.
GPT-based Evaluation:
1. Conceptual Definition: For complex tasks like evaluating dialogue quality or instruction following, where objective metrics are insufficient, another powerful LLM (like GPT-4o-mini) is used as an evaluator. It is given the model's output, the reference answer/instructions, and a scoring rubric, and asked to provide a numerical score.

5.3. Baselines

The paper compares MiMo-Audio against a comprehensive set of recent and state-of-the-art audio language models, including:

Open-Source Models:
- Baichuan-Audio (Li et al., 2025)
- Kimi-Audio (KimiTeam et al., 2025)
- Step-Audio2-mini (Wu et al., 2025)
- GLM-4-Voice (Zeng et al., 2024)
- Qwen2.5-Omni (Xu et al., 2025)
- Audio Flamingo 3 (Goel et al., 2025)
Closed-Source Models:
- Gemini series models (1.5 Pro, 2.5 Flash)
- GPT-4o-Audio and GPT-4o-mini-tts
  
  These baselines are representative as they include the most powerful contemporary models from both the open-source community and major industry labs, providing a robust benchmark for MiMo-Audio's performance.

6. Results & Analysis

The paper presents a wealth of results to validate its claims, which can be analyzed by stage.

6.1. Core Results Analysis

6.1.1. Tokenizer Performance

The following are the results from Table 1 of the original paper:

System	kBPS	SEED-ZH				SEED-EN
System	kBPS	PESQ-NB	PESQ-WB	SIM	STOI	PESQ-NB	PESQ-WB	SIM	STOI
MiMo-Audio-Tokenizer	1.55	3.30	2.71	0.89	0.93	3.02	2.43	0.85	0.92
GLM-4-Voice-Tokenizer	0.175	1.11	1.06	0.33	0.61	1.11	1.05	0.12	0.60
Baichuan-Audio-Tokenizer	1.0	2.37	1.84	0.78	0.86	2.11	1.62	0.69	0.85
XY-Tokenizer	1.0	2.88	2.24	0.87	0.90	2.69	2.14	0.82	0.90
Mimi	1.1	2.57	2.05	0.73	0.88	2.60	2.07	0.74	0.89
XCodec2.0	0.8	2.69	2.10	0.81	0.89	2.57	2.01	0.78	0.89
BigCodec	1.04	2.88	2.26	0.80	0.91	2.80	2.22	0.80	0.91

Analysis: MiMo-Audio-Tokenizer substantially outperforms all other tokenizers across all reconstruction metrics (PESQ, SIM, STOI) in both Chinese and English, at a comparable bitrate (kBPS). This is a crucial result, as it validates the effectiveness of their tokenizer design and large-scale training. The high scores, especially in SIM (speaker similarity), demonstrate that the tokens fed to the downstream LLM are "lossless" in the sense that they retain the fine-grained acoustic information necessary for high-fidelity generation and speaker preservation, which is a core tenet of the paper's hypothesis.

6.1.2. Emergence of Few-Shot Abilities

The chart in Figure 1 shows the performance on several few-shot tasks as the amount of pre-training data increases.

Figure 1 Emergent behavior in pretraining and performance comparison with SOTA models. 该图像是图表，展示了不同音频模型在多项任务中的表现。上半部分比较了不同训练样本数量下的准确率与相似度，下半部分则显示了MiMo-Audio-7B-Instruct在多个基准测试中的性能，具体包括MMAU-Pro、MMAU和Big Bench Audio的评估结果。 Analysis: For tasks like 5-shot SpeechMMLU and 16-shot Voice Conversion, performance is near zero for a long duration of training. Then, after crossing a critical threshold of data (around 0.7 trillion tokens), performance rises sharply in a non-linear "phase transition." This is the classic signature of emergent abilities observed in LLMs. This result is the cornerstone of the paper's "GPT-3 moment" claim. It strongly suggests that these complex generalization skills are not learned gradually but emerge once the model and data scale are sufficiently large.

6.1.3. Pre-training Evaluation (MiMo-Audio-7B-Base)

The following are the results from Table 6 of the original paper:

Task		Baichuan-Audio 7B-Base	Kimi-Audio 7B-Base	Step-Audio2-mini 7B-Base	MiMo-Audio 7B-Base
SpeechMMLU	S2S	31.9	11.8	51.8	69.1
	S2T	29.9	67.9	67.8	69.5
	T2S	16.7	0.0	63.4	71.5
	T2T	71.1	70.7	74.1	72.5
MMAU	Overall	25.9	28.6	60.3	66.0
	Speech	14.4	29.4	55.0	67.6
	Sound	30.3	31.5	67.9	65.2
	Music	32.9	24.8	58.1	65.3

Analysis:

Speech Intelligence (SpeechMMLU): MiMo-Audio-7B-Base achieves the highest scores on all speech-related modalities (S2S, S2T, T2S). Crucially, its performance on S2S (speech-in, speech-out) at 69.1 is extremely close to its T2T (text-in, text-out) performance of 72.5. This results in a tiny "modality gap" of only 3.4 points. In contrast, other models suffer a massive performance drop when switching to speech, with modality gaps of 22.3 (Step-Audio2), 58.9 (Kimi-Audio), and 39.2 (Baichuan-Audio). This demonstrates MiMo-Audio's superior ability to perform reasoning and access knowledge seamlessly across modalities.
General Audio Understanding (MMAU): The model also achieves the highest overall score on MMAU (66.0) and shows balanced, strong performance across all sub-domains (Speech, Sound, Music). This indicates that its pre-training endowed it with a broad understanding of the acoustic world, not just human speech.
Speech Continuation & Generalization: Qualitative demos (not shown in tables) reportedly show the model's ability to perform coherent and context-aware speech continuation in diverse scenarios (debates, talk shows, singing) and generalize to zero-shot tasks like voice conversion and style transfer, further supporting the claim of emergent general-purpose abilities.

6.1.4. Post-training Evaluation (MiMo-Audio-7B-Instruct)

The following are the results from Tables 8 and 9 of the original paper: Table 8: Audio Understanding and Spoken Dialogue

Datasets	Model	Performance
MMAU Speech \| Sound \| Music \| Overall	MiMo-Audio-7B-Instruct	68.47	82.58	73.65	74.90
	Gemini 2.5 Flash	76.58	73.27	65.57	71.80
	Audio Flamingo 3	66.37	79.58	66.77	73.30
	Step-Audio2-mini	68.16	79.30	68.44	72.73
	Kimi-Audio-Instruct	62.16	75.68	66.77	68.20
	Qwen2.5-Omni	70.60	78.10	65.90	71.50
	GLM-4-Voice	35.44	27.63	27.84	30.30

MMAU-Pro	MiMo-Audio-7B-Instruct	53.35
	Gemini 2.5 Flash	59.20
	Audio Flamingo 3	51.70
	Step-Audio2-mini	47.91
	Kimi-Audio-Instruct	46.60
	Qwen2.5-Omni	52.20
	GLM-4-Voice	38.25
	GPT-4o-Audio	52.50
MMAR	MiMo-Audio-7B-Instruct	63.60
	Gemini 2.5 Flash	65.60
	Audio Flamingo 3	58.50
	Step-Audio2-mini	55.80
	Kimi-Audio-Instruct	48.00
	Qwen2.5-Omni	56.70
	GLM-4-Voice	29.50
	GPT-4o-Audio	63.50
MMSU Perception \|Reasoning \| Overall	MiMo-Audio-7B-Instruct	46.86	76.98	61.70
	MiMo-Audio-7B-Instruct +Think	51.71	74.79	62.88
	Gemini 1.5 Pro	-	72.60	60.70
	Audio Flamingo 3	42.71	75.70	61.40
	Step-Audio2-mini	44.84	77.64	57.18
	Kimi-Audio-Instruct	42.67	16.16	59.78
	Qwen2.5-Omni	-	-	58.10
	GLM-4-Voice	11.04	-	13.30
Spoken Dialogue
Big Bench Audio S2T \| S2S	MiMo-Audio-7B-Instruct	72.90	60.20
	gpt-4o-audio-preview-2024-12-17	-	67.20
	Step-Audio2-mini	50.90	47.50
	Kimi-Audio-Instruct	59.40	51.00
	Qwen2.5-Omni	54.20	53.60
GLM-4-Voice	44.80	42.70
MultiChallenge Audio S2T \| S2S	MiMo-Audio-7B-Instruct	15.15	10.10
	Step-Audio2-mini	13.64	8.08
	Kimi-Audio-Instruct	7.07	8.08
	Qwen2.5-Omni	1.01	9.09
	GLM-4-Voice	11.11	6.06

Table 9: ASR and TTS Benchmarks

Datasets	Model	Performance
TTS
Seed-TTS-Eval	MiMo-Audio-7B-Instruct	1.96 \| 5.37 \| 14.14
ZH \| EN \| ZH-Hard	Step-Audio2-mini	2.13 \| 3.18 \| 16.31
Instruct-TTS
InstructTTSEval-EN	MiMo-Audio-7B-Instruct	80.60	77.63	59.54	72.59
APS \| DSD \| RP \| Overall	GPT-40-mini-tts	76.40	74.30	54.80	68.50
InstructTTSEval-ZH	MiMo-Audio-7B-Instruct	75.74	74.3	61.54	70.52
APS \| DSD \| RP \| Overall	GPT-40-mini-tts	54.90	52.30	46.0	51.07
ASR
ASR	MiMo-Audio-7B-Instruct	3.76 \| 1.78
Librispeech-test-clean \| AISHELL	Step-Audio2-mini	1.87 \| 0.95
	Kimi-Audio-Instruct	2.13 \| 0.62

Analysis:

Audio Understanding: MiMo-Audio-7B-Instruct achieves SOTA performance among all open-source models on MMSU, MMAU, MMAR, and MMAU-Pro. It even surpasses some closed-source models like Gemini 1.5 Pro on MMSU and GPT-4o-Audio on MMAU-Pro. The $+Think$ version shows a notable improvement on MMSU, validating the benefit of the chain-of-thought data.
Spoken Dialogue: The model significantly outperforms all other open-source models on both Big Bench Audio and the more difficult MultiChallenge Audio. Its scores are remarkably close to the proprietary gpt-4o-audio, demonstrating near SOTA conversational intelligence in an audio context.
Speech Generation (TTS): While its standard ASR and TTS performance is competitive but not unilaterally dominant, its performance on InstructTTSEval is exceptional. It clearly outperforms gpt-4o-mini-tts on both English and Chinese benchmarks for instruction-following TTS, highlighting its strength in controllable and expressive speech synthesis.

6.2. Ablation Studies / Parameter Analysis

While the paper does not contain a formal "Ablation Study" section, several results function as such:

Impact of Scale: The emergent abilities curve (Figure 1) is a powerful analysis of the "data scale" parameter, showing it is a critical ingredient.
Impact of Architecture: The tiny "modality gap" in Table 6 serves as an ablation for the architectural choices (tokenizer, patching). It shows that this specific architecture is highly effective at preserving cross-modal consistency compared to alternatives.
Impact of Post-Training: The performance lift from MiMo-Audio-7B-Base (Table 6) to MiMo-Audio-7B-Instruct (Table 8) on a benchmark like MMAU (66.0 -> 74.9) clearly demonstrates the effectiveness of the instruction-tuning phase.
Impact of "Thinking": The comparison between MiMo-Audio-7B-Instruct and MiMo-Audio-7B-Instruct +Think on the MMSU benchmark (61.70 -> 62.88) provides a direct ablation on the benefit of the chain-of-thought mechanism for speech understanding tasks.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully demonstrates that the "scaling law" paradigm, which revolutionized natural language processing, is equally applicable to the audio domain. By pre-training a unified generative model on an unprecedented scale of over 100 million hours of high-fidelity audio, MiMo-Audio achieves a "GPT-3 moment" for speech, exhibiting emergent few-shot learning capabilities. This allows the model to generalize to a wide variety of unseen audio tasks without task-specific fine-tuning.

The authors present a complete blueprint for this achievement, including a novel tokenizer, a scalable architecture, and a multi-stage training strategy. Their final instruction-tuned model, MiMo-Audio-7B-Instruct, sets a new state of the art for open-source models across a comprehensive suite of benchmarks for audio understanding, reasoning, dialogue, and generation, significantly narrowing the gap with leading closed-source systems. This work provides a foundational methodology for building truly versatile and intelligent audio language models.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations:

Limited In-Context-Learning Performance: The base model still struggles with certain complex in-context learning scenarios, such as generating speech with background music or processing complex sound events.
Unstable Spoken Dialogue Performance: The instruction-tuned model can sometimes exhibit inconsistencies in dialogue, such as timbre changes, unstable audio quality, mispronunciations (especially of formulas), and unreliable adherence to system prompts.
Limited Thinking Performance: The chain-of-thought (thinking) mechanism improved performance on speech understanding but degraded it for general sound and music tasks. This was attributed to the model "hallucinating" incorrect reasoning steps for non-speech audio.

For future work, the authors plan to leverage Reinforcement Learning (RL) to improve the stability and capability of the model in these challenging areas.

7.3. Personal Insights & Critique

This paper is a landmark contribution to the field of audio AI, with several key takeaways and points for reflection:

The Universality of Scaling: The most powerful insight is the validation that the principle of "intelligence through compression at scale" is not limited to text. This provides a strong signal that similar breakthroughs may be possible in other modalities like video, if the challenges of representation and scale can be overcome.
The Importance of the Full Stack: The success of MiMo-Audio is not just about data scale. It's a result of meticulous engineering across the entire stack: a better tokenizer, a more efficient architecture, and a well-designed training curriculum. This serves as a reminder that scaling alone is not a silver bullet; it must be paired with appropriate model design.
The "Thinking" Modality Barrier: The negative result regarding the thinking mechanism's performance on non-speech audio is fascinating. It suggests that reasoning structures developed for language (like CoT) may not transfer directly to non-linguistic domains. Future research could explore developing modality-specific reasoning paradigms (e.g., "acoustic chain-of-thought").
Critique and Considerations:
- The "GPT-3 moment" is a bold claim. While the evidence for emergent abilities is strong, the full breadth and robustness of these capabilities compared to the text-based revolution sparked by GPT-3 will require further independent and extensive probing by the research community.
- The immense computational and data resources required (100M+ hours of curated data, training a 7B model) place this research far beyond the reach of most academic labs, highlighting the growing resource gap in AI research.
- The reliance on GPT-based evaluation for some subjective tasks is a practical but imperfect methodology, as it can introduce the evaluator model's own biases.
- As a technical report from a single corporate entity, the results await verification and replication by independent parties, which is a crucial step in the scientific process.
  
  Overall, MiMo-Audio represents a significant leap forward, providing both a powerful open-source artifact and a clear, ambitious research direction for the future of general-purpose audio intelligence.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MiMo-Audio: Audio Language Models are Few-Shot Learners

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~26 min read · 35,610 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. MiMo-Audio-Tokenizer

4.2.2. MiMo-Audio Main Model

4.2.3. Training Strategy

4.2.4. Post-Training

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Tokenizer Performance

6.1.2. Emergence of Few-Shot Abilities

6.1.3. Pre-training Evaluation (MiMo-Audio-7B-Base)

6.1.4. Post-training Evaluation (MiMo-Audio-7B-Instruct)

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers