Paper status: completed

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Published:08/30/2024

Real-Time Speech Interaction Model (1)Text-Instructed Speech Generation (1)VoiceAssistant-400K Dataset (1)Streaming Inference Methods (1)End-to-End Conversational System (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents Mini-Omni, an end-to-end open-source real-time speech interaction model that generates text and audio simultaneously using text-instructed speech generation and batch-parallel inference. It also introduces the VoiceAssistant-400K dataset to enhance voice assist

Abstract

Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method "Any Model Can Talk". We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.

Mind Map

In-depth Reading

English Analysis~9 min read · 9,750 chars

1. Bibliographic Information

1.1. Title

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

1.2. Authors

Zhifei Xie (Inspirai and Tsinghua University) and Changqiao Wu (Inspirai).

1.3. Journal/Conference

This paper is a preprint published on arXiv. While not yet peer-reviewed in a major journal or conference at the time of publication (August 2024), it represents a significant open-source contribution to the rapidly evolving field of multimodal speech-to-speech models, drawing inspiration from proprietary systems like OpenAI's GPT-4o.

1.4. Publication Year

2024 (First version: August 20, 2024; Third version: August 29, 2024).

1.5. Abstract

The paper introduces Mini-Omni, an open-source, end-to-end multimodal large language model (LLM) designed for real-time speech interaction. Unlike traditional systems that cascade separate speech-to-text, text-reasoning, and text-to-speech models (causing high latency), Mini-Omni generates text and audio tokens simultaneously in a streaming fashion. The authors propose a "text-instructed parallel generation" method and a "batch-parallel" strategy to preserve the model's reasoning capabilities while outputting audio. They also release the VoiceAssistant-400K dataset to help other researchers develop similar "Any Model Can Talk" capabilities.

1.6. Original Source Link

Official PDF: https://arxiv.org/pdf/2408.16725v3.pdf
Project Repository: https://github.com/gpt-omni/mini-omni

2. Executive Summary

2.1. Background & Motivation

Current voice assistants typically follow a cascade approach:

Automatic Speech Recognition (ASR): Converts user voice to text.
Large Language Model (LLM): Processes text and generates a text response.
Text-to-Speech (TTS): Converts the response back into audio.

The core problem is latency. The user must wait for the LLM to finish generating text before the TTS even starts. While models like GPT-4o have achieved "near-human fluency" by using end-to-end audio reasoning, their architectures are proprietary (closed-source).

The authors aim to bridge this gap by creating an open-source model that can "hear" (process audio input) and "talk while thinking" (generate streaming audio and text tokens simultaneously). The challenge lies in the fact that direct audio reasoning is difficult for small models, often leading to incoherent outputs.

2.2. Main Contributions / Findings

Mini-Omni Model: The first fully end-to-end open-source model capable of real-time speech interaction.
Text-Instructed Parallel Generation: A decoding strategy where the model predicts text and audio tokens at the same time, using the text token to "guide" the audio generation.
Batch Parallel Decoding: A novel inference trick that uses a batch of two (one for text reasoning, one for audio synthesis) to borrow the superior logic of the text modality for the audio output.
"Any Model Can Talk" Framework: A three-stage training recipe that allows existing text-based LLMs to gain speech capabilities with minimal modification.
VoiceAssistant-400K: A specialized dataset containing 400,000 entries optimized for fine-tuning voice assistants, avoiding the "clunky" code or long-form text found in typical LLM datasets.

3.1. Foundational Concepts

To understand this paper, a novice should be familiar with several key concepts in modern AI:

Large Language Model (LLM): An AI model (like GPT-4 or Qwen) trained on massive amounts of text to predict the next "token" in a sequence.
Tokenization: The process of breaking down data into discrete units (tokens) that a computer can process. While text tokens are common (e.g., words or sub-words), Audio Tokenization converts continuous sound waves into a discrete sequence of numbers (audio tokens) using a codec like SNAC.
End-to-End (E2E): A system where a single neural network takes the raw input (audio) and produces the final output (audio) without intermediate, independent steps.
Streaming: A method of data transfer where the output is delivered as it is being generated, rather than waiting for the entire process to finish.
Adapter: A small, lightweight neural network layer added to a large pre-trained model to help it understand a new type of data (like audio) without retraining the whole model.

3.2. Previous Works

The authors build upon several prior innovations:

Whisper: A robust ASR model by OpenAI used here as an Encoder to help the model "hear."
MusicGen: A model that introduced Parallel Decoding, where multiple layers of audio tokens are predicted at once to speed up generation.
SNAC: A multi-scale neural audio codec. It captures audio details across different resolutions (layers). Mini-Omni uses 7 layers of SNAC tokens.
SpeechGPT: An earlier attempt at speech LLMs that used an "Audio-Text-Text-Audio" sequence, which still suffered from latency because it didn't generate them simultaneously.

3.3. Technological Evolution

The field moved from Pipeline models (ASR + LLM + TTS) to Token-based Audio LLMs (SpeechGPT). However, these models were "slow" because they generated text first, then audio. Mini-Omni represents the next step: Simultaneous Parallel Generation, where text and audio flow out of the model together.

3.4. Differentiation Analysis

The core difference between Mini-Omni and previous open-source models (like VITA or Qwen2-Audio) is that the latter two typically output text and rely on external TTS. Mini-Omni generates the actual audio tokens internally, allowing for much lower latency and more natural "streaming" interaction.

4. Methodology

4.1. Principles

The core intuition is that text is logically dense while audio is acoustically rich. By forcing the model to generate both simultaneously, the authors use the text tokens as an "anchor" for the audio tokens. This ensures that the speech produced is logically consistent with the thought process.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Audio Language Modeling

The model treats both text and audio as sequences of discrete tokens. Let $Y = (y_i \in \mathcal{V}_{txt} \mid i = 1, \dots, t_{txt})$ be a text sequence from a vocabulary $\mathcal{V}_{txt}$ (the set of all possible words/symbols). The model predicts the probability of the next token based on previous ones.

When incorporating speech, the model uses a tokenizer to convert audio into discrete speech tokens (dst), represented as $D = (d_i \in \mathcal{V}_{dst} \mid i = 1, \dots, t_{dst})$ . To handle both, a unified vocabulary $\mathcal{V}_{voxt} = \mathcal{V}_{txt} \cup \mathcal{V}_{dst}$ is created.

The training objective is to minimize the Negative Log-Likelihood (NLL). During training on a corpus $C$ , the loss function $\mathcal{L}$ for text-audio output pairs (T, A) given an input condition $X$ is defined as:

$ \mathcal{L}(T, A | C) = \sum_{j=1}^{m} \sum_{i=1}^{n_{j}} \log P(T_{i,j}, A_{i,j} | T_{<i,j}, A_{<i,j}; X_{j}) $

Symbol Explanation:

$\mathcal{L}$ : The loss (error) the model tries to minimize.
$m$ : The number of training examples.
$n_j$ : The maximum number of tokens in the $j$ -th sample.
$T_{i,j}, A_{i,j}$ : The $i$ -th text token and audio token of the $j$ -th sample.
$T_{<i,j}, A_{<i,j}$ : All previous text and audio tokens generated before step $i$ .
$X_j$ : The input condition (e.g., the user's voice prompt).

The model architecture, shown in the following figure (Figure 1 from the original paper), uses an audio adapter and a Whisper encoder to process input and a parallel head to generate output:

该图像是Mini-Omni模型的架构示意图，显示了文本和音频标记的处理流程。模型通过音频适配器和Whisper编码器，将wav转换为文本，并实现流式音频解码，进行并行生成。

4.2.2. Text-instructed Parallel Decoding

Instead of predicting one token at a time, Mini-Omni predicts a bundle of tokens. Because they use the SNAC codec, which has 7 layers of audio tokens, and they include 1 layer of text, the model has 8 output heads.

The authors use a delay pattern. In each step, the model generates:

Text Token: The primary logical unit.
Audio Layer 1-7: The acoustic units.

To ensure the text "guides" the audio, the text token is generated first, and the audio tokens follow with a one-step delay between layers. This is illustrated in Figure 2(b) from the paper:

该图像是一个示意图，展示了 Mini-Omni 模型在音频生成中的并行解码过程，包括普通并行解码及带文本指令的批量并行解码。图中展示了不同的文本和音频标记的处理方式，体现了模型在实时语音交互中的创新机制。

4.2.3. Batch Parallel Decoding

A significant discovery in the paper is that audio-only reasoning is weaker than text reasoning. To fix this without increasing model size, they use Batch Parallel Decoding (Figure 2(c) above).

They set the inference batch size to 2.
Sample 1: Tasked with generating both text and audio.
Sample 2: Tasked with generating only text (which is usually smarter/more logical).
The Trick: They discard the text tokens from Sample 1 and replace them with the "smarter" text tokens from Sample 2 to guide the audio generation of Sample 1. This allows the model to "borrow" the full power of its text reasoning for its voice output.

4.2.4. The "Any Model Can Talk" Training Recipe

The authors propose a three-stage training process to add speech to any LLM:

Modality Alignment: Freeze the main LLM. Train only the adapters (the "connectors") using ASR and TTS data so the model learns what speech "looks" like.
Adaptation Training: Train the model to respond to audio inputs with text outputs.
Multi-modal Finetuning: Unfreeze the whole model and train on the full speech-to-speech task.

The following figure (Figure 3 from the original paper) illustrates these stages:

该图像是一个示意图，展示了Mini-Omni的三阶段训练流程，包括模态扩展、模态适应训练和整体微调。每个阶段中，Mini-Omni通过不同的适配器和比对方式，将音频和文本进行整合，以实现更好的交互能力。

The input IDs are organized into 8 parallel sequences as shown in Figure 4:

Figure 4: Diagram of the input section of Mini-Omni parallel generation. The <answer> special token is placed at the end of the sequence to be generated, as determined by the task. 该图像是Mini-Omni并行生成的输入部分示意图。图中展示了不同模式下的标记，包括文本到文本、音频到文本、文本到音频及音频到音频的标记序列，特殊标记位于生成序列的末尾。

5. Experimental Setup

5.1. Datasets

The authors used a mix of speech and text datasets to ensure the model maintains its intelligence while learning to speak.

The following are the datasets used from Table 1 of the original paper:

Task	Stages	Dataset	Modality	Items
ASR	1, 2, 3	Libritts	A1 / T1	586 h
		VCTK	A1 / T1	44 h
		Multilingual LibriSpeech	A1 / T1	8000 h
Text QA	2, 3	Open-Orca	T1 / T2	2000K
Audio QA	3	Moss-002-sft-data	A1, T1 / A2, T2	1500K
Voice QA	final	Alpaca-GPT4	A1, T1 / A2, T2	55k
		Identity finetune	A1, T1 / A2, T2	2k
		QAassistant	A1, T1 / A2, T2	27k
		Rlhf (Anthropic)	A1, T1 / A2, T2	367k
		Trivia-singlechoice	A1, T1 / A2, T2	17k
		Trivia-multichoice	A1, T1 / A2, T2	20k
		OpenAssistant	A1, T1 / A2, T2	2k

Note on Modality: $A$ stands for Audio, $T$ stands for Text. Subscripts 1 and 2 denote Input and Output, respectively.

5.2. Evaluation Metrics

The primary metric used to evaluate speech understanding is the Word Error Rate (WER).

Conceptual Definition: WER measures the accuracy of an ASR system by comparing the model's transcribed text to a reference "ground truth" transcript. It calculates the percentage of errors (substitutions, deletions, insertions) relative to the total words.
Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $
Symbol Explanation:
- $S$ : Number of substitutions (wrong words).
- $D$ : Number of deletions (missing words).
- $I$ : Number of insertions (extra words).
- $N$ : Total number of words in the reference transcript.

5.3. Baselines

The model is compared against:

wav2vec2-base: A standard self-supervised audio model.
VITA: A recent open-source multimodal LLM.
Whisper-small: The gold standard for ASR, which also serves as the encoder for Mini-Omni.

6. Results & Analysis

6.1. Core Results Analysis

Mini-Omni achieves impressive ASR results, coming close to the performance of its own backbone encoder (Whisper-small). While it slightly lags behind the specialized Whisper decoder, the fact that a 0.5B parameter model can perform ASR while maintaining conversational abilities is a key finding.

The authors also qualitatively demonstrate that Batch Parallel Decoding is essential. Without it, the audio responses were "simpler" and less intelligent. By using the batch trick, the model's voice output matches the complexity of its text output.

6.2. Data Presentation (Tables)

The following are the results from Table 2 of the original paper, comparing speech recognition performance:

Method	test-clean	test-other	dev-clean	dev-other
wav2vec2-base	6.0	13.4	-	-
VITA	8.14	18.41	7.57	16.57
whisper-small	3.4	7.6	-	-
Mini-Omni	4.5	9.7	4.6	9.2

(Lower scores are better for WER).

The model's real-time streaming capabilities are demonstrated in the following examples (Figure 5 from the paper):

Figure 5: Real streaming output examples of Mini-Omni 该图像是图表，展示了 Mini-Omni 的实时语音交互示例。不同版本的 Omni（如 Omni-AT、Omni-AA 和 Omni-BATCH）对应于用户提出的问题，展示了各自的回应及音频播放功能。

6.3. Ablation Studies / Parameter Analysis

The authors highlight that the 0.5B parameter size (based on Qwen2-0.5B) was chosen to prove that even tiny models can handle end-to-end speech tasks if the decoding strategy is efficient. They found that summing and averaging sequences before feeding them into the model (multi-modal integration) was effective for merging text and audio features.

7. Conclusion & Reflections

7.1. Conclusion Summary

Mini-Omni is a pioneer in open-source real-time speech interaction. By introducing parallel text-audio generation and batch parallel decoding, the authors show that low-latency, "thinking-while-talking" behavior is possible without massive proprietary hardware. Their "Any Model Can Talk" method provides a clear roadmap for adding voice to other popular models like Llama or Mistral.

7.2. Limitations & Future Work

Model Size: At 0.5B parameters, the model's absolute reasoning power is limited compared to massive 70B+ models.
Audio Quality: While the SNAC codec is high-quality, the synthesis still relies on relatively small amounts of training data compared to commercial TTS systems.
Complex Reasoning: Even with the batch trick, very complex logical reasoning in audio remains a challenge.

7.3. Personal Insights & Critique

Inspiration: The "Batch Parallel Decoding" is a stroke of genius. It solves the "audio is dumber than text" problem by simply running two versions of the task in parallel and swapping the tokens. This is a highly efficient way to utilize the KV-cache of modern transformers.

Critique: The paper is somewhat brief on the specifics of the "TTS adapter" architecture (the 6 additional transformer blocks). More detail on why 6 blocks were chosen and how they were initialized would be beneficial for replication. Additionally, while the VoiceAssistant-400K dataset is a great contribution, more analysis of its diversity versus standard datasets like ShareGPT would strengthen the argument for its necessity.

Overall, Mini-Omni is a vital contribution to the open-source community, providing the building blocks for the next generation of "Omni" models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~9 min read · 9,750 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Audio Language Modeling

4.2.2. Text-instructed Parallel Decoding

4.2.3. Batch Parallel Decoding

4.2.4. The "Any Model Can Talk" Training Recipe

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers