FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
TL;DR Summary
This report presents FunAudioLLM, a model family enhancing natural voice interaction between humans and LLMs. It features SenseVoice for multilingual speech and emotion recognition, and CosyVoice for natural voice generation, supporting applications like speech translation and em
Abstract
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
The title clearly states the paper's central topic: the creation of a family of foundational models, named FunAudioLLM, designed to enable more natural, voice-based interactions between humans and Large Language Models (LLMs). It highlights the two core functionalities: voice understanding (input) and voice generation (output).
1.2. Authors
The authors are credited as "Tongyi SpeechTeam" from "Alibaba Group". This indicates the research is a product of a corporate research team rather than individual academic researchers. The Tongyi brand is associated with Alibaba's suite of large models, suggesting this work is part of a broader AI initiative within the company. The large number of authors listed in Section 7 reflects the scale of the project, typical for foundational model development in major tech companies.
1.3. Journal/Conference
The paper was submitted to arXiv, an open-access repository for electronic preprints of scientific papers. This is a common practice for rapid dissemination of research findings, especially for fast-moving fields like AI and large models. As a preprint, it has not yet undergone formal peer review for publication in a specific conference or journal. However, the models and code have been open-sourced, which allows the community to verify and build upon the work.
1.4. Publication Year
The initial version was submitted to arXiv on July 4, 2024.
1.5. Abstract
The abstract introduces FunAudioLLM, a model family aimed at improving human-LLM voice interaction. It consists of two primary models:
-
SenseVoice: For voice understanding. It handles multilingual automatic speech recognition (ASR), emotion recognition, and audio event detection. It comes in two versions:SenseVoice-Smallfor low-latency ASR in 5 languages andSenseVoice-Largefor high-precision ASR in over 50 languages. -
CosyVoice: For voice generation. It facilitates natural text-to-speech (TTS) with control over language, timbre, style, and speaker identity. It excels in multilingual generation, zero-shot voice cloning, and following instructions.The paper emphasizes that these models have been open-sourced on platforms like Modelscope and Huggingface, with code available on GitHub. By integrating these models with LLMs,
FunAudioLLMenables advanced applications like speech-to-speech translation, emotional chatbots, and expressive audiobook narration.
1.6. Original Source Link
-
Original Source Link: https://arxiv.org/abs/2407.04051
-
PDF Link: https://arxiv.org/pdf/2407.04051v3
The paper is a preprint available on arXiv. This means it is publicly accessible but has not yet been formally peer-reviewed.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper addresses is the gap between the advanced text-based capabilities of Large Language Models (LLMs) and the quality of voice-based interaction with them. While models like GPT-4o and Gemini have demonstrated multimodal capabilities, creating a truly natural, low-latency, and emotionally expressive voice interface remains a significant challenge.
This problem is important because voice is the most natural form of human communication. Current systems often suffer from:
-
High Latency: The time delay between speaking and receiving a spoken response can feel unnatural and disruptive.
-
Lack of Expressiveness: Synthesized voices are often monotonous and fail to convey emotion or appropriate prosody, making interactions feel robotic.
-
Limited Understanding: Speech recognition systems may struggle with multiple languages, accents, or fail to capture paralinguistic cues like emotion or background audio events (e.g., laughter, music).
-
Monolithic Systems: End-to-end multimodal models can be massive and difficult to adapt, whereas modular systems (separate speech understanding, language modeling, and speech generation) can be more flexible but risk error propagation.
The paper's entry point is to tackle this problem by developing a pair of specialized, high-performance, open-source foundation models:
SenseVoicefor comprehensive voice understanding andCosyVoicefor controllable, high-quality voice generation. The innovative idea is to create a modular yet highly integrated framework that can be easily combined with any LLM to build powerful voice-first applications.
2.2. Main Contributions / Findings
The paper presents the following primary contributions:
-
SenseVoiceModel for Voice Understanding:SenseVoice-Small: A non-autoregressive model designed for extremely low latency (over 15 times faster than Whisper-large) for real-time applications, supporting 5 languages.SenseVoice-Large: An autoregressive model optimized for high accuracy across 50+ languages, showing superior performance, particularly in Chinese and Cantonese, compared to strong baselines like Whisper.- Rich Transcription: Beyond just text,
SenseVoicecan detect emotion, audio events (music, applause, laughter), and handle punctuation and inverse text normalization.
-
CosyVoiceModel for Voice Generation:- Supervised Semantic Speech Tokenizer (): A novel speech tokenizer trained in a supervised manner, which captures semantic and paralinguistic information more effectively than unsupervised alternatives, improving synthesis quality and robustness to noisy data.
- High-Fidelity and Controllable Synthesis:
CosyVoiceachieves human-level quality in speech generation, supporting zero-shot voice cloning from just 3 seconds of audio, cross-lingual voice cloning, and instruction-based control over emotion, speaking style, and speaker identity. - Multiple Open-Source Variants: The release includes
CosyVoice-base(for cloning),CosyVoice-instruct(for controllable style), andCosyVoice-sft(a ready-to-use fine-tuned model).
-
FunAudioLLMFramework and Applications:- The paper demonstrates how combining
SenseVoice,CosyVoice, and an LLM enables a suite of advanced applications, including speech-to-speech translation with voice cloning, emotional voice chats, interactive podcasts, and expressive audiobook narration.
- The paper demonstrates how combining
-
Open-Sourcing: A significant contribution is the public release of the models, training/inference code, and fine-tuning scripts, fostering further research and development in the community. This contrasts with many state-of-the-art models from large tech companies that remain closed-source.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Large Language Models (LLMs)
LLMs, such as GPT-4, Llama, and Qwen (mentioned in the paper), are deep learning models with billions of parameters, trained on vast amounts of text data. Their core capability is to understand and generate human-like text. They function as the "brain" in the FunAudioLLM framework, responsible for tasks like translation, generating conversational responses, or analyzing text for emotional content.
3.1.2. Automatic Speech Recognition (ASR)
ASR is the technology that converts spoken language into written text. This is the primary function of the "voice understanding" component.
- Autoregressive (AR) Models: Generate text token by token, where each new token is predicted based on the previously generated tokens and the input audio. This is often more accurate but slower.
SenseVoice-LargeandWhisperare examples. - Non-Autoregressive (NAR) Models: Generate all output tokens in parallel or in a single pass. This is significantly faster but can be less accurate.
SenseVoice-Smallis an example, making it suitable for real-time applications.
3.1.3. Text-to-Speech (TTS)
TTS is the technology that converts written text into spoken language. This is the function of the "voice generation" component, CosyVoice. Modern TTS systems aim for naturalness, expressiveness, and the ability to clone voices.
- Zero-Shot Voice Cloning: The ability of a TTS model to replicate a speaker's voice using a short audio sample (a "prompt") without needing to be retrained on that speaker's voice.
- Instruction-Following TTS: A more advanced form of TTS where the model can modify the generated speech (e.g., emotion, pitch, speed) based on natural language commands provided alongside the text.
3.1.4. Speech Tokenizers
To be processed by language models, continuous audio signals must be converted into a sequence of discrete units, or "tokens". This is the role of a speech tokenizer.
- Unsupervised Tokenizers: Models like
EncodecorSoundStreamlearn to compress and discretize audio without any text labels. They focus on reconstructing the audio accurately but may not capture semantic content well. - Supervised Tokenizers: The paper's proposed tokenizer is trained with text transcriptions, forcing the tokens to be more semantically meaningful. This helps the subsequent TTS model learn the mapping from text to speech more easily.
3.1.5. Flow Matching
Flow Matching is a modern generative modeling technique used in CosyVoice to generate a Mel spectrogram (a visual representation of audio frequencies) from speech tokens. It learns to transform a simple noise distribution into the complex data distribution of spectrograms by following a "vector field". It is known for being more efficient (requiring fewer steps) than traditional diffusion models while achieving high-quality results.
3.2. Previous Works
- Whisper (Radford et al., 2023): Developed by OpenAI,
Whisperis a large-scale ASR model trained on 680,000 hours of weakly supervised multilingual data from the internet. It set a new standard for robustness and accuracy across many languages. The paper frequently comparesSenseVoicetoWhisper, positioningSenseVoice-Largeas a more accurate alternative (especially for Chinese) andSenseVoice-Smallas a much faster one. - GPT-4o (OpenAI, 2023) and Gemini (Reid et al., 2024): These are state-of-the-art multimodal models that can process and generate audio, vision, and text end-to-end. While powerful, they are often closed-source, massive, and less flexible than the modular approach proposed by
FunAudioLLM.FunAudioLLMoffers an open-source, adaptable alternative. - Paraformer (Gao et al., 2022): A non-autoregressive ASR model also from Alibaba, known for its high speed and accuracy in Mandarin speech recognition.
SenseVoice-Smallbuilds on similar NAR principles for low latency. - VALL-E / Neural Codec Language Models (Wang et al., 2023a): This work pioneered the use of neural codec language models for zero-shot TTS. It treats TTS as a language modeling problem, predicting discrete audio codec codes from text.
CosyVoiceadopts this paradigm but improves upon it with its supervised tokenizer. - ChatTTS, EmotiVoice, OpenVoice: These are other prominent open-source TTS models. The paper compares
CosyVoiceagainst them in Table 2, claimingCosyVoiceoffers a more comprehensive set of features, including zero-shot cloning, style control, fine-grained control, and a pre-fine-tuned model.
3.3. Technological Evolution
The field of voice interaction has evolved from separate, task-specific models to large, unified foundation models.
-
Early Systems: Separate ASR, Natural Language Understanding (NLU), and TTS modules pipelined together. These were brittle and suffered from error propagation.
-
End-to-End Models: Models like
Listen, Attend and Spellbegan to unify ASR into a single neural network. Similarly, TTS evolved with models likeTacotronandWaveNet. -
Large-Scale Pre-training:
Whispershowed the power of massive, weakly supervised pre-training for ASR, achieving unprecedented robustness. -
Generative TTS with Codecs:
VALL-Eframed TTS as predicting discrete audio codes, enabling powerful in-context learning capabilities like zero-shot voice cloning. -
Multimodal LLMs:
GPT-4oandGeminirepresent the move towards single, giant models that handle all modalities.FunAudioLLMpositions itself as a pragmatic and powerful middle ground: it leverages the power of foundation models but maintains a modular design (SenseVoice+ LLM +CosyVoice). This allows for specialization, optimization (e.g., low-latency ASR), and flexibility, while being open-source.
3.4. Differentiation Analysis
Compared to previous work, FunAudioLLM's core innovations are:
- Specialized and Optimized Models: Instead of a single monolithic model, it provides two specialized models:
SenseVoicefor understanding andCosyVoicefor generation. This allowsSenseVoice-Smallto be optimized for extreme speed andSenseVoice-Largefor extreme accuracy, a trade-off that is harder to manage in a single model. - Supervised Semantic Tokenizer (): This is a key differentiator from models like
VALL-Ethat use unsupervised audio codecs. By training the tokenizer with ASR supervision, the resulting tokens are more semantically meaningful, which the paper argues leads to better TTS quality, faster training, and robustness to noisy data. - Comprehensive Feature Set in an Open-Source Package: While other models offer parts of the puzzle (e.g.,
OpenVoicefor cloning,ChatTTSfor prosody control),CosyVoiceaims to provide a unified, open-source solution with zero-shot cloning, instruction-based style control, and fine-grained paralinguistic control. - Practical Framework for Applications: The paper goes beyond just releasing models; it provides a clear framework and demos for building real-world applications, lowering the barrier to entry for developers.
4. Methodology
4.1. Principles
The core principle of FunAudioLLM is to create a powerful, flexible, and open-source toolkit for voice-based human-LLM interaction by decomposing the problem into two specialized sub-tasks: voice understanding and voice generation. It leverages large-scale pre-training for both tasks and introduces a novel supervised speech tokenizer to bridge the gap between the semantic world of text and the acoustic world of speech.
The following figure provides a high-level overview of the FunAudioLLM framework, showing the SenseVoice and CosyVoice models as the core components for voice understanding and generation, respectively.
该图像是一个示意图,展示了 FunAudioLLM 中 SenseVoice 和 CosyVoice 模型的核心功能与特点。图中包括多语言语音识别、高精度语音克隆和情感合成等技术亮点,强调了它们在语音交互中的重要性。
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Voice Understanding Model: SenseVoice
SenseVoice is designed to be a comprehensive speech foundation model that performs multiple tasks, including ASR, Language ID (LID), Speech Emotion Recognition (SER), and Audio Event Detection (AED). It is offered in two variants to cater to different needs: speed vs. accuracy.
The architecture of both SenseVoice-Small and SenseVoice-Large is illustrated in the figure below.
该图像是示意图,展示了两种模型:SenseVoice Small 和 SenseVoice Large 的架构。上半部分标示了SenseVoice Small的任务嵌入、特征提取器和SAN-M编码器,强调了多任务损失输出。而下半部分则描述了SenseVoice Large采用变换器编码器和解码器的结构,显示了其自回归格式和开始提示符。整体结构旨在增强语音识别和生成的效率与准确性。
SenseVoice-Small
This model is optimized for speed and low latency.
- Architecture: It is a non-autoregressive, encoder-only model. The audio waveform is first converted to 80-dimensional log-mel filterbank features. These features are then down-sampled and fed into an encoder, which is implemented as a
memory-equipped self-attention network (SAN-M). The non-autoregressive nature allows it to process the entire audio and produce the output in parallel, leading to significant speed-ups. - Multi-task Learning: To handle various tasks,
SenseVoice-Smallprepends special task-specific embeddings to the input speech features before they enter the encoder. The input sequence is constructed as follows: $ \mathbf{X} = \mathrm{concat}(\mathbf{e}{\mathrm{LID}},\mathbf{e}{\mathrm{SER}},\mathbf{e}{\mathrm{AEC}},\mathbf{e}{\mathrm{ITN / NoITN}},\mathbf{X}_{\mathrm{speech}}) \quad (1) $ where:- is the down-sampled speech feature sequence of length and dimension .
- , , are embeddings for special tokens , , and . When these tokens are present, the model is trained to predict the corresponding label (language, emotion, or audio event) at that position in the output. During training, the token is sometimes replaced with the ground-truth language token to allow for both prediction and conditioning.
- are embeddings for or tokens, which instruct the model whether to produce a transcription with Inverse Text Normalization and punctuation.
- Output and Loss Function: The encoder processes the concatenated sequence to produce output probabilities .
$
\mathbf{P} = \operatorname {Softmax}(\operatorname {Linear}_{D \to |V'|}(\operatorname {Encoder}(\mathbf{X})) \quad (2)
$
where:
- is the vocabulary containing tokens for both ASR and the other classification tasks.
- The ASR task is optimized with the Connectionist Temporal Classification (CTC) loss, which is suitable for non-autoregressive sequence-to-sequence tasks where the alignment between input and output is unknown.
- The LID, SER, and AEC tasks are optimized with a standard cross-entropy loss on the output corresponding to their special input tokens.
SenseVoice-Large
This model is optimized for accuracy and broad language support.
- Architecture: It is an autoregressive encoder-decoder model, similar in principle to
Whisper. The encoder processes the audio features, and the decoder generates the output text token by token. - Task Specification: Unlike
SenseVoice-Small, which uses special embeddings at the encoder input,SenseVoice-Largespecifies tasks via a sequence of prompt tokens given to the decoder at the beginning of generation. For instance, including tokens like , , and in the initial decoder prompt instructs the model to predict the language, emotion, or timed audio events as part of its output sequence. This approach is more flexible and allows for generating complex, structured text outputs.
4.2.2. Semantic Speech Tokenizer ()
A critical component for CosyVoice is the tokenizer that converts continuous speech into discrete tokens. The paper argues that popular unsupervised tokenizers (e.g., Encodec) are suboptimal because their tokens lack a strong connection to semantic content, leading to unstable synthesis and requiring very clean training data.
To solve this, the authors propose the Supervised Semantic Speech Tokenizer ().
-
Architecture: The tokenizer is built by modifying the pre-trained
SenseVoice-Largemodel. Specifically, they take the first six layers of theSenseVoice-Largeencoder and insert a vector quantizer (VQ) module after them. An additional positional embedding is added after quantization to retain temporal information. The architecture is shown below.
该图像是一个示意图,展示了语音识别模型中的关键组件。图中展示了 ASR 解码器、两个编码器以及语音标记的处理流程,涉及到向量量化及位置编码的使用,说明了输入语音 如何通过不同阶段的处理生成输出 。 -
Training: The entire model (Encoder part 1 + VQ + rest of Encoder + Decoder) is trained end-to-end on the ASR task. This means the VQ module is forced to learn a codebook of discrete tokens that are optimal for reconstructing the textual transcription, not just the audio waveform.
-
Properties:
- Semantic Richness: Because the training objective is recognition error minimization, the resulting tokens are strongly correlated with semantic units (like phonemes or words) and paralinguistic features (like emotion).
- Noise Robustness: Supervised training makes the tokenizer less sensitive to noise in the audio compared to unsupervised reconstruction-based methods.
- Efficiency: The tokenizer uses a single codebook with 4096 entries and produces tokens at a 50 Hz frequency, which is computationally manageable for the subsequent language model.
4.2.3. Voice Generation Model: CosyVoice
CosyVoice is the family of TTS models that use the tokens to generate speech.
The figure below shows a semantic diagram of the CosyVoice architecture.
该图像是示意图,展示了生成语音的流程。图中包含了参考语音、文本标记器、自动回归变换器、流匹配和HiFTNet声码器等模块,最终生成语音。各步骤通过不同的Token进行连接,形成语音生成的完整路径。
System Overview
The CosyVoice generation process has three main stages:
- Text-to-Token LM: An autoregressive Transformer-based language model predicts the sequence of speech tokens from the input text.
- Token-to-Spectrogram Model: A diffusion-based model using flow matching reconstructs a detailed Mel spectrogram from the predicted tokens.
- Vocoder: A
HiFTNet-based vocoder synthesizes the final audio waveform from the Mel spectrogram.
Model Training
- Language Model (LM): The autoregressive LM is trained using a teacher-forcing approach. It receives the tokenized text and the ground-truth speech tokens (shifted by one position) as input and is optimized to predict the next speech token at each step.
- Flow Matching Model: This model learns the conditional probability , where is the target Mel spectrogram, is the sequence of tokens, is a speaker embedding, and is a reference Mel spectrogram (from a prompt). It uses a convolutional Transformer U-Net to learn the vector field of an optimal transport ODE, which allows for very efficient inference (5-10 steps). To improve in-context learning, the model uses classifier-free guidance (CFG) and masks a portion of the conditioning features during training.
Zero-shot In-context Learning
CosyVoice can clone a voice from a short audio prompt. The process, shown in the figure below, depends on whether the prompt and target text are in the same language.
该图像是示意图,展示了 FunAudioLLM 中的零-shot 语境学习(a)和跨语音克隆(b)流程。图中标注了说话者嵌入(Spk Emb)、输入文本、生成的语音标记等关键组成部分,展示了模型在不同任务中的数据流动。
- Same Language (a): The text and tokens from the prompt speech are simply prepended to the target text. The LM then continues generating speech tokens autoregressively from where the prompt left off.
- Cross-Lingual (b): To prevent the prosody of the prompt's language from "leaking" into the target language, the text and speech tokens from the prompt are omitted from the LM's input. Only the speaker embedding and reference Mel spectrogram from the prompt are used to condition the flow-matching and vocoder stages, ensuring timbre transfer without prosodic interference.
Instruction Fine-tuning
To create CosyVoice-instruct, the CosyVoice-base model is further fine-tuned on a dataset where inputs are paired with natural language instructions. This enables fine-grained control over:
- Speaker Identity: Descriptions like "A mysterious, elegant dancer" can create novel voice personas.
- Speaking Style: Instructions like "A happy girl with high tone and quick speech" can control emotion and prosody.
- Fine-grained Paralinguistics: Special tags can be inserted into the text to generate non-speech sounds like
[laughter]and[breath], or to add emphasis using .
5. Experimental Setup
5.1. Datasets
5.1.1. SenseVoice Training Data
The models were trained on massive datasets.
-
SenseVoice-Small: Trained on ~300,000 hours of audio covering 5 languages (Chinese, Cantonese, English, Japanese, Korean). -
SenseVoice-Large: Trained on the same data plus an additional 100,000 hours of multilingual data to support over 50 languages. -
Rich Transcription Data: To train the emotion and audio event detection capabilities, pseudo-labels were generated using existing open-source models. This resulted in a dataset with 150 million audio event entries and 30 million speech emotion entries.
The distribution of training data hours across languages is shown below (in log scale).
该图像是一个柱状图,展示了多种语言的语音识别数据量。横轴标识不同语言的缩写,纵轴表示各语言的数据数量,从中文(zh)和英语(en)开始,高峰值达到了76800,随后逐渐递减,显示出语音识别在不同语言中的数量分布情况。
5.1.2. CosyVoice Training Data
-
Total Data: ~170,000 hours across 5 languages (Chinese, English, Cantonese, Japanese, Korean).
-
Data Processing: The data was collected and cleaned using in-house tools for speech detection, SNR estimation, and speaker diarization. Pseudo-labels were generated using
SenseVoice-LargeandParaformer, followed by refinement with forced alignment models. -
Instruction Data: The
CosyVoice-instructmodel was fine-tuned on a smaller dataset with specific instruction types: 101 hours for speaker identity, 407 hours for speaking style, and 48 hours for fine-grained paralinguistics.The following are the results from Table 4 and Table 5 of the original paper:
Table 4: Hours of CosyVoice training data across languages.
| Language | Duration (hr) |
|---|---|
| ZH | 130,000 |
| EN | 30,000 |
| Yue | 5,000 |
| JP | 4,600 |
| KO | 2,200 |
Table 5: Duration statistics of instruction training data by type.
| Type | Duration (hr) |
|---|---|
| Speaker Identity | 101 |
| Speaking Style | 407 |
| Fine-grained Paralinguistics | 48 |
5.1.3. Evaluation Datasets
A wide range of standard benchmarks were used to evaluate the models:
- ASR: AISHELL-1/2, WenetSpeech (Chinese), LibriSpeech (English), and Common Voice (multilingual).
- SER: CREMA-D, MELD, IEMOCAP, MSP-Podcast, CASIA, MER2023, and ESD (covering English and Chinese).
- AED: ESC-50, Coswara, and custom in-house datasets.
- TTS Quality: LibriTTS (English) and AISHELL-3 (Chinese).
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, the following provides a complete explanation.
5.2.1. Word Error Rate (WER) / Character Error Rate (CER)
- Conceptual Definition: WER and CER are the standard metrics for evaluating ASR accuracy. They measure the number of errors in the transcribed text compared to a ground-truth reference. WER is used for space-separated languages like English, while CER is used for character-based languages like Chinese. A lower value is better.
- Mathematical Formula: $ \text{WER} = \frac{S + D + I}{N} $ The formula for CER is identical, but operates on characters instead of words.
- Symbol Explanation:
- : The number of substitutions (words/characters that were incorrectly transcribed).
- : The number of deletions (words/characters that were missed in the transcription).
- : The number of insertions (words/characters that were added to the transcription but were not in the reference).
- : The total number of words/characters in the reference transcript.
5.2.2. Real-Time Factor (RTF)
- Conceptual Definition: RTF measures the processing speed of a speech model. It is the ratio of the time taken to process an audio file to the duration of the audio file itself. An RTF less than 1 means the processing is faster than real-time. A lower value is better.
- Mathematical Formula: $ \text{RTF} = \frac{\text{Processing Time}}{\text{Audio Duration}} $
- Symbol Explanation:
Processing Time: The wall-clock time taken by the model to generate the transcription.Audio Duration: The length of the input audio clip in the same time units.
5.2.3. Accuracy (UA / WA)
- Conceptual Definition: Used for classification tasks like SER.
- Weighted Average Accuracy (WA): The overall percentage of correctly classified samples. It can be misleading if class distribution is imbalanced.
- Unweighted Average Accuracy (UA): The average of the accuracies for each class. It gives equal weight to each class, regardless of how many samples it has, making it a better metric for imbalanced datasets.
- Mathematical Formula: $ \text{WA} = \frac{\sum_{i=1}^{C} \text{TP}i}{\sum{i=1}^{C} N_i} $ $ \text{UA} = \frac{1}{C} \sum_{i=1}^{C} \frac{\text{TP}_i}{N_i} $
- Symbol Explanation:
- : The number of classes.
- : The number of true positives for class .
- : The total number of samples in class .
5.2.4. F1 Score
- Conceptual Definition: The harmonic mean of precision and recall, providing a single score that balances both. It is particularly useful for imbalanced classification tasks.
Macro F1calculates the F1 for each class and takes the average, whileWeighted F1calculates a weighted average based on the number of samples in each class. - Mathematical Formula: $ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $ where Precision = and Recall = .
- Symbol Explanation:
- : True Positives.
- : False Positives.
- : False Negatives.
5.2.5. Speaker Similarity (SS)
- Conceptual Definition: Measures how closely the timbre of a synthesized voice matches a target speaker's voice. It is typically calculated as the cosine similarity between speaker embeddings. A higher value is better.
- Mathematical Formula: $ \text{SS} = \text{CosineSimilarity}(\mathbf{e}{\text{gen}}, \mathbf{e}{\text{prompt}}) = \frac{\mathbf{e}{\text{gen}} \cdot \mathbf{e}{\text{prompt}}}{|\mathbf{e}{\text{gen}}| |\mathbf{e}{\text{prompt}}|} $
- Symbol Explanation:
- : The speaker embedding vector extracted from the generated speech.
- : The speaker embedding vector extracted from the prompt/reference speech.
5.3. Baselines
The paper compares its models against a strong set of representative baselines:
- For ASR (
SenseVoice):Whisper-SmallandWhisper-Large-V3: The state-of-the-art open-source ASR models from OpenAI.Paraformer-zh: A high-performance non-autoregressive model for Chinese ASR.
- For SER (
SenseVoice):EmoBox,Emo-Superb,MerBench: Recent SER benchmark results from literature.XLSR-SER,Qwen-Audio,SALMONN: Open-source SER models or Audio-LLMs with SER capabilities.
- For TTS (
CosyVoice):ChatTTS: A recent open-source TTS model focused on conversational speech.ChatGPTS: Likely a typo forChatTTSor referring to a generic GPT-based TTS model. Given the context, it's probablyChatTTS.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Multilingual Speech Recognition (SenseVoice)
The following are the results from Table 6 of the original paper, comparing ASR performance on Chinese and English benchmarks.
| Whisper-S | Whisper-LV3 | SenseVoice-S | SenseVoice-L | Paraformer-zh | |
|---|---|---|---|---|---|
| AISHELL-1 test | 10.04 | 5.14 | 2.96 | 2.09 | 1.95 |
| AISHELL-2 test_ios | 8.78 | 4.96 | 3.80 | 3.04 | 2.85 |
| WenetSpeech test_meeting | 25.62 | 18.87 | 7.44 | 6.73 | 6.97 |
| WenetSpeech test_net | 16.66 | 10.48 | 7.84 | 6.01 | 6.74 |
| LibriSpeech test_clean | 3.13 | 1.82 | 3.15 | 2.57 | - |
| LibriSpeech test_other | 7.37 | 3.50 | 7.18 | 4.28 | - |
| CommonVoice zh-CN | 19.60 | 12.55 | 10.78 | 7.68 | 10.30 |
| CommonVoice en | 14.85 | 9.39 | 14.71 | 9.00 | - |
| CommonVoice yue | 38.97 | 10.41 | 7.09 | 6.78 | - |
| CommonVoice ja | 19.51 | 10.34 | 11.96 | 9.19 | - |
| CommonVoice ko | 10.48 | 5.59 | 8.28 | 5.21 | - |
| CommonVoice 5 lang. Average | 20.68 | 9.66 | 10.56 | 7.57 | - |
Analysis:
-
Dominance in Chinese:
SenseVoice-SandSenseVoice-Lsignificantly outperform theirWhispercounterparts on all Chinese test sets (AISHELL, WenetSpeech, CommonVoice zh-CN).SenseVoice-Leven achieves results competitive with or better thanParaformer, a model specialized for Chinese. This suggests the training data and architecture are highly optimized for these languages. -
Cantonese Performance: The most dramatic improvement is on Cantonese (
yue), whereSenseVoice-L(6.78% CER) andSenseVoice-S(7.09% CER) are vastly superior toWhisper-L-V3(10.41% CER) and especiallyWhisper-S(38.97% CER). -
English Performance: On
LibriSpeech,Whispermodels still hold an edge. This is expected, asWhisperwas extensively trained on English data. However, onCommonVoice en,SenseVoice-Lis slightly better thanWhisper-L-V3. -
Overall:
SenseVoice-Lachieves the best average performance across the 5 Common Voice languages.The following are the results from Table 7, comparing inference efficiency.
Model Framework Parameters Support Language RTF 10s Audio Latency(ms) Whisper-S Autoregressive 224M 50+ 0.042 518 Whisper-L-V3 Autoregressive 1550M 50+ 0.111 1281 Paraformer-zh Non-autoregressive 220M zh 0.009 100 SenseVoice-S Non-autoregressive 234M zh,yue,en,jiao(ja,ko) 0.007 70 SenseVoice-L Autoregressive 1587M 50+ 0.110 1623
Analysis:
- Speed of
SenseVoice-S: The non-autoregressive architecture ofSenseVoice-Sprovides a massive speed advantage. With a latency of only 70ms for a 10s audio clip (RTF 0.007), it is 7.4 times faster thanWhisper-S(518ms) and 18.3 times faster thanWhisper-L-V3(1281ms). This makes it highly suitable for real-time interactive applications. SenseVoice-LEfficiency:SenseVoice-Lhas comparable latency toWhisper-L-V3, which is expected as they are both large autoregressive models.
6.1.2. Speech Emotion Recognition (SenseVoice)
The following are the results from Table 8 of the original paper.
| Test set | EmoBox | Emo-Superb | MerBench | SenseVoice-L | SenseVoice-S | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| UA | WA | F1 | WF1 | UA | WA | UA | WA | F1 | WF1 | UA | WA | |
| CASIA | 59.6 | 59.6 | 56.3 | - | - | - | 96.0 | 96.0 | 95.5 | 95.5 | 70.0 | 70.0 |
| CREMA-D | 76.8 | 76.5 | 76.6 | 67.7 | - | - | 90.1 | 90.4 | 89.8 | 89.9 | 73.1 | 74.0 |
| ESD | 84.6 | 84.6 | 84.3 | - | - | - | 93.2 | 93.2 | 92.2 | 92.2 | 85.5 | 81.0 |
| IEMOCAP | 73.5 | 72.9 | 73.1 | - | - | - | 73.9 | 75.3 | 73.2 | 72.8 | 70.5 | 67.9 |
| MELD | 31.5 | 51.9 | 32.9 | - | - | - | 58.7 | 63.1 | 50.9 | 65.7 | 50.8 | 57.8 |
| MER2023 | 61.2 | 65.2 | 62.3 | - | - | - | 70.9 | 69.2 | 55.6 | 57.4 | 69.0 | 68.3 |
| MSPP | 21.4 | 43.4 | 21.5 | 38.4 | - | - | 46.0 | 61.7 | 45.0 | 58.9 | 49.4 | 64.1 |
Analysis:
SenseVoicedemonstrates strong zero-shot SER performance without any fine-tuning on the target datasets.SenseVoice-Lis highly competitive with specialized SER benchmarks likeMerBenchon several datasets (e.g., CREMA-D, ESD). On MELD, MER2023, and MSPP, its performance is strong but slightly lower than the specialized models, which is expected.SenseVoice-Salso shows respectable performance, often outperformingEmoBox, proving that its speed does not come at a complete loss of its rich understanding capabilities.- The bar chart in Figure 8 further confirms that
SenseVoice-Largegenerally achieves the best or second-best performance across all tested datasets compared to other open-source models, establishing it as a state-of-the-art SER model.
6.1.3. Preserving Semantic Information by Tokenizer
The following are the results from Table 9, which evaluates the ASR performance of the model when using the discretized tokens, to check how much semantic information is preserved after quantization.
| Test set | Whisper-L-V3 | SenseVoice-L | tokens | |||
|---|---|---|---|---|---|---|
| w/o lid | w/ lid | w/o lid | w/ lid | w/o lid | w/ lid | |
| common_voice_zh-CN | 12.82 | 12.55 | 8.76 | 8.68 | 12.24 | 12.06 |
| common_voice_en | 13.55 | 9.39 | 9.79 | 9.77 | 15.43 | 15.38 |
Analysis:
- Semantic Preservation: The key finding is that the performance drop from using the quantized tokens is relatively small compared to the original
SenseVoice-Lcontinuous features, especially for Chinese. Oncommon_voice_zh-CN, the tokens even lead to better performance thanWhisper-Large V3. This empirically validates the paper's claim that the supervised tokenizer effectively preserves semantic information. - Performance Degradation: There is a noticeable degradation on English compared to the original
SenseVoice-L, but the performance is still in a reasonable range. This indicates that the single-codebook VQ with 4096 entries is a decent trade-off between compression and information preservation.
6.1.4. Generation Quality of CosyVoice
The following are the results from Table 10 (English) and Table 11 (Chinese), evaluating the generated speech quality.
Table 10: English (LibriTTS test-clean)
| Model | WER(%) | #Ins.&Del. | SS |
|---|---|---|---|
| Original | 2.66 | 92 | 69.67 |
| ChatTTS | 8.32 | 441 | - |
| CosyVoice | |||
| +5x re-ranking | 1.51 | 47 | 74.30 |
Table 11: Chinese (AISHELL-3 test)
| Model | CER (%) | #Ins.<Del. | SS |
|---|---|---|---|
| Original | 2.52 | 25 | 74.15 |
| CosyVoice | 3.82±0.24 | 24.4±2.24 | 81.58±0.16 |
| + 5× re-ranking | 1.84 | 11 | 81.58 |
| ChatTTS | 3.87 | 111 | - |
Analysis:
- Content Consistency:
CosyVoiceachieves WER/CER scores very close to the original human-uttered speech, indicating high content fidelity. It significantly outperformsChatTTS, which has much higher error rates and insertion/deletion counts. The paper notesChatTTSsuffers from "speaker leaking," whichCosyVoiceavoids. - Speaker Similarity:
CosyVoiceachieves a higher speaker similarity (SS) score than the original data itself. This might seem counterintuitive, but it suggests the model is very effective at capturing and reproducing a consistent speaker timbre, potentially "cleaning up" natural variations present in the original dataset. - ASR Re-ranking: Using an ASR model to re-rank multiple generated samples is highly effective, reducing the error rate to a remarkably low 1.51% WER (English) and 1.84% CER (Chinese), demonstrating its potential for high-quality offline synthesis.
6.1.5. Emotion Controllability of CosyVoice
The following are the results from Table 12, evaluating the accuracy of emotion control.
| Model | Happy | Sad | Angry | Surprised | Fearful | Disgusted |
|---|---|---|---|---|---|---|
| CosyVoice-base | 1.00±0.00 | 0.45±0.05 | 0.59±0.03 | 0.26±0.02 | 0.88±0.01 | 0.46±0.06 |
| CosyVoice-instruct | 1.00±0.00 | 0.98±0.02 | 0.83±0.04 | 0.64±0.03 | 0.87±0.03 | 0.93±0.02 |
| w/o instruction | 0.98±0.01 | 0.77±0.04 | 0.49±0.12 | 0.28±0.06 | 0.83±0.04 | 0.45±0.16 |
Analysis:
- The results clearly show the effectiveness of instruction fine-tuning.
CosyVoice-instruct(with instructions) achieves significantly higher accuracy in generating the target emotion compared toCosyVoice-baseor evenCosyVoice-instructwithout an explicit instruction. - For emotions like 'Sad', 'Angry', 'Surprised', and 'Disgusted', the improvement is dramatic. For example, sadness accuracy jumps from 45% to 98%.
- This confirms that the model successfully learns to associate natural language commands with specific expressive speech styles.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces FunAudioLLM, a powerful and comprehensive open-source framework for building natural voice interfaces for LLMs. It achieves this through two highly capable and specialized foundation models:
-
SenseVoiceprovides both high-speed (Small) and high-accuracy (Large) multilingual speech understanding, going beyond simple transcription to include emotion and audio event detection. -
CosyVoicedelivers state-of-the-art, controllable speech generation, enabled by its novel supervised semantic tokenizer (). It excels at zero-shot voice cloning and instruction-based style control.By open-sourcing these models and their associated code, the Tongyi SpeechTeam has made a significant contribution to the community, democratizing access to technology that is often kept proprietary. The demonstrated applications highlight the practical value of this modular framework for creating rich, next-generation voice experiences.
7.2. Limitations & Future Work
The authors candidly acknowledge several limitations:
- Under-resourced Languages: ASR performance for languages with less data is still significantly lower than for high-resource languages.
- No Streaming Support: The current
SenseVoicemodels are not designed for real-time streaming transcription, which is crucial for many conversational AI applications. Future work will focus on developing streamable versions. CosyVoiceLimitations: The model supports a limited number of languages, cannot infer emotion from text content alone (it requires an explicit instruction), performs poorly at singing, and can still improve the trade-off between maintaining timbre and expressing strong emotions.- Pipeline Approach: The framework is a pipeline (
SenseVoice-> LLM ->CosyVoice), which can lead to error propagation (e.g., an ASR error affecting the LLM's response). The components are not trained end-to-end with the LLM.
7.3. Personal Insights & Critique
This paper is an excellent piece of engineering and a valuable contribution to the open-source community.
Strengths and Inspirations:
- Pragmatic and Modular Design: The decision to build specialized models for understanding and generation is a smart, pragmatic approach. It allows for optimization targeted at specific use cases (latency vs. accuracy) and provides greater flexibility for developers to mix and match components, compared to a monolithic end-to-end model.
- The Tokenizer: The idea of a supervised semantic tokenizer is a key insight. It elegantly solves the problem of bridging the text and audio domains. By tying tokenization to the ASR task, the model learns a representation that is inherently meaningful, which likely simplifies the subsequent TTS task significantly. This is a powerful concept that could be applied to other audio generation tasks.
- Commitment to Open Source: In an era where many foundational models are locked behind APIs, the comprehensive open-sourcing of models, code, and training details is highly commendable and will undoubtedly accelerate research and innovation.
Potential Issues and Areas for Improvement:
- End-to-End Training: As the authors note, the pipeline nature is a potential weakness. A future research direction could be to explore methods for joint fine-tuning of the entire
SenseVoice+ LLM +CosyVoicestack. This could allow the models to adapt to each other and reduce cascading errors, potentially leading to more seamless and robust interactions. - Richer Paralinguistics: While the inclusion of laughter and breath is a good step, natural human speech is filled with a much wider array of fillers, pauses, and prosodic nuances. Expanding the instruction-following capabilities to control these subtle aspects would be a valuable next step toward truly human-like conversation.
- Evaluation of "Naturalness": The paper relies on objective metrics like WER and SS. While useful, they don't fully capture the subjective "naturalness" or "appropriateness" of a conversation. Human evaluation (e.g., Mean Opinion Score - MOS) would be necessary to provide a complete picture of the user experience, especially for applications like Emotional Voice Chat.
- Scalability to More Languages: The framework's utility is currently limited by the language coverage of
CosyVoice(5 languages). Expanding this to the 50+ languages supported bySenseVoice-Largewould be a major undertaking but would greatly increase the global impact of the project.
Similar papers
Recommended via semantic vector search.