Paper status: completed

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Published:07/05/2024
Original LinkPDF
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This report presents FunAudioLLM, a model family enhancing natural voice interaction between humans and LLMs. It features SenseVoice for multilingual speech and emotion recognition, and CosyVoice for natural voice generation, supporting applications like speech translation and em

Abstract

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

The title clearly states the paper's central topic: the creation of a family of foundational models, named FunAudioLLM, designed to enable more natural, voice-based interactions between humans and Large Language Models (LLMs). It highlights the two core functionalities: voice understanding (input) and voice generation (output).

1.2. Authors

The authors are credited as "Tongyi SpeechTeam" from "Alibaba Group". This indicates the research is a product of a corporate research team rather than individual academic researchers. The Tongyi brand is associated with Alibaba's suite of large models, suggesting this work is part of a broader AI initiative within the company. The large number of authors listed in Section 7 reflects the scale of the project, typical for foundational model development in major tech companies.

1.3. Journal/Conference

The paper was submitted to arXiv, an open-access repository for electronic preprints of scientific papers. This is a common practice for rapid dissemination of research findings, especially for fast-moving fields like AI and large models. As a preprint, it has not yet undergone formal peer review for publication in a specific conference or journal. However, the models and code have been open-sourced, which allows the community to verify and build upon the work.

1.4. Publication Year

The initial version was submitted to arXiv on July 4, 2024.

1.5. Abstract

The abstract introduces FunAudioLLM, a model family aimed at improving human-LLM voice interaction. It consists of two primary models:

  1. SenseVoice: For voice understanding. It handles multilingual automatic speech recognition (ASR), emotion recognition, and audio event detection. It comes in two versions: SenseVoice-Small for low-latency ASR in 5 languages and SenseVoice-Large for high-precision ASR in over 50 languages.

  2. CosyVoice: For voice generation. It facilitates natural text-to-speech (TTS) with control over language, timbre, style, and speaker identity. It excels in multilingual generation, zero-shot voice cloning, and following instructions.

    The paper emphasizes that these models have been open-sourced on platforms like Modelscope and Huggingface, with code available on GitHub. By integrating these models with LLMs, FunAudioLLM enables advanced applications like speech-to-speech translation, emotional chatbots, and expressive audiobook narration.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper addresses is the gap between the advanced text-based capabilities of Large Language Models (LLMs) and the quality of voice-based interaction with them. While models like GPT-4o and Gemini have demonstrated multimodal capabilities, creating a truly natural, low-latency, and emotionally expressive voice interface remains a significant challenge.

This problem is important because voice is the most natural form of human communication. Current systems often suffer from:

  • High Latency: The time delay between speaking and receiving a spoken response can feel unnatural and disruptive.

  • Lack of Expressiveness: Synthesized voices are often monotonous and fail to convey emotion or appropriate prosody, making interactions feel robotic.

  • Limited Understanding: Speech recognition systems may struggle with multiple languages, accents, or fail to capture paralinguistic cues like emotion or background audio events (e.g., laughter, music).

  • Monolithic Systems: End-to-end multimodal models can be massive and difficult to adapt, whereas modular systems (separate speech understanding, language modeling, and speech generation) can be more flexible but risk error propagation.

    The paper's entry point is to tackle this problem by developing a pair of specialized, high-performance, open-source foundation models: SenseVoice for comprehensive voice understanding and CosyVoice for controllable, high-quality voice generation. The innovative idea is to create a modular yet highly integrated framework that can be easily combined with any LLM to build powerful voice-first applications.

2.2. Main Contributions / Findings

The paper presents the following primary contributions:

  1. SenseVoice Model for Voice Understanding:

    • SenseVoice-Small: A non-autoregressive model designed for extremely low latency (over 15 times faster than Whisper-large) for real-time applications, supporting 5 languages.
    • SenseVoice-Large: An autoregressive model optimized for high accuracy across 50+ languages, showing superior performance, particularly in Chinese and Cantonese, compared to strong baselines like Whisper.
    • Rich Transcription: Beyond just text, SenseVoice can detect emotion, audio events (music, applause, laughter), and handle punctuation and inverse text normalization.
  2. CosyVoice Model for Voice Generation:

    • Supervised Semantic Speech Tokenizer (S3S^3): A novel speech tokenizer trained in a supervised manner, which captures semantic and paralinguistic information more effectively than unsupervised alternatives, improving synthesis quality and robustness to noisy data.
    • High-Fidelity and Controllable Synthesis: CosyVoice achieves human-level quality in speech generation, supporting zero-shot voice cloning from just 3 seconds of audio, cross-lingual voice cloning, and instruction-based control over emotion, speaking style, and speaker identity.
    • Multiple Open-Source Variants: The release includes CosyVoice-base (for cloning), CosyVoice-instruct (for controllable style), and CosyVoice-sft (a ready-to-use fine-tuned model).
  3. FunAudioLLM Framework and Applications:

    • The paper demonstrates how combining SenseVoice, CosyVoice, and an LLM enables a suite of advanced applications, including speech-to-speech translation with voice cloning, emotional voice chats, interactive podcasts, and expressive audiobook narration.
  4. Open-Sourcing: A significant contribution is the public release of the models, training/inference code, and fine-tuning scripts, fostering further research and development in the community. This contrasts with many state-of-the-art models from large tech companies that remain closed-source.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

LLMs, such as GPT-4, Llama, and Qwen (mentioned in the paper), are deep learning models with billions of parameters, trained on vast amounts of text data. Their core capability is to understand and generate human-like text. They function as the "brain" in the FunAudioLLM framework, responsible for tasks like translation, generating conversational responses, or analyzing text for emotional content.

3.1.2. Automatic Speech Recognition (ASR)

ASR is the technology that converts spoken language into written text. This is the primary function of the "voice understanding" component.

  • Autoregressive (AR) Models: Generate text token by token, where each new token is predicted based on the previously generated tokens and the input audio. This is often more accurate but slower. SenseVoice-Large and Whisper are examples.
  • Non-Autoregressive (NAR) Models: Generate all output tokens in parallel or in a single pass. This is significantly faster but can be less accurate. SenseVoice-Small is an example, making it suitable for real-time applications.

3.1.3. Text-to-Speech (TTS)

TTS is the technology that converts written text into spoken language. This is the function of the "voice generation" component, CosyVoice. Modern TTS systems aim for naturalness, expressiveness, and the ability to clone voices.

  • Zero-Shot Voice Cloning: The ability of a TTS model to replicate a speaker's voice using a short audio sample (a "prompt") without needing to be retrained on that speaker's voice.
  • Instruction-Following TTS: A more advanced form of TTS where the model can modify the generated speech (e.g., emotion, pitch, speed) based on natural language commands provided alongside the text.

3.1.4. Speech Tokenizers

To be processed by language models, continuous audio signals must be converted into a sequence of discrete units, or "tokens". This is the role of a speech tokenizer.

  • Unsupervised Tokenizers: Models like Encodec or SoundStream learn to compress and discretize audio without any text labels. They focus on reconstructing the audio accurately but may not capture semantic content well.
  • Supervised Tokenizers: The paper's proposed S3S^3 tokenizer is trained with text transcriptions, forcing the tokens to be more semantically meaningful. This helps the subsequent TTS model learn the mapping from text to speech more easily.

3.1.5. Flow Matching

Flow Matching is a modern generative modeling technique used in CosyVoice to generate a Mel spectrogram (a visual representation of audio frequencies) from speech tokens. It learns to transform a simple noise distribution into the complex data distribution of spectrograms by following a "vector field". It is known for being more efficient (requiring fewer steps) than traditional diffusion models while achieving high-quality results.

3.2. Previous Works

  • Whisper (Radford et al., 2023): Developed by OpenAI, Whisper is a large-scale ASR model trained on 680,000 hours of weakly supervised multilingual data from the internet. It set a new standard for robustness and accuracy across many languages. The paper frequently compares SenseVoice to Whisper, positioning SenseVoice-Large as a more accurate alternative (especially for Chinese) and SenseVoice-Small as a much faster one.
  • GPT-4o (OpenAI, 2023) and Gemini (Reid et al., 2024): These are state-of-the-art multimodal models that can process and generate audio, vision, and text end-to-end. While powerful, they are often closed-source, massive, and less flexible than the modular approach proposed by FunAudioLLM. FunAudioLLM offers an open-source, adaptable alternative.
  • Paraformer (Gao et al., 2022): A non-autoregressive ASR model also from Alibaba, known for its high speed and accuracy in Mandarin speech recognition. SenseVoice-Small builds on similar NAR principles for low latency.
  • VALL-E / Neural Codec Language Models (Wang et al., 2023a): This work pioneered the use of neural codec language models for zero-shot TTS. It treats TTS as a language modeling problem, predicting discrete audio codec codes from text. CosyVoice adopts this paradigm but improves upon it with its supervised S3S^3 tokenizer.
  • ChatTTS, EmotiVoice, OpenVoice: These are other prominent open-source TTS models. The paper compares CosyVoice against them in Table 2, claiming CosyVoice offers a more comprehensive set of features, including zero-shot cloning, style control, fine-grained control, and a pre-fine-tuned model.

3.3. Technological Evolution

The field of voice interaction has evolved from separate, task-specific models to large, unified foundation models.

  1. Early Systems: Separate ASR, Natural Language Understanding (NLU), and TTS modules pipelined together. These were brittle and suffered from error propagation.

  2. End-to-End Models: Models like Listen, Attend and Spell began to unify ASR into a single neural network. Similarly, TTS evolved with models like Tacotron and WaveNet.

  3. Large-Scale Pre-training: Whisper showed the power of massive, weakly supervised pre-training for ASR, achieving unprecedented robustness.

  4. Generative TTS with Codecs: VALL-E framed TTS as predicting discrete audio codes, enabling powerful in-context learning capabilities like zero-shot voice cloning.

  5. Multimodal LLMs: GPT-4o and Gemini represent the move towards single, giant models that handle all modalities.

    FunAudioLLM positions itself as a pragmatic and powerful middle ground: it leverages the power of foundation models but maintains a modular design (SenseVoice + LLM + CosyVoice). This allows for specialization, optimization (e.g., low-latency ASR), and flexibility, while being open-source.

3.4. Differentiation Analysis

Compared to previous work, FunAudioLLM's core innovations are:

  • Specialized and Optimized Models: Instead of a single monolithic model, it provides two specialized models: SenseVoice for understanding and CosyVoice for generation. This allows SenseVoice-Small to be optimized for extreme speed and SenseVoice-Large for extreme accuracy, a trade-off that is harder to manage in a single model.
  • Supervised Semantic Tokenizer (S3S^3): This is a key differentiator from models like VALL-E that use unsupervised audio codecs. By training the tokenizer with ASR supervision, the resulting tokens are more semantically meaningful, which the paper argues leads to better TTS quality, faster training, and robustness to noisy data.
  • Comprehensive Feature Set in an Open-Source Package: While other models offer parts of the puzzle (e.g., OpenVoice for cloning, ChatTTS for prosody control), CosyVoice aims to provide a unified, open-source solution with zero-shot cloning, instruction-based style control, and fine-grained paralinguistic control.
  • Practical Framework for Applications: The paper goes beyond just releasing models; it provides a clear framework and demos for building real-world applications, lowering the barrier to entry for developers.

4. Methodology

4.1. Principles

The core principle of FunAudioLLM is to create a powerful, flexible, and open-source toolkit for voice-based human-LLM interaction by decomposing the problem into two specialized sub-tasks: voice understanding and voice generation. It leverages large-scale pre-training for both tasks and introduces a novel supervised speech tokenizer to bridge the gap between the semantic world of text and the acoustic world of speech.

The following figure provides a high-level overview of the FunAudioLLM framework, showing the SenseVoice and CosyVoice models as the core components for voice understanding and generation, respectively.

fig 10 该图像是一个示意图,展示了 FunAudioLLM 中 SenseVoice 和 CosyVoice 模型的核心功能与特点。图中包括多语言语音识别、高精度语音克隆和情感合成等技术亮点,强调了它们在语音交互中的重要性。

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Voice Understanding Model: SenseVoice

SenseVoice is designed to be a comprehensive speech foundation model that performs multiple tasks, including ASR, Language ID (LID), Speech Emotion Recognition (SER), and Audio Event Detection (AED). It is offered in two variants to cater to different needs: speed vs. accuracy.

The architecture of both SenseVoice-Small and SenseVoice-Large is illustrated in the figure below.

fig 11 该图像是示意图,展示了两种模型:SenseVoice Small 和 SenseVoice Large 的架构。上半部分标示了SenseVoice Small的任务嵌入、特征提取器和SAN-M编码器,强调了多任务损失输出。而下半部分则描述了SenseVoice Large采用变换器编码器和解码器的结构,显示了其自回归格式和开始提示符。整体结构旨在增强语音识别和生成的效率与准确性。

SenseVoice-Small

This model is optimized for speed and low latency.

  • Architecture: It is a non-autoregressive, encoder-only model. The audio waveform is first converted to 80-dimensional log-mel filterbank features. These features are then down-sampled and fed into an encoder, which is implemented as a memory-equipped self-attention network (SAN-M). The non-autoregressive nature allows it to process the entire audio and produce the output in parallel, leading to significant speed-ups.
  • Multi-task Learning: To handle various tasks, SenseVoice-Small prepends special task-specific embeddings to the input speech features before they enter the encoder. The input sequence X\mathbf{X} is constructed as follows: $ \mathbf{X} = \mathrm{concat}(\mathbf{e}{\mathrm{LID}},\mathbf{e}{\mathrm{SER}},\mathbf{e}{\mathrm{AEC}},\mathbf{e}{\mathrm{ITN / NoITN}},\mathbf{X}_{\mathrm{speech}}) \quad (1) $ where:
    • XspeechRT×D\mathbf{X}_{\mathrm{speech}} \in \mathbb{R}^{T \times D} is the down-sampled speech feature sequence of length TT and dimension DD.
    • eLID\mathbf{e}_{\mathrm{LID}}, eSER\mathbf{e}_{\mathrm{SER}}, eAEC\mathbf{e}_{\mathrm{AEC}} are embeddings for special tokens <LID><LID>, <SER><SER>, and <AEC><AEC>. When these tokens are present, the model is trained to predict the corresponding label (language, emotion, or audio event) at that position in the output. During training, the <LID><LID> token is sometimes replaced with the ground-truth language token to allow for both prediction and conditioning.
    • eITN/NoITN\mathbf{e}_{\mathrm{ITN / NoITN}} are embeddings for <ITN><ITN> or <NoITN><NoITN> tokens, which instruct the model whether to produce a transcription with Inverse Text Normalization and punctuation.
  • Output and Loss Function: The encoder processes the concatenated sequence X\mathbf{X} to produce output probabilities P\mathbf{P}. $ \mathbf{P} = \operatorname {Softmax}(\operatorname {Linear}_{D \to |V'|}(\operatorname {Encoder}(\mathbf{X})) \quad (2) $ where:
    • VV' is the vocabulary containing tokens for both ASR and the other classification tasks.
    • The ASR task is optimized with the Connectionist Temporal Classification (CTC) loss, which is suitable for non-autoregressive sequence-to-sequence tasks where the alignment between input and output is unknown.
    • The LID, SER, and AEC tasks are optimized with a standard cross-entropy loss on the output corresponding to their special input tokens.

SenseVoice-Large

This model is optimized for accuracy and broad language support.

  • Architecture: It is an autoregressive encoder-decoder model, similar in principle to Whisper. The encoder processes the audio features, and the decoder generates the output text token by token.
  • Task Specification: Unlike SenseVoice-Small, which uses special embeddings at the encoder input, SenseVoice-Large specifies tasks via a sequence of prompt tokens given to the decoder at the beginning of generation. For instance, including tokens like <LID><LID>, <SER><SER>, and <AED><AED> in the initial decoder prompt instructs the model to predict the language, emotion, or timed audio events as part of its output sequence. This approach is more flexible and allows for generating complex, structured text outputs.

4.2.2. Semantic Speech Tokenizer (S3S^3)

A critical component for CosyVoice is the tokenizer that converts continuous speech into discrete tokens. The paper argues that popular unsupervised tokenizers (e.g., Encodec) are suboptimal because their tokens lack a strong connection to semantic content, leading to unstable synthesis and requiring very clean training data.

To solve this, the authors propose the Supervised Semantic Speech Tokenizer (S3S^3).

  • Architecture: The S3S^3 tokenizer is built by modifying the pre-trained SenseVoice-Large model. Specifically, they take the first six layers of the SenseVoice-Large encoder and insert a vector quantizer (VQ) module after them. An additional positional embedding is added after quantization to retain temporal information. The architecture is shown below.

    fig 7 该图像是一个示意图,展示了语音识别模型中的关键组件。图中展示了 ASR 解码器、两个编码器以及语音标记的处理流程,涉及到向量量化及位置编码的使用,说明了输入语音 XX 如何通过不同阶段的处理生成输出 YY

  • Training: The entire model (Encoder part 1 + VQ + rest of Encoder + Decoder) is trained end-to-end on the ASR task. This means the VQ module is forced to learn a codebook of discrete tokens that are optimal for reconstructing the textual transcription, not just the audio waveform.

  • Properties:

    • Semantic Richness: Because the training objective is recognition error minimization, the resulting tokens are strongly correlated with semantic units (like phonemes or words) and paralinguistic features (like emotion).
    • Noise Robustness: Supervised training makes the tokenizer less sensitive to noise in the audio compared to unsupervised reconstruction-based methods.
    • Efficiency: The tokenizer uses a single codebook with 4096 entries and produces tokens at a 50 Hz frequency, which is computationally manageable for the subsequent language model.

4.2.3. Voice Generation Model: CosyVoice

CosyVoice is the family of TTS models that use the S3S^3 tokens to generate speech.

The figure below shows a semantic diagram of the CosyVoice architecture.

fig 1 该图像是示意图,展示了生成语音的流程。图中包含了参考语音、文本标记器、自动回归变换器、流匹配和HiFTNet声码器等模块,最终生成语音。各步骤通过不同的Token进行连接,形成语音生成的完整路径。

System Overview

The CosyVoice generation process has three main stages:

  1. Text-to-Token LM: An autoregressive Transformer-based language model predicts the sequence of S3S^3 speech tokens from the input text.
  2. Token-to-Spectrogram Model: A diffusion-based model using flow matching reconstructs a detailed Mel spectrogram from the predicted S3S^3 tokens.
  3. Vocoder: A HiFTNet-based vocoder synthesizes the final audio waveform from the Mel spectrogram.

Model Training

  • Language Model (LM): The autoregressive LM is trained using a teacher-forcing approach. It receives the tokenized text and the ground-truth speech tokens (shifted by one position) as input and is optimized to predict the next speech token at each step.
  • Flow Matching Model: This model learns the conditional probability P(SX,v,Sref)P(S|X, v, S_{ref}), where SS is the target Mel spectrogram, XX is the sequence of S3S^3 tokens, vv is a speaker embedding, and SrefS_{ref} is a reference Mel spectrogram (from a prompt). It uses a convolutional Transformer U-Net to learn the vector field of an optimal transport ODE, which allows for very efficient inference (5-10 steps). To improve in-context learning, the model uses classifier-free guidance (CFG) and masks a portion of the conditioning features during training.

Zero-shot In-context Learning

CosyVoice can clone a voice from a short audio prompt. The process, shown in the figure below, depends on whether the prompt and target text are in the same language.

fig 2 该图像是示意图,展示了 FunAudioLLM 中的零-shot 语境学习(a)和跨语音克隆(b)流程。图中标注了说话者嵌入(Spk Emb)、输入文本、生成的语音标记等关键组成部分,展示了模型在不同任务中的数据流动。

  • Same Language (a): The text and S3S^3 tokens from the prompt speech are simply prepended to the target text. The LM then continues generating speech tokens autoregressively from where the prompt left off.
  • Cross-Lingual (b): To prevent the prosody of the prompt's language from "leaking" into the target language, the text and speech tokens from the prompt are omitted from the LM's input. Only the speaker embedding and reference Mel spectrogram from the prompt are used to condition the flow-matching and vocoder stages, ensuring timbre transfer without prosodic interference.

Instruction Fine-tuning

To create CosyVoice-instruct, the CosyVoice-base model is further fine-tuned on a dataset where inputs are paired with natural language instructions. This enables fine-grained control over:

  • Speaker Identity: Descriptions like "A mysterious, elegant dancer" can create novel voice personas.
  • Speaking Style: Instructions like "A happy girl with high tone and quick speech" can control emotion and prosody.
  • Fine-grained Paralinguistics: Special tags can be inserted into the text to generate non-speech sounds like [laughter] and [breath], or to add emphasis using <strong>...</strong><strong>...</strong>.

5. Experimental Setup

5.1. Datasets

5.1.1. SenseVoice Training Data

The models were trained on massive datasets.

  • SenseVoice-Small: Trained on ~300,000 hours of audio covering 5 languages (Chinese, Cantonese, English, Japanese, Korean).

  • SenseVoice-Large: Trained on the same data plus an additional 100,000 hours of multilingual data to support over 50 languages.

  • Rich Transcription Data: To train the emotion and audio event detection capabilities, pseudo-labels were generated using existing open-source models. This resulted in a dataset with 150 million audio event entries and 30 million speech emotion entries.

    The distribution of training data hours across languages is shown below (in log scale).

    fig 8 该图像是一个柱状图,展示了多种语言的语音识别数据量。横轴标识不同语言的缩写,纵轴表示各语言的数据数量,从中文(zh)和英语(en)开始,高峰值达到了76800,随后逐渐递减,显示出语音识别在不同语言中的数量分布情况。

5.1.2. CosyVoice Training Data

  • Total Data: ~170,000 hours across 5 languages (Chinese, English, Cantonese, Japanese, Korean).

  • Data Processing: The data was collected and cleaned using in-house tools for speech detection, SNR estimation, and speaker diarization. Pseudo-labels were generated using SenseVoice-Large and Paraformer, followed by refinement with forced alignment models.

  • Instruction Data: The CosyVoice-instruct model was fine-tuned on a smaller dataset with specific instruction types: 101 hours for speaker identity, 407 hours for speaking style, and 48 hours for fine-grained paralinguistics.

    The following are the results from Table 4 and Table 5 of the original paper:

Table 4: Hours of CosyVoice training data across languages.

Language Duration (hr)
ZH 130,000
EN 30,000
Yue 5,000
JP 4,600
KO 2,200

Table 5: Duration statistics of instruction training data by type.

Type Duration (hr)
Speaker Identity 101
Speaking Style 407
Fine-grained Paralinguistics 48

5.1.3. Evaluation Datasets

A wide range of standard benchmarks were used to evaluate the models:

  • ASR: AISHELL-1/2, WenetSpeech (Chinese), LibriSpeech (English), and Common Voice (multilingual).
  • SER: CREMA-D, MELD, IEMOCAP, MSP-Podcast, CASIA, MER2023, and ESD (covering English and Chinese).
  • AED: ESC-50, Coswara, and custom in-house datasets.
  • TTS Quality: LibriTTS (English) and AISHELL-3 (Chinese).

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, the following provides a complete explanation.

5.2.1. Word Error Rate (WER) / Character Error Rate (CER)

  • Conceptual Definition: WER and CER are the standard metrics for evaluating ASR accuracy. They measure the number of errors in the transcribed text compared to a ground-truth reference. WER is used for space-separated languages like English, while CER is used for character-based languages like Chinese. A lower value is better.
  • Mathematical Formula: $ \text{WER} = \frac{S + D + I}{N} $ The formula for CER is identical, but operates on characters instead of words.
  • Symbol Explanation:
    • SS: The number of substitutions (words/characters that were incorrectly transcribed).
    • DD: The number of deletions (words/characters that were missed in the transcription).
    • II: The number of insertions (words/characters that were added to the transcription but were not in the reference).
    • NN: The total number of words/characters in the reference transcript.

5.2.2. Real-Time Factor (RTF)

  • Conceptual Definition: RTF measures the processing speed of a speech model. It is the ratio of the time taken to process an audio file to the duration of the audio file itself. An RTF less than 1 means the processing is faster than real-time. A lower value is better.
  • Mathematical Formula: $ \text{RTF} = \frac{\text{Processing Time}}{\text{Audio Duration}} $
  • Symbol Explanation:
    • Processing Time: The wall-clock time taken by the model to generate the transcription.
    • Audio Duration: The length of the input audio clip in the same time units.

5.2.3. Accuracy (UA / WA)

  • Conceptual Definition: Used for classification tasks like SER.
    • Weighted Average Accuracy (WA): The overall percentage of correctly classified samples. It can be misleading if class distribution is imbalanced.
    • Unweighted Average Accuracy (UA): The average of the accuracies for each class. It gives equal weight to each class, regardless of how many samples it has, making it a better metric for imbalanced datasets.
  • Mathematical Formula: $ \text{WA} = \frac{\sum_{i=1}^{C} \text{TP}i}{\sum{i=1}^{C} N_i} $ $ \text{UA} = \frac{1}{C} \sum_{i=1}^{C} \frac{\text{TP}_i}{N_i} $
  • Symbol Explanation:
    • CC: The number of classes.
    • TPi\text{TP}_i: The number of true positives for class ii.
    • NiN_i: The total number of samples in class ii.

5.2.4. F1 Score

  • Conceptual Definition: The harmonic mean of precision and recall, providing a single score that balances both. It is particularly useful for imbalanced classification tasks. Macro F1 calculates the F1 for each class and takes the average, while Weighted F1 calculates a weighted average based on the number of samples in each class.
  • Mathematical Formula: $ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $ where Precision = TPTP+FP\frac{\text{TP}}{\text{TP} + \text{FP}} and Recall = TPTP+FN\frac{\text{TP}}{\text{TP} + \text{FN}}.
  • Symbol Explanation:
    • TP\text{TP}: True Positives.
    • FP\text{FP}: False Positives.
    • FN\text{FN}: False Negatives.

5.2.5. Speaker Similarity (SS)

  • Conceptual Definition: Measures how closely the timbre of a synthesized voice matches a target speaker's voice. It is typically calculated as the cosine similarity between speaker embeddings. A higher value is better.
  • Mathematical Formula: $ \text{SS} = \text{CosineSimilarity}(\mathbf{e}{\text{gen}}, \mathbf{e}{\text{prompt}}) = \frac{\mathbf{e}{\text{gen}} \cdot \mathbf{e}{\text{prompt}}}{|\mathbf{e}{\text{gen}}| |\mathbf{e}{\text{prompt}}|} $
  • Symbol Explanation:
    • egen\mathbf{e}_{\text{gen}}: The speaker embedding vector extracted from the generated speech.
    • eprompt\mathbf{e}_{\text{prompt}}: The speaker embedding vector extracted from the prompt/reference speech.

5.3. Baselines

The paper compares its models against a strong set of representative baselines:

  • For ASR (SenseVoice):
    • Whisper-Small and Whisper-Large-V3: The state-of-the-art open-source ASR models from OpenAI.
    • Paraformer-zh: A high-performance non-autoregressive model for Chinese ASR.
  • For SER (SenseVoice):
    • EmoBox, Emo-Superb, MerBench: Recent SER benchmark results from literature.
    • XLSR-SER, Qwen-Audio, SALMONN: Open-source SER models or Audio-LLMs with SER capabilities.
  • For TTS (CosyVoice):
    • ChatTTS: A recent open-source TTS model focused on conversational speech.
    • ChatGPTS: Likely a typo for ChatTTS or referring to a generic GPT-based TTS model. Given the context, it's probably ChatTTS.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Multilingual Speech Recognition (SenseVoice)

The following are the results from Table 6 of the original paper, comparing ASR performance on Chinese and English benchmarks.

Whisper-S Whisper-LV3 SenseVoice-S SenseVoice-L Paraformer-zh
AISHELL-1 test 10.04 5.14 2.96 2.09 1.95
AISHELL-2 test_ios 8.78 4.96 3.80 3.04 2.85
WenetSpeech test_meeting 25.62 18.87 7.44 6.73 6.97
WenetSpeech test_net 16.66 10.48 7.84 6.01 6.74
LibriSpeech test_clean 3.13 1.82 3.15 2.57 -
LibriSpeech test_other 7.37 3.50 7.18 4.28 -
CommonVoice zh-CN 19.60 12.55 10.78 7.68 10.30
CommonVoice en 14.85 9.39 14.71 9.00 -
CommonVoice yue 38.97 10.41 7.09 6.78 -
CommonVoice ja 19.51 10.34 11.96 9.19 -
CommonVoice ko 10.48 5.59 8.28 5.21 -
CommonVoice 5 lang. Average 20.68 9.66 10.56 7.57 -

Analysis:

  • Dominance in Chinese: SenseVoice-S and SenseVoice-L significantly outperform their Whisper counterparts on all Chinese test sets (AISHELL, WenetSpeech, CommonVoice zh-CN). SenseVoice-L even achieves results competitive with or better than Paraformer, a model specialized for Chinese. This suggests the training data and architecture are highly optimized for these languages.

  • Cantonese Performance: The most dramatic improvement is on Cantonese (yue), where SenseVoice-L (6.78% CER) and SenseVoice-S (7.09% CER) are vastly superior to Whisper-L-V3 (10.41% CER) and especially Whisper-S (38.97% CER).

  • English Performance: On LibriSpeech, Whisper models still hold an edge. This is expected, as Whisper was extensively trained on English data. However, on CommonVoice en, SenseVoice-L is slightly better than Whisper-L-V3.

  • Overall: SenseVoice-L achieves the best average performance across the 5 Common Voice languages.

    The following are the results from Table 7, comparing inference efficiency.

    Model Framework Parameters Support Language RTF 10s Audio Latency(ms)
    Whisper-S Autoregressive 224M 50+ 0.042 518
    Whisper-L-V3 Autoregressive 1550M 50+ 0.111 1281
    Paraformer-zh Non-autoregressive 220M zh 0.009 100
    SenseVoice-S Non-autoregressive 234M zh,yue,en,jiao(ja,ko) 0.007 70
    SenseVoice-L Autoregressive 1587M 50+ 0.110 1623

Analysis:

  • Speed of SenseVoice-S: The non-autoregressive architecture of SenseVoice-S provides a massive speed advantage. With a latency of only 70ms for a 10s audio clip (RTF 0.007), it is 7.4 times faster than Whisper-S (518ms) and 18.3 times faster than Whisper-L-V3 (1281ms). This makes it highly suitable for real-time interactive applications.
  • SenseVoice-L Efficiency: SenseVoice-L has comparable latency to Whisper-L-V3, which is expected as they are both large autoregressive models.

6.1.2. Speech Emotion Recognition (SenseVoice)

The following are the results from Table 8 of the original paper.

Test set EmoBox Emo-Superb MerBench SenseVoice-L SenseVoice-S
UA WA F1 WF1 UA WA UA WA F1 WF1 UA WA
CASIA 59.6 59.6 56.3 - - - 96.0 96.0 95.5 95.5 70.0 70.0
CREMA-D 76.8 76.5 76.6 67.7 - - 90.1 90.4 89.8 89.9 73.1 74.0
ESD 84.6 84.6 84.3 - - - 93.2 93.2 92.2 92.2 85.5 81.0
IEMOCAP 73.5 72.9 73.1 - - - 73.9 75.3 73.2 72.8 70.5 67.9
MELD 31.5 51.9 32.9 - - - 58.7 63.1 50.9 65.7 50.8 57.8
MER2023 61.2 65.2 62.3 - - - 70.9 69.2 55.6 57.4 69.0 68.3
MSPP 21.4 43.4 21.5 38.4 - - 46.0 61.7 45.0 58.9 49.4 64.1

Analysis:

  • SenseVoice demonstrates strong zero-shot SER performance without any fine-tuning on the target datasets.
  • SenseVoice-L is highly competitive with specialized SER benchmarks like MerBench on several datasets (e.g., CREMA-D, ESD). On MELD, MER2023, and MSPP, its performance is strong but slightly lower than the specialized models, which is expected.
  • SenseVoice-S also shows respectable performance, often outperforming EmoBox, proving that its speed does not come at a complete loss of its rich understanding capabilities.
  • The bar chart in Figure 8 further confirms that SenseVoice-Large generally achieves the best or second-best performance across all tested datasets compared to other open-source models, establishing it as a state-of-the-art SER model.

6.1.3. Preserving Semantic Information by S3S^3 Tokenizer

The following are the results from Table 9, which evaluates the ASR performance of the model when using the discretized S3S^3 tokens, to check how much semantic information is preserved after quantization.

Test set Whisper-L-V3 SenseVoice-L S3S^{3}tokens
w/o lid w/ lid w/o lid w/ lid w/o lid w/ lid
common_voice_zh-CN 12.82 12.55 8.76 8.68 12.24 12.06
common_voice_en 13.55 9.39 9.79 9.77 15.43 15.38

Analysis:

  • Semantic Preservation: The key finding is that the performance drop from using the quantized S3S^3 tokens is relatively small compared to the original SenseVoice-L continuous features, especially for Chinese. On common_voice_zh-CN, the S3S^3 tokens even lead to better performance than Whisper-Large V3. This empirically validates the paper's claim that the supervised tokenizer effectively preserves semantic information.
  • Performance Degradation: There is a noticeable degradation on English compared to the original SenseVoice-L, but the performance is still in a reasonable range. This indicates that the single-codebook VQ with 4096 entries is a decent trade-off between compression and information preservation.

6.1.4. Generation Quality of CosyVoice

The following are the results from Table 10 (English) and Table 11 (Chinese), evaluating the generated speech quality.

Table 10: English (LibriTTS test-clean)

Model WER(%) #Ins.&Del. SS
Original 2.66 92 69.67
ChatTTS 8.32 441 -
CosyVoice 2.89±0.182.89{\pm }0.18 88.60±3.8888.60{\pm }3.88 74.30±0.1574.30{\pm }0.15
+5x re-ranking 1.51 47 74.30

Table 11: Chinese (AISHELL-3 test)

Model CER (%) #Ins.<Del. SS
Original 2.52 25 74.15
CosyVoice 3.82±0.24 24.4±2.24 81.58±0.16
+ 5× re-ranking 1.84 11 81.58
ChatTTS 3.87 111 -

Analysis:

  • Content Consistency: CosyVoice achieves WER/CER scores very close to the original human-uttered speech, indicating high content fidelity. It significantly outperforms ChatTTS, which has much higher error rates and insertion/deletion counts. The paper notes ChatTTS suffers from "speaker leaking," which CosyVoice avoids.
  • Speaker Similarity: CosyVoice achieves a higher speaker similarity (SS) score than the original data itself. This might seem counterintuitive, but it suggests the model is very effective at capturing and reproducing a consistent speaker timbre, potentially "cleaning up" natural variations present in the original dataset.
  • ASR Re-ranking: Using an ASR model to re-rank multiple generated samples is highly effective, reducing the error rate to a remarkably low 1.51% WER (English) and 1.84% CER (Chinese), demonstrating its potential for high-quality offline synthesis.

6.1.5. Emotion Controllability of CosyVoice

The following are the results from Table 12, evaluating the accuracy of emotion control.

Model Happy Sad Angry Surprised Fearful Disgusted
CosyVoice-base 1.00±0.00 0.45±0.05 0.59±0.03 0.26±0.02 0.88±0.01 0.46±0.06
CosyVoice-instruct 1.00±0.00 0.98±0.02 0.83±0.04 0.64±0.03 0.87±0.03 0.93±0.02
w/o instruction 0.98±0.01 0.77±0.04 0.49±0.12 0.28±0.06 0.83±0.04 0.45±0.16

Analysis:

  • The results clearly show the effectiveness of instruction fine-tuning. CosyVoice-instruct (with instructions) achieves significantly higher accuracy in generating the target emotion compared to CosyVoice-base or even CosyVoice-instruct without an explicit instruction.
  • For emotions like 'Sad', 'Angry', 'Surprised', and 'Disgusted', the improvement is dramatic. For example, sadness accuracy jumps from 45% to 98%.
  • This confirms that the model successfully learns to associate natural language commands with specific expressive speech styles.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces FunAudioLLM, a powerful and comprehensive open-source framework for building natural voice interfaces for LLMs. It achieves this through two highly capable and specialized foundation models:

  • SenseVoice provides both high-speed (Small) and high-accuracy (Large) multilingual speech understanding, going beyond simple transcription to include emotion and audio event detection.

  • CosyVoice delivers state-of-the-art, controllable speech generation, enabled by its novel supervised semantic tokenizer (S3S^3). It excels at zero-shot voice cloning and instruction-based style control.

    By open-sourcing these models and their associated code, the Tongyi SpeechTeam has made a significant contribution to the community, democratizing access to technology that is often kept proprietary. The demonstrated applications highlight the practical value of this modular framework for creating rich, next-generation voice experiences.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations:

  • Under-resourced Languages: ASR performance for languages with less data is still significantly lower than for high-resource languages.
  • No Streaming Support: The current SenseVoice models are not designed for real-time streaming transcription, which is crucial for many conversational AI applications. Future work will focus on developing streamable versions.
  • CosyVoice Limitations: The model supports a limited number of languages, cannot infer emotion from text content alone (it requires an explicit instruction), performs poorly at singing, and can still improve the trade-off between maintaining timbre and expressing strong emotions.
  • Pipeline Approach: The framework is a pipeline (SenseVoice -> LLM -> CosyVoice), which can lead to error propagation (e.g., an ASR error affecting the LLM's response). The components are not trained end-to-end with the LLM.

7.3. Personal Insights & Critique

This paper is an excellent piece of engineering and a valuable contribution to the open-source community.

Strengths and Inspirations:

  • Pragmatic and Modular Design: The decision to build specialized models for understanding and generation is a smart, pragmatic approach. It allows for optimization targeted at specific use cases (latency vs. accuracy) and provides greater flexibility for developers to mix and match components, compared to a monolithic end-to-end model.
  • The S3S^3 Tokenizer: The idea of a supervised semantic tokenizer is a key insight. It elegantly solves the problem of bridging the text and audio domains. By tying tokenization to the ASR task, the model learns a representation that is inherently meaningful, which likely simplifies the subsequent TTS task significantly. This is a powerful concept that could be applied to other audio generation tasks.
  • Commitment to Open Source: In an era where many foundational models are locked behind APIs, the comprehensive open-sourcing of models, code, and training details is highly commendable and will undoubtedly accelerate research and innovation.

Potential Issues and Areas for Improvement:

  • End-to-End Training: As the authors note, the pipeline nature is a potential weakness. A future research direction could be to explore methods for joint fine-tuning of the entire SenseVoice + LLM + CosyVoice stack. This could allow the models to adapt to each other and reduce cascading errors, potentially leading to more seamless and robust interactions.
  • Richer Paralinguistics: While the inclusion of laughter and breath is a good step, natural human speech is filled with a much wider array of fillers, pauses, and prosodic nuances. Expanding the instruction-following capabilities to control these subtle aspects would be a valuable next step toward truly human-like conversation.
  • Evaluation of "Naturalness": The paper relies on objective metrics like WER and SS. While useful, they don't fully capture the subjective "naturalness" or "appropriateness" of a conversation. Human evaluation (e.g., Mean Opinion Score - MOS) would be necessary to provide a complete picture of the user experience, especially for applications like Emotional Voice Chat.
  • Scalability to More Languages: The framework's utility is currently limited by the language coverage of CosyVoice (5 languages). Expanding this to the 50+ languages supported by SenseVoice-Large would be a major undertaking but would greatly increase the global impact of the project.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.