Paper status: completed

Fun-ASR Technical Report

Published:09/16/2025
Original LinkPDF
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Fun-ASR integrates large data, model capacity, and deep LLM integration, optimizing through reinforcement learning for real-world challenges. It achieves state-of-the-art performance on industrial datasets, demonstrating effectiveness in streaming, noise robustness, and code-swit

Abstract

In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Fun-ASR Technical Report

The title clearly indicates that this paper is a technical report detailing a system named Fun-ASR. The focus is on the technical implementation and performance of this specific Automatic Speech Recognition (ASR) system.

1.2. Authors

The authors are listed as the "Tongyi Fun Team" from Alibaba Group, followed by a long list of individual contributors. This authorship model is common for large-scale industry projects where a team, rather than a few individuals, is responsible for the research and development. The affiliation with Alibaba Group, a major technology conglomerate, suggests that the research is well-resourced and oriented towards practical, commercial applications.

1.3. Journal/Conference

The paper is available as a preprint on arXiv, a repository for electronic preprints of scientific papers. It has not been subjected to formal peer review. Technical reports like this are common for major industry labs (like Google, Meta, and in this case, Alibaba) to announce and detail significant engineering achievements or new large-scale models, often before or in parallel with a peer-reviewed publication.

1.4. Publication Year

The metadata indicates a publication date of September 15, 2025. This is a future date and is likely a placeholder or an error in the provided information. However, the content and references (some pointing to 2025) suggest the work is very recent, likely conducted in late 2024 or early 2025.

1.5. Abstract

The abstract introduces Fun-ASR, a large-scale Automatic Speech Recognition (ASR) system based on Large Language Models (LLMs). The authors state that recent ASR advancements are driven by three paradigms: scaling up data, scaling up model size, and integrating with LLMs. However, they identify a key problem: LLM-based ASR systems are prone to hallucination (generating text not present in the audio), which degrades user experience. Fun-ASR is presented as a solution that combines these three paradigms with reinforcement learning to mitigate hallucination and achieve state-of-the-art (SOTA) performance. The system is specifically optimized for production environments, featuring enhancements for streaming, noise robustness, code-switching, and hotword customization. A key finding highlighted is that while many LLM-based ASR systems perform well on open-source benchmarks, they falter on real-world industry evaluation sets, whereas Fun-ASR excels in these practical settings.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper addresses is the gap between the high performance of modern ASR systems on academic benchmarks and their often-subpar performance in real-world, practical applications. While the field of ASR has seen massive improvements by increasing training data, making models larger, and integrating them with powerful LLMs, these advancements have introduced new challenges.

The most significant challenge highlighted is LLM hallucination, where the ASR system generates text that is plausible but not actually spoken in the audio. This is especially common in noisy environments or during silent periods and severely undermines the reliability of ASR systems in production.

Existing state-of-the-art models, while achieving impressive scores on clean, well-curated open-source datasets (like Librispeech or AIShell), often fail to handle the complexities of real-world audio, which includes background noise, code-switching (mixing languages), and the need for domain-specific terminology (hotwords). The paper's motivation is to build an ASR system that not only achieves SOTA accuracy but is also robust, reliable, and optimized for these practical deployment scenarios, directly tackling the hallucination problem and bridging the research-to-production gap.

2.2. Main Contributions / Findings

The paper presents Fun-ASR, a comprehensive ASR system, and makes the following main contributions:

  1. A Large-Scale, LLM-Integrated ASR System: The paper details the architecture and training of Fun-ASR, which comes in two sizes: a 7.7-billion parameter model for maximum accuracy and a 0.8-billion parameter nano version for efficiency. The system synergistically combines a powerful audio encoder with a large language model decoder.

  2. A Sophisticated Multi-Stage Training Paradigm: Fun-ASR is not trained monolithically. It undergoes a meticulous training process involving:

    • Encoder Pre-training: A two-stage process using both self-supervised (Best-RQ) and supervised (AED) learning.
    • Supervised Fine-tuning (SFT): A five-stage process that carefully unfreezes and trains different components (adaptor, encoder, LLM) to align the audio and text modalities without catastrophic forgetting.
    • Reinforcement Learning (RL): A final fine-tuning stage using a custom RL framework (FunRL) and a tailored reward function to specifically address issues like hallucination, keyword errors, and language mismatches.
  3. Comprehensive Production-Oriented Optimizations: Beyond raw accuracy, Fun-ASR is engineered for real-world use cases. The paper details specific optimizations for:

    • Streaming ASR: Enabling real-time transcription with low latency.
    • Noise Robustness: Using extensive data augmentation to perform well in noisy environments.
    • Code-Switching: Handling mixed-language (Chinese-English) speech.
    • Hotword Customization: A RAG-based mechanism to improve recognition of user-defined terms.
    • Hallucination Mitigation: Specific data augmentation and RL strategies to reduce spurious text generation.
  4. Demonstrated Superiority on Real-World Data: The key finding is that Fun-ASR significantly outperforms other leading open-source and commercial ASR systems on challenging, in-house industry evaluation sets. This result substantiates the authors' claim that performance on standard benchmarks is not a reliable indicator of real-world capability and highlights the effectiveness of their production-focused optimizations.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following concepts:

  • Automatic Speech Recognition (ASR): The task of converting spoken language into written text. Modern ASR systems use deep learning to learn the complex mapping from audio waveforms to sequences of characters or words.
  • Large Language Models (LLMs): These are massive neural networks (e.g., GPT-3, Llama) trained on vast amounts of text data. They excel at understanding context, grammar, and semantic meaning, making them powerful for generating coherent and contextually appropriate text. In ASR, they are used as decoders to improve the linguistic quality of the transcription.
  • Encoder-Decoder Architecture: A common framework for sequence-to-sequence tasks.
    • The Encoder processes the input sequence (e.g., audio) and compresses it into a high-level feature representation (a set of vectors).
    • The Decoder takes this representation and generates the output sequence (e.g., text) one element at a time, often using information about previously generated elements.
  • Attention Mechanism: A mechanism that allows the decoder to selectively focus on different parts of the encoder's output at each step of the generation process. This is crucial for handling long sequences. The standard Scaled Dot-Product Attention formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
    • QQ (Query): The representation of the current element being processed by the decoder.
    • KK (Key): The representations of all elements in the input sequence from the encoder.
    • VV (Value): The content representations of the input sequence elements.
    • dkd_k: The dimension of the key vectors. The scaling factor dk\sqrt{d_k} prevents the dot products from becoming too large. The formula computes a weighted sum of the values (VV), where the weights are determined by the similarity between the current query (QQ) and all keys (KK).
  • Connectionist Temporal Classification (CTC): A loss function used in ASR that solves the problem of not knowing the precise alignment between audio frames and output characters. It allows the model to predict a sequence of labels (including a special "blank" token) for each audio frame. It then sums the probabilities of all possible alignments that collapse to the correct target transcription, enabling end-to-end training without needing frame-level labels.
  • Reinforcement Learning (RL): A machine learning paradigm where an "agent" learns to make decisions by interacting with an "environment." The agent receives "rewards" or "penalties" for its actions and learns a "policy" (a strategy) to maximize its cumulative reward. In this paper, RL is used to fine-tune the ASR model, where the "actions" are generating transcriptions and the "rewards" are based on accuracy, keyword correctness, and absence of hallucination.
  • Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning (PEFT) method. Instead of fine-tuning all the weights of a large pre-trained model (which is computationally expensive and can lead to catastrophic forgetting), LoRA freezes the original weights and injects small, trainable "rank decomposition matrices" into the layers of the model. This drastically reduces the number of trainable parameters while still allowing the model to adapt to new tasks.
  • Retrieval-Augmented Generation (RAG): A technique that enhances generative models by allowing them to retrieve information from an external knowledge base before generating a response. In this paper, RAG is used for hotword customization, where the system retrieves relevant hotwords from a vocabulary to guide the final transcription.

3.2. Previous Works

The paper builds upon several key advancements in the ASR field:

  • Whisper (Radford et al., 2023): This work from OpenAI demonstrated the incredible effectiveness of data scaling. By training a large encoder-decoder model on 680,000 hours of weakly supervised multilingual audio data from the internet, Whisper achieved human-level robustness and accuracy across a wide range of languages and conditions. It set a new standard for ASR and proved that massive, diverse data can overcome the need for meticulously transcribed corpora.
  • LLM-Integrated ASR Models:
    • Seed-ASR (Bai et al., 2024): This model represents the trend of deeply integrating ASR with LLMs. It leverages the powerful contextual understanding and world knowledge of LLMs to improve transcription accuracy, especially in cases of semantic ambiguity. Fun-ASR is conceptually similar but focuses more on production readiness.
    • FireRedASR (Xu et al., 2025): Another model in the same paradigm, aiming to bridge the gap between speech and text understanding by using an LLM. These models serve as key baselines for Fun-ASR.
  • Self-Supervised Learning for Speech:
    • Best-RQ (Chiu et al., 2022): This is a framework for self-supervised pre-training on speech audio. It works by quantizing audio features into discrete units, masking some of these units, and training the model to predict the masked units. This allows the model to learn powerful acoustic representations from unlabeled audio data, a key step in the Fun-ASR training pipeline.
  • Non-Autoregressive ASR:
    • Paraformer-v2 (An et al., 2024b): This model, also from Alibaba, is a non-autoregressive transformer. Unlike traditional decoders that generate text token by token, non-autoregressive models predict the entire sequence in parallel, leading to much faster inference. It serves as a strong baseline representing an alternative architectural choice.

3.3. Technological Evolution

The field of ASR has evolved through several stages:

  1. Traditional Hybrid Models: Early systems combined Gaussian Mixture Models (GMMs) with Hidden Markov Models (HMMs), later replaced by Deep Neural Networks (DNN-HMMs). These required separate acoustic, pronunciation, and language models.

  2. End-to-End (E2E) Models: The rise of deep learning led to E2E systems like those based on CTC or attention-based encoder-decoder architectures. These models learn a direct mapping from audio to text in a single neural network.

  3. Large-Scale Pre-training: Inspired by NLP, models like Whisper showed that pre-training on massive, diverse, and weakly labeled datasets leads to unprecedented robustness and generalization.

  4. LLM Integration: The current paradigm, represented by models like Seed-ASR and now Fun-ASR, involves replacing or augmenting the ASR decoder with a pre-trained LLM. This leverages the LLM's vast linguistic knowledge to produce more accurate, coherent, and contextually aware transcriptions.

    Fun-ASR sits firmly in the fourth stage but distinguishes itself by adding a heavy focus on production-readiness and targeted refinement through RL, tackling practical issues that are often overlooked in purely academic research.

3.4. Differentiation Analysis

Compared to its predecessors and contemporaries, Fun-ASR's innovations are:

  • Holistic, Multi-Stage Training: While others also use SFT or pre-training, Fun-ASR employs a highly structured, multi-stage pipeline that systematically warms up and aligns different components before full joint training. The use of a text LLM to initialize the audio encoder is a particularly novel cross-modal strategy.
  • Targeted Reinforcement Learning: The FunRL framework is specifically designed for Large Audio-Language Models. More importantly, the reward function is not just about improving Word Error Rate (WER). It is a multi-objective function that explicitly penalizes hallucination and rewards keyword accuracy, directly targeting the main pain points of LLM-based ASR in production.
  • Engineering for Production: The primary differentiator is the explicit focus on a suite of practical features. While other models might have some of these, Fun-ASR presents them as a core part of its design, including streaming, noise robustness, code-switching, and RAG-based hotword customization.
  • Emphasis on Real-World Evaluation: The paper's critical stance on open-source benchmarks and its focus on performance on private, real-world industry datasets is a key part of its narrative, differentiating it from research that may be over-optimized for academic leaderboards.

4. Methodology

4.1. Principles

The core principle of Fun-ASR is to create a highly accurate and robust ASR system by synergistically combining three elements: a powerful audio encoder to understand acoustics, a massive Large Language Model (LLM) to understand language, and a targeted Reinforcement Learning (RL) stage to refine the model for real-world complexities. The methodology is designed to bridge the gap between the audio and text modalities effectively and to explicitly solve practical problems like hallucination, noise, and domain-specific language that plague existing systems.

4.2. Core Methodology In-depth

The overall architecture of Fun-ASR is depicted in Figure 2 from the paper.

fig 2 该图像是一个示意图,展示了 Fun-ASR 系统的构成及数据流。图中包含音频上下文、音频适配器、音频编码器和 CTC 解码器,强调了用户热词的集成及模型的预测上下文部分。

The system consists of four main components:

  1. Audio Encoder: A multi-layer transformer encoder that processes the input speech and converts it into a sequence of high-level acoustic representations.

  2. Audio Adaptor: A shallow (2-layer) transformer encoder that acts as a bridge, projecting the audio representations from the audio encoder's space into a space that the LLM can understand.

  3. CTC Decoder: A simple decoder built on top of the audio encoder. It provides a quick, initial transcription via greedy search. This hypothesis is used for the RAG-based hotword customization feature.

  4. LLM-based Decoder: This is the main decoder. It takes the adapted audio representations and, optionally, context from the CTC prediction (e.g., retrieved hotwords), to generate the final, high-quality transcription.

    The paper proposes two model sizes to cater to different computational budgets:

  • Fun-ASR: A 7.7B parameter model (0.7B audio encoder + 7B LLM) for SOTA accuracy.

  • Fun-ASR-nano: A 0.8B parameter model (0.2B audio encoder + 0.6B LLM) for efficiency.

    The training process is highly structured and divided into several major phases.

4.2.1. Phase 1: Audio Encoder Pre-training

The goal is to create a powerful audio encoder before integrating it with the LLM. This is done in two stages, as shown in Figure 3 of the paper (note: the image file is mislabeled as fig 4 in the markdown).

fig 4 该图像是一个示意图,展示了 Fun-ASR 系统中的训练流程。首先,使用去除因果掩码的预训练文本 LLM,接着经过 Best-RQ 基于自监督的训练,然后进行 AED 基于监督的训练,最终生成用于 FunAudio-ASR 的音频编码器。

  • Stage 1: Self-supervised pre-training. The Best-RQ framework is used for self-supervised learning on large amounts of unlabeled audio data. A key innovation here is the initialization: instead of random weights, the encoder is initialized with weights from a pre-trained text LLM (Qwen3). This cross-modal initialization provides a strong linguistic and semantic inductive bias, accelerating convergence and improving representation quality.
  • Stage 2: Supervised pre-training. The encoder is then trained within a standard attention-based encoder-decoder (AED) framework on large-scale labeled ASR data, following the methodology of SenseVoice-Large. This stage fine-tunes the encoder to learn rich acoustic and phonetic features from transcribed speech. The resulting trained encoder provides a strong starting point for the main Fun-ASR model.

4.2.2. Phase 2: Supervised Fine-tuning (SFT)

This phase integrates the pre-trained audio encoder and LLM. It consists of five sequential stages designed to gradually align the components and adapt them to the ASR task.

  • Stage 1: Adaptor Training. The pre-trained audio encoder and LLM are frozen. Only the audio adaptor module is trained. The goal is to learn a mapping from the audio encoder's output space to the LLM's input embedding space. This is a crucial alignment step.
  • Stage 2: Encoder and Adaptor Training. The LLM remains frozen. The audio encoder and adaptor are trained together. This allows the audio encoder to learn to produce representations that are more semantically meaningful and better aligned with the frozen LLM's understanding of language.
  • Stage 3: LLM Adaptation via LoRA. The encoder and adaptor are frozen. The LLM's parameters are updated using Low-Rank Adaptation (LoRA). This step adapts the LLM to the nuances of the ASR task (e.g., transcribed speech often has different characteristics than written text) while preserving its powerful pre-trained knowledge and preventing catastrophic forgetting.
  • Stage 4: Joint Fine-tuning. Full-parameter fine-tuning is applied to the audio encoder and adaptor, while the LLM continues to be fine-tuned with LoRA. This stage uses only high-quality data and trains all components jointly to achieve optimal performance.
  • Stage 5: CTC Decoder Training. The audio encoder is frozen, and only the separate CTC decoder head is trained. This CTC decoder is not used for the final output but serves as a fast, first-pass decoder to generate an initial hypothesis for use in the hotword customization RAG mechanism.

4.2.3. Phase 3: Contextual Supervised Fine-tuning

After standard SFT, the model is further trained on long-duration audio with contextual prompts to improve its ability to handle long conversations and disambiguate words. Since high-quality contextual data is scarce, the authors synthesize it in three steps:

  1. Keyword Extraction: An LLM (Qwen3-32B) extracts key entities and terms from the transcript of an audio segment.
  2. Relevant Context Synthesis: The same LLM is prompted to generate diverse, conversational-style contextual sentences that contain the extracted keywords.
  3. Irrelevant Context Combination: To prevent the model from becoming over-reliant on the provided context, the synthesized relevant context is mixed with randomly sampled irrelevant contextual sentences, making the training more robust.

4.2.4. Phase 4: Reinforcement Learning

This is the final and most innovative training phase, designed to address specific real-world failures like hallucination and keyword errors.

4.2.4.1. The FunRL Framework

The authors designed FunRL, a custom RL framework for Large Audio-Language Models (LALMs), as shown in Figure 4a of the paper (note: the image file is mislabeled as fig 3 in the markdown).

fig 3 该图像是图表,展示了 Fun-ASR 系统的框架和强化学习的时间消耗分析。左侧图(a)描述了 Fun-RL 框架,包括 FSDP LLM、SQLang LLM 和奖励模型之间的交互。右侧图(b)以饼图形式展示了强化学习过程中各组成部分的时间消耗,标示了各部分所占时间比例与总消耗时间。

The framework efficiently coordinates GPU usage between three modules:

  1. Audio Encoder: Batches and processes audio clips on the GPU to extract embeddings.

  2. Rollout Module: Uses the audio embeddings to generate multiple candidate transcriptions (hypotheses) from the current policy on the GPU.

  3. Policy Module: Calculates rewards for the hypotheses, computes the RL loss, and updates the model's policy on the GPU. The updated policy is then synchronized back to the rollout module.

    This alternating use of the GPU minimizes device-switching overhead, making the RL process efficient.

4.2.4.2. GRPO-based RL for ASR

The paper uses a policy-gradient RL algorithm called Group-based Reward Policy Optimization (GRPO). For a given input, the model generates a group of GG responses {oi}i=1G\{o_i\}_{i=1}^G. Each response oio_i receives a reward RiR_i. The advantage A^i,t\hat{A}_{i,t} for each token is calculated by normalizing the group-level rewards.

The formula for the advantage is given in Equation (1): $ \hat{A}{i,t} = \frac{R{i} - \mathrm{mean}{{R_{j}}{j = 1}^{G}}}{\mathrm{std}\left({R{j}}_{j = 1}^{G}\right)} \quad (1) $

  • RiR_i: The reward for the ii-th response in the group.
  • mean{}\mathrm{mean}\{\cdot\} and std{}\mathrm{std}\{\cdot\}: The mean and standard deviation of rewards across the group of GG responses. The advantage measures how much better or worse a given response is compared to the average response in the group.

The policy is then updated using a clipped objective function, similar to PPO, with an added KL-divergence penalty to prevent the policy from straying too far from a reference policy. The objective function is given in Equation (2): $ L_{\mathrm{GRPO}}(\theta) = \frac{1}{G}\sum_{i = 1}^{G}\frac{1}{|o_{i}|}\sum_{t = 1}^{|o_{i}|}\mathrm{min}\big(r_{i,t}(\theta)\hat{A}{i,t}, \mathrm{clip}\big(r{i,t}(\theta),1 - \epsilon ,1 + \epsilon \big)\hat{A}{i,t}\big) -\beta D{\mathrm{KL}}(\pi_{\theta}| \pi_{\mathrm{ref}}) \quad (2) $ where ri,t(θ)r_{i,t}(\theta) is the probability ratio: $ r_{i,t}(\theta) = \frac{\pi_{\theta}(o_{i,t}\mid q,o_{i,< t})}{\pi_{\theta_{\mathrm{old}}}(o_{i,t}\mid q,o_{i,< t})} \quad (3) $

  • θ\theta: The parameters of the current policy πθ\pi_\theta.
  • θold\theta_{old}: The parameters of the policy before the update (used for calculating the ratio).
  • πref\pi_{ref}: A reference policy (often the model before RL training).
  • ri,t(θ)r_{i,t}(\theta): The ratio of the probability of generating token oi,to_{i,t} under the new policy vs. the old policy.
  • ϵ\epsilon: A small hyperparameter for clipping the ratio, which stabilizes training.
  • DKLD_{KL}: The KL-divergence, measuring the difference between the current and reference policies.
  • β\beta: A coefficient controlling the strength of the KL penalty.

4.2.4.3. Custom Reward Function

A key contribution is the multi-faceted reward function, designed to improve user experience beyond simple WER reduction. The total reward RiR_i is a sum of the following components:

  • Ri1R_i^1 (ASR Accuracy): 1WER(y,y)1 - \mathrm{WER}(y^*, y), where yy^* is the ground truth and yy is the generated transcript. This directly optimizes for recognition accuracy.
  • Ri2R_i^2 (Keyword Accuracy and Recall): A reward based on both the precision and recall of pre-identified important keywords. This encourages the model to correctly transcribe critical terms.
  • Ri3R_i^3 (Noise Robustness and Hallucination Suppression): A penalty proportional to the length of any hallucinated text, which is detected using regular expression matching. This directly discourages the model from generating spurious output.
  • Ri4R_i^4 (Language Match): A large penalty of -1 is applied if the output language does not match the input language (e.g., translating instead of transcribing).

4.2.5. Production-oriented Optimizations

The paper details several specific techniques to make Fun-ASR practical for deployment.

  • Streaming Ability: The model is fine-tuned on data that is chunked to simulate a real-time streaming scenario, reducing the train-inference mismatch.
  • Noise Robust Training: A massive 110K hours of noisy speech is simulated by mixing clean speech with a large corpus of noise samples at various signal-to-noise ratios (SNRs).
  • Code-switching: To handle mixed Chinese-English speech, synthetic training data is generated by using an LLM (Qwen3) to create code-switched sentences, which are then converted to speech using a Text-to-Speech (TTS) model.
  • Hotword Customization: A RAG-based mechanism is implemented. During inference, the fast CTC decoder generates a hypothesis. This hypothesis is used to retrieve candidate hotwords from a user-defined vocabulary based on phoneme or word-piece edit distance. These retrieved hotwords are then fed as context to the main LLM decoder to produce the final, customized output.
  • Hallucination Mitigation: During data augmentation, pure-noise segments are created by adding noise to zero-padded audio. Training on these samples forces the model to learn to output nothing for non-speech inputs, thus reducing hallucinations during silences.

5. Experimental Setup

5.1. Datasets

The experiments use a combination of public benchmarks and private, industry-grade datasets.

  • Open-Source Datasets:
    • Chinese: AIShell-1, AIShell-2, WeNetSpeech, Fleurs-zh. These are standard Mandarin speech recognition corpora of varying scales and domains.
    • English: Librispeech (clean and other), Fleurs-en. Librispeech is a widely used benchmark derived from audiobooks.
    • Multilingual: Gigaspeech2 and Fleurs.
  • Industry Evaluation Sets: To counter potential data leakage in public benchmarks and to evaluate real-world performance, the authors created several private test sets:
    • Leakage-Free Set: Videos uploaded to YouTube and Bilibili after June 30th, 2025 (a future date, likely meant to be 2024), which were then manually transcribed. This ensures the model has not seen the data during training.
    • Noise Robustness Set: Real-world recordings from 11 different noisy environments (canteen, subway, street, etc.) to test performance under challenging acoustic conditions.
    • Other in-house sets are mentioned, such as In-house, Fairfield, Home Scenario, Complex Background, and English General.

5.2. Evaluation Metrics

The primary metric used throughout the paper is Word Error Rate (WER), and for some languages like Chinese, Character Error Rate (CER) is also used.

  • Word Error Rate (WER)

    1. Conceptual Definition: WER measures the error rate of an ASR system at the word level. It compares the predicted transcription to a reference transcription and counts the number of substitutions (incorrect words), deletions (missed words), and insertions (extra words). A lower WER indicates better performance.
    2. Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $
    3. Symbol Explanation:
      • SS: The number of substitutions.
      • DD: The number of deletions.
      • II: The number of insertions.
      • NN: The total number of words in the reference transcription.
  • Character Error Rate (CER)

    1. Conceptual Definition: CER is analogous to WER but operates at the character level. It is commonly used for languages like Chinese, where the concept of a "word" is not as clearly defined as in English.
    2. Mathematical Formula: $ \mathrm{CER} = \frac{S_{char} + D_{char} + I_{char}}{N_{char}} $
    3. Symbol Explanation:
      • ScharS_{char}: The number of character substitutions.
      • DcharD_{char}: The number of character deletions.
      • IcharI_{char}: The number of character insertions.
      • NcharN_{char}: The total number of characters in the reference transcription.
  • Accuracy (acc) and Recall (rec) These are used for hotword evaluation.

    1. Conceptual Definition:
      • Accuracy (Precision): Of all the hotwords the model predicted, what fraction were correct?
      • Recall: Of all the true hotwords in the audio, what fraction did the model correctly identify?
    2. Mathematical Formula: $ \mathrm{Accuracy} (\text{Precision}) = \frac{\text{TP}}{\text{TP} + \text{FP}} $ $ \mathrm{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $
    3. Symbol Explanation:
      • TP (True Positives): Hotwords correctly identified.
      • FP (False Positives): Words incorrectly identified as hotwords.
      • FN (False Negatives): Hotwords that were missed by the model.

5.3. Baselines

Fun-ASR is compared against a strong set of recent and SOTA ASR systems:

  • Whisper-large-v3: OpenAI's large-scale model, a benchmark for robustness.

  • Seed-ASR: A leading LLM-based ASR model, available as an open-source model and a commercial API.

  • Kimi-Audio: Another powerful multimodal audio-language model.

  • FireRed-ASR: An open-source industrial-grade Mandarin ASR model.

  • Paraformer v2: A strong non-autoregressive model from Alibaba.

  • Step-Audio2: Another recent ASR model.

  • For multilingual evaluation, baselines include dolphin-small and seamless-m4t-large-v2.

    These baselines represent the current state of the art in both academic and industrial ASR research.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results strongly support the paper's central thesis: performance on real-world industry datasets is more meaningful than on potentially contaminated open-source benchmarks, and Fun-ASR's production-oriented design allows it to excel in these practical scenarios.

The following are the results from Table 1 of the original paper:

Test set Whisper-large-v3 Seed-ASR Seed-ASR* Kimi-Audio Step-Audio2 FireRed-ASR Fun-ASR-nano Fun-ASR
AIShell1 4.72 0.68 1.63 0.71 0.63 0.54 1.76 1.22
AIShell2 4.68 2.27 2.76 2.86 2.10 2.58 2.80 2.3
Fleurs-zh 5.18 3.43 3.23 3.11 2.68 4.81 3.47 2.64
Fleurs-en 6.23 9.39 9.39 6.99 3.03 10.79 7.95 5.84
Librispeech-clean 1.86 1.58 2.8 1.32 1.17 1.84 1.75 1.57
Librispeech-other 3.43 2.84 5.69 2.63 2.42 4.52 4.37 3.24
WenetSpeech Meeting 18.39 5.69 7.07 6.24 4.75 4.95 8.78 6.49
WenetSpeech Net 11.89 4.66 4.84 6.45 4.67 4.94 6.28 5.46

On these open-source benchmarks (Table 1), all modern ASR systems achieve very low WERs, and the performance differences are relatively small. Fun-ASR is competitive but doesn't universally dominate.

The narrative changes dramatically when looking at the industry datasets. The following are the results from Table 2 of the original paper:

Test set Seed-ASR Whisper-large-v3 FireRed-ASR Kimi-Audio Paraformer v2 Fun-ASR-nano Fun-ASR
In-house 7.20 16.58 10.10 9.02 8.11 7.26 6.66
Fairfield 4.59 22.21 7.49 10.95 9.55 5.43 4.66
Home Scenario 8.08 18.17 9.67 23.79 6.87 6.02 5.17
Complex Background 12.90 32.57 15.56 15.56 15.19 17.07 11.29
English General 15.65 18.56 21.62 18.12 19.48 15.87 14.22
Opensouce 3.83 7.05 5.31 3.20 6.23 4.65 3.60
Average 8.71 19.19 11.63 13.54 10.91 9.38 7.60

In Table 2, Fun-ASR achieves the lowest average WER (7.60%), outperforming all other models, including the strong Seed-ASR (8.71%). Notably, models like Whisper-large-v3 that are very robust on public data show a massive performance degradation on these sets (average WER of 19.19%), especially in noisy "Complex Background" scenarios (32.57% WER). This highlights the effectiveness of Fun-ASR's noise robustness and other production-focused training. Even the lightweight Fun-ASR-nano (9.38% average WER) is competitive with much larger models.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Streaming Performance

The following are the results from Table 3 of the original paper:

Test set Seed-ASR Fun-ASR Fun-ASR-nano
In-house 8.64 7.00 7.97
Fairfield 5.51 5.33 6.92
Home Scenario 9.7 5.33 6.51
Complex Background 15.48 12.50 14.83
English General 18.78 14.74 16.70
OpenSouce Test Sets 3.80 3.60 5.13

Table 3 shows that Fun-ASR maintains its superior performance even in streaming mode, consistently outperforming Seed-ASR across all test sets. This validates the effectiveness of the streaming simulation during fine-tuning.

6.2.2. Noise Robustness

The following are the results from Table 4 of the original paper:

Fun-ASR
Environment w/o NRT w/ NRT NRT + RL
canteen 20.67 20.34 19.88
dinner 14.02 9.88 9.55
meeting 6.45 6.27 6.24
office 15.02 11.58 11.42
outdoor 10.12 9.85 9.58
park 13.67 11.88 11.37
shop 12.22 11.48 11.24
street 12.05 10.58 10.86
subway 14.11 13.31 13.29
supermarket 14.27 8.81 8.75
walkstreet 13.89 13.87 13.94
Average 13.32 11.58 11.45

Table 4 clearly shows the impact of Noise Robust Training (NRT). NRT reduces the average WER from 13.32% to 11.58%, a relative improvement of 13%. The gains are most dramatic in challenging environments like "dinner" (14.02% -> 9.88%) and "supermarket" (14.27% -> 8.81%), where LLM-based systems are most likely to hallucinate. Reinforcement Learning (NRT + RL) provides a further, smaller improvement to 11.45%, suggesting it helps refine the model's behavior in these noisy conditions.

6.2.3. Code-Switching and Hotword Customization

The following are the results from Table 5 of the original paper:

Test set Offline Streaming
w/o CS w/o RL w/RL w/o CS w/o RL w/RL
A 4.53 1.70 1.59 6.19 5.85 2.28
B 4.76 4.56 4.50 6.32 5.68 5.07

Table 5 demonstrates that the synthetic code-switching (CS) training data provides a massive performance boost (e.g., on test set A, offline WER drops from 4.53% to 1.70%). RL provides further gains, especially in the challenging streaming setting (on test set A, streaming WER drops from 5.85% to 2.28%).

The following are the results from Table 6 of the original paper:

  <!-- This table seems to have missing data for the last column in the original paper -->
</tr>
<tr>
  <td>math</td>
  <td>0.86</td>
  <td>0.99</td>
  <td>0.99</td>
  <td>0.86</td>
  <td>0.99</td>
  <td>0.99</td>
  <td>1.29</td>
  <td>0.99</td>
  <td>0.99</td>
</tr>
<tr>
  <td>religion</td>
  <td>3.20</td>
  <td>0.98</td>
  <td>0.98</td>
  <td>2.87</td>
  <td>0.99</td>
  <td>0.99</td>
  <td>3.71</td>
  <td>0.99</td>
  <td>0.98</td>
</tr>
<tr>
  <td>food</td>
  <td>1.90</td>
  <td>0.98</td>
  <td>0.99</td>
  <td>1.55</td>
  <td>0.99</td>
  <td>1.00</td>
  <td>2.01</td>
  <td>0.99</td>
  <td>0.99</td>
</tr>
<tr>
  <td>name</td>
  <td>0.53</td>
  <td>1.00</td>
  <td>0.95</td>
  <td>0.35</td>
  <td>1.00</td>
  <td>1.00</td>
  <td>1.29</td>
  <td>1.00</td>
  <td>0.95</td>
</tr>
<tr>
  <td>brand</td>
  <td>0.41</td>
  <td>1.00</td>
  <td>0.99</td>
  <td>0.33</td>
  <td>1.00</td>
  <td>0.99</td>
  <td>0.98</td>
  <td>0.95</td>
  <td>0.98</td>
</tr>
<tr>
  <td>astronomy</td>
  <td>2.11</td>
  <td>1.00</td>
  <td>0.97</td>
  <td>1.97</td>
  <td>0.99</td>
  <td>0.97</td>
  <td>2.28</td>
  <td>0.98</td>
  <td>0.95</td>
</tr>
<tr>
  <td>chemistry</td>
  <td>1.76</td>
  <td>0.99</td>
  <td>0.97</td>
  <td>1.91</td>
  <td>0.99</td>
  <td>0.98</td>
  <td>2.81</td>
  <td>0.98</td>
  <td>0.97</td>
</tr>
<tr>
  <td>philosophy</td>
  <td>3.03</td>
  <td>0.99</td>
  <td>0.96</td>
  <td>2.84</td>
  <td>0.99</td>
  <td>0.97</td>
  <td>3.31</td>
  <td>0.99</td>
  <td>0.96</td>
</tr>
<tr>
  <td>physics</td>
  <td>1.72</td>
  <td>0.99</td>
  <td>1.00</td>
  <td>1.82</td>
  <td>0.98</td>
  <td>1.00</td>
  <td>2.31</td>
  <td>0.99</td>
  <td>0.98</td>
</tr>
Topic Offline w/o RL Offline w/RL Streaming w/o RL Streaming w/ RL
WER acc rec WER acc rec WER acc rec WER acc rec
biology 1.67 0.98 0.99 1.70 0.97 1.00 2.04 0.98 0.98

Table 6 shows that the hotword customization feature is effective, achieving high recall (>0.97) on most topics. RL helps improve both WER and recall in several cases, most notably for the "name" topic where recall improves from 0.95 to 1.00 and WER drops significantly.

6.2.4. Effect of Reinforcement Learning

The following are the results from Table 8 of the original paper:

Test set Offline Streaming
w/o RL w/ RL w/o RL w/ RL
In-house 6.55 6.66 7.24 7.00
Fairfield 5.14 4.66 6.96 5.33
Home Scenario 5.19 5.17 6.53 5.73
Complex Background 12.16 11.29 13.53 12.50
Accented 15.17 14.22 15.54 14.74
Opensouce 3.98 3.96 5.17 4.37
Average 8.78 8.42 10.05 9.13

Table 8 provides a clear ablation for RL. On average, RL improves the WER from 8.78% to 8.42% (4.1% relative improvement) in offline mode and from 10.05% to 9.13% (9.2% relative improvement) in streaming mode. The gains are most pronounced on the more challenging datasets (Complex Background, Accented, Fairfield) and are larger for the streaming setting, suggesting that RL helps the model make better decisions under latency constraints.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents Fun-ASR, a large-scale, LLM-based ASR system designed and optimized for practical, real-world deployment. By synergistically combining data scaling, model scaling, deep LLM integration, and a targeted reinforcement learning phase, Fun-ASR achieves state-of-the-art performance. The authors emphasize that a key contribution is the system's robustness and superior accuracy on challenging industrial evaluation sets, in contrast to other models that perform well on open-source benchmarks but fail in practice. The paper successfully demonstrates that through production-oriented optimizations for streaming, noise, code-switching, and hotwords, it is possible to build a deployable ASR system that is both highly accurate and reliable.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations of the current Fun-ASR system:

  • Limited Language Support: The primary model is optimized for Chinese and English. While a multilingual version exists, production-grade features like streaming and hotword customization are not yet fully supported for other languages.

  • Constrained Context Window: The model's ability to handle very long audio recordings is limited. It struggles without an external Voice Activity Detection (VAD) module to segment the audio, indicating that its internal context management is not yet robust enough for indefinite-length inputs.

  • No Far-Field or Multi-Channel Support: The current version is not designed for far-field audio (e.g., from smart speakers across a room) or to leverage multi-channel microphone arrays for beamforming and noise reduction.

    The authors plan to address these limitations in future releases.

7.3. Personal Insights & Critique

This technical report is an excellent case study on bridging the gap between academic research and industrial-strength AI products.

Inspirations:

  • Problem-Driven Engineering: The entire project is driven by a clear, practical problem: existing SOTA models are not robust enough for real-world use. The focus on fixing specific failure modes (hallucination, keyword errors) through targeted methods like the custom RL reward function is a powerful approach.
  • Sophistication of Training: The multi-stage training pipeline is a testament to the complexity of building modern large-scale models. It shows that simply "training a big model on big data" is not enough; careful, staged alignment and adaptation of components are critical.
  • The Value of Real-World Data: The paper makes a compelling argument for the limited utility of aging open-source benchmarks. The emphasis on creating and evaluating on fresh, challenging, and representative datasets is a crucial lesson for the entire field.

Critique and Potential Issues:

  • Reproducibility and Accessibility: The primary critique is the lack of reproducibility. The models, massive datasets, and computational resources are proprietary to Alibaba. While the methodological description is detailed, it is impossible for the academic community to verify the results or build upon the work directly. This is a common issue with large-scale industrial research.

  • Lack of Peer Review: As a technical report, the paper has not undergone the scrutiny of peer review. Some claims, experimental details, or comparisons might be presented in a way that favors the authors' system.

  • Evaluation on Private Datasets: While the authors' rationale for using private datasets is sound (avoiding data leakage), it also means the SOTA claims cannot be independently verified by the broader community. The results on these sets must be taken with the understanding that the test conditions are controlled by the authors.

  • Minor Inconsistencies: The paper contains minor inconsistencies, such as the mislabeling of figures and a future publication date, which reflect its status as a rapidly developed technical report rather than a polished, peer-reviewed article.

    Overall, Fun-ASR represents a significant engineering achievement. It showcases a path forward for ASR systems where raw accuracy is balanced with practical robustness, reliability, and user-centric features, a direction that is essential for the real-world impact of AI technologies.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.