AiPaper
Paper status: completed

Recent Advances in Speech Language Models: A Survey

Published:10/02/2024
Original LinkPDF
Price: 0.10
Price: 0.10
5 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This survey provides a comprehensive overview of recent methodologies for constructing Speech Language Models (SpeechLMs), emphasizing their advantages as end-to-end models that generate speech directly, overcoming challenges like information loss, latency, and error accumulation

Abstract

Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize their evaluation metrics, and discuss the challenges and future research directions in this rapidly evolving field. The GitHub repository is available at https://github.com/dreamtheater123/Awesome-SpeechLM-Survey

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Recent Advances in Speech Language Models: A Survey

1.2. Authors

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, and Irwin King, Fellow, IEEE

1.3. Journal/Conference

This paper is a preprint, published on arXiv. The authors include several researchers, some affiliated with academic institutions, and one Fellow of IEEE (Irwin King), indicating expertise in the field of artificial intelligence and machine learning. As a survey paper, its goal is to provide a comprehensive overview of a rapidly evolving research area, rather than presenting novel research findings. Preprints on arXiv allow for rapid dissemination of research and contribute to ongoing discussions in the AI community.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses the limitations of conventional text-based Large Language Models (LLMs) in natural human interaction, which often relies on speech. It highlights the shortcomings of the Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS) pipeline, specifically information loss during modality conversion, significant latency, and cumulative errors. To overcome these, Speech Language Models (SpeechLMs) are introduced as end-to-end models that generate speech directly from speech without intermediate text conversion. This survey provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing their architectural components (speech tokenizer, language model, vocoder) and training recipes. It also systematically surveys their capabilities, categorizes evaluation metrics, and discusses current challenges and future research directions in this rapidly evolving field.

Official Source: https://arxiv.org/abs/2410.03751v4 PDF Link: https://arxiv.org/pdf/2410.03751v4.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inherent limitations of integrating speech capabilities into Large Language Models (LLMs) using a traditional Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS) pipeline. While LLMs have excelled in text-based interactions, natural human communication is fundamentally multimodal, with speech being a primary channel.

This problem is important because:

  1. Information Loss: The pipeline converts rich speech signals (containing semantic and paralinguistic information like pitch, timbre, tone) into text, losing crucial non-semantic cues. This limits the model's ability to interpret user intent accurately and generate expressive, nuanced responses.

  2. Significant Latency: Chaining three complex, sequential modules (ASR, LLM, TTS) inevitably introduces considerable delays, making real-time, natural conversation difficult. Each module often involves its own sub-components and decoding steps (e.g., text generators, tokenizers, beam search), increasing computational demands.

  3. Cumulative Errors: Errors can propagate and accumulate across stages. ASR transcription errors can negatively impact the LLM's understanding and response generation, and the TTS module might struggle to synthesize text that is grammatically correct but acoustically challenging.

    The paper's entry point and innovative idea revolve around Speech Language Models (SpeechLMs). These are proposed as an alternative end-to-end approach that directly processes and generates speech, bypassing the intermediate text conversion of the traditional pipeline. This promises to retain more information, reduce latency, and mitigate error accumulation.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. First Comprehensive Survey: It presents the first comprehensive overview of recent methodologies for constructing SpeechLMs, filling a gap in the literature compared to existing surveys that focus on traditional speech/audio technologies or multimodal LLMs without specifically addressing end-to-end speech generation models.

  2. Novel Taxonomy for SpeechLMs: The paper proposes a new classification system for SpeechLMs based on their underlying components (speech tokenizer, language model, vocoder) and training recipes (pre-training, instruction-tuning, post-alignment, feature modeling, and interaction paradigms).

  3. Classification of Evaluation Methods: It introduces a novel classification system for evaluating SpeechLMs, categorizing them into automatic (objective) and human (subjective) assessments, covering aspects like representation, linguistic, paralinguistic, generation quality, real-time interaction, and downstream task performance.

  4. Identification of Challenges and Future Directions: The survey discusses the current challenges in the SpeechLM field, such as understanding component choices, achieving end-to-end training, real-time speech generation, addressing safety risks (toxicity, privacy), and improving performance on rare languages. It also outlines promising avenues for future research.

    These contributions collectively provide a structured understanding of the rapidly evolving SpeechLM landscape, guiding researchers and practitioners in developing more powerful and natural voice-based AI.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand Speech Language Models (SpeechLMs), several foundational concepts from natural language processing (NLP) and speech processing are crucial.

  • Large Language Models (LLMs): These are advanced neural network models, typically based on the Transformer architecture, trained on vast amounts of text data to understand, generate, and predict human-like text. They learn complex patterns of language, enabling tasks like translation, summarization, and question answering. Examples include GPT-3, LLaMA, and OPT.
  • Automatic Speech Recognition (ASR): ASR is the process by which spoken language is converted into written text. It involves several stages, including acoustic modeling (mapping audio to phonemes), pronunciation modeling (mapping phonemes to words), and language modeling (predicting sequences of words).
  • Text-to-Speech (TTS): TTS is the process of synthesizing human speech from written text. Modern TTS systems often involve text analysis (to determine pronunciation, prosody), acoustic modeling (to generate acoustic features like mel-spectrograms), and vocoding (to convert acoustic features into a waveform).
  • Autoregressive Models: A type of statistical model that predicts the next item in a sequence based on the preceding items. In the context of LLMs or SpeechLMs, this means predicting the next word, token, or speech segment based on all the previous ones, enabling generation of coherent sequences.
  • Transformer Architecture: Introduced in "Attention Is All You Need" [38], the Transformer is a neural network architecture that relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence. It has become the backbone of most state-of-the-art LLMs and is widely used in SpeechLMs.
    • Self-Attention: A mechanism that allows the model to weigh the importance of different input elements (tokens) when processing each element. For an input query QQ, keys KK, and values VV, the Attention mechanism is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
      • QQ (Query), KK (Key), VV (Value) are matrices representing the input embeddings.
      • QKTQ K^T calculates the dot product similarity between queries and keys.
      • dk\sqrt{d_k} is a scaling factor to prevent large dot products from pushing the softmax into regions with very small gradients, where dkd_k is the dimension of the keys.
      • softmax\mathrm{softmax} normalizes the scores so they sum to 1, representing attention weights.
      • The attention weights are then multiplied by the VV matrix to get the weighted sum of values.
  • Tokenization: The process of breaking down a continuous stream of data (like text or speech) into discrete units called tokens. For text, tokens can be words, subwords, or characters. For speech, they can be speech units, acoustic tokens, or semantic tokens derived from audio features.
  • Vocoders: Devices or algorithms that convert acoustic features (e.g., mel-spectrograms, speech tokens) back into audible speech waveforms. They are crucial for the synthesis part of TTS and SpeechLMs.
  • Self-Supervised Learning (SSL): A paradigm where models learn representations from unlabeled data by solving a pretext task (e.g., predicting masked portions of the input). This is extensively used in speech processing (e.g., wav2vec 2.0, HuBERT) to learn powerful speech representations.
  • Generative Adversarial Networks (GANs): A class of deep learning models where two neural networks, a generator and a discriminator, compete against each other. The generator creates synthetic data (e.g., audio), and the discriminator tries to distinguish real data from generated data. This adversarial process drives the generator to produce highly realistic outputs, often used in vocoders.
  • Variational Autoencoders (VAEs): A type of generative model that learns a compressed latent representation of data and can generate new data points by sampling from this latent space. They are used for learning discrete representations (VQ-VAE) and for various generative tasks.
  • Diffusion Models: A class of generative models that learn to reverse a gradual diffusion process, where noise is progressively added to data until it becomes pure noise. By learning to denoise, they can generate high-quality data (e.g., images, audio) from random noise.
  • k-means Clustering: An unsupervised machine learning algorithm used to partition nn observations into kk clusters, where each observation belongs to the cluster with the nearest mean (centroid). Used in HuBERT to discretize speech features.
  • Vector Quantization (VQ): A technique for quantizing vectors into a finite set of codebook vectors. Used in audio codecs to compress continuous audio features into discrete tokens.
  • Residual Vector Quantization (RVQ): An extension of VQ where quantization is performed in multiple stages, with each stage quantizing the residual error from the previous stage. This allows for more precise reconstruction and higher fidelity.
  • Masked Language Modeling (MLM): A pre-training objective where a portion of the input tokens are masked, and the model is trained to predict the original masked tokens. Widely used in BERT and speech SSL models like W2v-BERT and HuBERT.
  • Contrastive Learning: A learning paradigm where the model is trained to pull similar samples closer together in an embedding space and push dissimilar samples further apart. Used in wav2vec 2.0 for learning speech representations.
  • Mel-spectrograms: A time-frequency representation of audio that approximates human hearing. It's commonly used as an intermediate representation in TTS and ASR systems.
  • Fundamental Frequency (F0): Also known as pitch, it is the lowest frequency of a vibrating object and is a key acoustic correlate of perceived pitch. It conveys prosodic and emotional information in speech.
  • Perplexity (PPL): A common intrinsic evaluation metric for language models. Lower perplexity indicates a better model, as it means the model is more confident and accurate in predicting the next token in a sequence.
  • Word Error Rate (WER): A common metric for evaluating the performance of ASR systems. It measures the number of errors (substitutions, deletions, insertions) required to transform the recognized word sequence into the reference word sequence, divided by the total number of words in the reference. $ \mathrm{WER} = \frac{S + D + I}{N} $ Where:
    • SS is the number of substitutions.
    • DD is the number of deletions.
    • II is the number of insertions.
    • NN is the total number of words in the reference (ground truth).
  • Character Error Rate (CER): Similar to WER, but calculated at the character level.
  • Mean Opinion Score (MOS): A subjective metric used in telecommunications and speech quality assessment. Human listeners rate the quality of speech samples on a scale (e.g., 1 to 5). The MOS is the arithmetic mean of all individual scores. Variants include MMOS (overall quality), PMOS (prosody), and SMOS (speaker similarity).
  • Reinforcement Learning from Human Feedback (RLHF): A technique used to align LLMs with human preferences. It involves training a reward model on human comparisons of model outputs, and then using this reward model to fine-tune the LLM with reinforcement learning algorithms.
  • Proximal Policy Optimization (PPO): A popular RL algorithm used in RLHF to optimize policies (model behaviors) by making small, stable updates.
  • Direct Preference Optimization (DPO): A simpler and more stable alternative to PPO for RLHF, directly optimizing the policy based on human preference data without needing a separate reward model.

3.2. Previous Works

The paper contextualizes SpeechLMs by contrasting them with existing approaches and related research areas:

  • Traditional Speech and Audio Technologies: Several surveys have focused on specific aspects:

    • Spoken Language Understanding (SLU) [16]: Deals with extracting semantic meaning and intent from spoken utterances.
    • Audio and speech Self-Supervised Learning (SSL) [17, 18]: Focuses on learning robust speech representations from large amounts of unlabeled audio data.
    • Integration of speech with other modalities [19, 20]: Explores how speech interacts with modalities like vision.
    • These works primarily focus on components or specific tasks within speech processing, whereas SpeechLMs aim for an end-to-end generative foundation model.
  • Single-modal and Multi-modal LLMs:

    • Surveys on single-modal LLMs [21, 22] cover advancements in text generation and understanding.
    • Surveys on multi-modal LLMs [23-25] explore models that combine text with other modalities like images or video.
    • While SpeechLMs are multi-modal (speech and potentially text), they have a unique focus on direct speech input and output, which differentiates them from broader multi-modal LLM surveys that might prioritize vision-language tasks.
  • Overlap between Audio Modality and LLMs:

    • Latif et al. [26] examine LLMs in audio processing.
    • Peng et al. [27] review SpeechLLMs within the SLU domain.
    • Ji et al. [28] focus on spoken dialogue systems encompassing speech, sound, and music.
    • The current survey distinguishes itself by focusing specifically on Speech Language Models as end-to-end generative models for speech, which is a more constrained and rapidly developing sub-field compared to general audio processing or SLU.
  • The "Naive" ASR+LLM+TTSASR + LLM + TTS Pipeline: This is the most direct prior work that SpeechLMs aim to supersede.

    • Mechanism: As depicted in Figure 1a, an ASR module converts spoken input to text, an LLM processes the text and generates a textual response, and a TTS module converts the textual response back to speech.
    • Limitations (as detailed in Background & Motivation):
      • Information Loss: ASR discards paralinguistic information (pitch, timbre, tonality), which is vital for expressive communication and accurate intent interpretation.
      • Significant Latency: Sequential operation of three complex modules introduces considerable delays, hindering natural real-time interaction.
      • Cumulative Error: Errors from ASR can propagate to LLM and TTS, leading to degraded overall performance and potentially incomprehensible outputs.

3.3. Technological Evolution

The technological landscape has evolved from specialized, pipeline-based systems to integrated, end-to-end foundation models.

  1. Early Speech Technologies: Focused on separate, modular components for ASR, TTS, and SLU. These systems were often rule-based or employed simpler statistical models.
  2. Deep Learning Revolution: The advent of deep learning significantly improved ASR (e.g., RNNs, LSTMs, Transformers) and TTS (e.g., WaveNet, Tacotron). SSL for speech (e.g., wav2vec 2.0, HuBERT) allowed learning powerful speech representations from vast unlabeled data.
  3. Rise of Text-based LLMs: Transformers and massive datasets led to the emergence of powerful LLMs (e.g., GPT, LLaMA) that excel at text generation and understanding.
  4. Initial Speech-LLM Integration (Pipeline Approach): A natural first step was to combine existing ASR, LLM, and TTS modules. This demonstrated basic speech interaction but inherited all the limitations described above.
  5. Emergence of SpeechLMs: Recognizing the limitations of the pipeline, researchers began exploring end-to-end SpeechLMs. This involves designing models that can directly process raw audio or audio tokens and generate audio or audio tokens, often inspired by the Transformer architecture of LLMs. This approach aims to address the shortcomings of the ASR+LLM+TTSASR+LLM+TTS pipeline by integrating modalities more deeply.

3.4. Differentiation Analysis

Compared to the main methods in related work, SpeechLMs introduce several core differences and innovations:

  • End-to-End Modality Handling: The most significant differentiation is the direct, end-to-end processing and generation of speech. Unlike the ASR+LLM+TTSASR + LLM + TTS pipeline, SpeechLMs do not convert speech to text (and back) as an intermediate step within the core generation loop.
  • Retention of Paralinguistic Information: By working directly with speech waveforms or rich speech tokens, SpeechLMs are designed to preserve and leverage paralinguistic information (e.g., pitch, timbre, emotion, speaking style) that is lost when speech is reduced to text. This allows for more expressive, natural, and context-aware interactions.
  • Reduced Latency: By integrating the ASR-like and TTS-like functionalities into a single, unified model, SpeechLMs aim to significantly reduce the computational overhead and sequential delays inherent in multi-stage pipelines. This is crucial for real-time conversational AI.
  • Mitigation of Cumulative Errors: A unified architecture trained end-to-end is less susceptible to error propagation, as the model learns to handle variations and ambiguities across modalities more robustly, rather than relying on perfect performance from discrete, cascaded modules.
  • Broader Application Scope: SpeechLMs can natively handle multimodal inputs (interleaved speech and text) and outputs, enabling more fluid human-computer interaction, including scenarios like speech-in-text-out and vice-versa, or mixed dialogue. They can also implicitly learn relationships between different speech characteristics and semantic content.
  • Novelty as a Survey: This paper itself differentiates from existing surveys by focusing specifically on the architectures, training recipes, capabilities, and evaluations of end-to-end generative SpeechLMs, rather than broader audio processing, SLU, or general multimodal LLM topics. It provides a unique lens on this emerging field.

4. Methodology

4.1. Principles

The core idea behind Speech Language Models (SpeechLMs) is to enable end-to-end processing and generation of speech, bypassing the limitations of the conventional ASR+LLM+TTSASR + LLM + TTS pipeline. The theoretical basis or intuition is drawn from the success of TextLMs (like LLMs) in autoregressively generating coherent text sequences. SpeechLMs extend this paradigm to the audio domain, aiming to model speech (and potentially interleaved speech and text) autoregressively.

The fundamental design pattern for SpeechLMs involves three main components:

  1. Speech Tokenizer: Converts continuous audio waveforms into discrete or continuous tokens or representations that a language model can process. This step is crucial for abstracting the raw audio signal into a more manageable, information-rich format.

  2. Language Model (LM): Typically a Transformer-based architecture (often borrowed from TextLMs), this component takes the speech tokens (and potentially text tokens) as input and autoregressively predicts the next token in the sequence. It learns the statistical patterns and dependencies within and across speech and text modalities.

  3. Token-to-Speech Synthesizer (Vocoder): Transforms the tokens generated by the language model back into audible speech waveforms. This is essentially the inverse operation of the speech tokenizer.

    This three-stage design, while still having distinct components, is considered "end-to-end" in the sense that the language model directly operates on speech-derived representations and produces speech-synthesizable outputs, fostering a more integrated learning process compared to the cascaded ASR+LLM+TTSASR + LLM + TTS system.

4.2. Core Methodology In-depth (Layer by Layer)

The paper formally defines a SpeechLM as an autoregressive foundation model that processes and generates speech end-to-end. It can also incorporate text for cross-modal functionalities.

Let's define the input and output:

  • A speech audio waveform is denoted as a=(a1,a2,,aQ)\mathbf{a} = (a_1, a_2, \ldots, a_Q), where aiRa_i \in \mathbb{R} are audio samples of length QQ.

  • A text span is denoted as t=(t1,t2,,tK)\mathbf{t} = (t_1, t_2, \ldots, t_K), where tjt_j are text tokens (word, subword, character) of length KK.

  • A multimodal sequence is M=(M1,M2,,MN)\mathbf{M} = (M_1, M_2, \ldots, M_N), where each element Mi{ai,tj}M_i \in \{a_i, t_j\} can be an audio sample or a text token.

  • The input multimodal sequence is Min=(M1in,M2in,,MNinin)\mathbf{M}^{\mathrm{in}} = (M_1^{\mathrm{in}}, M_2^{\mathrm{in}}, \ldots, M_{N_{\mathrm{in}}}^{\mathrm{in}}), with Nin0N_{\mathrm{in}} \ge 0.

  • The output multimodal sequence is Mout=(M1out,M2out,,MNoutout)\mathbf{M}^{\mathrm{out}} = (M_1^{\mathrm{out}}, M_2^{\mathrm{out}}, \ldots, M_{N_{\mathrm{out}}}^{\mathrm{out}}), with Nout0N_{\mathrm{out}} \ge 0.

    A SpeechLM parameterized by θ\theta is then represented as: Mout=SpeechLM(Min;θ) \mathbf{M}^{\mathrm{out}} = SpeechLM(\mathbf{M}^{\mathrm{in}}; \theta) This equation indicates that the SpeechLM takes an input multimodal sequence and generates an output multimodal sequence, both potentially containing speech and/or text.

4.2.1. Components in SpeechLM

The three main components within a SpeechLM are: speech tokenizer, language model, and token-to-speech synthesizer (vocoder).

4.2.1.1. Speech Tokenizer

The speech tokenizer is the initial component that transforms continuous audio signals (waveforms) into discrete or continuous tokens or representations suitable for a language model. Its goal is to capture essential audio features while reducing dimensionality. These tokens can then be used for autoregressive modeling.

The paper categorizes speech tokenizers into three types based on their primary objective:

  1. Semantic Understanding Objective:

    • Goal: Convert speech waveforms into tokens that capture the content and meaning, primarily enhancing tasks like ASR. These tokens focus on semantic features.
    • Architecture: Typically comprises a speech encoder fE()f_E(\cdot) and a quantizer d()d(\cdot).
    • Process: The encoder transforms the waveform a\mathbf{a} into continuous embeddings v\mathbf{v}: v=fE(a;θfE) \mathbf{v} = f_E(\mathbf{a}; \theta_{f_E}) Where v=(v1,v2,,vP)\mathbf{v} = (v_1, v_2, \ldots, v_P) are the encoded embeddings, and θfE\theta_{f_E} are the encoder parameters. Since v\mathbf{v} is continuous, a quantizer d()d(\cdot) discretizes these embeddings into speech tokens s\mathbf{s}: s=d(v;θd)ors=d(a;θd) \mathbf{s} = d(\mathbf{v}; \theta_d) \quad \text{or} \quad \mathbf{s} = d(\mathbf{a}; \theta_d) Where s=(s1,s2,,sP)\mathbf{s} = (s_1, s_2, \ldots, s_P) are the speech tokens, and θd\theta_d are the quantizer parameters. If continuous tokens are used, s=v\mathbf{s} = \mathbf{v}.
    • Training: Tokens s\mathbf{s} can be used as target labels for pre-training objectives like masking and reconstructing (e.g., masked language modeling - MLM, contrastive loss).
    • Examples:
      • Wav2vec 2.0 [30]: Uses a convolutional encoder and a product quantization module.
      • W2v-BERT [32]: Builds on wav2vec 2.0 with MLM loss and contrastive loss.
      • HuBERT [33]: Uses k-means to cluster speech utterances into hidden units, then MLM to predict these units from masked speech.
      • Google USM [36]: Employs text-injection loss for text-speech alignment.
      • WavLM [37]: Adds a speech denoising objective.
  2. Acoustic Generation Objective:

    • Goal: Capture acoustic features essential for generating high-quality speech waveforms, prioritizing acoustic characteristics over semantic content, suitable for speech (re)synthesis.
    • Architecture: Typically includes an encoder fE()f_E(\cdot), a quantizer d()d(\cdot), and a decoder fD()f_D(\cdot).
    • Process: The encoder and quantizer transform the waveform into tokens s\mathbf{s} (same as semantic tokenizers). The decoder fD()f_D(\cdot) then reconstructs these tokens back into speech waveforms a^\hat{\textbf{a}}: a^=fD(s;θfE) \hat{\textbf{a}} = f_D(\mathbf{s}; \boldsymbol{\theta}_{f_E}) Where a^\hat{\textbf{a}} is the generated or reconstructed waveform.
    • Examples: Neural audio codecs [35, 49] are primary examples, using vector quantization (VQ) or residual vector quantization (RVQ) [77].
  3. Mixed Objective:

    • Goal: Balance both semantic understanding and acoustic generation by combining advantages of both.
    • Approach: Most existing mixed tokenizers adopt the architecture of acoustic generation tokenizers and distill information from semantic tokenizers.
    • Examples:
      • SpeechTokenizer [34]: Uses RVQ-GAN architecture and distills semantic information from HuBERT into the first layer of RVQ.

      • Mimi [9]: Employs a single VQ to extract information from WavLM and an RVQ module for acoustic information.

        The paper provides detailed notations for three representative speech tokenizers:

  • HuBERT (Semantic Objective):

    • Feature encoder fEf_E maps raw audio a\mathbf{a} to continuous embeddings v\mathbf{v}: v=fE(a;θfE)\mathbf{v} = f_E(\mathbf{a}; \boldsymbol{\theta}_{f_E}).
    • Embeddings are quantized into discrete speech tokens s\mathbf{s} via k-means clustering of MFCC features: s=d(MFCC(a);θd)\mathbf{s} = d(\mathrm{MFCC}(\mathbf{a}); \theta_d).
    • Trained with a masked prediction objective: L(θ)=EaD[iMlogp(siv\M;θ)] \mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{a} \sim \mathcal{D}} \left[ \sum_{i \in \mathcal{M}} - \log p (s_i \mid \mathbf{v}_{\backslash \mathcal{M}} ; \boldsymbol{\theta}) \right] Where:
      • L(θ)\mathcal{L}(\boldsymbol{\theta}) is the loss function for model parameters θ\boldsymbol{\theta}.
      • EaD\mathbb{E}_{\mathbf{a} \sim \mathcal{D}} denotes the expectation over audio samples a\mathbf{a} drawn from the data distribution D\mathcal{D}.
      • M\mathcal{M} represents the set of masked indices in the sequence.
      • p(siv\M;θ)p(s_i \mid \mathbf{v}_{\backslash \mathcal{M}} ; \boldsymbol{\theta}) is the probability of predicting the correct token sis_i at a masked position, given the unmasked embeddings v\M\mathbf{v}_{\backslash \mathcal{M}} and model parameters θ\boldsymbol{\theta}.
    • Iterative refinement of speech tokens: s(n+1)=d(fE(a;θfE(n));θd(n)) \mathbf{s}^{(n+1)} = d ( f_E ( \mathbf{a} ; \boldsymbol{\theta}_{f_E}^{(n)} ) ; \boldsymbol{\theta}_d^{(n)} ) Where s(n+1)\mathbf{s}^{(n+1)} are the refined speech tokens at iteration n+1n+1, and θfE(n)\boldsymbol{\theta}_{f_E}^{(n)} and θd(n)\boldsymbol{\theta}_d^{(n)} are the encoder and discretizer parameters at iteration nn.
  • Encodec (Acoustic Objective):

    • Encoder fEf_E maps raw audio a\mathbf{a} to continuous embeddings v\mathbf{v}: v=fE(a;θfE)\mathbf{v} = f_E(\mathbf{a}; \boldsymbol{\theta}_{f_E}).
    • Embeddings are discretized using multi-stage RVQ: s=d(v;θd)=(d1(v;θd1), d2(vv^1;θd2),, dR(vr=1R1v^r;θdR)), \begin{array}{r} { \mathbf{s} = d(\mathbf{v}; \boldsymbol{\theta}_d) = \big ( d_1 (\mathbf{v}; \boldsymbol{\theta}_{d_1}) , \mathrm{ ~ } d_2 (\mathbf{v} - \hat{\mathbf{v}}_1 ; \boldsymbol{\theta}_{d_2}) , } \\ { \dots , \mathrm{ ~ } d_R (\mathbf{v} - \displaystyle \sum_{r=1}^{R-1} \hat{\mathbf{v}}_r ; \boldsymbol{\theta}_{d_R}) \big ) , } \end{array} Where:
      • s\mathbf{s} are the discrete speech tokens.
      • dr()d_r(\cdot) is the quantizer for stage rr.
      • θdr\boldsymbol{\theta}_{d_r} are the parameters for the quantizer at stage rr.
      • v^r\hat{\mathbf{v}}_r denotes the quantized embedding at stage rr.
      • RR is the total number of RVQ stages.
    • Decoder fDf_D reconstructs audio waveform a^\hat{\bf a} from quantized tokens: a^=fD(s;θfD)\hat{\bf a} = f_D({\bf s}; \theta_{f_D}).
  • SpeechTokenizer (Mixed Objective):

    • Encoder fEf_E transforms input audio a\mathbf{a} into continuous embeddings v\mathbf{v}: v=fE(a;θfE)\mathbf{v} = f_E(\mathbf{a}; \theta_{f_E}).
    • Discretization via multi-stage RVQ, similar to Encodec. The key difference is that the first RVQ stage distills tokens derived from HuBERT (semantic information), and subsequent stages quantize the residuals (acoustic information).

4.2.1.2. Language Model

The language model component in SpeechLMs largely adapts architectures from TextLMs, primarily Transformers [38] or decoder-only architectures (e.g., OPT [3], LLaMA [39]). It generates speech in an autoregressive manner.

  • Text-based Decoder-Only Transformer LM:

    • Vocabulary size: Vt|V_t|.
    • Hidden dimension: hh.
    • Embedding matrix: EtRVt×hE_t \in \mathbb{R}^{|V_t| \times h}.
    • Sequence of LL transformer decoder blocks: De={De1,De2,,DeL}\mathbf{De} = \{De_1, De_2, \ldots, De_L\}.
    • Output embedding matrix: EtRh×VtE_t' \in \mathbb{R}^{h \times |V_t|}.
    • Representation: toutLM(tin,(Et,De,Et)) \mathbf{t}^{\mathrm{out}} \sim \mathrm{LM}(\mathbf{t}^{\mathrm{in}}, (E_t, \mathbf{De}, E_t')) Where tin\mathbf{t}^{\mathrm{in}} is the input text token sequence, and tout\mathbf{t}^{\mathrm{out}} is the generated text token sequence.
  • Adapting for Speech Generation:

    • The original text tokenizer is replaced by a speech tokenizer.
    • The text embedding matrix EtE_t is replaced by a speech embedding matrix EsRVs×hE_s \in \mathbb{R}^{|V_s| \times h}, where Vs|V_s| is the vocabulary size of the speech tokenizer.
    • The output embedding matrix EtE_t' is replaced by EsRh×VsE_s' \in \mathbb{R}^{h \times |V_s|}.
    • Representation for speech-only LM: soutLM(sin,(Es,De,Es)) \mathbf{s}^{\mathrm{out}} \sim \mathrm{LM}(\mathbf{s}^{\mathrm{in}}, (E_s, \mathbf{De}, E_s')) Where sin\mathbf{s}^{\mathrm{in}} is the input speech token sequence, and sout\mathbf{s}^{\mathrm{out}} is the generated speech token sequence.
  • Jointly Modeling Text and Speech (Multimodal LM):

    • A common approach is to expand the vocabulary of the original TextLM to include both text and speech tokens.
    • The speech embedding matrix is appended to the text embedding matrix, forming a larger embedding matrix EmR(Vt+Vs)×hE_m \in \mathbb{R}^{(|V_t| + |V_s|) \times h}. (Note: the paper states (VtVs)(|V_t| \cdot |V_s|) which seems like a typo, typically it's an additive union of vocabularies for concatenation).
    • Let m\mathbf{m} be a token sequence containing both speech and text tokens.
    • Representation for multimodal LM: moutLM(min,(Ej,De,Ej)) \mathbf{m}^{\mathrm{out}} \sim \mathrm{LM}(\mathbf{m}^{\mathrm{in}}, (E_j, \mathbf{De}, E_j')) Where EjE_j and EjE_j' represent the combined embedding matrices, and min\mathbf{m}^{\mathrm{in}} and mout\mathbf{m}^{\mathrm{out}} are input and output multimodal token sequences.
    • When modeling with continuous tokens, the embeddings from the speech tokenizer are directly fed into the language model, and the LM architecture might not need significant changes.

4.2.1.3. Token-to-Speech Synthesizer (Vocoder)

After the language model generates speech tokens autoregressively, the vocoder module converts these tokens back into audible speech waveforms. This is the reverse process of the speech tokenizer. a=Vo(s;θVo) \mathbf{a} = Vo(\mathbf{s}; \theta_{V_o}) Where:

  • a\mathbf{a} is the synthesized speech audio waveform.

  • Vo is the vocoder model.

  • s\mathbf{s} is the input sequence of speech tokens.

  • θVo\theta_{V_o} are the parameters of the vocoder model.

    The paper outlines two main pipelines for SpeechLM vocoders:

  1. Direct Synthesis: The vocoder directly converts speech tokens into audio waveforms.

    • Example: Polyak et al. [48] adapted HiFi-GAN [47] to take speech tokens as input.
    • Suitable for tokens from acoustic generation tokenizers which contain sufficient acoustic information.
  2. Input-Enhanced Synthesis: An additional module transforms the speech tokens into a continuous latent representation (e.g., mel-spectrograms) before feeding them to the vocoder.

    • Reason: Vocoders often require intermediate audio representations.
    • Example: CosyVoice [88] uses a Conditional FlowMatching (CFM) model to convert speech tokens into mel-spectrograms, then uses HiFi-GAN for final synthesis.
    • Suitable for tokens from semantic understanding tokenizers which provide rich semantic information but lack fine acoustic details.

Vocoder Categories:

  • GAN-based Vocoder: Most commonly adopted due to fast and high-fidelity generation.

    • Architecture: Comprises a generator (GG) and a discriminator (DD). The generator produces audio, and the discriminator distinguishes real from synthetic audio.
    • Training Objectives:
      • GAN Loss: The fundamental adversarial objective. For the generator (G) and discriminator (D), typically uses least squares loss:
        • Generator GAN Loss: LGAN(G;D)=Ems[(D(G(ms))1)2] \mathcal{L}_{\mathrm{GAN}}(G; D) = \mathbb{E}_{ms} \left[ \left( D(G(ms)) - 1 \right)^2 \right] Where:
          • ms represents the mel-spectrogram input to the generator.
          • G(ms) is the waveform generated by the generator.
          • D(G(ms)) is the discriminator's output for the generated waveform.
          • The generator aims to make D(G(ms)) close to 1 (fool the discriminator into thinking it's real).
        • Discriminator GAN Loss: LGAN(D;G)=E(x,ms)[(D(x)1)2+(D(G(ms)))2], \begin{array}{r} { \mathcal{L}_{\mathrm{GAN}}(D; G) = \mathbb{E}_{(x, ms)} \left[ \left( D(x) - 1 \right)^2 + \left( D(G(ms)) \right)^2 \right] , } \end{array} Where:
          • xx is the ground truth audio waveform.
          • D(x) is the discriminator's output for the real waveform (aims for 1).
          • D(G(ms)) is the discriminator's output for the generated waveform (aims for 0).
      • Mel-spectrogram Loss: Improves fidelity by aligning mel-spectrograms of generated and ground-truth audio: LMel(G)=E(x,ms)[ϕ(x)ϕ(G(ms))1], \begin{array}{r} { \mathcal{L}_{\mathrm{Mel}}(G) = \mathbb{E}_{(x, ms)} \left[ \| \phi (x) - \phi (G (m s)) \|_1 \right] , } \end{array} Where:
        • ϕ()\phi(\cdot) is the function to transform a waveform into its corresponding mel-spectrogram.
        • 1\| \cdot \|_1 denotes the L1 distance.
      • Feature Matching Loss: Aligns discriminator-encoded features of real and generated samples: LFM(G;D)=E(x,ms)[i=1T1NiDi(x)Di(G(ms))1] \mathcal{L}_{FM}(G; D) = \mathbb{E}_{(x, ms)} \left[ \sum_{i=1}^{T} \frac{1}{N_i} \left. D^i (x) - D^i (G (m s)) \right._1 \right] Where:
        • Di()D^i(\cdot) represents the features from the ii-th layer of the discriminator.
        • NiN_i is the number of features in the ii-th layer.
        • TT is the total number of layers considered.
    • Architectural Choices:
      • MelGAN [123]: Uses residual blocks with dilations and a multi-scale discriminator.
      • HiFi-GAN [47]: Proposes a multi-period discriminator for diverse periodic patterns.
      • Fre-GAN [124]: Employs Discrete Wavelet Transform (DWT) for high-frequency content.
      • BigVGAN [80]: Introduces a periodic activation function (snake function) and anti-aliased representation.
  • GAN-based Neural Audio Codec: When neural audio codecs use GANs, their decoders can directly serve as vocoders (e.g., SoundStream decoder in Encodec [49]). Polyak et al. [48] used HiFi-GAN as a backbone, disentangling input features into semantic tokens, pitch tokens, and speaker embeddings.

  • Other Types of Vocoder:

    • Pure Signal Processing Vocoder: Traditional methods (e.g., WORLD [126]) relying on deterministic algorithms, but often introduce artifacts.

    • Autoregressive Vocoder: Generates audio samples one at a time, conditioned on previous samples (e.g., WaveNet [44]). High quality but computationally expensive and slow.

    • Flow-based Vocoder: Uses invertible transformations to map a simple distribution to complex audio samples (e.g., WaveGlow [46]). Efficient sampling and density evaluation, parallel generation, but high parameter/memory cost.

    • VAE-based Vocoders: Use VAEs to learn latent representations for reconstruction [77, 127], less explored for vocoders.

    • Diffusion-based Vocoder: Gradually adds noise to data and learns to reverse the process (e.g., DiffWave [128]). Produces high-fidelity speech.

      HiFi-GAN Notation (Most Used Vocoder): HiFi-GAN synthesizes audio waveforms from mel-spectrograms or speech tokens. The generator G(s;θG)G (\mathbf{s} ; \theta_G) maps a sequence of speech tokens s\mathbf{s} to an output audio waveform a\mathbf{a}: a=Vo(s;θVo)=G(s;θG) \mathbf{a} = Vo(\mathbf{s}; \theta_{Vo}) = G(\mathbf{s}; \theta_G) Where θVo=θG\theta_{Vo} = \theta_G are the vocoder parameters. It uses multi-period DMPD(a;θMPD)D_{MPD} (\mathbf{a} ; \theta_{MPD}) and multi-scale discriminators DMSD(a;θMSD)D_{MSD} ({ \bf a } ; \theta_{MSD}) for adversarial training. At inference, only GG is used.

4.2.2. Training Recipes

The training of SpeechLMs involves three main aspects: features modeled, training stages, and speech interaction paradigms.

4.2.2.1. Features Modeled

These refer to the types of features or tokens output by the speech tokenizer and processed by the language model.

  1. Discrete Features: Quantized representations of speech signals as distinct, countable units. These are the most common in SpeechLMs as they align with TextLM token processing.

    • Semantic Tokens: Derived from semantic understanding tokenizers (e.g., HuBERT, wav2vec 2.0). Excel in content generation but may lack acoustic details.
      • GSLM [50] compared CPC, wav2vec 2.0, and HuBERT, finding HuBERT best for speech resynthesis and generation.
      • AudioPaLM [52] found USM-v2 (USMv1 variant) best for ASR and Speech Translation (ST).
    • Paralinguistic Tokens: Integrated to capture expressive information (prosody, pitch, timbre).
      • pGSLM [56] used fundamental frequency (F0) and unit duration alongside HuBERT tokens.
      • SPIRIT-LM [5] added pitch and style tokens [148] to HuBERT tokens.
    • Acoustic Tokens: Aim to capture features for high-fidelity speech reconstruction, typically from neural audio codec models.
      • Codec Language Models (CodecLMs): Directly model codec tokens (e.g., Viola [57], NTPP [69] using VQ-VAE tokens).
    • Discussion: Semantic tokens lead to semantically coherent speech but lack acoustic details (requiring post-processing like diffusion). Acoustic tokens yield high-fidelity audio but may struggle with content accuracy. Solutions include hierarchical modeling (e.g., AudioLM [115] modeling semantic then acoustic tokens) or mixed tokens (e.g., Moshi [9], SpeechGPT-Gen [60] jointly modeling semantic and acoustic).
  2. Continuous Features: Unquantized, real-valued representations (e.g., mel-spectrograms, latent embeddings).

    • Spectron [61] predicts spectrograms frame-by-frame.
    • Mini-Omni [10] and SLAM-Omni [54] extract intermediate representations from a frozen Whisper encoder.
    • LauraGPT [63] uses an audio encoder trained with the language model to derive latent representations.
    • Challenges: Require modifying TextLM training pipelines and demand more storage.

4.2.2.2. Training Stages

SpeechLMs follow a multi-stage training process, similar to TextLMs.

  1. Language Model Pre-Training: Learning statistical patterns and dependencies from large corpora.

    • Training Data: Large-scale open-source speech data, including ASR (e.g., LibriSpeech, LibriLight, Gigaspeech), TTS (LibriTTS), Speech Translation (ST) (CoVoST2, CVSS), podcasts (Spotify Podcasts), and dialogues (Fisher). Some datasets include text transcripts for multi-modal learning.
    • Cold Initialization: Training Transformer-based language models from scratch.
      • GSLM [50] pioneered this, showing HuBERT tokens superior.
      • SUTLM [64] compared four methods for modeling speech and text tokens:
        • Speech-only: [SPEECH] S12 S34 S33 . S11 S59 (Only the speech sequence is provided.)
        • Text-only: [TEXT] A quick brown fox jumps over a lazy dog. (Only the text sequence is provided.)
        • Concatenated speech-text: [SPEECH] S12 S34 S33 ... S11 S59 [TEXT] A quick brown fox jumps over a lazy dog. (The speech sequence and text sequence are concatenated together.)
        • Alternating speech-text: [SPEECH] S12 S34 S33 [TEXT] brown fox jumps over a lazy [SPEECH] S11 S59 (The sequence is interleaved with speech and text tokens.) SUTLM found alternating speech-text best for cross-modal evaluations.
      • Other architectures: pGSLM [56] (multi-stream transformer), dGSLM [4] (dialogue transformer), LSLM [65] (streaming SSL encoder + TTS model).
    • Continued Pre-Training: Initializing with pre-trained TextLM weights to leverage linguistic knowledge.
      • Hassid et al. [51] found OPT [3] and LLaMA [39] checkpoints beneficial, outperforming cold initialization.
      • AudioPaLM [52] showed benefits from larger PaLM and PaLM-2 checkpoints and larger datasets.
      • Text-Speech Representation Alignment:
        • Single sequence: SPIRIT-LM [5] showed interleaving text and speech tokens during continued pre-training significantly boosts performance. Spectron [61] and SpeechGPT [8] use joint supervision where input speech is transcribed to text, LLM predicts text response, and text is synthesized to speech.
        • Multi-sequence: Llama-Omni [11] uses LLM hidden states to decode text and generate discrete speech tokens simultaneously. Mini-Omni [10] generates one text sequence and seven acoustic token sequences in parallel. Moshi [9] generates text, semantic, and acoustic token sequences in parallel, aligned at the word level.
    • Discussion: Aligning text and speech representations is challenging, as text is concentrated knowledge. Trade-offs exist: text primarily conveys semantics (improves semantic capabilities but may compromise paralinguistic capture), and inference can be text-present (higher latency, better reasoning, less hallucination) or text-independent (more efficient but less stable).
  2. Language Model Instruction-Tuning: Fine-tuning SpeechLMs to follow specific instructions for diverse tasks.

    • Focus: Creating effective instruction-following datasets.
    • Approaches:
      • SpeechGPT [8] and SpeechGPT-Gen [60]: Two-stage tuning (cross-modal and chain-of-modality). First, ASR datasets are augmented with instructions for speech-to-text; TTS data for text-to-speech. Second, text-based instruction datasets are converted to speech-in-speech-out using TTS.
      • Llama-Omni [11]: Synthesizes text-based datasets into speech-like prompts, uses an LLM to generate speech-patterned responses, and then synthesizes prompt/response pairs.
      • COSMIC [68]: Constructed speech QA data using GPT-3.5 to generate Q&A pairs from TED talk transcripts.
  3. Language Model Post-Alignment: Refining SpeechLM behavior to align with human preferences, ensuring safety and reliability.

    • Techniques: Reinforcement Learning from Human Feedback (RLHF), including Proximal Policy Optimization (PPO) [151] and Direct Preference Optimization (DPO) [152].
    • Challenges: Unique to speech interaction.
      • Align-SLM [153]: Addresses inconsistent semantic content by using a TextLM to select preferred responses (after ASR transcription) and aligning preferences using DPO.
      • SpeechAlign [154]: Focuses on acoustic quality, using optimization to align the language model's output with "golden" token distributions to mitigate subpar speech generation from generated tokens.
    • Importance: Critical for mitigating safety risks (e.g., toxicity, privacy) in generative models, which is currently under-explored for SpeechLMs.

4.2.2.3. Speech Interaction Paradigm

This refers to how SpeechLMs handle dynamic conversational flows beyond simple input-response.

  1. Real-time Interaction: Advanced handling of conversation data, supporting natural interruptions and simultaneous communication.

    • Streaming Tokenizers and Vocoders: Eliminate the need to wait for complete speech encoding, reducing latency.
    • Full-duplex Modeling: Enables simultaneous bidirectional communication, allowing user interruption and simultaneous responses.
      • User interruption: Model can be interrupted and respond.
      • Simultaneous response: Model processes input and generates output concurrently.
    • Approaches:
      • dGSLM [4]: Employs separate transformers for each speaker with cross-attention layers.
      • NTPP [69]: Uses a next-token-pair prediction approach with a decoder-only Transformer for dual channels.
      • Moshi [9]: Concatenates user input and model response channels, processed by an RQ-Transformer.
      • LSLM [65]: Models one speaker's speech with a decoder-only Transformer, integrating a streaming SSL encoder for listening and speaking channel embeddings.
  2. Interactive Period Recognition (IPR): The ability to discern when users are interacting with the model and when to remain silent.

    • Importance: Creates natural conversational flow, avoids unnecessary interruptions, and allows the model to disregard irrelevant speech.
    • Methods:
      • Voice Activity Detection (VAD) module: MiniCPM-o 2.6 [72] integrates a VAD to respond only when audio surpasses a threshold, ignoring noise.

      • Trained distinction: VITA [70] trains the SpeechLM to distinguish query speech from non-query audio, outputting an end-of-sequence token for non-query audio.

        The following table summarizes the popular choices of the three components in various SpeechLM papers.

The following are the results from Table II of the original paper:

Approach Speech Tokenizer Language Model Vocoder
Kimi-Audio [78] Whisper Encoder [12] + Linear Projector Qwen2.5 [79] BigVGAN [80]
Qwen2.5-Omni [81] Whisper Qwen2.5 Talker + Codec Decoder [81]
Minmo o [82] SenseVoice [83] Qwen2.5 CosyVoice 2 [84]
Lyra [85] Whisper [12] Qwen2-VL [86] HuBERT + HiFi-GAN
Flow-Omni [87] Whisper Encoder + Linear Projector Qwen2 [41] Flow Matching (Transformer + MLP) + HiFi-GAN
SLAM-Omni [54] Whisper Encoder + Linear Projector Qwen2 [41]
OmniFlatten [53] CosyVoice Encoder [88] Qwen2 CosyVoice Decoder [88]
SyncLLM [89] BET [33] LaA- [2] HiFi-GAN [47], [48]
EMOVA [90] S SIL [91] LaMA-3 VITS [92]
Freeze-Omni [67] IntrinsicVoice [94] Transformer [38] Qwen2 TiCodec [93]
Mini-Omni2 [66] HuBERT Qwen2 HiFi-GAN
SALMONN-omni [71] Whisper Qwen2 Mini-Omni [10]
Zeng et al. [97] Mamba Streaming Encoder [95] Whisper + VQ GLM [42] VoiceCraft [96] + Codec Decoder
NTPP [69] VQ-VAE LLaMA-3, Mistral, Gemma 2 CosyVoice HiFi-GAN
GPST [98] EnCodec [49] Transformer Codec Decoder
GLM-4-Voice [55] Whisper + VQ [9] GLM-4-9B-Base [42] CosyVoice
Moshi [9] Mimi [9] Transformer* Mimi
VITA [70] CNN + Transformer + MLP [70] Mixtral [43] Text-to-Speech Toolkit [70]
LSLM [65] vq-wav2vec [31] Decoder-Only Transformer
SPIRIT-LM [5] HuBERT, VQ-VAE [77], speechprop LLaMA-2 [40] UniVATS [99] HiFi-GAN
TWIST [51] HuBERT OPT [3], LLaMA [39]
PSLM [100] HuBERT HiFi-GAN
VOXTLM [102] HuBERT NekoMata [101] HiFi-GAN
Voicebox [103] EnCodec OPT [3] Transformer* [38] HiFi-GAN
Park et al. [104] AV-HuBERT [105] OPT HiFi-GAN HiFi-GAN
USDM [106] XLS-R [107] Mistral Voicebox [108]
VioLA [57] EnCodec Transformer* Codec Decoder [49]
FunAudioLLM [83] SAN-M [109] Transformer* HiFTNet[110]
SpeechGPT-Gen [60] SpeechTokenizer [34] LLaMA-2 SpeechTokenizer decoder [34]
COT [?] SpeechTokenizer LLaMA-2 SoundStorm
AnyGPT [111] SpeechTokenizer LLaMA-2 SoundStorm
LauraGPT [63] onformer* Qwen [112] Transformer + Codec Decoder
Spectron [61] Conformer* PaLM 2* [113] WaveFit [114]
AudioLM [115] w2v-BErT [32] Decoder-Only Transformer* SoundStream* [35]
UniAudio [116] EnCodec, Hifi-codec [117], Transformer* Codec Decoder
Llama-Omni [11] RVQGAN [118] Whisper LLaMA-3.1 HiFi-GAN
Mini-Omni [10] Whisper + ASR Adapter [10] Qwen2 TTS Adapter [10]
tGSLM [62] Segmentation + SSE [119] + Lexical embedder Transformer* Tacotron-2 + Waveglow [45], [46]
SpeechGPT [8] HuBERT LLaMA HiFi-GAN
dGSLM [4] HuBERT Dialogue Transformer [4] HiFi-GAN
SUTLM [64]] HuBERT Transformer*
pGSLM [56] HuBERT MS-TLM [56] HiFi-GAN
G GSLM [50] HuBERT, CPC [29], Wav2vec 2.0 [30] Transformer* Tacotron-2 + Waveglow

5. Experimental Setup

This section outlines the common practices for evaluating Speech Language Models (SpeechLMs), drawing from the methodologies described in the surveyed papers. Since this is a survey, it does not present a single experimental setup for a new model, but rather a meta-analysis of the experimental setups prevalent in the field.

5.1. Datasets

SpeechLMs require vast amounts of data for both pre-training and instruction-tuning. The datasets used can be speech-only, text-only, or paired speech-text.

  • Pre-Training Datasets: Primarily large-scale open-source speech data.

    • Automatic Speech Recognition (ASR) Datasets:
      • LibriSpeech [131]: 1k hours of English speech derived from audiobooks, paired with text.
      • Multilingual LibriSpeech [132]: 50.5k hours of multilingual speech.
      • LibriLight [133]: 60k hours, used for ASR with limited or no supervision.
      • People dataset [134]: 30k hours of diverse English speech.
      • VoxPopuli [135]: 1.6k hours, multilingual for representation learning, semi-supervised learning, and interpretation.
      • Gigaspeech [136]: 40k hours, multi-domain ASR corpus.
      • Common Voice [137]: 2.5k hours (as of 2019), massively multilingual speech corpus.
      • VCTK [138]: 0.3k hours, for TTS and multi-speaker speech.
      • WenetSpeech [139]: 22k hours of multi-domain Mandarin speech.
    • Text-to-Speech (TTS) Datasets:
      • LibriTTS [140]: 0.6k hours, derived from LibriSpeech for TTS.
    • Speech Translation (ST) Datasets:
      • CoVoST2 [141]: 2.8k hours, for massively multilingual speech-to-text translation.
      • CVSS [142]: 1.9k hours, for massively multilingual speech-to-speech translation.
    • Other Audio Datasets:
      • VoxCeleb [143] and VoxCeleb2 [144]: For speaker identification (0.4k and 2.4k hours, respectively).
      • Spotify Podcasts [145]: 47k hours of podcasts.
      • Fisher [146]: 2k hours of telephone conversations.
  • Instruction-Tuning Datasets: These are often constructed by augmenting existing datasets or synthesizing new ones.

    • SpeechInstruct* [8]: Synthesized from text-based instruction data.
    • InstructS2S-200K* [11]: Synthesized instruction-following data.
    • VoiceAssistant-400K* [10]: Synthesized instruction-following data. (Note: * indicates speech versions of text datasets synthesized using TTS.)

These datasets are chosen to provide a broad range of speech contexts, languages, and tasks, enabling the SpeechLMs to learn robust representations and generalize across diverse applications. The inclusion of text transcripts in some datasets helps in learning the relationship between spoken and written forms.

5.2. Evaluation Metrics

Evaluating SpeechLMs involves both automatic (objective) and human (subjective) assessments to cover their diverse capabilities.

5.2.1. Automatic (Objective) Evaluation

  1. Representation Evaluation: Assesses how well speech features are encoded into meaningful vectors.
    • ABX Score [156, 158]: Measures embedding similarity and quantifies the separation of phonetic categories. It compares three sound samples: two from the same category (A) and one from a different category (B). The score reflects how often the system correctly identifies that two sounds from A are more similar to each other than one sound from A is to a sound from B. No single formula is universally provided for ABX score, as it's a test paradigm often based on classification accuracy or distance metrics in embedding space. Conceptually, it measures P(dist(A1,A2)<dist(A1,B))P(\mathrm{dist}(A_1, A_2) < \mathrm{dist}(A_1, B)), where A1,A2A_1, A_2 are from category A, and BB is from category B.
    • Speech Resynthesis (WER/CER): Measures information loss due to discretization. An input speech is encoded into tokens, then synthesized back. Word Error Rate (WER) or Character Error Rate (CER) is computed between the original speech's ASR transcription and the resynthesized speech's ASR transcription.
      • Word Error Rate (WER): $ \mathrm{WER} = \frac{S + D + I}{N} $ Where:
        • SS is the number of substitutions (words replaced).
        • DD is the number of deletions (words omitted).
        • II is the number of insertions (words added).
        • NN is the total number of words in the reference (ground truth) transcription.
      • Character Error Rate (CER): Calculated identically to WER, but at the character level.
  2. Linguistic Evaluation: Assesses the model's ability to generate and understand lexical, syntactic, and semantic rules.
    • sWUGGY [158]: Evaluates at the lexical level if the model can distinguish a real word from a non-real word in a pair.
    • sBLIMP [158]: Evaluates at the syntactic level if the model can identify the grammatically correct sentence from a pair.
    • Spoken StoryCloze [51]: Evaluates semantic comprehension by assessing the model's ability to choose the genuine ending of a story from two options.
    • All these are typically measured by comparing the model's negative log-likelihood (NLL) for the correct choice versus the incorrect one.
  3. Paralinguistic Evaluation: Focuses on non-verbal aspects like prosody, emotion, and timbre.
    • For Prosodic Tokens (e.g., pGSLM [56]):
      • Correctness: Calculates minimal Mean Absolute Error (min-MAE) of prosodic tokens between generated and reference samples. $ \mathrm{min-MAE} = \min_{j} \frac{1}{L} \sum_{i=1}^{L} |P_{gen}^{(i)} - P_{ref,j}^{(i)}| $ Where:
        • Pgen(i)P_{gen}^{(i)} is the ii-th prosodic token from a generated sample.
        • Pref,j(i)P_{ref,j}^{(i)} is the ii-th prosodic token from the jj-th reference sample.
        • LL is the sequence length.
        • The minimum is taken over multiple generated samples.
      • Consistency: Pearson correlation between mean values of prompt prosodic tokens and generated continuation prosodic tokens. $ \rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma_X \sigma_Y} $ Where:
        • XX and YY are the mean prosodic values of the prompt and continuation, respectively.
        • cov(X,Y)\mathrm{cov}(X,Y) is their covariance.
        • σX\sigma_X and σY\sigma_Y are their standard deviations.
      • Expressiveness: Measured by the standard deviation of generated prosody token values. $ \sigma = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2} $ Where xix_i are the generated prosody token values and xˉ\bar{x} is their mean.
    • Speech-Text Sentiment Preservation (STSP) [5]: Uses a sentiment classifier to assess if the sentiment of the generated speech or text matches the prompt's sentiment.
  4. Generation Quality and Diversity:
    • Area Under the Curve (AUC) with various temperature values on perplexity and VERT.
      • Perplexity (PPL): Measures how well a probability model predicts a sample. Lower PPL indicates better generation quality. $ \mathrm{PPL}(W) = P(w_1 w_2 \ldots w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1 w_2 \ldots w_N)}} $ For a sequence W=(w1,w2,,wN)W = (w_1, w_2, \ldots, w_N), if the model can estimate the probability of each word given the previous words: $ P(w_1 w_2 \ldots w_N) = \prod_{i=1}^{N} P(w_i | w_1 \ldots w_{i-1}) $
      • VERT: Represents the geometric mean of the ratio of k-grams in generated speech that repeat at least once, measuring diversity. No standard formula provided in the paper; it's a specific metric introduced in GSLM [50].
    • ChatGPT Score: ASR transcription of generated speech is sent to ChatGPT for quality assessment.
  5. Real-Time Interaction Evaluation: For streaming or full-duplex models.
    • Naturalness of Dialogues: Examines turn-taking events like Inter-Pausal Unit (IPU), pause, gaps, and overlapping speech. Naturalness is high if these statistics resemble human dialogues [4].
    • Usefulness (e.g., NTPP [69]):
      • Reflective pauses: SpeechLM's ability to remain silent while the user speaks.
      • Interruptions: SpeechLM's ability to cease speaking when interrupted.
    • Benchmarks: Full Duplex Bench [168] evaluates turn-taking; Talking Turns [169] uses a neural network to predict turn-taking events.
  6. Downstream Evaluation: Evaluates SpeechLM's performance on specific tasks (e.g., ASR, TTS, Speaker Identification). Can be done via few-shot examples or instruction-tuning.
    • Benchmarks:
      • SUPERB [155]: Variety of speech understanding tasks.
      • SD-Eval [162]: Paralinguistic understanding (emotion, age, environment).
      • SALMON [165]: Speech generation with consistent paralinguistic and environmental characteristics.
      • VoiceBench [166]: General SpeechLM capabilities.
      • Dynamic-SUPERB [163], MMAU [159], AirBench [161], AudioBench [160]: Extend to sound and music-related tasks.
      • VoxEval [167]: Benchmarks knowledge understanding with speech I/O, various audio conditions, and spoken math reasoning.

5.2.2. Human (Subjective) Evaluation

  1. Mean Opinion Score (MOS): Quantifies perceived speech quality by human listeners.
    • Conceptual Definition: A group of evaluators rates audio samples on a predefined scale (e.g., 1-5, from poor to excellent). The MOS is the average of these scores.

    • Variants: MMOS (overall quality), PMOS (prosody), and SMOS (timbre similarity).

    • Alternatives: Machine-based evaluation using neural network models trained for naturalness prediction [170] or speaker identification can be used to assess naturalness and timbre similarity.

      The following are the results from Table VI of the original paper:

      Name Eval Type # Tasks Audio Type I/O
      ABX [156][158] Representation 1 Speech A → −
      sWUGGY [158] Linguistic 1 Speech A → −
      sBLIMP [158] Linguistic 1 Speech A →
      sStoryCloze [51] Linguistic 1 Speech A/T → −
      STSP [5] Paralinguistic 1 Speech A/T → A/T
      MMAU [159] Downstream 27 Speech, Sound, Music A → T
      Audiobench [160] Downstream 8 Speech, Sound A → T
      AIR-Bench [161] Downstream 20 Speech, Sound, Music A → T
      SD-Eval [162] Downstream 4 Speech A → T
      SUPERB [163] Downstream 10 Speech A → T
      VoxDialogue [164] Downstream 12 Speech, Sound, Music A → T
      Dynamic-SUPERB [163] Downstream 180 Speech, Sound, Music A → T
      SALMON [165] Downstream 8 Speech A → −
      VoiceBench [166] Downstream 8 Speech A → T
      VoxEval [167] Downstream 56 Speech A → A

I/O, A, and T represent input/output modality, audio, and text, respectively. A dash ("-") indicates that the modality is not explicitly specified as input or output, or the task might be representation learning without a direct output modality.

5.3. Baselines

As this is a survey paper, it does not propose a single new method or conduct direct comparative experiments against specific baselines. Instead, it reviews numerous SpeechLM approaches and implicitly compares them by discussing their design choices, reported performance in their respective original papers, and the challenges they aim to address. The "baseline" in the context of this survey is the traditional ASR+LLM+TTSASR + LLM + TTS pipeline, whose limitations serve as the primary motivation for the development of SpeechLMs. Individual SpeechLM papers would, of course, benchmark against other SpeechLMs or state-of-the-art ASR/TTS systems depending on their specific tasks.

6. Results & Analysis

Since this paper is a survey, it does not present novel experimental results from the authors' own model. Instead, it synthesizes the findings and trends observed across a multitude of published SpeechLM papers. This section, therefore, analyzes the general outcomes, commonalities, and implications derived from the surveyed literature regarding SpeechLM components, training, and capabilities.

6.1. Core Results Analysis

The survey highlights several key findings regarding the effectiveness and characteristics of SpeechLMs compared to the traditional ASR+LLM+TTSASR + LLM + TTS pipeline:

  • Superiority of End-to-End SpeechLMs: SpeechLMs demonstrate inherent advantages over the pipeline approach by mitigating information loss (especially paralinguistic cues), significantly reducing latency, and preventing cumulative errors. This validates the core hypothesis that direct speech processing is a more effective paradigm for natural human-computer interaction.

  • Impact of Speech Tokenizer Choice:

    • Semantic Tokenizers (e.g., HuBERT, wav2vec 2.0): Are highly effective for SpeechLMs focused on semantic understanding and coherent content generation. HuBERT particularly stands out as a strong performer across various tasks. However, speech generated solely from semantic tokens often lacks expressive acoustic details.
    • Acoustic Tokenizers (e.g., Encodec, SoundStream): Excel at producing high-fidelity audio but can struggle with semantic accuracy in content generation.
    • Mixed Objective Tokenizers (e.g., SpeechTokenizer, Mimi): Show promise in balancing semantic and acoustic needs by distilling semantic information into acoustic representations, suggesting a path towards models that are both semantically accurate and acoustically rich.
  • Language Model Training Strategies:

    • Continued Pre-Training from TextLMs: This strategy, where SpeechLMs initialize with weights from powerful TextLMs (like OPT, LLaMA, PaLM), consistently yields better performance and faster convergence than cold initialization. This indicates that the linguistic knowledge embedded in TextLMs is highly transferable and beneficial for speech-based tasks, even if speech is the primary modality.
    • Text-Speech Alignment: Methods that explicitly align text and speech representations (e.g., through interleaved token sequences or multi-sequence generation) generally lead to enhanced model performance, particularly in cross-modal understanding and generation tasks.
    • Autoregressive Generation: The Transformer's decoder-only autoregressive architecture is widely adopted and proven effective for sequence generation in the speech domain, mirroring its success in text.
  • Vocoder Dominance: GAN-based vocoders, particularly HiFi-GAN and its variants, are the most prevalent choice in SpeechLMs due to their ability to synthesize high-fidelity and efficient audio waveforms. The choice between direct synthesis and input-enhanced synthesis depends on the information richness of the generated speech tokens.

  • Emergence of Real-time Interaction: While traditional SpeechLMs often generate responses after full input, newer research focuses on real-time interaction paradigms. Approaches using streaming tokenizers/vocoders and full-duplex modeling are critical advancements for natural conversations, allowing for interruptions and simultaneous speech. Interactive Period Recognition (IPR) (e.g., via VAD or learned cues) is also gaining importance for more natural conversational flow.

  • Diverse Capabilities: SpeechLMs demonstrate a broad range of downstream applications across semantic, speaker-related, and paralinguistic domains. This versatility, spanning spoken dialogue, speech translation, ASR, TTS, speaker identification, emotion recognition, and paralinguistics-enhanced generation, underscores their potential as foundation models for spoken language.

  • Instruction-Tuning Effectiveness: Instruction-tuning, often relying on synthesized data, effectively adapts SpeechLMs to follow instructions for various tasks, demonstrating their adaptability and generalizability.

    In summary, the collective findings across the surveyed papers validate the SpeechLM paradigm as a powerful and promising direction for voice-based AI. The continuous development of specialized components, sophisticated training techniques, and advanced interaction paradigms is steadily closing the gap between human and machine speech communication.

6.2. Data Presentation (Tables)

The following are the results from Table III of the original paper, summarizing popular datasets used in the pre-training and instruction-tuning phases of SpeechLMs. * means it is the speech version of the text dataset synthesized using TTS. S2ST and S2TT represent Speech-to-Speech Translation and Speech-to-Text Translation, respectively.

Dataset Type Phase Hours Year
LibriSpeech [131] ASR Pre-Training 1k 2015
Multilingual LibriSpeech [132] ASR Pre-Training 50.5k 2020
LibriLight [133] ASR Pre-Training 60k 2019
People dataset [134] ASR Pre-Training 30k 2021
VoxPopuli [135] ASR Pre-Training 1.6k 2021
Gigaspeech [136] ASR Pre-Training 40k 2021
Common Voice [137] ASR Pre-Training 2.5k 2019
VCTK [138] ASR Pre-Training 0.3k 2017
WenetSpeech [139] ASR Pre-Training 22k 2022
LibriTTS [140] TTS Pre-Training 0.6k 2019
CoVoST2 [141] S2TT Pre-Training 2.8k 2020
CVSS [142] S2ST Pre-Training 1.9k 2022
VoxCeleb [143] Speaker Identification Pre-Training 0.4k 2017
VoxCeleb2 [144] Speaker Identification Pre-Training 2.4k 2018
Spotify Podcasts [145] Podcast PreTraining 47k 2020
Fisher [146] Telephone conversation Pre-Training 2k 2004
SpeechInstruct* [8] Instruction-following Instruction-Tuning 2023
InstructS2S-200K* [11] Instruction-following Instruction-Tuning 2024
VoiceAssistant-400K* [10] Instruction-following Instruction-Tuning 2024

6.3. Ablation Studies / Parameter Analysis

As a survey, the paper does not conduct its own ablation studies or parameter analyses. Instead, it synthesizes the implications of such studies from the surveyed literature, particularly regarding the choice of components and training methodologies.

  • Component Choice (Speech Tokenizer): The comparative analyses highlighted by the survey (e.g., GSLM [50] comparing CPC, wav2vec 2.0, HuBERT; AudioPaLM [52] experimenting with w2v-bert, USM-v1, USM-v2) serve as a form of "meta-ablation." They show that semantic tokenizers like HuBERT are crucial for content understanding, while acoustic tokenizers (e.g., Encodec) are vital for high-fidelity audio. The emergence of mixed objective tokenizers directly addresses the trade-offs found in these implicit "ablation studies," aiming to combine the strengths.

  • Training Strategies (Cold vs. Continued Pre-Training): The observation that continued pre-training on TextLM checkpoints significantly outperforms cold initialization [51] is a crucial "ablation-like" finding. It demonstrates the immense value of leveraging existing linguistic knowledge from TextLMs, effectively acting as an ablation of the TextLM pre-training stage. The performance differences also highlight that not all pre-trained checkpoints (e.g., image-pretrained) are equally beneficial, indicating the specificity of modality transfer.

  • Text-Speech Alignment: The effectiveness of different text-speech alignment methods (single-sequence interleaving vs. multi-sequence generation, text-present vs. text-independent inference) as discussed in SPIRIT-LM [5], SUTLM [64], Llama-Omni [11], Mini-Omni [10], and Moshi [9] represents an exploration of architectural choices akin to parameter analysis. These works implicitly ablate different ways of integrating text and speech within the language model, revealing trade-offs in reasoning abilities, latency, and hallucination.

  • Features Modeled (Discrete vs. Continuous, Semantic vs. Paralinguistic): The discussions around the trade-offs of discrete vs. continuous features and the integration of paralinguistic tokens (e.g., F0, unit duration, pitch, style tokens) are direct analyses of how different input representations impact SpeechLM capabilities. For instance, pGSLM [56] and SPIRIT-LM [5] essentially perform an ablation by adding paralinguistic tokens to a base semantic tokenizer, demonstrating their impact on expressiveness without significantly compromising semantic understanding.

    In essence, while the survey itself doesn't present new experimental data, it rigorously synthesizes the results of numerous individual research papers, acting as a high-level analysis of how different design choices and training parameters affect the overall performance and capabilities of SpeechLMs.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey provides a pioneering and comprehensive overview of recent advancements in Speech Language Models (SpeechLMs), addressing the inherent limitations of the traditional Automatic Speech Recognition (ASR) + Large Language Model (LLM) + Text-to-Speech (TTS) pipeline. The paper effectively highlights the key advantages of SpeechLMs, including their ability to retain rich paralinguistic information, significantly reduce latency, and mitigate error accumulation through end-to-end speech processing.

The authors meticulously detail the architectural components of SpeechLMs (speech tokenizers, language models, and vocoders), categorizing various approaches within each. They also systematically review the diverse training recipes, encompassing different feature modeling techniques (discrete vs. continuous, semantic vs. acoustic vs. mixed), multi-stage training processes (pre-training, instruction-tuning, post-alignment), and advanced speech interaction paradigms (real-time interaction, interactive period recognition). Furthermore, the survey comprehensively outlines the wide range of downstream applications, classifying them into semantic-, speaker-, and paralinguistic-related tasks, and categorizes existing evaluation metrics and benchmarks.

Overall, the paper concludes that SpeechLMs are a promising alternative to pipeline-based systems, offering a more natural, efficient, and expressive way for humans to interact with AI models through speech.

7.2. Limitations & Future Work

The survey identifies several crucial challenges and outlines promising future research directions:

  • Understanding Different Component Choices: Current comparisons of speech tokenizers, language models, and vocoders are often limited in scope. A comprehensive, systematic comparison of these diverse components is needed to guide efficient SpeechLM development.
  • End-to-End Training: While SpeechLMs are conceptually end-to-end, many implementations still train components separately. Investigating fully end-to-end training, allowing gradients to propagate from vocoder output to tokenizer input, could unlock more coherent and high-fidelity speech generation.
  • Real-Time Speech Generation: The latency in SpeechLMs remains a significant hurdle for truly natural, real-time human interaction. Future work should focus on streamable pipelines and SpeechLMs that can autonomously generate audio samples in waveform chunks to minimize delay.
  • Safety Risks in SpeechLMs: This area is largely under-investigated compared to TextLLMs. SpeechLMs introduce unique safety concerns related to toxicity (e.g., generating acoustically inappropriate content like erotic speech) and privacy (e.g., inferring sensitive speaker attributes like ethnicity or religious beliefs from acoustic features). Robust research is needed to identify and mitigate these vulnerabilities.
  • Performance on Rare Languages: SpeechLMs have the potential to address the low-resource language challenge more effectively than TextLMs because spoken data is often more abundant than written text in such languages. Future research should prioritize training SpeechLMs on these languages and dialects to expand global accessibility.

7.3. Personal Insights & Critique

This survey is a highly valuable resource for anyone entering or working in the field of Speech Language Models. Its comprehensive nature, detailed breakdown of components, and systematic classification of training and evaluation methods are exceptional. The paper's emphasis on the "beginner-friendly" aspect is well-executed through clear explanations and structured analysis.

Inspirations and Applications: The vision of SpeechLMs that can genuinely understand and generate speech with full paralinguistic nuance is incredibly inspiring. Such models could revolutionize human-computer interaction, making AI assistants feel truly natural and empathetic. Beyond general conversation, the ability to generate speech conditioned on specific voices, emotions, or styles has profound implications for accessibility (e.g., personalized voice assistants for individuals with speech impairments), content creation (e.g., expressive audiobook narration), and even mental health applications (e.g., AI companions that respond with appropriate emotional tone). The potential for SpeechLMs to bridge communication gaps for low-resource languages by leveraging spoken data directly is a powerful application that could empower vast communities.

Potential Issues and Areas for Improvement:

  1. Complexity of Component Interactions: While the survey dissects each component, the intricate interactions and co-optimization challenges across the tokenizer, LM, and vocoder could be further emphasized. The choice of one component (e.g., semantic vs. acoustic tokenizer) deeply impacts the others and the overall system performance. A deeper dive into how researchers have balanced these interdependencies in actual implementations, beyond simply listing chosen components, would be beneficial.

  2. Computational Cost: The sheer scale of SpeechLMs, especially those leveraging large TextLLM backbones and requiring high-fidelity vocoders, implies massive computational requirements for training and inference. The survey mentions latency as a challenge but could explicitly discuss the energy footprint and hardware demands, which are critical for sustainable AI development and wider deployment.

  3. Data Quality and Bias: While large datasets are crucial, their quality and potential biases are less discussed. Speech data often carries inherent biases (e.g., gender, accent, socioeconomic status) that can be amplified by SpeechLMs. A more explicit discussion on how these biases are being addressed (or need to be addressed) in SpeechLM training and evaluation would be valuable, especially given the identified safety risks.

  4. "Black Box" Nature: Like TextLLMs, SpeechLMs are complex neural networks. Their interpretability is a significant challenge. How do these models "learn" paralinguistic features? Can we disentangle semantic from emotional understanding? Future surveys might delve into emerging interpretability techniques for SpeechLMs.

  5. Multilinguality and Code-Switching: While multilingual datasets are mentioned, the specific challenges and advancements in SpeechLMs for code-switching (mixing languages within a single utterance) are a nuanced area that could be explored in more detail, as it reflects natural human communication patterns.

    Despite these minor points, this survey stands as an excellent foundational text for understanding the current landscape and future trajectory of Speech Language Models. Its rigorous approach and clear presentation will undoubtedly aid researchers in navigating this dynamic and impactful field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.