Recent Advances in Speech Language Models: A Survey
TL;DR Summary
This survey provides a comprehensive overview of recent methodologies for constructing Speech Language Models (SpeechLMs), emphasizing their advantages as end-to-end models that generate speech directly, overcoming challenges like information loss, latency, and error accumulation
Abstract
Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize their evaluation metrics, and discuss the challenges and future research directions in this rapidly evolving field. The GitHub repository is available at https://github.com/dreamtheater123/Awesome-SpeechLM-Survey
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Recent Advances in Speech Language Models: A Survey
1.2. Authors
Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, and Irwin King, Fellow, IEEE
1.3. Journal/Conference
This paper is a preprint, published on arXiv. The authors include several researchers, some affiliated with academic institutions, and one Fellow of IEEE (Irwin King), indicating expertise in the field of artificial intelligence and machine learning. As a survey paper, its goal is to provide a comprehensive overview of a rapidly evolving research area, rather than presenting novel research findings. Preprints on arXiv allow for rapid dissemination of research and contribute to ongoing discussions in the AI community.
1.4. Publication Year
2024
1.5. Abstract
The paper addresses the limitations of conventional text-based Large Language Models (LLMs) in natural human interaction, which often relies on speech. It highlights the shortcomings of the Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS) pipeline, specifically information loss during modality conversion, significant latency, and cumulative errors. To overcome these, Speech Language Models (SpeechLMs) are introduced as end-to-end models that generate speech directly from speech without intermediate text conversion. This survey provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing their architectural components (speech tokenizer, language model, vocoder) and training recipes. It also systematically surveys their capabilities, categorizes evaluation metrics, and discusses current challenges and future research directions in this rapidly evolving field.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2410.03751v4 PDF Link: https://arxiv.org/pdf/2410.03751v4.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inherent limitations of integrating speech capabilities into Large Language Models (LLMs) using a traditional Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS) pipeline. While LLMs have excelled in text-based interactions, natural human communication is fundamentally multimodal, with speech being a primary channel.
This problem is important because:
-
Information Loss: The pipeline converts rich speech signals (containing semantic and paralinguistic information like pitch, timbre, tone) into text, losing crucial non-semantic cues. This limits the model's ability to interpret user intent accurately and generate expressive, nuanced responses.
-
Significant Latency: Chaining three complex, sequential modules (ASR, LLM, TTS) inevitably introduces considerable delays, making real-time, natural conversation difficult. Each module often involves its own sub-components and decoding steps (e.g., text generators, tokenizers, beam search), increasing computational demands.
-
Cumulative Errors: Errors can propagate and accumulate across stages. ASR transcription errors can negatively impact the LLM's understanding and response generation, and the TTS module might struggle to synthesize text that is grammatically correct but acoustically challenging.
The paper's entry point and innovative idea revolve around
Speech Language Models (SpeechLMs). These are proposed as an alternative end-to-end approach that directly processes and generates speech, bypassing the intermediate text conversion of the traditional pipeline. This promises to retain more information, reduce latency, and mitigate error accumulation.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
First Comprehensive Survey: It presents the first comprehensive overview of recent methodologies for constructing
SpeechLMs, filling a gap in the literature compared to existing surveys that focus on traditional speech/audio technologies or multimodal LLMs without specifically addressing end-to-end speech generation models. -
Novel Taxonomy for SpeechLMs: The paper proposes a new classification system for
SpeechLMsbased on their underlying components (speech tokenizer, language model, vocoder) and training recipes (pre-training, instruction-tuning, post-alignment, feature modeling, and interaction paradigms). -
Classification of Evaluation Methods: It introduces a novel classification system for evaluating
SpeechLMs, categorizing them into automatic (objective) and human (subjective) assessments, covering aspects like representation, linguistic, paralinguistic, generation quality, real-time interaction, and downstream task performance. -
Identification of Challenges and Future Directions: The survey discusses the current challenges in the
SpeechLMfield, such as understanding component choices, achieving end-to-end training, real-time speech generation, addressing safety risks (toxicity, privacy), and improving performance on rare languages. It also outlines promising avenues for future research.These contributions collectively provide a structured understanding of the rapidly evolving
SpeechLMlandscape, guiding researchers and practitioners in developing more powerful and natural voice-based AI.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand Speech Language Models (SpeechLMs), several foundational concepts from natural language processing (NLP) and speech processing are crucial.
- Large Language Models (LLMs): These are advanced neural network models, typically based on the
Transformerarchitecture, trained on vast amounts of text data to understand, generate, and predict human-like text. They learn complex patterns of language, enabling tasks like translation, summarization, and question answering. Examples include GPT-3, LLaMA, and OPT. - Automatic Speech Recognition (ASR):
ASRis the process by which spoken language is converted into written text. It involves several stages, including acoustic modeling (mapping audio to phonemes), pronunciation modeling (mapping phonemes to words), and language modeling (predicting sequences of words). - Text-to-Speech (TTS):
TTSis the process of synthesizing human speech from written text. ModernTTSsystems often involve text analysis (to determine pronunciation, prosody), acoustic modeling (to generate acoustic features like mel-spectrograms), and vocoding (to convert acoustic features into a waveform). - Autoregressive Models: A type of statistical model that predicts the next item in a sequence based on the preceding items. In the context of
LLMsorSpeechLMs, this means predicting the next word, token, or speech segment based on all the previous ones, enabling generation of coherent sequences. - Transformer Architecture: Introduced in "Attention Is All You Need" [38], the
Transformeris a neural network architecture that relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence. It has become the backbone of most state-of-the-artLLMsand is widely used inSpeechLMs.- Self-Attention: A mechanism that allows the model to weigh the importance of different input elements (tokens) when processing each element. For an input query , keys , and values , the
Attentionmechanism is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices representing the input embeddings.
- calculates the dot product similarity between queries and keys.
- is a scaling factor to prevent large dot products from pushing the softmax into regions with very small gradients, where is the dimension of the keys.
- normalizes the scores so they sum to 1, representing attention weights.
- The attention weights are then multiplied by the matrix to get the weighted sum of values.
- Self-Attention: A mechanism that allows the model to weigh the importance of different input elements (tokens) when processing each element. For an input query , keys , and values , the
- Tokenization: The process of breaking down a continuous stream of data (like text or speech) into discrete units called
tokens. For text, tokens can be words, subwords, or characters. For speech, they can bespeech units,acoustic tokens, orsemantic tokensderived from audio features. - Vocoders: Devices or algorithms that convert acoustic features (e.g., mel-spectrograms, speech tokens) back into audible speech waveforms. They are crucial for the synthesis part of
TTSandSpeechLMs. - Self-Supervised Learning (SSL): A paradigm where models learn representations from unlabeled data by solving a pretext task (e.g., predicting masked portions of the input). This is extensively used in speech processing (e.g.,
wav2vec 2.0,HuBERT) to learn powerful speech representations. - Generative Adversarial Networks (GANs): A class of deep learning models where two neural networks, a
generatorand adiscriminator, compete against each other. Thegeneratorcreates synthetic data (e.g., audio), and thediscriminatortries to distinguish real data from generated data. This adversarial process drives the generator to produce highly realistic outputs, often used in vocoders. - Variational Autoencoders (VAEs): A type of generative model that learns a compressed latent representation of data and can generate new data points by sampling from this latent space. They are used for learning discrete representations (
VQ-VAE) and for various generative tasks. - Diffusion Models: A class of generative models that learn to reverse a gradual diffusion process, where noise is progressively added to data until it becomes pure noise. By learning to denoise, they can generate high-quality data (e.g., images, audio) from random noise.
- k-means Clustering: An unsupervised machine learning algorithm used to partition observations into clusters, where each observation belongs to the cluster with the nearest mean (centroid). Used in
HuBERTto discretize speech features. - Vector Quantization (VQ): A technique for quantizing vectors into a finite set of codebook vectors. Used in audio codecs to compress continuous audio features into discrete tokens.
- Residual Vector Quantization (RVQ): An extension of
VQwhere quantization is performed in multiple stages, with each stage quantizing the residual error from the previous stage. This allows for more precise reconstruction and higher fidelity. - Masked Language Modeling (MLM): A pre-training objective where a portion of the input tokens are masked, and the model is trained to predict the original masked tokens. Widely used in
BERTand speechSSLmodels likeW2v-BERTandHuBERT. - Contrastive Learning: A learning paradigm where the model is trained to pull similar samples closer together in an embedding space and push dissimilar samples further apart. Used in
wav2vec 2.0for learning speech representations. - Mel-spectrograms: A time-frequency representation of audio that approximates human hearing. It's commonly used as an intermediate representation in
TTSandASRsystems. - Fundamental Frequency (F0): Also known as pitch, it is the lowest frequency of a vibrating object and is a key acoustic correlate of perceived pitch. It conveys prosodic and emotional information in speech.
- Perplexity (PPL): A common intrinsic evaluation metric for language models. Lower
perplexityindicates a better model, as it means the model is more confident and accurate in predicting the next token in a sequence. - Word Error Rate (WER): A common metric for evaluating the performance of
ASRsystems. It measures the number of errors (substitutions, deletions, insertions) required to transform the recognized word sequence into the reference word sequence, divided by the total number of words in the reference. $ \mathrm{WER} = \frac{S + D + I}{N} $ Where:- is the number of substitutions.
- is the number of deletions.
- is the number of insertions.
- is the total number of words in the reference (ground truth).
- Character Error Rate (CER): Similar to
WER, but calculated at the character level. - Mean Opinion Score (MOS): A subjective metric used in telecommunications and speech quality assessment. Human listeners rate the quality of speech samples on a scale (e.g., 1 to 5). The
MOSis the arithmetic mean of all individual scores. Variants includeMMOS(overall quality),PMOS(prosody), andSMOS(speaker similarity). - Reinforcement Learning from Human Feedback (RLHF): A technique used to align
LLMswith human preferences. It involves training a reward model on human comparisons of model outputs, and then using this reward model to fine-tune theLLMwith reinforcement learning algorithms. - Proximal Policy Optimization (PPO): A popular
RLalgorithm used inRLHFto optimize policies (model behaviors) by making small, stable updates. - Direct Preference Optimization (DPO): A simpler and more stable alternative to
PPOforRLHF, directly optimizing the policy based on human preference data without needing a separate reward model.
3.2. Previous Works
The paper contextualizes SpeechLMs by contrasting them with existing approaches and related research areas:
-
Traditional Speech and Audio Technologies: Several surveys have focused on specific aspects:
Spoken Language Understanding (SLU)[16]: Deals with extracting semantic meaning and intent from spoken utterances.- Audio and speech
Self-Supervised Learning (SSL)[17, 18]: Focuses on learning robust speech representations from large amounts of unlabeled audio data. - Integration of speech with other modalities [19, 20]: Explores how speech interacts with modalities like vision.
- These works primarily focus on components or specific tasks within speech processing, whereas
SpeechLMsaim for an end-to-end generative foundation model.
-
Single-modal and Multi-modal LLMs:
- Surveys on single-modal
LLMs[21, 22] cover advancements in text generation and understanding. - Surveys on multi-modal
LLMs[23-25] explore models that combine text with other modalities like images or video. - While
SpeechLMsare multi-modal (speech and potentially text), they have a unique focus on direct speech input and output, which differentiates them from broader multi-modalLLMsurveys that might prioritize vision-language tasks.
- Surveys on single-modal
-
Overlap between Audio Modality and LLMs:
- Latif et al. [26] examine
LLMsin audio processing. - Peng et al. [27] review
SpeechLLMswithin theSLUdomain. - Ji et al. [28] focus on spoken dialogue systems encompassing speech, sound, and music.
- The current survey distinguishes itself by focusing specifically on
Speech Language Modelsas end-to-end generative models for speech, which is a more constrained and rapidly developing sub-field compared to general audio processing orSLU.
- Latif et al. [26] examine
-
The "Naive" Pipeline: This is the most direct prior work that
SpeechLMsaim to supersede.- Mechanism: As depicted in Figure 1a, an
ASRmodule converts spoken input to text, anLLMprocesses the text and generates a textual response, and aTTSmodule converts the textual response back to speech. - Limitations (as detailed in Background & Motivation):
- Information Loss:
ASRdiscards paralinguistic information (pitch, timbre, tonality), which is vital for expressive communication and accurate intent interpretation. - Significant Latency: Sequential operation of three complex modules introduces considerable delays, hindering natural real-time interaction.
- Cumulative Error: Errors from
ASRcan propagate toLLMandTTS, leading to degraded overall performance and potentially incomprehensible outputs.
- Information Loss:
- Mechanism: As depicted in Figure 1a, an
3.3. Technological Evolution
The technological landscape has evolved from specialized, pipeline-based systems to integrated, end-to-end foundation models.
- Early Speech Technologies: Focused on separate, modular components for
ASR,TTS, andSLU. These systems were often rule-based or employed simpler statistical models. - Deep Learning Revolution: The advent of deep learning significantly improved
ASR(e.g., RNNs, LSTMs, Transformers) andTTS(e.g., WaveNet, Tacotron).SSLfor speech (e.g.,wav2vec 2.0,HuBERT) allowed learning powerful speech representations from vast unlabeled data. - Rise of Text-based LLMs:
Transformersand massive datasets led to the emergence of powerfulLLMs(e.g., GPT, LLaMA) that excel at text generation and understanding. - Initial Speech-LLM Integration (Pipeline Approach): A natural first step was to combine existing
ASR,LLM, andTTSmodules. This demonstrated basic speech interaction but inherited all the limitations described above. - Emergence of SpeechLMs: Recognizing the limitations of the pipeline, researchers began exploring end-to-end
SpeechLMs. This involves designing models that can directly process raw audio or audio tokens and generate audio or audio tokens, often inspired by theTransformerarchitecture ofLLMs. This approach aims to address the shortcomings of the pipeline by integrating modalities more deeply.
3.4. Differentiation Analysis
Compared to the main methods in related work, SpeechLMs introduce several core differences and innovations:
- End-to-End Modality Handling: The most significant differentiation is the direct, end-to-end processing and generation of speech. Unlike the pipeline,
SpeechLMsdo not convert speech to text (and back) as an intermediate step within the core generation loop. - Retention of Paralinguistic Information: By working directly with speech waveforms or rich speech tokens,
SpeechLMsare designed to preserve and leverage paralinguistic information (e.g., pitch, timbre, emotion, speaking style) that is lost when speech is reduced to text. This allows for more expressive, natural, and context-aware interactions. - Reduced Latency: By integrating the ASR-like and TTS-like functionalities into a single, unified model,
SpeechLMsaim to significantly reduce the computational overhead and sequential delays inherent in multi-stage pipelines. This is crucial for real-time conversational AI. - Mitigation of Cumulative Errors: A unified architecture trained end-to-end is less susceptible to error propagation, as the model learns to handle variations and ambiguities across modalities more robustly, rather than relying on perfect performance from discrete, cascaded modules.
- Broader Application Scope:
SpeechLMscan natively handle multimodal inputs (interleaved speech and text) and outputs, enabling more fluid human-computer interaction, including scenarios like speech-in-text-out and vice-versa, or mixed dialogue. They can also implicitly learn relationships between different speech characteristics and semantic content. - Novelty as a Survey: This paper itself differentiates from existing surveys by focusing specifically on the architectures, training recipes, capabilities, and evaluations of end-to-end generative SpeechLMs, rather than broader audio processing,
SLU, or general multimodalLLMtopics. It provides a unique lens on this emerging field.
4. Methodology
4.1. Principles
The core idea behind Speech Language Models (SpeechLMs) is to enable end-to-end processing and generation of speech, bypassing the limitations of the conventional pipeline. The theoretical basis or intuition is drawn from the success of TextLMs (like LLMs) in autoregressively generating coherent text sequences. SpeechLMs extend this paradigm to the audio domain, aiming to model speech (and potentially interleaved speech and text) autoregressively.
The fundamental design pattern for SpeechLMs involves three main components:
-
Speech Tokenizer: Converts continuous audio waveforms into discrete or continuous
tokensor representations that a language model can process. This step is crucial for abstracting the raw audio signal into a more manageable, information-rich format. -
Language Model (LM): Typically a
Transformer-based architecture (often borrowed fromTextLMs), this component takes thespeech tokens(and potentiallytext tokens) as input and autoregressively predicts the next token in the sequence. It learns the statistical patterns and dependencies within and across speech and text modalities. -
Token-to-Speech Synthesizer (Vocoder): Transforms the
tokensgenerated by the language model back into audible speech waveforms. This is essentially the inverse operation of thespeech tokenizer.This three-stage design, while still having distinct components, is considered "end-to-end" in the sense that the language model directly operates on speech-derived representations and produces speech-synthesizable outputs, fostering a more integrated learning process compared to the cascaded system.
4.2. Core Methodology In-depth (Layer by Layer)
The paper formally defines a SpeechLM as an autoregressive foundation model that processes and generates speech end-to-end. It can also incorporate text for cross-modal functionalities.
Let's define the input and output:
-
A speech audio waveform is denoted as , where are audio samples of length .
-
A text span is denoted as , where are text tokens (word, subword, character) of length .
-
A multimodal sequence is , where each element can be an audio sample or a text token.
-
The input multimodal sequence is , with .
-
The output multimodal sequence is , with .
A
SpeechLMparameterized by is then represented as: This equation indicates that theSpeechLMtakes an input multimodal sequence and generates an output multimodal sequence, both potentially containing speech and/or text.
4.2.1. Components in SpeechLM
The three main components within a SpeechLM are: speech tokenizer, language model, and token-to-speech synthesizer (vocoder).
4.2.1.1. Speech Tokenizer
The speech tokenizer is the initial component that transforms continuous audio signals (waveforms) into discrete or continuous tokens or representations suitable for a language model. Its goal is to capture essential audio features while reducing dimensionality. These tokens can then be used for autoregressive modeling.
The paper categorizes speech tokenizers into three types based on their primary objective:
-
Semantic Understanding Objective:
- Goal: Convert speech waveforms into tokens that capture the content and meaning, primarily enhancing tasks like
ASR. These tokens focus on semantic features. - Architecture: Typically comprises a
speech encoderand aquantizer. - Process: The encoder transforms the waveform into continuous embeddings :
Where are the encoded embeddings, and are the encoder parameters.
Since is continuous, a
quantizerdiscretizes these embeddings intospeech tokens: Where are the speech tokens, and are the quantizer parameters. If continuous tokens are used, . - Training: Tokens can be used as target labels for pre-training objectives like masking and reconstructing (e.g.,
masked language modeling - MLM,contrastive loss). - Examples:
Wav2vec 2.0[30]: Uses a convolutional encoder and aproduct quantizationmodule.W2v-BERT[32]: Builds onwav2vec 2.0withMLMloss andcontrastive loss.HuBERT[33]: Usesk-meansto cluster speech utterances into hidden units, thenMLMto predict these units from masked speech.Google USM[36]: Employstext-injection lossfor text-speech alignment.WavLM[37]: Adds aspeech denoising objective.
- Goal: Convert speech waveforms into tokens that capture the content and meaning, primarily enhancing tasks like
-
Acoustic Generation Objective:
- Goal: Capture acoustic features essential for generating high-quality speech waveforms, prioritizing acoustic characteristics over semantic content, suitable for speech (re)synthesis.
- Architecture: Typically includes an
encoder, aquantizer, and adecoder. - Process: The encoder and quantizer transform the waveform into tokens (same as semantic tokenizers). The decoder then reconstructs these tokens back into speech waveforms : Where is the generated or reconstructed waveform.
- Examples:
Neural audio codecs[35, 49] are primary examples, usingvector quantization (VQ)orresidual vector quantization (RVQ)[77].
-
Mixed Objective:
- Goal: Balance both
semantic understandingandacoustic generationby combining advantages of both. - Approach: Most existing mixed tokenizers adopt the architecture of acoustic generation tokenizers and distill information from semantic tokenizers.
- Examples:
-
SpeechTokenizer[34]: UsesRVQ-GANarchitecture and distills semantic information fromHuBERTinto the first layer ofRVQ. -
Mimi[9]: Employs a singleVQto extract information fromWavLMand anRVQmodule for acoustic information.The paper provides detailed notations for three representative speech tokenizers:
-
- Goal: Balance both
-
HuBERT (Semantic Objective):
- Feature encoder maps raw audio to continuous embeddings : .
- Embeddings are quantized into discrete speech tokens via
k-means clusteringofMFCCfeatures: . - Trained with a
masked prediction objective: Where:- is the loss function for model parameters .
- denotes the expectation over audio samples drawn from the data distribution .
- represents the set of masked indices in the sequence.
- is the probability of predicting the correct token at a masked position, given the unmasked embeddings and model parameters .
- Iterative refinement of speech tokens: Where are the refined speech tokens at iteration , and and are the encoder and discretizer parameters at iteration .
-
Encodec (Acoustic Objective):
- Encoder maps raw audio to continuous embeddings : .
- Embeddings are discretized using
multi-stage RVQ: Where:- are the discrete speech tokens.
- is the quantizer for stage .
- are the parameters for the quantizer at stage .
- denotes the quantized embedding at stage .
- is the total number of
RVQstages.
- Decoder reconstructs audio waveform from quantized tokens: .
-
SpeechTokenizer (Mixed Objective):
- Encoder transforms input audio into continuous embeddings : .
- Discretization via
multi-stage RVQ, similar toEncodec. The key difference is that the firstRVQstage distills tokens derived fromHuBERT(semantic information), and subsequent stages quantize the residuals (acoustic information).
4.2.1.2. Language Model
The language model component in SpeechLMs largely adapts architectures from TextLMs, primarily Transformers [38] or decoder-only architectures (e.g., OPT [3], LLaMA [39]). It generates speech in an autoregressive manner.
-
Text-based Decoder-Only Transformer LM:
- Vocabulary size: .
- Hidden dimension: .
- Embedding matrix: .
- Sequence of transformer decoder blocks: .
- Output embedding matrix: .
- Representation: Where is the input text token sequence, and is the generated text token sequence.
-
Adapting for Speech Generation:
- The original text tokenizer is replaced by a
speech tokenizer. - The text embedding matrix is replaced by a
speech embedding matrix, where is the vocabulary size of thespeech tokenizer. - The output embedding matrix is replaced by .
- Representation for speech-only LM: Where is the input speech token sequence, and is the generated speech token sequence.
- The original text tokenizer is replaced by a
-
Jointly Modeling Text and Speech (Multimodal LM):
- A common approach is to expand the vocabulary of the original
TextLMto include both text and speech tokens. - The
speech embedding matrixis appended to thetext embedding matrix, forming a larger embedding matrix . (Note: the paper states which seems like a typo, typically it's an additive union of vocabularies for concatenation). - Let be a token sequence containing both speech and text tokens.
- Representation for multimodal LM: Where and represent the combined embedding matrices, and and are input and output multimodal token sequences.
- When modeling with
continuous tokens, the embeddings from thespeech tokenizerare directly fed into the language model, and the LM architecture might not need significant changes.
- A common approach is to expand the vocabulary of the original
4.2.1.3. Token-to-Speech Synthesizer (Vocoder)
After the language model generates speech tokens autoregressively, the vocoder module converts these tokens back into audible speech waveforms. This is the reverse process of the speech tokenizer.
Where:
-
is the synthesized speech audio waveform.
-
Vois the vocoder model. -
is the input sequence of speech tokens.
-
are the parameters of the vocoder model.
The paper outlines two main pipelines for
SpeechLMvocoders:
-
Direct Synthesis: The vocoder directly converts
speech tokensinto audio waveforms.- Example: Polyak et al. [48] adapted
HiFi-GAN[47] to takespeech tokensas input. - Suitable for tokens from
acoustic generation tokenizerswhich contain sufficient acoustic information.
- Example: Polyak et al. [48] adapted
-
Input-Enhanced Synthesis: An additional module transforms the
speech tokensinto a continuous latent representation (e.g.,mel-spectrograms) before feeding them to the vocoder.- Reason: Vocoders often require intermediate audio representations.
- Example:
CosyVoice[88] uses aConditional FlowMatching (CFM)model to convertspeech tokensintomel-spectrograms, then usesHiFi-GANfor final synthesis. - Suitable for tokens from
semantic understanding tokenizerswhich provide rich semantic information but lack fine acoustic details.
Vocoder Categories:
-
GAN-based Vocoder: Most commonly adopted due to fast and high-fidelity generation.
- Architecture: Comprises a
generator() and adiscriminator(). Thegeneratorproduces audio, and thediscriminatordistinguishes real from synthetic audio. - Training Objectives:
- GAN Loss: The fundamental adversarial objective. For the
generator(G)anddiscriminator(D), typically uses least squares loss:- Generator GAN Loss:
Where:
msrepresents the mel-spectrogram input to the generator.G(ms)is the waveform generated by the generator.D(G(ms))is the discriminator's output for the generated waveform.- The generator aims to make
D(G(ms))close to 1 (fool the discriminator into thinking it's real).
- Discriminator GAN Loss:
Where:
- is the ground truth audio waveform.
D(x)is the discriminator's output for the real waveform (aims for 1).D(G(ms))is the discriminator's output for the generated waveform (aims for 0).
- Generator GAN Loss:
Where:
- Mel-spectrogram Loss: Improves fidelity by aligning mel-spectrograms of generated and ground-truth audio:
Where:
- is the function to transform a waveform into its corresponding mel-spectrogram.
- denotes the L1 distance.
- Feature Matching Loss: Aligns discriminator-encoded features of real and generated samples:
Where:
- represents the features from the -th layer of the discriminator.
- is the number of features in the -th layer.
- is the total number of layers considered.
- GAN Loss: The fundamental adversarial objective. For the
- Architectural Choices:
MelGAN[123]: Uses residual blocks with dilations and amulti-scale discriminator.HiFi-GAN[47]: Proposes amulti-period discriminatorfor diverse periodic patterns.Fre-GAN[124]: EmploysDiscrete Wavelet Transform (DWT)for high-frequency content.BigVGAN[80]: Introduces aperiodic activation function(snake function) andanti-aliased representation.
- Architecture: Comprises a
-
GAN-based Neural Audio Codec: When neural audio codecs use
GANs, their decoders can directly serve as vocoders (e.g.,SoundStreamdecoder inEncodec[49]). Polyak et al. [48] usedHiFi-GANas a backbone, disentangling input features into semantic tokens, pitch tokens, and speaker embeddings. -
Other Types of Vocoder:
-
Pure Signal Processing Vocoder: Traditional methods (e.g.,
WORLD[126]) relying on deterministic algorithms, but often introduce artifacts. -
Autoregressive Vocoder: Generates audio samples one at a time, conditioned on previous samples (e.g.,
WaveNet[44]). High quality but computationally expensive and slow. -
Flow-based Vocoder: Uses invertible transformations to map a simple distribution to complex audio samples (e.g.,
WaveGlow[46]). Efficient sampling and density evaluation, parallel generation, but high parameter/memory cost. -
VAE-based Vocoders: Use
VAEsto learn latent representations for reconstruction [77, 127], less explored for vocoders. -
Diffusion-based Vocoder: Gradually adds noise to data and learns to reverse the process (e.g.,
DiffWave[128]). Produces high-fidelity speech.HiFi-GAN Notation (Most Used Vocoder):
HiFi-GANsynthesizes audio waveforms frommel-spectrogramsorspeech tokens. Thegeneratormaps a sequence of speech tokens to an output audio waveform : Where are the vocoder parameters. It usesmulti-periodandmulti-scale discriminatorsfor adversarial training. At inference, only is used.
-
4.2.2. Training Recipes
The training of SpeechLMs involves three main aspects: features modeled, training stages, and speech interaction paradigms.
4.2.2.1. Features Modeled
These refer to the types of features or tokens output by the speech tokenizer and processed by the language model.
-
Discrete Features: Quantized representations of speech signals as distinct, countable units. These are the most common in
SpeechLMsas they align withTextLMtoken processing.- Semantic Tokens: Derived from
semantic understanding tokenizers(e.g.,HuBERT,wav2vec 2.0). Excel in content generation but may lack acoustic details.GSLM[50] comparedCPC,wav2vec 2.0, andHuBERT, findingHuBERTbest for speech resynthesis and generation.AudioPaLM[52] foundUSM-v2(USMv1variant) best forASRandSpeech Translation (ST).
- Paralinguistic Tokens: Integrated to capture expressive information (prosody, pitch, timbre).
pGSLM[56] usedfundamental frequency (F0)andunit durationalongsideHuBERTtokens.SPIRIT-LM[5] addedpitchandstyle tokens[148] toHuBERTtokens.
- Acoustic Tokens: Aim to capture features for high-fidelity speech reconstruction, typically from
neural audio codecmodels.Codec Language Models (CodecLMs): Directly modelcodec tokens(e.g.,Viola[57],NTPP[69] usingVQ-VAEtokens).
- Discussion: Semantic tokens lead to semantically coherent speech but lack acoustic details (requiring post-processing like diffusion). Acoustic tokens yield high-fidelity audio but may struggle with content accuracy. Solutions include hierarchical modeling (e.g.,
AudioLM[115] modeling semantic then acoustic tokens) or mixed tokens (e.g.,Moshi[9],SpeechGPT-Gen[60] jointly modeling semantic and acoustic).
- Semantic Tokens: Derived from
-
Continuous Features: Unquantized, real-valued representations (e.g.,
mel-spectrograms, latent embeddings).Spectron[61] predictsspectrogramsframe-by-frame.Mini-Omni[10] andSLAM-Omni[54] extract intermediate representations from a frozenWhisper encoder.LauraGPT[63] uses an audio encoder trained with the language model to derive latent representations.- Challenges: Require modifying
TextLMtraining pipelines and demand more storage.
4.2.2.2. Training Stages
SpeechLMs follow a multi-stage training process, similar to TextLMs.
-
Language Model Pre-Training: Learning statistical patterns and dependencies from large corpora.
- Training Data: Large-scale open-source speech data, including
ASR(e.g.,LibriSpeech,LibriLight,Gigaspeech),TTS(LibriTTS),Speech Translation (ST)(CoVoST2,CVSS), podcasts (Spotify Podcasts), and dialogues (Fisher). Some datasets include text transcripts for multi-modal learning. - Cold Initialization: Training
Transformer-based language models from scratch.GSLM[50] pioneered this, showingHuBERTtokens superior.SUTLM[64] compared four methods for modeling speech and text tokens:- Speech-only:
[SPEECH] S12 S34 S33 . S11 S59(Only the speech sequence is provided.) - Text-only:
[TEXT] A quick brown fox jumps over a lazy dog.(Only the text sequence is provided.) - Concatenated speech-text:
[SPEECH] S12 S34 S33 ... S11 S59 [TEXT] A quick brown fox jumps over a lazy dog.(The speech sequence and text sequence are concatenated together.) - Alternating speech-text:
[SPEECH] S12 S34 S33 [TEXT] brown fox jumps over a lazy [SPEECH] S11 S59(The sequence is interleaved with speech and text tokens.)SUTLMfoundalternating speech-textbest for cross-modal evaluations.
- Speech-only:
- Other architectures:
pGSLM[56] (multi-stream transformer),dGSLM[4] (dialogue transformer),LSLM[65] (streamingSSLencoder +TTSmodel).
- Continued Pre-Training: Initializing with pre-trained
TextLMweights to leverage linguistic knowledge.- Hassid et al. [51] found
OPT[3] andLLaMA[39] checkpoints beneficial, outperforming cold initialization. AudioPaLM[52] showed benefits from largerPaLMandPaLM-2checkpoints and larger datasets.- Text-Speech Representation Alignment:
- Single sequence:
SPIRIT-LM[5] showed interleaving text and speech tokens during continued pre-training significantly boosts performance.Spectron[61] andSpeechGPT[8] use joint supervision where input speech is transcribed to text,LLMpredicts text response, and text is synthesized to speech. - Multi-sequence:
Llama-Omni[11] usesLLMhidden states to decode text and generate discrete speech tokens simultaneously.Mini-Omni[10] generates one text sequence and seven acoustic token sequences in parallel.Moshi[9] generates text, semantic, and acoustic token sequences in parallel, aligned at the word level.
- Single sequence:
- Hassid et al. [51] found
- Discussion: Aligning text and speech representations is challenging, as text is concentrated knowledge. Trade-offs exist: text primarily conveys semantics (improves semantic capabilities but may compromise paralinguistic capture), and inference can be text-present (higher latency, better reasoning, less hallucination) or text-independent (more efficient but less stable).
- Training Data: Large-scale open-source speech data, including
-
Language Model Instruction-Tuning: Fine-tuning
SpeechLMsto follow specific instructions for diverse tasks.- Focus: Creating effective
instruction-following datasets. - Approaches:
SpeechGPT[8] andSpeechGPT-Gen[60]: Two-stage tuning (cross-modal and chain-of-modality). First,ASRdatasets are augmented with instructions for speech-to-text;TTSdata for text-to-speech. Second, text-based instruction datasets are converted to speech-in-speech-out usingTTS.Llama-Omni[11]: Synthesizes text-based datasets into speech-like prompts, uses anLLMto generate speech-patterned responses, and then synthesizes prompt/response pairs.COSMIC[68]: Constructed speech QA data using GPT-3.5 to generate Q&A pairs from TED talk transcripts.
- Focus: Creating effective
-
Language Model Post-Alignment: Refining
SpeechLMbehavior to align with human preferences, ensuring safety and reliability.- Techniques:
Reinforcement Learning from Human Feedback (RLHF), includingProximal Policy Optimization (PPO)[151] andDirect Preference Optimization (DPO)[152]. - Challenges: Unique to speech interaction.
Align-SLM[153]: Addresses inconsistent semantic content by using aTextLMto select preferred responses (after ASR transcription) and aligning preferences usingDPO.SpeechAlign[154]: Focuses on acoustic quality, using optimization to align the language model's output with "golden" token distributions to mitigate subpar speech generation from generated tokens.
- Importance: Critical for mitigating safety risks (e.g., toxicity, privacy) in generative models, which is currently under-explored for
SpeechLMs.
- Techniques:
4.2.2.3. Speech Interaction Paradigm
This refers to how SpeechLMs handle dynamic conversational flows beyond simple input-response.
-
Real-time Interaction: Advanced handling of conversation data, supporting natural interruptions and simultaneous communication.
- Streaming Tokenizers and Vocoders: Eliminate the need to wait for complete speech encoding, reducing latency.
- Full-duplex Modeling: Enables simultaneous bidirectional communication, allowing user interruption and simultaneous responses.
User interruption: Model can be interrupted and respond.Simultaneous response: Model processes input and generates output concurrently.
- Approaches:
dGSLM[4]: Employs separate transformers for each speaker withcross-attention layers.NTPP[69]: Uses anext-token-pair predictionapproach with adecoder-only Transformerfor dual channels.Moshi[9]: Concatenates user input and model response channels, processed by anRQ-Transformer.LSLM[65]: Models one speaker's speech with adecoder-only Transformer, integrating astreaming SSL encoderfor listening and speaking channel embeddings.
-
Interactive Period Recognition (IPR): The ability to discern when users are interacting with the model and when to remain silent.
- Importance: Creates natural conversational flow, avoids unnecessary interruptions, and allows the model to disregard irrelevant speech.
- Methods:
-
Voice Activity Detection (VAD)module:MiniCPM-o 2.6[72] integrates aVADto respond only when audio surpasses a threshold, ignoring noise. -
Trained distinction:
VITA[70] trains theSpeechLMto distinguishquery speechfromnon-query audio, outputting anend-of-sequence tokenfor non-query audio.The following table summarizes the popular choices of the three components in various SpeechLM papers.
-
The following are the results from Table II of the original paper:
| Approach | Speech Tokenizer | Language Model | Vocoder |
| Kimi-Audio [78] | Whisper Encoder [12] + Linear Projector | Qwen2.5 [79] | BigVGAN [80] |
| Qwen2.5-Omni [81] | Whisper | Qwen2.5 | Talker + Codec Decoder [81] |
| Minmo o [82] | SenseVoice [83] | Qwen2.5 | CosyVoice 2 [84] |
| Lyra [85] | Whisper [12] | Qwen2-VL [86] | HuBERT + HiFi-GAN |
| Flow-Omni [87] | Whisper Encoder + Linear Projector | Qwen2 [41] | Flow Matching (Transformer + MLP) + HiFi-GAN |
| SLAM-Omni [54] | Whisper Encoder + Linear Projector | Qwen2 [41] | |
| OmniFlatten [53] | CosyVoice Encoder [88] | Qwen2 | CosyVoice Decoder [88] |
| SyncLLM [89] | BET [33] | LaA- [2] | HiFi-GAN [47], [48] |
| EMOVA [90] | S SIL [91] | LaMA-3 | VITS [92] |
| Freeze-Omni [67] IntrinsicVoice [94] | Transformer [38] | Qwen2 | TiCodec [93] |
| Mini-Omni2 [66] | HuBERT | Qwen2 | HiFi-GAN |
| SALMONN-omni [71] | Whisper | Qwen2 | Mini-Omni [10] |
| Zeng et al. [97] | Mamba Streaming Encoder [95] Whisper + VQ | GLM [42] | VoiceCraft [96] + Codec Decoder |
| NTPP [69] | VQ-VAE | LLaMA-3, Mistral, Gemma 2 | CosyVoice HiFi-GAN |
| GPST [98] | EnCodec [49] | Transformer | Codec Decoder |
| GLM-4-Voice [55] | Whisper + VQ [9] | GLM-4-9B-Base [42] | CosyVoice |
| Moshi [9] | Mimi [9] | Transformer* | Mimi |
| VITA [70] | CNN + Transformer + MLP [70] | Mixtral [43] | Text-to-Speech Toolkit [70] |
| LSLM [65] | vq-wav2vec [31] | Decoder-Only Transformer | |
| SPIRIT-LM [5] | HuBERT, VQ-VAE [77], speechprop | LLaMA-2 [40] | UniVATS [99] HiFi-GAN |
| TWIST [51] | HuBERT | OPT [3], LLaMA [39] | |
| PSLM [100] | HuBERT | HiFi-GAN | |
| VOXTLM [102] | HuBERT | NekoMata [101] | HiFi-GAN |
| Voicebox [103] | EnCodec | OPT [3] Transformer* [38] | HiFi-GAN |
| Park et al. [104] | AV-HuBERT [105] | OPT | HiFi-GAN HiFi-GAN |
| USDM [106] | XLS-R [107] | Mistral | Voicebox [108] |
| VioLA [57] | EnCodec | Transformer* | Codec Decoder [49] |
| FunAudioLLM [83] | SAN-M [109] | Transformer* | HiFTNet[110] |
| SpeechGPT-Gen [60] | SpeechTokenizer [34] | LLaMA-2 | SpeechTokenizer decoder [34] |
| COT [?] | SpeechTokenizer | LLaMA-2 | SoundStorm |
| AnyGPT [111] | SpeechTokenizer | LLaMA-2 | SoundStorm |
| LauraGPT [63] | onformer* | Qwen [112] | Transformer + Codec Decoder |
| Spectron [61] | Conformer* | PaLM 2* [113] | WaveFit [114] |
| AudioLM [115] | w2v-BErT [32] | Decoder-Only Transformer* | SoundStream* [35] |
| UniAudio [116] | EnCodec, Hifi-codec [117], | Transformer* | Codec Decoder |
| Llama-Omni [11] | RVQGAN [118] Whisper | LLaMA-3.1 | HiFi-GAN |
| Mini-Omni [10] | Whisper + ASR Adapter [10] | Qwen2 | TTS Adapter [10] |
| tGSLM [62] | Segmentation + SSE [119] + Lexical embedder | Transformer* | Tacotron-2 + Waveglow [45], [46] |
| SpeechGPT [8] | HuBERT | LLaMA | HiFi-GAN |
| dGSLM [4] | HuBERT | Dialogue Transformer [4] | HiFi-GAN |
| SUTLM [64]] | HuBERT | Transformer* | |
| pGSLM [56] | HuBERT | MS-TLM [56] | HiFi-GAN |
| G GSLM [50] | HuBERT, CPC [29], Wav2vec 2.0 [30] | Transformer* | Tacotron-2 + Waveglow |
5. Experimental Setup
This section outlines the common practices for evaluating Speech Language Models (SpeechLMs), drawing from the methodologies described in the surveyed papers. Since this is a survey, it does not present a single experimental setup for a new model, but rather a meta-analysis of the experimental setups prevalent in the field.
5.1. Datasets
SpeechLMs require vast amounts of data for both pre-training and instruction-tuning. The datasets used can be speech-only, text-only, or paired speech-text.
-
Pre-Training Datasets: Primarily large-scale open-source speech data.
- Automatic Speech Recognition (ASR) Datasets:
LibriSpeech[131]: 1k hours of English speech derived from audiobooks, paired with text.Multilingual LibriSpeech[132]: 50.5k hours of multilingual speech.LibriLight[133]: 60k hours, used forASRwith limited or no supervision.People dataset[134]: 30k hours of diverse English speech.VoxPopuli[135]: 1.6k hours, multilingual for representation learning, semi-supervised learning, and interpretation.Gigaspeech[136]: 40k hours, multi-domainASRcorpus.Common Voice[137]: 2.5k hours (as of 2019), massively multilingual speech corpus.VCTK[138]: 0.3k hours, forTTSand multi-speaker speech.WenetSpeech[139]: 22k hours of multi-domain Mandarin speech.
- Text-to-Speech (TTS) Datasets:
LibriTTS[140]: 0.6k hours, derived fromLibriSpeechforTTS.
- Speech Translation (ST) Datasets:
CoVoST2[141]: 2.8k hours, for massively multilingual speech-to-text translation.CVSS[142]: 1.9k hours, for massively multilingual speech-to-speech translation.
- Other Audio Datasets:
VoxCeleb[143] andVoxCeleb2[144]: For speaker identification (0.4k and 2.4k hours, respectively).Spotify Podcasts[145]: 47k hours of podcasts.Fisher[146]: 2k hours of telephone conversations.
- Automatic Speech Recognition (ASR) Datasets:
-
Instruction-Tuning Datasets: These are often constructed by augmenting existing datasets or synthesizing new ones.
SpeechInstruct* [8]: Synthesized from text-based instruction data.InstructS2S-200K* [11]: Synthesized instruction-following data.VoiceAssistant-400K* [10]: Synthesized instruction-following data. (Note: * indicates speech versions of text datasets synthesized usingTTS.)
These datasets are chosen to provide a broad range of speech contexts, languages, and tasks, enabling the SpeechLMs to learn robust representations and generalize across diverse applications. The inclusion of text transcripts in some datasets helps in learning the relationship between spoken and written forms.
5.2. Evaluation Metrics
Evaluating SpeechLMs involves both automatic (objective) and human (subjective) assessments to cover their diverse capabilities.
5.2.1. Automatic (Objective) Evaluation
- Representation Evaluation: Assesses how well speech features are encoded into meaningful vectors.
- ABX Score [156, 158]: Measures
embedding similarityand quantifies the separation of phonetic categories. It compares three sound samples: two from the same category (A) and one from a different category (B). The score reflects how often the system correctly identifies that two sounds from A are more similar to each other than one sound from A is to a sound from B. No single formula is universally provided for ABX score, as it's a test paradigm often based on classification accuracy or distance metrics in embedding space. Conceptually, it measures , where are from category A, and is from category B. - Speech Resynthesis (WER/CER): Measures information loss due to discretization. An input speech is encoded into tokens, then synthesized back.
Word Error Rate (WER)orCharacter Error Rate (CER)is computed between the original speech'sASRtranscription and the resynthesized speech'sASRtranscription.- Word Error Rate (WER):
$
\mathrm{WER} = \frac{S + D + I}{N}
$
Where:
- is the number of substitutions (words replaced).
- is the number of deletions (words omitted).
- is the number of insertions (words added).
- is the total number of words in the reference (ground truth) transcription.
- Character Error Rate (CER): Calculated identically to
WER, but at the character level.
- Word Error Rate (WER):
$
\mathrm{WER} = \frac{S + D + I}{N}
$
Where:
- ABX Score [156, 158]: Measures
- Linguistic Evaluation: Assesses the model's ability to generate and understand lexical, syntactic, and semantic rules.
- sWUGGY [158]: Evaluates at the
lexical levelif the model can distinguish a real word from a non-real word in a pair. - sBLIMP [158]: Evaluates at the
syntactic levelif the model can identify the grammatically correct sentence from a pair. - Spoken StoryCloze [51]: Evaluates
semantic comprehensionby assessing the model's ability to choose the genuine ending of a story from two options. - All these are typically measured by comparing the model's
negative log-likelihood (NLL)for the correct choice versus the incorrect one.
- sWUGGY [158]: Evaluates at the
- Paralinguistic Evaluation: Focuses on non-verbal aspects like prosody, emotion, and timbre.
- For Prosodic Tokens (e.g.,
pGSLM[56]):- Correctness: Calculates
minimal Mean Absolute Error (min-MAE)of prosodic tokens between generated and reference samples. $ \mathrm{min-MAE} = \min_{j} \frac{1}{L} \sum_{i=1}^{L} |P_{gen}^{(i)} - P_{ref,j}^{(i)}| $ Where:- is the -th prosodic token from a generated sample.
- is the -th prosodic token from the -th reference sample.
- is the sequence length.
- The minimum is taken over multiple generated samples.
- Consistency:
Pearson correlationbetween mean values of prompt prosodic tokens and generated continuation prosodic tokens. $ \rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma_X \sigma_Y} $ Where:- and are the mean prosodic values of the prompt and continuation, respectively.
- is their covariance.
- and are their standard deviations.
- Expressiveness: Measured by the
standard deviationof generated prosody token values. $ \sigma = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2} $ Where are the generated prosody token values and is their mean.
- Correctness: Calculates
- Speech-Text Sentiment Preservation (STSP) [5]: Uses a
sentiment classifierto assess if the sentiment of the generated speech or text matches the prompt's sentiment.
- For Prosodic Tokens (e.g.,
- Generation Quality and Diversity:
- Area Under the Curve (AUC) with various temperature values on
perplexityandVERT.- Perplexity (PPL): Measures how well a probability model predicts a sample. Lower PPL indicates better generation quality. $ \mathrm{PPL}(W) = P(w_1 w_2 \ldots w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1 w_2 \ldots w_N)}} $ For a sequence , if the model can estimate the probability of each word given the previous words: $ P(w_1 w_2 \ldots w_N) = \prod_{i=1}^{N} P(w_i | w_1 \ldots w_{i-1}) $
- VERT: Represents the geometric mean of the ratio of
k-gramsin generated speech that repeat at least once, measuring diversity. No standard formula provided in the paper; it's a specific metric introduced in GSLM [50].
- ChatGPT Score:
ASRtranscription of generated speech is sent toChatGPTfor quality assessment.
- Area Under the Curve (AUC) with various temperature values on
- Real-Time Interaction Evaluation: For streaming or full-duplex models.
- Naturalness of Dialogues: Examines
turn-taking eventslikeInter-Pausal Unit (IPU),pause,gaps, andoverlapping speech. Naturalness is high if these statistics resemble human dialogues [4]. - Usefulness (e.g.,
NTPP[69]):Reflective pauses:SpeechLM's ability to remain silent while the user speaks.Interruptions:SpeechLM's ability to cease speaking when interrupted.
- Benchmarks:
Full Duplex Bench[168] evaluates turn-taking;Talking Turns[169] uses a neural network to predict turn-taking events.
- Naturalness of Dialogues: Examines
- Downstream Evaluation: Evaluates
SpeechLM's performance on specific tasks (e.g.,ASR,TTS,Speaker Identification). Can be done viafew-shot examplesorinstruction-tuning.- Benchmarks:
SUPERB[155]: Variety of speech understanding tasks.SD-Eval[162]: Paralinguistic understanding (emotion, age, environment).SALMON[165]: Speech generation with consistent paralinguistic and environmental characteristics.VoiceBench[166]: GeneralSpeechLMcapabilities.Dynamic-SUPERB[163],MMAU[159],AirBench[161],AudioBench[160]: Extend to sound and music-related tasks.VoxEval[167]: Benchmarks knowledge understanding with speech I/O, various audio conditions, and spoken math reasoning.
- Benchmarks:
5.2.2. Human (Subjective) Evaluation
- Mean Opinion Score (MOS): Quantifies perceived speech quality by human listeners.
-
Conceptual Definition: A group of evaluators rates audio samples on a predefined scale (e.g., 1-5, from poor to excellent). The
MOSis the average of these scores. -
Variants:
MMOS(overall quality),PMOS(prosody), andSMOS(timbre similarity). -
Alternatives: Machine-based evaluation using neural network models trained for naturalness prediction [170] or speaker identification can be used to assess naturalness and timbre similarity.
The following are the results from Table VI of the original paper:
Name Eval Type # Tasks Audio Type I/O ABX [156][158] Representation 1 Speech A → − sWUGGY [158] Linguistic 1 Speech A → − sBLIMP [158] Linguistic 1 Speech A → sStoryCloze [51] Linguistic 1 Speech A/T → − STSP [5] Paralinguistic 1 Speech A/T → A/T MMAU [159] Downstream 27 Speech, Sound, Music A → T Audiobench [160] Downstream 8 Speech, Sound A → T AIR-Bench [161] Downstream 20 Speech, Sound, Music A → T SD-Eval [162] Downstream 4 Speech A → T SUPERB [163] Downstream 10 Speech A → T VoxDialogue [164] Downstream 12 Speech, Sound, Music A → T Dynamic-SUPERB [163] Downstream 180 Speech, Sound, Music A → T SALMON [165] Downstream 8 Speech A → − VoiceBench [166] Downstream 8 Speech A → T VoxEval [167] Downstream 56 Speech A → A
-
I/O, A, and T represent input/output modality, audio, and text, respectively. A dash ("-") indicates that the modality is not explicitly specified as input or output, or the task might be representation learning without a direct output modality.
5.3. Baselines
As this is a survey paper, it does not propose a single new method or conduct direct comparative experiments against specific baselines. Instead, it reviews numerous SpeechLM approaches and implicitly compares them by discussing their design choices, reported performance in their respective original papers, and the challenges they aim to address. The "baseline" in the context of this survey is the traditional pipeline, whose limitations serve as the primary motivation for the development of SpeechLMs. Individual SpeechLM papers would, of course, benchmark against other SpeechLMs or state-of-the-art ASR/TTS systems depending on their specific tasks.
6. Results & Analysis
Since this paper is a survey, it does not present novel experimental results from the authors' own model. Instead, it synthesizes the findings and trends observed across a multitude of published SpeechLM papers. This section, therefore, analyzes the general outcomes, commonalities, and implications derived from the surveyed literature regarding SpeechLM components, training, and capabilities.
6.1. Core Results Analysis
The survey highlights several key findings regarding the effectiveness and characteristics of SpeechLMs compared to the traditional pipeline:
-
Superiority of End-to-End SpeechLMs:
SpeechLMsdemonstrate inherent advantages over the pipeline approach by mitigating information loss (especially paralinguistic cues), significantly reducing latency, and preventing cumulative errors. This validates the core hypothesis that direct speech processing is a more effective paradigm for natural human-computer interaction. -
Impact of Speech Tokenizer Choice:
- Semantic Tokenizers (e.g.,
HuBERT,wav2vec 2.0): Are highly effective forSpeechLMsfocused on semantic understanding and coherent content generation.HuBERTparticularly stands out as a strong performer across various tasks. However, speech generated solely from semantic tokens often lacks expressive acoustic details. - Acoustic Tokenizers (e.g.,
Encodec,SoundStream): Excel at producing high-fidelity audio but can struggle with semantic accuracy in content generation. - Mixed Objective Tokenizers (e.g.,
SpeechTokenizer,Mimi): Show promise in balancing semantic and acoustic needs by distilling semantic information into acoustic representations, suggesting a path towards models that are both semantically accurate and acoustically rich.
- Semantic Tokenizers (e.g.,
-
Language Model Training Strategies:
- Continued Pre-Training from
TextLMs: This strategy, whereSpeechLMsinitialize with weights from powerfulTextLMs(likeOPT,LLaMA,PaLM), consistently yields better performance and faster convergence thancold initialization. This indicates that the linguistic knowledge embedded inTextLMsis highly transferable and beneficial for speech-based tasks, even if speech is the primary modality. - Text-Speech Alignment: Methods that explicitly align text and speech representations (e.g., through interleaved token sequences or multi-sequence generation) generally lead to enhanced model performance, particularly in cross-modal understanding and generation tasks.
- Autoregressive Generation: The
Transformer'sdecoder-onlyautoregressive architecture is widely adopted and proven effective for sequence generation in the speech domain, mirroring its success in text.
- Continued Pre-Training from
-
Vocoder Dominance:
GAN-based vocoders, particularlyHiFi-GANand its variants, are the most prevalent choice inSpeechLMsdue to their ability to synthesize high-fidelity and efficient audio waveforms. The choice betweendirect synthesisandinput-enhanced synthesisdepends on the information richness of the generatedspeech tokens. -
Emergence of Real-time Interaction: While traditional
SpeechLMsoften generate responses after full input, newer research focuses onreal-time interaction paradigms. Approaches usingstreaming tokenizers/vocodersandfull-duplex modelingare critical advancements for natural conversations, allowing for interruptions and simultaneous speech.Interactive Period Recognition (IPR)(e.g., viaVADor learned cues) is also gaining importance for more natural conversational flow. -
Diverse Capabilities:
SpeechLMsdemonstrate a broad range ofdownstream applicationsacross semantic, speaker-related, and paralinguistic domains. This versatility, spanningspoken dialogue,speech translation,ASR,TTS,speaker identification,emotion recognition, andparalinguistics-enhanced generation, underscores their potential asfoundation modelsfor spoken language. -
Instruction-Tuning Effectiveness: Instruction-tuning, often relying on synthesized data, effectively adapts
SpeechLMsto follow instructions for various tasks, demonstrating their adaptability and generalizability.In summary, the collective findings across the surveyed papers validate the
SpeechLMparadigm as a powerful and promising direction for voice-based AI. The continuous development of specialized components, sophisticated training techniques, and advanced interaction paradigms is steadily closing the gap between human and machine speech communication.
6.2. Data Presentation (Tables)
The following are the results from Table III of the original paper, summarizing popular datasets used in the pre-training and instruction-tuning phases of SpeechLMs. * means it is the speech version of the text dataset synthesized using TTS. S2ST and S2TT represent Speech-to-Speech Translation and Speech-to-Text Translation, respectively.
| Dataset | Type | Phase | Hours | Year |
| LibriSpeech [131] | ASR | Pre-Training | 1k | 2015 |
| Multilingual LibriSpeech [132] | ASR | Pre-Training | 50.5k | 2020 |
| LibriLight [133] | ASR | Pre-Training | 60k | 2019 |
| People dataset [134] | ASR | Pre-Training | 30k | 2021 |
| VoxPopuli [135] | ASR | Pre-Training | 1.6k | 2021 |
| Gigaspeech [136] | ASR | Pre-Training | 40k | 2021 |
| Common Voice [137] | ASR | Pre-Training | 2.5k | 2019 |
| VCTK [138] | ASR | Pre-Training | 0.3k | 2017 |
| WenetSpeech [139] | ASR | Pre-Training | 22k | 2022 |
| LibriTTS [140] | TTS | Pre-Training | 0.6k | 2019 |
| CoVoST2 [141] | S2TT | Pre-Training | 2.8k | 2020 |
| CVSS [142] | S2ST | Pre-Training | 1.9k | 2022 |
| VoxCeleb [143] | Speaker Identification | Pre-Training | 0.4k | 2017 |
| VoxCeleb2 [144] | Speaker Identification | Pre-Training | 2.4k | 2018 |
| Spotify Podcasts [145] | Podcast | PreTraining | 47k | 2020 |
| Fisher [146] | Telephone conversation | Pre-Training | 2k | 2004 |
| SpeechInstruct* [8] | Instruction-following | Instruction-Tuning | 2023 | |
| InstructS2S-200K* [11] | Instruction-following | Instruction-Tuning | 2024 | |
| VoiceAssistant-400K* [10] | Instruction-following | Instruction-Tuning | 2024 |
6.3. Ablation Studies / Parameter Analysis
As a survey, the paper does not conduct its own ablation studies or parameter analyses. Instead, it synthesizes the implications of such studies from the surveyed literature, particularly regarding the choice of components and training methodologies.
-
Component Choice (Speech Tokenizer): The comparative analyses highlighted by the survey (e.g.,
GSLM[50] comparingCPC,wav2vec 2.0,HuBERT;AudioPaLM[52] experimenting withw2v-bert,USM-v1,USM-v2) serve as a form of "meta-ablation." They show thatsemantic tokenizerslikeHuBERTare crucial for content understanding, whileacoustic tokenizers(e.g.,Encodec) are vital for high-fidelity audio. The emergence ofmixed objective tokenizersdirectly addresses the trade-offs found in these implicit "ablation studies," aiming to combine the strengths. -
Training Strategies (Cold vs. Continued Pre-Training): The observation that
continued pre-trainingonTextLMcheckpoints significantly outperformscold initialization[51] is a crucial "ablation-like" finding. It demonstrates the immense value of leveraging existing linguistic knowledge fromTextLMs, effectively acting as an ablation of theTextLMpre-training stage. The performance differences also highlight that not all pre-trained checkpoints (e.g., image-pretrained) are equally beneficial, indicating the specificity of modality transfer. -
Text-Speech Alignment: The effectiveness of different text-speech alignment methods (single-sequence interleaving vs. multi-sequence generation, text-present vs. text-independent inference) as discussed in
SPIRIT-LM[5],SUTLM[64],Llama-Omni[11],Mini-Omni[10], andMoshi[9] represents an exploration of architectural choices akin to parameter analysis. These works implicitly ablate different ways of integrating text and speech within the language model, revealing trade-offs in reasoning abilities, latency, and hallucination. -
Features Modeled (Discrete vs. Continuous, Semantic vs. Paralinguistic): The discussions around the trade-offs of
discretevs.continuous featuresand the integration ofparalinguistic tokens(e.g.,F0,unit duration,pitch,style tokens) are direct analyses of how different input representations impactSpeechLMcapabilities. For instance,pGSLM[56] andSPIRIT-LM[5] essentially perform an ablation by addingparalinguistic tokensto a basesemantic tokenizer, demonstrating their impact on expressiveness without significantly compromising semantic understanding.In essence, while the survey itself doesn't present new experimental data, it rigorously synthesizes the results of numerous individual research papers, acting as a high-level analysis of how different design choices and training parameters affect the overall performance and capabilities of
SpeechLMs.
7. Conclusion & Reflections
7.1. Conclusion Summary
This survey provides a pioneering and comprehensive overview of recent advancements in Speech Language Models (SpeechLMs), addressing the inherent limitations of the traditional Automatic Speech Recognition (ASR) + Large Language Model (LLM) + Text-to-Speech (TTS) pipeline. The paper effectively highlights the key advantages of SpeechLMs, including their ability to retain rich paralinguistic information, significantly reduce latency, and mitigate error accumulation through end-to-end speech processing.
The authors meticulously detail the architectural components of SpeechLMs (speech tokenizers, language models, and vocoders), categorizing various approaches within each. They also systematically review the diverse training recipes, encompassing different feature modeling techniques (discrete vs. continuous, semantic vs. acoustic vs. mixed), multi-stage training processes (pre-training, instruction-tuning, post-alignment), and advanced speech interaction paradigms (real-time interaction, interactive period recognition). Furthermore, the survey comprehensively outlines the wide range of downstream applications, classifying them into semantic-, speaker-, and paralinguistic-related tasks, and categorizes existing evaluation metrics and benchmarks.
Overall, the paper concludes that SpeechLMs are a promising alternative to pipeline-based systems, offering a more natural, efficient, and expressive way for humans to interact with AI models through speech.
7.2. Limitations & Future Work
The survey identifies several crucial challenges and outlines promising future research directions:
- Understanding Different Component Choices: Current comparisons of
speech tokenizers,language models, andvocodersare often limited in scope. A comprehensive, systematic comparison of these diverse components is needed to guide efficientSpeechLMdevelopment. - End-to-End Training: While
SpeechLMsare conceptually end-to-end, many implementations still train components separately. Investigating fully end-to-end training, allowing gradients to propagate from vocoder output to tokenizer input, could unlock more coherent and high-fidelity speech generation. - Real-Time Speech Generation: The latency in
SpeechLMsremains a significant hurdle for truly natural, real-time human interaction. Future work should focus on streamable pipelines andSpeechLMsthat can autonomously generate audio samples in waveform chunks to minimize delay. - Safety Risks in SpeechLMs: This area is largely under-investigated compared to
TextLLMs.SpeechLMsintroduce unique safety concerns related to toxicity (e.g., generating acoustically inappropriate content like erotic speech) and privacy (e.g., inferring sensitive speaker attributes like ethnicity or religious beliefs from acoustic features). Robust research is needed to identify and mitigate these vulnerabilities. - Performance on Rare Languages:
SpeechLMshave the potential to address thelow-resource languagechallenge more effectively thanTextLMsbecause spoken data is often more abundant than written text in such languages. Future research should prioritize trainingSpeechLMson these languages and dialects to expand global accessibility.
7.3. Personal Insights & Critique
This survey is a highly valuable resource for anyone entering or working in the field of Speech Language Models. Its comprehensive nature, detailed breakdown of components, and systematic classification of training and evaluation methods are exceptional. The paper's emphasis on the "beginner-friendly" aspect is well-executed through clear explanations and structured analysis.
Inspirations and Applications:
The vision of SpeechLMs that can genuinely understand and generate speech with full paralinguistic nuance is incredibly inspiring. Such models could revolutionize human-computer interaction, making AI assistants feel truly natural and empathetic. Beyond general conversation, the ability to generate speech conditioned on specific voices, emotions, or styles has profound implications for accessibility (e.g., personalized voice assistants for individuals with speech impairments), content creation (e.g., expressive audiobook narration), and even mental health applications (e.g., AI companions that respond with appropriate emotional tone). The potential for SpeechLMs to bridge communication gaps for low-resource languages by leveraging spoken data directly is a powerful application that could empower vast communities.
Potential Issues and Areas for Improvement:
-
Complexity of Component Interactions: While the survey dissects each component, the intricate interactions and co-optimization challenges across the tokenizer, LM, and vocoder could be further emphasized. The choice of one component (e.g., semantic vs. acoustic tokenizer) deeply impacts the others and the overall system performance. A deeper dive into how researchers have balanced these interdependencies in actual implementations, beyond simply listing chosen components, would be beneficial.
-
Computational Cost: The sheer scale of
SpeechLMs, especially those leveraging largeTextLLMbackbones and requiring high-fidelity vocoders, implies massive computational requirements for training and inference. The survey mentions latency as a challenge but could explicitly discuss the energy footprint and hardware demands, which are critical for sustainable AI development and wider deployment. -
Data Quality and Bias: While large datasets are crucial, their quality and potential biases are less discussed. Speech data often carries inherent biases (e.g., gender, accent, socioeconomic status) that can be amplified by
SpeechLMs. A more explicit discussion on how these biases are being addressed (or need to be addressed) inSpeechLMtraining and evaluation would be valuable, especially given the identified safety risks. -
"Black Box" Nature: Like
TextLLMs,SpeechLMsare complex neural networks. Their interpretability is a significant challenge. How do these models "learn" paralinguistic features? Can we disentangle semantic from emotional understanding? Future surveys might delve into emerging interpretability techniques forSpeechLMs. -
Multilinguality and Code-Switching: While
multilingual datasetsare mentioned, the specific challenges and advancements inSpeechLMsforcode-switching(mixing languages within a single utterance) are a nuanced area that could be explored in more detail, as it reflects natural human communication patterns.Despite these minor points, this survey stands as an excellent foundational text for understanding the current landscape and future trajectory of
Speech Language Models. Its rigorous approach and clear presentation will undoubtedly aid researchers in navigating this dynamic and impactful field.
Similar papers
Recommended via semantic vector search.