Paper status: completed

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Published:01/05/2023

Text-to-Speech Synthesis (4)Neural Codec Language Models (1)Conditional Language Modeling (1)Zero-Shot Speech Synthesis (1)High-Quality Personalized Speech Synthesis (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents VALL-E, a novel Text-to-Speech method using a neural codec language model. It reformulates TTS as conditional language modeling, achieving high-quality personalized speech synthesis with just 3 seconds of an unseen speaker's recording, and significantly improvi

Abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

Mind Map

In-depth Reading

English Analysis~25 min read · 34,415 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is a novel approach to Text-to-Speech (TTS) synthesis, specifically focusing on zero-shot generation of personalized speech using a neural codec language model.

1.2. Authors

The paper lists multiple authors from Microsoft, including Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Their research backgrounds appear to be in speech processing, natural language processing, and artificial intelligence, given their affiliation with Microsoft and the nature of the research.

1.3. Journal/Conference

The paper was published on arXiv, which is a preprint server for scientific articles. While arXiv is not a peer-reviewed journal or conference in itself, it is a highly influential platform for rapid dissemination of research in fields like AI, machine learning, and computer science. Papers often appear on arXiv before or concurrently with their submission to, or acceptance by, formal conferences or journals.

1.4. Publication Year

The paper was published on January 5, 2023.

1.5. Abstract

The paper introduces a language modeling approach for Text-to-Speech (TTS) synthesis named VALL-E. It trains a neural codec language model using discrete codes derived from an off-the-shelf neural audio codec model, reframing TTS as a conditional language modeling task rather than continuous signal regression. During pre-training, VALL-E is scaled up to use 60,000 hours of English speech, which is hundreds of times larger than existing systems. This extensive training enables VALL-E to develop in-context learning capabilities, allowing it to synthesize high-quality personalized speech from an unseen speaker with just a 3-second enrolled recording as an acoustic prompt. Experimental results demonstrate that VALL-E significantly surpasses the state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity. Furthermore, the model is found to preserve the speaker's emotion and the acoustic environment of the prompt during synthesis.

1.6. Original Source Link

The official source link is https://arxiv.org/abs/2301.02111, and the PDF link is https://arxiv.org/pdf/2301.02111v1.pdf. It is currently published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem and Importance

The core problem the paper aims to solve is the limitation of current Text-to-Speech (TTS) systems in generating high-quality, personalized speech for unseen speakers in a zero-shot scenario. Current TTS systems, whether cascaded (acoustic model + vocoder) or end-to-end, face several challenges:

Data Scarcity: They typically require high-quality, clean data from recording studios. Large-scale data crawled from the internet often leads to performance degradation due to noise and variations.
Poor Generalization: Due to relatively small training datasets (dozens to hundreds of hours), existing systems suffer from poor generalization. Speaker similarity and speech naturalness decline dramatically for speakers not present in the training data.
Complex Customization: To tackle zero-shot TTS, prior work often relies on speaker adaptation (requiring additional fine-tuning), speaker encoding (requiring complex pre-designed features or heavy structural engineering), or meta-learning, which adds complexity and still limits efficiency.

This problem is crucial because the ability to customize a TTS system to an arbitrary voice using only a minimal enrolled recording is highly desirable for practical applications, enabling personalized voice assistants, content creation, and accessibility tools without extensive data collection or fine-tuning for each new speaker.

2.1.2. Paper's Entry Point and Innovative Idea

The paper's entry point is inspired by the success of large language models (LLMs) in text synthesis (e.g., GPT-3, PaLM), which demonstrated that scaling up models with large and diverse data can lead to emergent capabilities like in-context learning. The innovative idea is to transfer this success to speech synthesis by:

Reframing TTS as a Conditional Language Modeling Task: Instead of the traditional approach of continuous signal regression (e.g., predicting mel spectrograms), VALL-E treats TTS as predicting discrete audio tokens.
Utilizing Discrete Audio Codec Codes: It leverages an off-the-shelf neural audio codec model (specifically, EnCodec) to tokenize raw audio into discrete codes. These codes are rich enough to retain speaker identity, emotion, and acoustic environment information, unlike simpler self-supervised speech representations that discard speaker characteristics.
Massive Scale-Up of Training Data: VALL-E is pre-trained on an unprecedented 60,000 hours of English speech, which is hundreds of times more than previous TTS systems. This large, diverse, and multi-speaker dataset is semi-supervised (using ASR-generated transcriptions for audio-only data), suggesting robustness to noise.
Enabling In-Context Learning: By treating speech synthesis as a language modeling task with discrete tokens, VALL-E can leverage prompting-based large-model techniques, exhibiting in-context learning capabilities. This allows it to synthesize high-quality personalized speech from unseen speakers with just a 3-second acoustic prompt, without any fine-tuning or complex architectural modifications.

2.2. Main Contributions / Findings

The primary contributions and key findings of the paper are:

Introduction of VALL-E, a Novel TTS Framework: VALL-E is the first TTS framework that leverages a language model approach with audio codec codes as intermediate representations, replacing traditional mel spectrograms. This design enables strong in-context learning capabilities, similar to GPT-3, for zero-shot TTS without requiring additional structure engineering, pre-designed acoustic features, or fine-tuning.
Demonstration of Data Scaling for TTS: The paper builds a generalized TTS system by pre-training on an unprecedented 60,000 hours of semi-supervised speech data. This highlights that simply scaling up semi-supervised data has been underestimated in the TTS field and can lead to significant performance improvements and generalization across speakers.
Achieving State-of-the-Art Zero-Shot TTS: VALL-E significantly outperforms the state-of-the-art zero-shot TTS system on both LibriSpeech and VCTK datasets in terms of speech naturalness (CMOS) and speaker similarity (SMOS). On VCTK, it even achieves a CMOS score comparable to ground truth, suggesting synthesized speech is as natural as human recordings.
Emergence of Advanced Synthesis Capabilities: VALL-E demonstrates several emergent capabilities:
- Diverse Outputs: It can generate diverse synthesized results (e.g., varying pace, accent, or prosody) for the same input text and target speaker by using different sampling strategies during inference.
- Acoustic Environment Maintenance: The model can preserve the acoustic environment (e.g., reverberation) present in the acoustic prompt during synthesis.
- Speaker's Emotion Maintenance: VALL-E can maintain the speaker's emotion from the acoustic prompt, even without explicit emotional TTS training.
  
  These findings solve the problem of poor generalization and complex customization in zero-shot TTS by offering a scalable, data-driven, and prompt-based solution that achieves high-quality, personalized speech synthesis for unseen speakers.

This section provides the foundational knowledge and reviews the technological landscape relevant to understanding the VALL-E paper.

3.1. Foundational Concepts

3.1.1. Text-to-Speech (TTS) Synthesis

Text-to-Speech (TTS) synthesis is the artificial production of human speech from text. The goal is to convert written language into spoken language that sounds natural and intelligible. Early systems used concatenative synthesis (piecing together pre-recorded speech units), while modern systems primarily use neural networks.

3.1.2. Neural Networks

Neural networks are a class of machine learning models inspired by the structure and function of the human brain. They consist of interconnected nodes (neurons) organized in layers. They learn to recognize patterns in data through training, adjusting the strengths of connections (weights) between neurons. In TTS, they are used to learn the complex mapping from text to acoustic features or directly to waveforms.

3.1.3. Mel Spectrograms

A mel spectrogram is a visual representation of the spectrum of frequencies of a sound signal as it varies with time, but with the frequencies transformed onto the mel scale. The mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. Mel spectrograms are widely used as an intermediate representation in traditional TTS systems because they capture human auditory perception more accurately than linear frequency scales, providing a compact and perceptually relevant representation of speech.

3.1.4. Vocoder

A vocoder (voice encoder-decoder) is a system that analyzes and synthesizes the human voice. In TTS pipelines, after an acoustic model generates mel spectrograms (or other acoustic features) from text, a vocoder is responsible for converting these acoustic features back into a raw audio waveform. Examples include WaveNet, WaveGlow, and HifiGAN.

3.1.5. Zero-Shot Text-to-Speech (TTS)

Zero-shot TTS refers to the ability of a TTS system to synthesize speech in the voice of a speaker it has never encountered during its training phase, using only a very short (e.g., 3-second) audio sample of that speaker's voice as a prompt. This is a challenging task because the model must generalize speaker-specific characteristics from limited information.

3.1.6. Language Model (LM)

A Language Model (LM) is a statistical model that determines the probability of a sequence of words or tokens. In recent years, large neural language models like GPT-3 have demonstrated remarkable abilities in generating coherent and contextually relevant text by learning patterns from vast amounts of text data. The VALL-E paper re-frames TTS as a conditional language modeling task on discrete audio tokens.

3.1.7. Discrete Codes / Acoustic Tokens

Discrete codes (or acoustic tokens) are quantized, distinct numerical representations of raw audio. Instead of representing audio as a continuous signal or a continuous feature like a mel spectrogram, discrete codes assign specific, finite integer values to segments of audio. This process is similar to how words are discrete tokens in text. Using discrete codes allows TTS to be treated as a sequence-to-sequence problem, where the model predicts a sequence of discrete audio tokens rather than continuous values.

3.1.8. Neural Audio Codec

A neural audio codec is a type of neural network specifically designed to encode (compress) audio waveforms into a more compact, often discrete, representation and then decode (decompress) that representation back into a high-fidelity audio waveform. Unlike traditional audio codecs (e.g., MP3), neural codecs learn a more perceptually relevant compression, often retaining crucial information like speaker identity, emotion, and acoustic environment even at very low bitrates. EnCodec [Défossez et al., 2022] is an example used in VALL-E.

3.1.9. In-Context Learning

In-context learning is an emergent capability observed in large language models. It refers to the model's ability to learn a new task or adapt to a new style (e.g., a new speaker's voice) by simply being provided with a few examples or a prompt within the input sequence, without requiring any updates to its internal parameters (i.e., no fine-tuning). The model implicitly "learns" from the context provided in the prompt.

3.1.10. Autoregressive (AR) and Non-Autoregressive (NAR) Models

Autoregressive (AR) Model: An autoregressive model predicts the next element in a sequence based on all previously predicted elements in that same sequence. Each output step depends on the outputs of the preceding steps. This makes them good for modeling dependencies but can be slow during inference as prediction must be sequential.
Non-Autoregressive (NAR) Model: A non-autoregressive model predicts all elements of a sequence simultaneously or in parallel, without relying on previously predicted elements within the same output sequence. This can significantly speed up inference but might struggle with capturing long-range dependencies or maintaining consistency compared to AR models.

3.1.11. Transformer

The Transformer is a neural network architecture introduced by Vaswani et al. (2017) that relies entirely on self-attention mechanisms to process input sequences. Unlike recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers can process all parts of an input sequence in parallel, making them highly efficient for long sequences and capable of capturing long-range dependencies. They consist of an encoder and a decoder stack, each with multi-head self-attention and feed-forward layers. VALL-E uses Transformer decoder architecture.

3.1.12. Residual Vector Quantization (RVQ)

Residual Vector Quantization (RVQ) is a technique used in vector quantization (VQ) where the quantization error (the difference between the original signal and its quantized version) is progressively quantized by subsequent quantizers. Instead of a single quantizer trying to capture all information, RVQ uses a cascade of quantizers. The first quantizer captures the most significant information, and each subsequent quantizer learns to encode the residual (the error) from the previous stage. This hierarchical approach allows for efficient compression and reconstruction, with earlier stages capturing coarser features (like speaker identity) and later stages refining finer details.

3.2. Previous Works

The paper contextualizes VALL-E by discussing advancements in zero-shot TTS and spoken generative pre-trained models.

3.2.1. Zero-Shot TTS

Cascaded TTS Systems:
- Traditional TTS systems often follow a cascaded pipeline, typically comprising an acoustic model (e.g., Tacotron2 [Shen et al., 2018], FastSpeech [Ren et al., 2019], Transformer TTS [Li et al., 2019]) that converts text to mel spectrograms, followed by a vocoder (e.g., WaveNet, WaveGlow, HifiGAN) that synthesizes the waveform from the mel spectrograms.
- Acoustic Model Example (Tacotron2): Tacotron2 is an autoregressive model that predicts a mel spectrogram from a sequence of input characters. It consists of an encoder (character embeddings, convolutional layers, Bi-directional LSTM) and a decoder (LSTM with attention mechanism) which generates mel frames one by one. The attention mechanism is crucial for aligning input characters with output mel frames. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where $Q$ is the Query matrix (from the decoder), $K$ is the Key matrix (from the encoder), $V$ is the Value matrix (from the encoder), and $d_k$ is the dimension of the keys, used for scaling.
End-to-End TTS Models:
- These models aim to optimize the acoustic model and vocoder jointly, synthesizing speech directly from text without an explicit intermediate mel spectrogram stage or by integrating the vocoder into the training (e.g., Conditional VAE with Adversarial Learning [Kim et al., 2021], DelightfulTTS 2 [Liu et al., 2022]).
Approaches for Zero-Shot TTS:
- Speaker Adaptation: Pioneers like Arik et al. [2018] proposed methods to adapt a pre-trained TTS model to a new speaker with a small amount of target speaker data. Subsequent works [Chen et al., 2019, Wang et al., 2020, Chen et al., 2021] focused on improving adaptation efficiency. Meta-TTS [Huang et al., 2022] applied meta-learning to this, requiring only a few shots (examples) for adaptation.
- Speaker Encoding-based Methods: These methods use a separate speaker encoder to extract a speaker embedding (a fixed-size vector representation of a speaker's voice) from an enrolled recording. This embedding is then fed to the TTS component. The speaker encoder can be pre-trained on speaker verification tasks (e.g., Jia et al. [2018]). Advanced speaker embedding models [Cai et al., 2018] or complex speaker encoder designs [Wu et al., 2022] are used to improve quality for unseen speakers, but Tan et al. [2021] note that quality for unseen speakers often remains undesirable.
- Diffusion Models: More recently, diffusion model-based TTS [Popov et al., 2021, Kim et al., 2022] has been extended to zero-shot TTS [Kang et al., 2022].
Differentiation: VALL-E follows a cascaded approach (uses an external neural codec decoder) but differs fundamentally by using audio codec codes as intermediate representations instead of mel spectrograms, and by exhibiting in-context learning without fine-tuning or complex speaker encoder engineering.

3.2.2. Spoken Generative Pre-trained Models

Self-supervised Speech Models:
- Self-supervised learning is widely used in speech understanding (e.g., wav2vec 2.0 [Baevski et al., 2020b], HuBERT [Hsu et al., 2021], WavLM [Chen et al., 2022]). These models learn representations from unlabeled speech data.
- vq-wav2vec [Baevski et al., 2020a] and HuBERT [Hsu et al., 2021] extract discrete tokens from speech, which can reconstruct content but often discard speaker identity and have low reconstruction quality [Borsos et al., 2022].
Speech-to-Speech Generation:
- GSLM [Lakhotia et al., 2021] synthesizes speech based on HuBERT codes.
- Polyak et al. [2021] improved GSLM by combining HuBERT codes with VQVAE codes and a speaker encoder.
- AudioLM [Borsos et al., 2022] trains speech-to-speech language models using k-means tokens from self-supervised models and acoustic tokens from neural codec models [Zeghidour et al., 2022] to generate high-quality speech. AudioLM can synthesize speech from audio codecs without an additional vocoder.
Pre-training for Neural TTS:
- Chung et al. [2018] pre-trained a speech decoder in TTS using autoregressive mel-spectrogram prediction.
- SpeechT5 [Ao et al., 2022] proposed a unified-modal encoder-decoder framework that leverages unlabeled speech and text for pre-training all TTS components.
- Tjandra et al. [2019] quantized unlabeled speech into discrete tokens using VQVAE [van den Oord et al., 2017] and trained a token-to-speech sequence model, demonstrating efficiency with small fine-tuning data.
- A3t [Bai et al., 2022] proposed mask and reconstruction on mel spectrograms for speech editing and synthesis.
Differentiation: Previous TTS pre-training work used less than 1K hours of data, whereas VALL-E uses 60K hours. Crucially, VALL-E is the first to use audio codec codes as intermediate representations and to exhibit in-context learning for zero-shot TTS. Unlike AudioLM which is speech-to-speech, VALL-E is text-to-speech, providing explicit content control.

3.3. Technological Evolution

The field of speech synthesis has evolved significantly:

Rule-based/Concatenative Systems: Early TTS systems relied on linguistic rules and concatenating pre-recorded phonetic units.
Statistical Parametric Synthesis (e.g., HMM-based): Hidden Markov Models (HMMs) were used to model speech features, offering more flexibility but often producing robotic-sounding speech.
Deep Learning Era (Cascaded Systems): The last decade saw dramatic breakthroughs with neural networks. Cascaded systems like Tacotron2 + WaveNet improved naturalness by mapping text to mel spectrograms (acoustic model) and then to waveforms (vocoder).
End-to-End Systems: Integrating the acoustic model and vocoder into a single neural network (e.g., Deep Voice, FastSpeech 2, Glow-TTS, Conditional VAEs) simplified pipelines and further enhanced quality.
Multi-Speaker & Zero-Shot TTS: Efforts focused on generalizing to multiple speakers and unseen speakers, leading to speaker adaptation and speaker encoding techniques.
Self-Supervised Learning & Generative Models: Inspired by advancements in NLP, self-supervised learning (e.g., wav2vec 2.0, HuBERT) enabled learning representations from large unlabeled speech data. Generative models like diffusion models and language models (e.g., AudioLM for speech-to-speech) started to emerge.
Large Language Model Paradigm Shift (VALL-E): This paper marks a significant point by directly applying the language modeling paradigm, successful in text generation, to text-to-speech. By using discrete audio codec codes and massive semi-supervised data, VALL-E leverages the in-context learning capabilities of large language models to achieve unprecedented zero-shot TTS performance.

3.4. Differentiation Analysis

The core differences and innovations of VALL-E compared to previous zero-shot TTS and spoken generative pre-trained models are:

Intermediate Representation:
- Previous TTS: Primarily uses mel spectrograms as an intermediate, continuous representation.
- VALL-E: Uses discrete audio codec codes (specifically from EnCodec). This is a fundamental shift that allows TTS to be treated as a discrete token sequence generation problem, analogous to text language modeling.
Objective Function:
- Previous TTS: Typically optimized as a continuous signal regression problem (predicting mel spectrogram values).
- VALL-E: Optimized as a language model task, predicting the next discrete code in a sequence.
Training Data Scale:
- Previous TTS/Pre-training: Generally trained on dozens to hundreds of hours (e.g., LibriTTS) or at most ~1K hours of data for pre-training.
- VALL-E: Pre-trained on an unprecedented 60,000 hours of semi-supervised English speech, hundreds of times larger. This scale is crucial for its emergent capabilities.
In-Context Learning Capability:
- Previous TTS: Lacks in-context learning. Zero-shot TTS methods typically require fine-tuning, speaker adaptation modules, or complex speaker encoders with pre-designed features.
- VALL-E: Emerges in-context learning capabilities, similar to GPT-3. It can synthesize high-quality, personalized speech for unseen speakers with just a 3-second acoustic prompt, without any additional fine-tuning or structural engineering.
Generative Diversity and Acoustic Richness:
- Previous TTS: Often produces deterministic outputs for a given text, lacking diversity. Typically focuses on clean speech.
- VALL-E: Can generate diverse outputs (different prosody, pace) due to its sampling-based decoding. It also preserves rich acoustic information from the prompt, including speaker's emotion and acoustic environment (e.g., reverberation), which is a novel emergent property in zero-shot TTS.
Content Control:
- Speech-to-Speech LMs (e.g., AudioLM): These models primarily take speech as input and generate speech, making explicit content control challenging or requiring additional components.
- VALL-E: As a text-to-speech model, it offers direct and explicit control over the synthesized content via the input text.
  
  In essence, VALL-E shifts the paradigm of zero-shot TTS by treating it as a conditional language modeling problem over discrete audio tokens, enabled by massive data scaling and the inherent in-context learning abilities of large Transformer models.

4. Methodology

This section details the technical solution proposed by VALL-E, from its foundational principles to its specific architectural components and training procedures.

4.1. Principles

The core idea of VALL-E is to re-frame Text-to-Speech (TTS) synthesis as a conditional language modeling task. Instead of predicting continuous acoustic features like mel spectrograms, VALL-E learns to predict sequences of discrete audio tokens. This approach is motivated by the success of large language models (LMs) in text generation, which have shown powerful in-context learning capabilities when trained on vast amounts of data.

The theoretical basis and intuition behind this approach are:

Discretization Enables Language Modeling: By converting continuous audio into discrete codes, speech can be treated like text tokens. This allows the application of powerful Transformer-based language models that excel at modeling sequences of discrete entities.
Neural Codecs Preserve Rich Information: An off-the-shelf neural audio codec (like EnCodec) can compress audio into discrete tokens while preserving high fidelity and crucial acoustic properties such as speaker identity, emotion, and acoustic environment. This is vital for personalized and expressive TTS.
Large-Scale Data for Generalization and Emergent Abilities: Training on a massive, diverse dataset allows the model to learn a wide range of speaking styles, prosodies, and acoustic conditions. This scale is hypothesized to enable in-context learning, where the model can adapt to unseen speakers from a short acoustic prompt without explicit fine-tuning.
Hierarchical Generation: Leveraging the hierarchical nature of residual vector quantization (RVQ) in neural codecs, VALL-E employs a two-stage generation process: an autoregressive (AR) model for coarse acoustic tokens (which dictate rhythm and speaker identity) and a non-autoregressive (NAR) model for finer details (which refine the acoustic quality). This balances quality and inference speed.

4.2. Core Methodology In-depth

The VALL-E framework consists of three main stages: speech quantization, conditional codec language modeling (with AR and NAR components), and inference via prompting.

4.2.1. Speech Quantization

The first step in VALL-E is to convert raw audio waveforms into discrete acoustic codes. This is crucial because raw audio, typically stored as 16-bit integer values at high sampling rates (e.g., 24 kHz), results in extremely long sequences, making it intractable for language models. Speech quantization compresses both the value range and the sequence length.

VALL-E adopts a pre-trained neural audio codec model, EnCodec [Défossez et al., 2022], as its tokenizer. EnCodec is a convolutional encoder-decoder model designed for high-fidelity audio compression.

Encoding Process: EnCodec takes a 24 kHz audio waveform as input. Its encoder produces embeddings at a rate of 75 Hz (a 320-fold reduction in sampling rate). Each embedding is then subjected to Residual Vector Quantization (RVQ).
Residual Vector Quantization (RVQ): In RVQ, a series of quantizers are used. The first quantizer captures the most significant information, and subsequent quantizers learn to encode the residual error from the previous stage. This creates a hierarchical structure of discrete tokens.
- VALL-E uses eight hierarchical quantizers, each with 1024 entries. This configuration corresponds to EnCodec operating at 6K bitrates for 24 kHz audio reconstruction.
- For a 10-second waveform, this results in a discrete representation matrix of $750 \times 8$ entries, where 750 is the downsampled time step ( $24,000 \text{ Hz} \times 10 \text{ s} / 320$ ) and 8 is the number of quantizers.
- The paper notes that the first quantizer plays the most important role in reconstruction, with the impact of others gradually decreasing, as illustrated in the figure below. This hierarchy motivates the two-stage (AR for first, NAR for others) language modeling design.
  
  The following figure (Figure 2 from the original paper) shows the neural audio codec model:
  
  Figure 2: The neural audio codec model revisit. Because RVQ is employed, the first quantizer plays the most important role in reconstruction, and the impact from others gradually decreases.
Decoding Process: With the discrete codes from all quantizers, the convolutional decoder of EnCodec generates real-valued embeddings and reconstructs the waveform at 24 kHz. The advantage of using a neural codec is that it contains rich speaker information and acoustic information, allowing for speaker identity preservation and high-quality reconstruction without additional vocoder training.

4.2.2. Problem Formulation: Regarding TTS as Conditional Codec Language Modeling

Given a dataset of audio samples $\mathbf{y}$ and their corresponding phoneme transcriptions $\mathbf{x} = \{x_0, x_1, \ldots, x_L\}$ , the neural codec model encodes each audio sample into a discrete acoustic code matrix $\mathbf{C}^{T \times 8}$ , where $T$ is the downsampled utterance length and 8 is the number of quantizers. Each row vector $\mathbf{c}_{t,:}$ represents the eight codes for frame $t$ , and each column vector $\mathbf{c}_{:,j}$ represents the code sequence from the $j$ -th codebook ( $j \in \{1, \dots, 8\}$ ). The neural codec decoder can reconstruct the waveform from $\mathbf{C}$ , denoted as $\mathrm{Decodec}(\mathbf{C}) \approx \hat{\mathbf{y}}$ .

VALL-E frames zero-shot TTS as a conditional codec language modeling task. The goal is to train a neural language model to generate an acoustic code matrix $\mathbf{C}$ conditioned on:

A phoneme sequence $\mathbf{x}$ (for content).
An acoustic prompt matrix $\tilde{\mathbf{C}}^{T' \times 8}$ (for speaker's voice, derived from an enrolled recording).

The optimization objective is to maximize the probability of the generated codes: $ \operatorname*{max} p ( \mathbf { C } | \mathbf { x } , \tilde { \mathbf { C } } ) $ Here, $\tilde{\mathbf{C}}$ is obtained by applying the same neural codec to an enrolled recording (e.g., a 3-second speech segment of the target speaker). The language model is expected to learn to extract content from $\mathbf{x}$ and speaker information from $\tilde{\mathbf{C}}$ . During inference, the trained language model first estimates $\mathbf{C}$ , and then the neural codec decoder synthesizes the high-quality speech.

4.2.3. Training: Conditional Codec Language Modeling

Due to the hierarchical nature of residual quantization, where earlier quantizers capture coarser acoustic properties (like speaker identity) and later quantizers learn finer details, VALL-E designs two conditional language models in a hierarchical manner: one autoregressive (AR) for the first quantizer's codes and one non-autoregressive (NAR) for the subsequent quantizers' codes.

The following figure (Figure 3 from the original paper) shows the structure of the conditional codec language modeling:

Figure 3: The structure of the conditional codec language modeling, which is built in a hierarchical manner. In practice, the NAR decoder will be called seven times to generate codes in seven quantizers.

4.2.3.1. Autoregressive Codec Language Modeling (AR Model)

The autoregressive (AR) language model is responsible for generating the discrete tokens from the first quantizer (i.e., $\mathbf{c}_{:,1}$ ). This model is crucial for establishing the rhythm, prosody, and overall speaker identity of the synthesized speech, as the first quantizer's codes carry the most significant acoustic information.

Model Architecture: It comprises a phoneme embedding layer ( $W_x$ ), an acoustic embedding layer ( $W_a$ ), a Transformer decoder (causal transformer), and a prediction layer.
Input and Conditioning: The AR model is conditioned on the phoneme sequence $\mathbf{x}$ (as the content prompt) and the acoustic prompt from the first quantizer's codes ( $\tilde{\mathbf{C}}_{:,1}$ ). The model's input is the concatenation of the embedded $\mathbf{x}$ and $\mathbf{c}_{:,1}$ . Special $<EOS>$ tokens are appended after each sequence. Sinuous position embeddings are computed separately for the prompt and input tokens.
Causal Transformer: As a causal Transformer model, each token $c_{t,1}$ (at time step $t$ for the first quantizer) can only attend to prior phoneme tokens $\mathbf{x}$ and previously generated acoustic tokens $\mathbf{c}_{\le t,1}$ .
Optimization Objective: The model is optimized to maximize the probability of predicting the next token in the first codebook. $ p ( \mathbf { c } _ { : , 1 } | \mathbf { x } , \tilde { \mathbf { C } } _ { : , 1 } ; \theta _ { A R } ) = \prod _ { t = 0 } ^ { T } p ( \mathbf { c } _ { t , 1 } | \mathbf { c } _ { < t , 1 } , \tilde { \mathbf { c } } _ { : , 1 } , \mathbf { x } ; \theta _ { A R } ) $ Where:
- $\mathbf{c}_{:,1}$ represents the sequence of discrete codes from the first quantizer for the target speech.
- $\mathbf{x}$ is the input phoneme sequence for the target content.
- $\tilde{\mathbf{C}}_{:,1}$ represents the sequence of discrete codes from the first quantizer of the acoustic prompt (enrolled recording).
- $\theta_{AR}$ denotes the parameters of the autoregressive model.
- $T$ is the total length of the generated acoustic code sequence.
- $p(\mathbf{c}_{t,1} | \mathbf{c}_{<t,1}, \tilde{\mathbf{c}}_{:,1}, \mathbf{x} ; \theta_{AR})$ is the probability of predicting the code $c_{t,1}$ at time step $t$ , conditioned on all previous codes $c_{<t,1}$ , the acoustic prompt $\tilde{\mathbf{c}}_{:,1}$ , and the phoneme sequence $\mathbf{x}$ .
Parameter Sharing: The output projection layer's parameters are shared with the acoustic embedding $W_a$ .
Training vs. Inference: During training, the process is pure causal language model training where any prefix sequence $\mathbf{c}_{<t,1}$ acts as a prompt for the subsequent part. During inference, the acoustic token sequence of the enrolled recording ( $\tilde{\mathbf{c}}_{:,1}$ ) is explicitly used as a prefix for AR decoding, and the phoneme sequence of the enrolled recording is concatenated with the target phoneme sequence as the phoneme prompt. This allows flexible length prediction and adaptation to diverse speaking speeds.

4.2.3.2. Non-Autoregressive Codec Language Modeling (NAR Model)

Once the first quantizer codes ( $\mathbf{c}_{:,1}$ ) are obtained from the AR model, a non-autoregressive (NAR) language model is employed to generate the codes for the remaining seven quantizers (i.e., $\mathbf{c}_{:,j \in [2,8]}$ ). The NAR model generates all codes for a given stage in parallel, which significantly speeds up inference.

Model Architecture: Similar to the AR model, but it includes eight separate acoustic embedding layers (one for each quantizer).
Training Process: In each training step, a random stage $i \in [2, 8]$ is sampled. The model is trained to maximize the probability of the acoustic tokens from the $i$ -th quantizer codebook.
Input and Conditioning: The NAR model is conditioned on:
1. The phoneme sequence $\mathbf{x}$ (content prompt).
2. The full acoustic prompt matrix $\tilde{\mathbf{C}}$ (speaker identity and acoustic environment).
3. The predicted acoustic tokens from all previous codebooks $\mathbf{C}_{:,<j}$ (for current stage $j$ ).
- The acoustic tokens from stage 1 to stage i-1 (i.e., $\mathbf{C}_{:,<i}$ $C_{:, < i}$ ) are embedded and summed up as part of the model input. $ e _ { c _ { t , j } } = W _ { a } ^ { j } \odot c _ { t , j } $ Where:
  - $e_{c_{t,j}}$ is the embedded representation of the code $c_{t,j}$ at time step $t$ from quantizer $j$ .
  - $W_a^j$ is the acoustic embedding layer specific to quantizer $j$ .
  - $\odot$ indicates index selection (retrieving the embedding for the specific code value).
  - The summed embedding for a frame $t$ from previous quantizers up to i-1 is: $ e _ { { \mathbf c } _ { { \mathbf t } } } = \sum _ { j = 1 } ^ { i - 1 } e _ { c _ { t } , j } $
- The full acoustic prompt matrix $\tilde{\mathbf{C}}$ is also embedded by summing embeddings from all eight quantizers for each frame: $ e _ { \tilde { \mathbf { c } } _ { \mathbf { t } } } = \sum _ { j = 1 } ^ { 8 } e _ { \tilde { c } _ { t , j } } $
- The final Transformer input is the concatenation of the embedded phoneme sequence ( $\mathbf{e_x}$ ), the embedded acoustic prompt ( $\mathbf{e_{\tilde{c}}}$ ), and the summed embeddings of previously generated acoustic codes ( $\mathbf{e_{c_{:,<i}}}$ ).
Positional Embeddings: Positional embeddings are computed separately for prompts and the acoustic sequence.
Stage Injection: The current quantizer stage $i$ is injected into the network using Adaptive Layer Normalization (AdaLN) [Xu et al., 2019] to inform the model about which stage it is currently generating codes for. $ \mathrm { A d a L } \bar { \mathbf { N } } ( h , i ) = a _ { i } \mathrm { L a y e r N o r m } ( h ) + b _ { i } $ Where $h$ represents the intermediate activations, and $a_i$ and $b_i$ are obtained from a linear projection of the stage embedding.
Non-Autoregressive Attention: Unlike AR models, the NAR model allows each token to attend to all input tokens in the self-attention layer.
Parameter Sharing: Weights of the $j$ -th prediction layer are shared with the $(j+1)$ -th acoustic embedding layer.
Optimization Objective: The NAR model's overall objective is to predict codes for quantizers 2 through 8: $ p ( \mathbf { C } _ { : , 2 : 8 } | \mathbf { x } , \tilde { \mathbf { C } } ; \theta _ { N A R } ) = \prod _ { j = 2 } ^ { 8 } p ( \mathbf { c } _ { : , j } | \mathbf { C } _ { : , < j } , \mathbf { x } , \tilde { \mathbf { C } } ; \theta _ { N A R } ) $ Where:
- $\mathbf{C}_{:,2:8}$ refers to the codes from quantizer stages 2 through 8.
- $\mathbf{c}_{:,j}$ is the sequence of discrete codes from the $j$ -th quantizer.
- $\mathbf{C}_{:,<j}$ represents the codes from all quantizers prior to stage $j$ that have already been generated.
- $\theta_{NAR}$ denotes the parameters of the non-autoregressive model.

4.2.3.3. Overall Prediction

The complete prediction of the acoustic code matrix $\mathbf{C}$ is modeled as the product of the AR model for the first quantizer and the NAR model for the subsequent quantizers: $ p ( { \mathbf { C } } | { \mathbf { x } } , \tilde { { \mathbf { C } } } ; \theta ) = p ( { \mathbf { c } } _ { : , 1 } | \tilde { { \mathbf { C } } } _ { : , 1 } , { \mathbf { X } } ; \theta _ { A R } ) \prod _ { j = 2 } ^ { 8 } p ( { \mathbf { c } } _ { : , \mathbf { j } } | { \mathbf { c } } _ { : , < j } , { \mathbf { x } } , \tilde { { \mathbf { C } } } ; \theta _ { N A R } ) $ This combination leverages the strength of AR for flexible length prediction and initial speaker modeling and the efficiency of NAR for refining acoustic details.

4.2.4. Inference: In-Context Learning via Prompting

VALL-E exhibits in-context learning capabilities in zero-shot TTS, meaning it can synthesize high-quality speech for unseen speakers without any fine-tuning or parameter updates. This is achieved through carefully designed prompting.

The inference process involves:

Prompt Preparation:
- The input text is converted into a phoneme sequence (the phoneme prompt).
- The enrolled recording (e.g., a 3-second sample of the target speaker's voice) is encoded by the neural codec model into an acoustic matrix (the acoustic prompt).
Code Generation: Both prompts are fed into the AR and NAR models.
- AR Model Decoding: For the AR model (generating $\mathbf{c}_{:,1}$ ), sampling-based decoding is used. This is preferred over beam search because beam search can sometimes lead the language model into infinity loops and sampling-based methods significantly increase the diversity of the output speech.
- NAR Model Decoding: For the NAR model (generating $\mathbf{c}_{:,2:8}$ ), greedy decoding is used, where the token with the highest probability is chosen at each step.
Waveform Synthesis: Finally, the full sequence of eight discrete code sequences generated by VALL-E is fed into the neural codec decoder (the EnCodec decoder) to synthesize the final waveform.

The paper describes two main inference settings:

VALL-E (Main Zero-Shot TTS Setting):
- Goal: To generate specific content (from a given text sentence) in the voice of an unseen speaker.
- Phoneme Prompt: The phoneme transcription of the enrolled speech is prepended to the phoneme sequence of the target sentence.
- Acoustic Prompt: The first layer acoustic tokens ( $\tilde{\mathbf{c}}_{:,1}$ ) of the enrolled speech are used as an acoustic prefix for the AR model. The full acoustic matrix $\tilde{\mathbf{C}}$ is used as an acoustic prompt for the NAR model.
- Outcome: VALL-E generates acoustic tokens corresponding to the target text, cloning the speaker's voice from the prompt.
VALL-E-continual (Speech Continuation Setting):
- Goal: To generate a continuation of an utterance, maintaining the speaker's voice and prosody.
- Phoneme Prompt: The full transcription of the utterance is used.
- Acoustic Prompt: The first 3 seconds of the ground-truth utterance are used as the acoustic prompt.
- Outcome: The model generates the continuation of the speech, semantically and acoustically continuous with the enrolled speech.

5. Experimental Setup

This section details the datasets, evaluation metrics, and model configurations used to train and evaluate VALL-E.

5.1. Datasets

5.1.1. Training Data

LibriLight [Kahn et al., 2020]:
- Source & Scale: Consists of 60,000 hours of unlabelled English speech, primarily from audiobooks. This is hundreds of times larger than typical TTS datasets.
- Characteristics: Contains speech from over 7,000 unique speakers. The original data is audio-only.
- Transcription Generation: To obtain transcriptions, a hybrid DNN-HMM ASR model was trained on 960 hours of labeled LibriSpeech data (following the Kaldi recipe [Povey et al., 2011]). This ASR model then decoded the unlabeled LibriLight speech data to generate phoneme-level alignments with a 30 ms frameshift.
- Domain: Audiobook domain.
- Rationale: Chosen for its massive scale, diversity in speakers, and prosodies, despite containing more noisy speech and inaccurate transcriptions compared to clean studio-recorded data. The authors believe the proposed approach is robust to this noise due to leveraging large data.
Neural Codec Data: The EnCodec model [Défossez et al., 2022] was used to generate the acoustic code matrix for all 60,000 hours of LibriLight data.

5.1.2. Evaluation Data

LibriSpeech [Panayotov et al., 2015]:
- Source & Scale: The test-clean subset of LibriSpeech was used for zero-shot TTS evaluation.
- Characteristics: Consists of samples with lengths between 4 and 10 seconds, totaling a 2.2-hour subset. Crucially, there is no speaker overlap between the LibriLight training data and the LibriSpeech test-clean speakers, ensuring a true zero-shot scenario.
- Domain: Audiobook domain.
VCTK [Veaux et al., 2016]:
- Source & Scale: A dataset containing speech from 108 speakers.
- Characteristics: None of the 108 speakers were observed during VALL-E's training. This dataset is known for containing speakers with various accents, posing a greater challenge for generalization compared to LibriSpeech.
- Domain: Varied speech.
EmoV-DB [Adigwe et al., 2018]:
- Source & Scale: A dataset containing speech with five emotions.
- Characteristics: Used for qualitative analysis of speaker's emotion maintenance in zero-shot settings.

5.2. Evaluation Metrics

The paper employs both automatic and human evaluation metrics to assess VALL-E's performance.

5.2.1. Automatic Metrics

5.2.1.1. Word Error Rate (WER)

Conceptual Definition: WER is a common metric for evaluating the performance of speech recognition systems, but in TTS, it's used to assess the robustness of the generated speech. It quantifies how accurately the synthesized speech matches the target transcription, indicating whether the TTS system introduces errors like deletions, insertions, or substitutions. A lower WER indicates higher synthesis robustness and better fidelity to the input text.
Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $
Symbol Explanation:
- $S$ : Number of substitutions (words in the reference that were replaced by a different word in the hypothesis).
- $D$ : Number of deletions (words in the reference that were omitted in the hypothesis).
- $I$ : Number of insertions (words in the hypothesis that were not present in the reference).
- $N$ : Total number of words in the reference transcription.
Implementation: An ASR model (HuBERT-Large [Hsu et al., 2021] fine-tuned on LibriSpeech 960h, a CTC-based model without language model fusion) is used to transcribe the generated audio, and the WER is calculated against the original text transcriptions.

5.2.1.2. Speaker Similarity Score (SPK)

Conceptual Definition: This metric quantifies how similar the synthesized speech sounds to the voice of the enrolled speaker (provided in the acoustic prompt). It measures the success of speaker cloning in a zero-shot scenario. A higher SPK score indicates better preservation of the speaker identity.
Implementation: The WavLM-TDNN model [Chen et al., 2022], a state-of-the-art speaker verification model (top-ranked at VoxSRC Challenge 2021 and 2022), is used. It predicts a similarity score in the range of $[-1, 1]$ , where a larger value signifies higher similarity between the prompt (decompressed enrolled speech) and the synthesized speech.

5.2.2. Human Evaluation

Human listeners provide subjective ratings, which are often considered the most reliable indicators of speech quality.

5.2.2.1. Comparative Mean Option Score (CMOS)

Conceptual Definition: CMOS is a mean opinion score (MOS) variant used to evaluate the naturalness or overall quality of speech by directly comparing two audio samples (e.g., the proposed system vs. a baseline or ground truth). Listeners are asked to rate one system relative to another.
Scale: Ranges from -3 to 3, with intervals of 1.
- -3: The new system is much worse than the baseline.
- 0: The new system is similar to the baseline.
- +3: The new system is much better than the baseline.
Goal: Provides an indicator of speech naturalness and perceived quality.
Implementation: 12 native English speakers were invited as CMOS contributors via crowdsourcing.

5.2.2.2. Similarity Mean Option Score (SMOS)

Conceptual Definition: SMOS is an absolute mean option score specifically designed to measure how similar the synthesized speech is to the original speaker's voice from the acoustic prompt. Listeners rate the degree of speaker similarity.
Scale: Ranges from 1 to 5, with 0.5-point increments.
- 1: Not similar at all.
- 5: Extremely similar.
Goal: Directly quantifies speaker cloning effectiveness.
Implementation: 6 native English speakers were invited as SMOS contributors via crowdsourcing.

5.3. Baselines

VALL-E was primarily compared against the following state-of-the-art zero-shot TTS system:

YourTTS [Casanova et al., 2022b]:
- Description: This is a state-of-the-art zero-shot multi-speaker TTS system. It's an end-to-end TTS model that leverages speaker embeddings for zero-shot voice conversion and TTS.
- Training Data: YourTTS was trained on a combined dataset including VCTK [Veaux et al., 2016], LibriTTS [Zen et al., 2019], and TTS-Portuguese [Casanova et al., 2022a].
- Representativeness: It is a strong baseline because it represents the speaker encoding-based approach for zero-shot TTS and has demonstrated good performance. The authors used their publicly released checkpoint for comparison.
  
  Additionally, for comparison of robustness (WER) in speech-to-speech LM-based generation, VALL-E also compared its WER results with:
GSLM [Lakhotia et al., 2021]:
- Description: A speech-to-speech generative model that synthesizes speech based on HuBERT codes. It uses Tacotron2 and WaveGlow for waveform reconstruction.
AudioLM* [Borsos et al., 2022]:
- Description: A speech-to-speech language modeling approach to audio generation that uses audio codecs and semantic codes.

5.4. Model and Training Details

5.4.1. Model Architecture

Both the Autoregressive (AR) model and the Non-Autoregressive (NAR) model in VALL-E share a similar Transformer architecture.
Layers: 12 layers.
Attention Heads: 16 attention heads.
Embedding Dimension: 1024.
Feed-Forward Layer Dimension: 4096.
Dropout: 0.1.

5.4.2. Training Configuration

Waveform Length: The average waveform length in LibriLight is 60 seconds. During training, waveforms were randomly cropped to lengths between 10 and 20 seconds.
Phoneme Prompt: The corresponding phoneme alignments of the cropped waveforms were used. Consecutive repetitions in the force-aligned phoneme sequence were removed.
NAR Acoustic Prompt: For the NAR model, a random 3-second segment of waveform from the same utterance as the target was selected to serve as the acoustic prompt.
Hardware: Trained using 16 NVIDIA TESLA V100 32GB GPUs.
Batch Size: A batch size of 6,000 acoustic tokens per GPU.
Training Steps: 800,000 steps.
Optimizer: AdamW optimizer.
Learning Rate Schedule: The learning rate was warmed up for the first 32,000 updates to a peak of $5 \times 10^{-4}$ , and then linearly decayed.

6. Results & Analysis

This section presents and analyzes the experimental results obtained for VALL-E, comparing its performance against baselines and delving into qualitative observations.

6.1. Core Results Analysis

6.1.1. LibriSpeech Evaluation

The LibriSpeech test-clean dataset was used for zero-shot TTS evaluation, specifically a 2.2-hour subset with no speaker overlap with the LibriLight training data. For each synthesis, VALL-E used a randomly chosen 3-second speech segment from another utterance of the same speaker as the acoustic prompt. VALL-E-continual used the first 3 seconds of the ground-truth speech as the enrolled speech.

The following are the results from Table 2 of the original paper:

model	WER	SPK
GroundTruth	2.2	0.754
Speech-to-Speech Systems
GSLM	12.4	0.126
AudioLM*	6.0	-
TTS Systems
YourTTS	7.7	0.337
VALL-E	5.9	0.580
VALL-E-continual	3.8	0.508

Table 2: Evaluation results on audio generation. YourTTS and VALL-E are text-to-speech models using phonemes as inputs, while GSLM and AudioLM are speech-to-speech models using latent code as inputs. The WER result of AudioLM is obtained by a Conformer Transducer model [Borsos et al., 2022]. Since AudioLM is not open-source, we cannot evaluate its speaker score with our tool.*

Analysis of Automatic Metrics (Table 2):

Robustness (WER): VALL-E demonstrates significantly better robustness compared to the YourTTS baseline (5.9 WER vs. 7.7 WER). This indicates that VALL-E generates speech that is more faithful to the input text, with fewer deletion, insertion, or replacement errors. The VALL-E-continual setting further reduces WER to 3.8, suggesting that having ground-truth acoustic tokens for the initial segment improves robustness.
Speaker Similarity (SPK): VALL-E achieves a much higher speaker similarity score (0.580) compared to YourTTS (0.337). This highlights VALL-E's superior ability to clone the voice of unseen speakers from a short acoustic prompt.
Comparison with Speech-to-Speech LMs: VALL-E also outperforms GSLM (12.4 WER, 0.126 SPK) and AudioLM (6.0 WER) in robustness and speaker similarity. The paper attributes VALL-E's better WER to its training with pseudo-phonemes, which provide better alignment quality with input text compared to HuBERT/w2v-BERT codes used by GSLM and AudioLM. GSLM's low SPK score is expected, as HuBERT codes discard speaker identity.

The following are the results from Table 3 of the original paper:

SMOS CMOS (v.s. VALL-E)

YourTTS 3.45±0.09 -0.12

VALL-E 4.38±0.10 0.00

GroundTruth 4.5±0.10 +0.17

	SMOS	CMOS (v.s. VALL-E)
YourTTS	3.45±0.09	-0.12
VALL-E	4.38±0.10	0.00
GroundTruth	4.5±0.10	+0.17

Table 3: Human evaluation with 40 speakers on LibriSpeech test-clean with 3-second enrolled recording for each.

Analysis of Human Evaluation (Table 3):

Speaker Similarity (SMOS): VALL-E achieves an SMOS of 4.38, which is very close to the ground truth (4.5) and significantly outperforms YourTTS (3.45). This strongly confirms its effectiveness in zero-shot speaker cloning.
Speech Naturalness (CMOS): VALL-E beats YourTTS with a CMOS of +0.12 (relative to VALL-E itself being 0.00). This indicates that VALL-E synthesizes more natural and realistic speech than the baseline. Although GroundTruth still has a slightly higher CMOS (+0.17 relative to VALL-E), the performance gap is small, suggesting high perceived naturalness for VALL-E's outputs.

6.1.2. VCTK Evaluation

The VCTK dataset (108 speakers) was used for further zero-shot evaluation. Notably, VALL-E saw none of these speakers during training, while YourTTS had seen 97 of them. Comparisons were made using 3s, 5s, and 10s prompts.

The following are the results from Table 6 of the original paper:

	3s prompt	5s prompt	10s prompt
108 full speakers
YourTTS*	0.357	0.377	0.394
VALL-E	0.382	0.423	0.484
GroundTruth	0.546	0.591	0.620
11 unseen speakers
YourTTS	0.331	0.337	0.344
VALL-E	0.389	0.380	0.414
GroundTruth	0.528	-	0.586

*Table 6: Automatic evaluation of speaker similarity with 108 speakers on VCTK. YourTTS has observed 97 speakers during training, while VALL-E observed none of them.

Analysis of Automatic Speaker Similarity (SPK) (Table 6):

VALL-E consistently outperforms YourTTS in speaker similarity, even when YourTTS had prior exposure to 97 of the 108 speakers (e.g., 0.382 vs. 0.357 for 3s prompt with all speakers). This highlights VALL-E's strong generalization to truly unseen speakers.
When compared on the strictly unseen 11 speakers (a fair zero-shot scenario), the performance gap between VALL-E and YourTTS widens, especially with shorter prompts (e.g., 0.389 vs. 0.331 for 3s prompt).
The SPK score for both models improves with longer prompt lengths (e.g., 3s to 10s), which is an intuitive finding as more information is available for speaker cloning.

The following are the results from Table 7 of the original paper:

SMOS CMOS (v.s. VALL-E)

YourTTS* 3.70±0.09 -0.23

VALL-E 3.81±0.09 0.00

GroundTruth 4.29±0.09 -0.04

	SMOS	CMOS (v.s. VALL-E)
YourTTS*	3.70±0.09	-0.23
VALL-E	3.81±0.09	0.00
GroundTruth	4.29±0.09	-0.04

Table 7: Human evaluation with 60 speakers on VCTK with 3-second enrolled recording for each.

Analysis of Human Evaluation (Table 7):

Speaker Similarity (SMOS): VALL-E achieves a SMOS of 3.81, which is better than YourTTS (3.70), even though YourTTS had seen some of these speakers. This further confirms VALL-E's superior speaker similarity for zero-shot scenarios.
Speech Naturalness (CMOS): VALL-E significantly outperforms YourTTS with a CMOS of +0.23. Remarkably, VALL-E achieves a CMOS of +0.04 over GroundTruth, indicating no statistically significant difference from human recordings on this dataset. The paper notes that VCTK is more challenging (diverse accents, noisy environments) than LibriSpeech, and VALL-E's strong performance here is notable.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study of the NAR Model

The authors conducted an ablation study on the Non-Autoregressive (NAR) model to understand the contribution of different prompts. Three NAR models were trained:

NAR-no prompt: No phoneme or acoustic prompts.
NAR-phn prompt: Only phoneme sequence as a prompt.
NAR-2 prompts: Both phoneme prompt and acoustic token prompt as conditions.

For this study, the ground-truth first-level acoustic tokens were used as input to the NAR model to isolate its performance.

The following are the results from Table 4 of the original paper:

	NAR-no prompt	NAR-phn prompt	NAR-2 prompts
WER	19.6	3.0	2.8
SPK	0.518	0.541	0.732

Table 4: Ablation study of the NAR model. The inputs of the NAR models are the ground-truth for the ablation study.

Analysis of NAR Ablation (Table 4):

NAR-no prompt: This model performs poorly on both WER (19.6) and SPK (0.518), even with ground-truth first-level acoustic tokens. This shows that the NAR model requires conditioning to generate meaningful and speaker-consistent speech.
NAR-phn prompt: Adding the phoneme prompt dramatically reduces WER from 19.6 to 3.0. This demonstrates that the phoneme prompt is the primary contributor to ensuring the correct content of the generated speech. There is a slight improvement in SPK (0.541).
NAR-2 prompts: With both phoneme and acoustic token prompts, the WER slightly improves to 2.8, but critically, the SPK score significantly jumps to 0.732. This confirms that the acoustic token prompt is essential for the NAR model to learn and maintain speaker information and thus improve speaker evaluation quality for the finer acoustic details it generates.

6.2.2. Ablation Study of the AR Model

This study examines the importance of the acoustic prompt for the Autoregressive (AR) model, using the NAR-2 prompts setting as the NAR model.

The following are the results from Table 5 of the original paper:

	WER	SPK
VALL-E	5.9	0.585
w/o acoustic prompt	5.9	0.236

Table 5: Ablation study of the AR model.

Analysis of AR Ablation (Table 5):

w/o acoustic prompt: When the acoustic prompt is removed from the AR model (which generates the crucial first-level codes), the speaker similarity score (SPK) plummets from 0.585 to 0.236. The WER remains unchanged.
Conclusion: This demonstrates that the acoustic prompt provided to the AR model is extremely crucial for establishing and maintaining speaker identity. Even though the NAR model later receives its own acoustic prompt, the initial speaker information captured by the AR model is paramount.

6.3. Qualitative Analysis

Beyond quantitative metrics, VALL-E exhibits several interesting qualitative properties.

6.3.1. Diversity

Traditional TTS systems often produce highly deterministic outputs, leading to a one-to-one mapping between input text and output waveform. This is because mel spectrogram generation typically relies on reconstruction without inherent randomness. VALL-E, by contrast, uses a sampling-based method to generate discrete tokens during inference, which introduces randomness and allows for diverse outputs from the same input text and target speaker.

The figures below (Figure 4 and 5 from the original paper) illustrate this diversity.

该图像是两幅声波图，展示了不同时间段的声音幅度变化。上图和下图分别表示了相同文本的音频信号，横轴为时间（秒），纵轴为幅度。图中音频片段包含了'after early nightfall'等句子。
Figure 4: Diversity analysis of VALL-E. Each utterance is synthesized two times with different random seeds. We can observe substantial diversity of the two outputs regarding the same input.

该图像是一个示意图，上面显示了两段语音信号的波形图。上方波形代表较低的幅度变化，而下方波形表现出更丰富的细节和幅度变化，这可能与不同的人声特征或情感表达有关。
Figure 5: The image is a diagram displaying the waveforms of two segments of speech signals. The upper waveform shows relatively lower amplitude variations, while the lower waveform exhibits richer details and amplitude changes, possibly related to different vocal characteristics or emotional expressions.

Analysis of Diversity (Figure 4a and 4b):

Figure 4a (LibriSpeech sample): Shows two syntheses of the same sentence. They exhibit different lengths and phrase durations, with the first sample having a faster speech rate.
Figure 4b (VCTK sample): Shows two syntheses of a different sentence. They display variations in accent and emphasis; for example, the second output emphasizes the word "must" with a larger amplitude, while the first does not. This diversity is valuable for downstream applications like speech recognition, where varied inputs (different speakers, acoustic environments, prosodies) are beneficial for training robust models, a need that previous TTS systems struggled to meet.

6.3.2. Acoustic Environment Maintenance

An interesting finding is VALL-E's ability to maintain acoustic environment consistency between the acoustic prompt and the generated speech. If the acoustic prompt contains reverberation (e.g., recorded in a large room), VALL-E can synthesize speech that also includes reverberation. In contrast, baseline systems typically output clean speech regardless of the prompt's environment. The authors attribute this to VALL-E being trained on a large-scale dataset (LibriLight) that likely contains a wider variety of acoustic conditions than the cleaner data used by baselines, enabling it to learn and reproduce such environmental characteristics.

6.3.3. Speaker's Emotion Maintenance

VALL-E also demonstrates the ability to preserve the speaker's emotion present in the acoustic prompt in a zero-shot setting. By selecting acoustic prompts from an emotional speech dataset like EmoV-DB [Adigwe et al., 2018], VALL-E can synthesize speech that retains the same emotion (e.g., anger), even though the model was not explicitly fine-tuned on an emotional TTS dataset. This emergent capability suggests that the discrete audio codec codes and the large-scale training capture high-level prosodic and emotional features effectively.

6.4. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

	Current Systems	VALL-E
Intermediate representation	mel spectrogram	audio codec code
Objective function	continuous signal regression	language model
Training data	≤ 600 hours	60K hours
In-context learning	X	✓

Table 1: A comparison between VALL-E and current cascaded TTS systems.

Analysis of Table 1: This table summarizes the fundamental innovations of VALL-E compared to traditional cascaded TTS systems. It highlights the shift from mel spectrograms to audio codec codes as intermediate representations, from continuous signal regression to language modeling as the objective, a massive increase in training data scale (60K hours vs. ≤ 600 hours), and the crucial emergence of in-context learning capabilities in VALL-E.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduced VALL-E, a groundbreaking language model approach to Text-to-Speech (TTS) synthesis. By leveraging discrete audio codec codes as intermediate representations and scaling pre-training to an unprecedented 60,000 hours of speech data, VALL-E demonstrates powerful in-context learning capabilities for zero-shot TTS. This allows for high-quality personalized speech synthesis for unseen speakers using only a 3-second acoustic prompt, without requiring any fine-tuning or complex architectural modifications. VALL-E achieved new state-of-the-art zero-shot TTS results on both LibriSpeech and VCTK datasets, significantly outperforming existing systems in speech naturalness (CMOS) and speaker similarity (SMOS). Furthermore, the model exhibits emergent properties such as generating diverse outputs, preserving the acoustic environment of the prompt, and maintaining the speaker's emotion in synthesis.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

7.2.1. Synthesis Robustness

Limitation: VALL-E sometimes produces unclear, missed, or duplicated words. This robustness issue is primarily attributed to the autoregressive (AR) nature of the phoneme-to-acoustic language model part, which can suffer from disordered attention alignments without explicit constraints. This phenomenon is also observed in vanilla Transformer-based TTS models.
Future Work: The authors plan to leverage techniques from other TTS research, such as applying non-autoregressive models or modifying the attention mechanism, to address these robustness issues.

7.2.2. Data Coverage

Limitation: Despite using 60,000 hours of training data, the dataset still lacks comprehensive coverage of all voices, particularly for accent speakers (evidenced by slightly worse results on VCTK compared to LibriSpeech). Additionally, the LibriLight dataset, being derived from audiobooks, predominantly features a reading style, limiting the diversity of speaking styles.
Future Work: The authors propose to further scale up the training data to improve model performance across prosody, speaking style, and speaker similarity dimensions, believing that further data scaling, along with model scaling, could almost entirely solve the zero-shot TTS task.

7.2.3. Model Structure

Limitation: The current VALL-E architecture uses two separate models (an AR model for the first quantizer and a NAR model for the subsequent ones).
Future Work:
- Universal Model: A promising direction is to unify these into a single, large universal model that can predict codes across all quantizers.
- Full NAR Models: Exploring a complete non-autoregressive framework could significantly speed up model inference.

7.2.4. Broader Impacts

Potential Risks: The ability of VALL-E to synthesize speech that accurately maintains speaker identity poses potential risks of misuse, such as spoofing voice identification systems or impersonating specific speakers.
Mitigation: The authors suggest building a detection model to discriminate between human and VALL-E-synthesized audio clips. They also commit to practicing Microsoft AI Principles in future model development.

7.3. Personal Insights & Critique

7.3.1. Personal Insights

Paradigm Shift: VALL-E represents a significant paradigm shift in TTS, effectively porting the success of large language models from text to speech. The idea of treating TTS as a conditional language modeling problem over discrete audio tokens is elegant and powerful, enabling in-context learning that was previously elusive in TTS.
Power of Discrete Representations: The choice of neural audio codec codes from EnCodec as the intermediate representation is a critical enabler. Unlike HuBERT or wav2vec codes that focus on linguistic content and discard speaker identity, EnCodec codes retain rich acoustic information—speaker characteristics, emotion, and environmental nuances—which is crucial for high-fidelity and personalized synthesis.
Data Scaling is Key: The paper strongly reinforces the idea, now prevalent in AI, that data scale (especially semi-supervised) is a powerful force for generalization and emergent capabilities. The leap to 60,000 hours, even with noisy transcriptions, appears to be a major factor in VALL-E's success.
Emergent Abilities: The observed abilities to maintain acoustic environment and speaker's emotion in zero-shot settings are truly remarkable. They suggest that the model has learned a deep, disentangled understanding of various acoustic attributes beyond just content and speaker identity.

7.3.2. Critique

Dependence on ASR for Training Data: The use of ASR-generated transcriptions for 60K hours of LibriLight data is a practical solution but also a potential weak point. Inaccuracies in these pseudo-transcriptions could introduce noise and propagate errors. While the paper argues robustness, the extent to which this might limit the model's ultimate fidelity or introduce subtle artifacts is worth further investigation.
Robustness Issues (AR Model): The acknowledged synthesis robustness issues (missed/duplicated words) from the AR model indicate that while the language modeling approach is powerful, the specific challenges of aligning phonemes to variable-length acoustic sequences still persist. This suggests that directly applying Transformer decoders to speech generation might still require specialized attention mechanisms or non-autoregressive techniques to overcome alignment pitfalls, as done in FastSpeech for mel spectrograms.
Compute Resources: Training VALL-E on 16 V100 GPUs for 800,000 steps underscores the substantial computational resources required for such large-scale language models. This limits accessibility for smaller research groups or individual researchers.
Black Box Nature: Like many large neural models, VALL-E is largely a black box. Understanding precisely how it extracts and disentangles speaker identity, emotion, and environment from acoustic prompts and reconstructs them is complex. Further interpretability research could provide deeper insights.
Ethical Implications: The broader impacts section is appropriately included. The potential for deepfake audio generation and impersonation is a serious concern. While a detection model is proposed, the arms race between generation and detection is well-known. Robust ethical guidelines and safeguards are paramount.

7.3.3. Transferability and Future Directions

Cross-Domain Application: The VALL-E approach could be transferable to other audio generation tasks beyond TTS, such as music synthesis, sound event generation, or speech-to-speech translation, by adapting the input prompts and discrete code targets.
Multilingual/Multimodal Extensions: Extending VALL-E to multilingual TTS or multimodal TTS (e.g., conditioning on visual cues like speaker's face) could unlock new capabilities.
Improved Phonetic Representations: To mitigate reliance on ASR-generated phonemes, future work could explore self-supervised learning of discrete phonetic representations that are more robust to noise and capture linguistic information directly from raw audio.
Unified Generative Model: The idea of a single, universal language model for all quantizer stages, or a fully NAR architecture, promises further efficiency and potentially even higher quality by optimizing the entire generation process end-to-end within one model.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.