Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
TL;DR Summary
The paper presents VALL-E, a novel Text-to-Speech method using a neural codec language model. It reformulates TTS as conditional language modeling, achieving high-quality personalized speech synthesis with just 3 seconds of an unseen speaker's recording, and significantly improvi
Abstract
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is a novel approach to Text-to-Speech (TTS) synthesis, specifically focusing on zero-shot generation of personalized speech using a neural codec language model.
1.2. Authors
The paper lists multiple authors from Microsoft, including Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Their research backgrounds appear to be in speech processing, natural language processing, and artificial intelligence, given their affiliation with Microsoft and the nature of the research.
1.3. Journal/Conference
The paper was published on arXiv, which is a preprint server for scientific articles. While arXiv is not a peer-reviewed journal or conference in itself, it is a highly influential platform for rapid dissemination of research in fields like AI, machine learning, and computer science. Papers often appear on arXiv before or concurrently with their submission to, or acceptance by, formal conferences or journals.
1.4. Publication Year
The paper was published on January 5, 2023.
1.5. Abstract
The paper introduces a language modeling approach for Text-to-Speech (TTS) synthesis named VALL-E. It trains a neural codec language model using discrete codes derived from an off-the-shelf neural audio codec model, reframing TTS as a conditional language modeling task rather than continuous signal regression. During pre-training, VALL-E is scaled up to use 60,000 hours of English speech, which is hundreds of times larger than existing systems. This extensive training enables VALL-E to develop in-context learning capabilities, allowing it to synthesize high-quality personalized speech from an unseen speaker with just a 3-second enrolled recording as an acoustic prompt. Experimental results demonstrate that VALL-E significantly surpasses the state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity. Furthermore, the model is found to preserve the speaker's emotion and the acoustic environment of the prompt during synthesis.
1.6. Original Source Link
The official source link is https://arxiv.org/abs/2301.02111, and the PDF link is https://arxiv.org/pdf/2301.02111v1.pdf. It is currently published as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
2.1.1. Core Problem and Importance
The core problem the paper aims to solve is the limitation of current Text-to-Speech (TTS) systems in generating high-quality, personalized speech for unseen speakers in a zero-shot scenario.
Current TTS systems, whether cascaded (acoustic model + vocoder) or end-to-end, face several challenges:
-
Data Scarcity: They typically require high-quality, clean data from recording studios. Large-scale data crawled from the internet often leads to performance degradation due to noise and variations.
-
Poor Generalization: Due to relatively small training datasets (dozens to hundreds of hours), existing systems suffer from poor generalization.
Speaker similarityandspeech naturalnessdecline dramatically for speakers not present in the training data. -
Complex Customization: To tackle
zero-shot TTS, prior work often relies onspeaker adaptation(requiring additional fine-tuning),speaker encoding(requiring complex pre-designed features or heavy structural engineering), ormeta-learning, which adds complexity and still limits efficiency.This problem is crucial because the ability to customize a TTS system to an arbitrary voice using only a minimal enrolled recording is highly desirable for practical applications, enabling personalized voice assistants, content creation, and accessibility tools without extensive data collection or fine-tuning for each new speaker.
2.1.2. Paper's Entry Point and Innovative Idea
The paper's entry point is inspired by the success of large language models (LLMs) in text synthesis (e.g., GPT-3, PaLM), which demonstrated that scaling up models with large and diverse data can lead to emergent capabilities like in-context learning. The innovative idea is to transfer this success to speech synthesis by:
- Reframing TTS as a Conditional Language Modeling Task: Instead of the traditional approach of continuous signal regression (e.g., predicting mel spectrograms),
VALL-Etreats TTS as predicting discrete audio tokens. - Utilizing Discrete Audio Codec Codes: It leverages an
off-the-shelf neural audio codec model(specifically,EnCodec) to tokenize raw audio into discrete codes. These codes are rich enough to retainspeaker identity,emotion, andacoustic environmentinformation, unlike simplerself-supervised speech representationsthat discard speaker characteristics. - Massive Scale-Up of Training Data:
VALL-Eis pre-trained on an unprecedented 60,000 hours of English speech, which is hundreds of times more than previous TTS systems. This large, diverse, and multi-speaker dataset is semi-supervised (using ASR-generated transcriptions for audio-only data), suggesting robustness to noise. - Enabling In-Context Learning: By treating speech synthesis as a language modeling task with discrete tokens,
VALL-Ecan leverageprompting-basedlarge-model techniques, exhibitingin-context learningcapabilities. This allows it to synthesize high-quality personalized speech from unseen speakers with just a 3-second acoustic prompt, without any fine-tuning or complex architectural modifications.
2.2. Main Contributions / Findings
The primary contributions and key findings of the paper are:
- Introduction of
VALL-E, a Novel TTS Framework:VALL-Eis the first TTS framework that leverages alanguage modelapproach withaudio codec codesas intermediate representations, replacing traditionalmel spectrograms. This design enables strongin-context learningcapabilities, similar toGPT-3, forzero-shot TTSwithout requiring additional structure engineering, pre-designed acoustic features, or fine-tuning. - Demonstration of Data Scaling for TTS: The paper builds a generalized TTS system by pre-training on an unprecedented 60,000 hours of
semi-supervisedspeech data. This highlights that simply scaling up semi-supervised data has been underestimated in the TTS field and can lead to significant performance improvements and generalization across speakers. - Achieving State-of-the-Art Zero-Shot TTS:
VALL-Esignificantly outperforms the state-of-the-art zero-shot TTS system on bothLibriSpeechandVCTKdatasets in terms ofspeech naturalness(CMOS) andspeaker similarity(SMOS). OnVCTK, it even achieves aCMOSscore comparable to ground truth, suggesting synthesized speech is as natural as human recordings. - Emergence of Advanced Synthesis Capabilities:
VALL-Edemonstrates several emergent capabilities:-
Diverse Outputs: It can generate diverse synthesized results (e.g., varying pace, accent, or prosody) for the same input text and target speaker by using different sampling strategies during inference.
-
Acoustic Environment Maintenance: The model can preserve the acoustic environment (e.g., reverberation) present in the acoustic prompt during synthesis.
-
Speaker's Emotion Maintenance:
VALL-Ecan maintain the speaker's emotion from the acoustic prompt, even without explicit emotional TTS training.These findings solve the problem of poor generalization and complex customization in zero-shot TTS by offering a scalable, data-driven, and prompt-based solution that achieves high-quality, personalized speech synthesis for unseen speakers.
-
3. Prerequisite Knowledge & Related Work
This section provides the foundational knowledge and reviews the technological landscape relevant to understanding the VALL-E paper.
3.1. Foundational Concepts
3.1.1. Text-to-Speech (TTS) Synthesis
Text-to-Speech (TTS) synthesis is the artificial production of human speech from text. The goal is to convert written language into spoken language that sounds natural and intelligible. Early systems used concatenative synthesis (piecing together pre-recorded speech units), while modern systems primarily use neural networks.
3.1.2. Neural Networks
Neural networks are a class of machine learning models inspired by the structure and function of the human brain. They consist of interconnected nodes (neurons) organized in layers. They learn to recognize patterns in data through training, adjusting the strengths of connections (weights) between neurons. In TTS, they are used to learn the complex mapping from text to acoustic features or directly to waveforms.
3.1.3. Mel Spectrograms
A mel spectrogram is a visual representation of the spectrum of frequencies of a sound signal as it varies with time, but with the frequencies transformed onto the mel scale. The mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. Mel spectrograms are widely used as an intermediate representation in traditional TTS systems because they capture human auditory perception more accurately than linear frequency scales, providing a compact and perceptually relevant representation of speech.
3.1.4. Vocoder
A vocoder (voice encoder-decoder) is a system that analyzes and synthesizes the human voice. In TTS pipelines, after an acoustic model generates mel spectrograms (or other acoustic features) from text, a vocoder is responsible for converting these acoustic features back into a raw audio waveform. Examples include WaveNet, WaveGlow, and HifiGAN.
3.1.5. Zero-Shot Text-to-Speech (TTS)
Zero-shot TTS refers to the ability of a TTS system to synthesize speech in the voice of a speaker it has never encountered during its training phase, using only a very short (e.g., 3-second) audio sample of that speaker's voice as a prompt. This is a challenging task because the model must generalize speaker-specific characteristics from limited information.
3.1.6. Language Model (LM)
A Language Model (LM) is a statistical model that determines the probability of a sequence of words or tokens. In recent years, large neural language models like GPT-3 have demonstrated remarkable abilities in generating coherent and contextually relevant text by learning patterns from vast amounts of text data. The VALL-E paper re-frames TTS as a conditional language modeling task on discrete audio tokens.
3.1.7. Discrete Codes / Acoustic Tokens
Discrete codes (or acoustic tokens) are quantized, distinct numerical representations of raw audio. Instead of representing audio as a continuous signal or a continuous feature like a mel spectrogram, discrete codes assign specific, finite integer values to segments of audio. This process is similar to how words are discrete tokens in text. Using discrete codes allows TTS to be treated as a sequence-to-sequence problem, where the model predicts a sequence of discrete audio tokens rather than continuous values.
3.1.8. Neural Audio Codec
A neural audio codec is a type of neural network specifically designed to encode (compress) audio waveforms into a more compact, often discrete, representation and then decode (decompress) that representation back into a high-fidelity audio waveform. Unlike traditional audio codecs (e.g., MP3), neural codecs learn a more perceptually relevant compression, often retaining crucial information like speaker identity, emotion, and acoustic environment even at very low bitrates. EnCodec [Défossez et al., 2022] is an example used in VALL-E.
3.1.9. In-Context Learning
In-context learning is an emergent capability observed in large language models. It refers to the model's ability to learn a new task or adapt to a new style (e.g., a new speaker's voice) by simply being provided with a few examples or a prompt within the input sequence, without requiring any updates to its internal parameters (i.e., no fine-tuning). The model implicitly "learns" from the context provided in the prompt.
3.1.10. Autoregressive (AR) and Non-Autoregressive (NAR) Models
- Autoregressive (AR) Model: An
autoregressive modelpredicts the next element in a sequence based on all previously predicted elements in that same sequence. Each output step depends on the outputs of the preceding steps. This makes them good for modeling dependencies but can be slow during inference as prediction must be sequential. - Non-Autoregressive (NAR) Model: A
non-autoregressive modelpredicts all elements of a sequence simultaneously or in parallel, without relying on previously predicted elements within the same output sequence. This can significantly speed up inference but might struggle with capturing long-range dependencies or maintaining consistency compared toARmodels.
3.1.11. Transformer
The Transformer is a neural network architecture introduced by Vaswani et al. (2017) that relies entirely on self-attention mechanisms to process input sequences. Unlike recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers can process all parts of an input sequence in parallel, making them highly efficient for long sequences and capable of capturing long-range dependencies. They consist of an encoder and a decoder stack, each with multi-head self-attention and feed-forward layers. VALL-E uses Transformer decoder architecture.
3.1.12. Residual Vector Quantization (RVQ)
Residual Vector Quantization (RVQ) is a technique used in vector quantization (VQ) where the quantization error (the difference between the original signal and its quantized version) is progressively quantized by subsequent quantizers. Instead of a single quantizer trying to capture all information, RVQ uses a cascade of quantizers. The first quantizer captures the most significant information, and each subsequent quantizer learns to encode the residual (the error) from the previous stage. This hierarchical approach allows for efficient compression and reconstruction, with earlier stages capturing coarser features (like speaker identity) and later stages refining finer details.
3.2. Previous Works
The paper contextualizes VALL-E by discussing advancements in zero-shot TTS and spoken generative pre-trained models.
3.2.1. Zero-Shot TTS
- Cascaded TTS Systems:
- Traditional
TTSsystems often follow acascadedpipeline, typically comprising anacoustic model(e.g.,Tacotron2[Shen et al., 2018],FastSpeech[Ren et al., 2019],Transformer TTS[Li et al., 2019]) that converts text tomel spectrograms, followed by avocoder(e.g.,WaveNet,WaveGlow,HifiGAN) that synthesizes thewaveformfrom themel spectrograms. - Acoustic Model Example (Tacotron2):
Tacotron2is anautoregressivemodel that predicts amel spectrogramfrom a sequence of input characters. It consists of anencoder(character embeddings,convolutional layers,Bi-directional LSTM) and adecoder(LSTMwithattention mechanism) which generatesmel framesone by one. Theattention mechanismis crucial for aligning input characters with outputmel frames. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where is the Query matrix (from the decoder), is the Key matrix (from the encoder), is the Value matrix (from the encoder), and is the dimension of the keys, used for scaling.
- Traditional
- End-to-End TTS Models:
- These models aim to optimize the
acoustic modelandvocoderjointly, synthesizing speech directly from text without an explicit intermediatemel spectrogramstage or by integrating thevocoderinto the training (e.g.,Conditional VAE with Adversarial Learning[Kim et al., 2021],DelightfulTTS 2[Liu et al., 2022]).
- These models aim to optimize the
- Approaches for Zero-Shot TTS:
- Speaker Adaptation: Pioneers like
Arik et al. [2018]proposed methods to adapt a pre-trainedTTSmodel to a new speaker with a small amount of target speaker data. Subsequent works [Chen et al., 2019, Wang et al., 2020, Chen et al., 2021] focused on improving adaptation efficiency.Meta-TTS[Huang et al., 2022] appliedmeta-learningto this, requiring only a fewshots(examples) for adaptation. - Speaker Encoding-based Methods: These methods use a separate
speaker encoderto extract aspeaker embedding(a fixed-size vector representation of a speaker's voice) from an enrolled recording. Thisembeddingis then fed to theTTScomponent. Thespeaker encodercan be pre-trained onspeaker verificationtasks (e.g.,Jia et al. [2018]). Advancedspeaker embedding models[Cai et al., 2018] or complexspeaker encoderdesigns [Wu et al., 2022] are used to improve quality for unseen speakers, butTan et al. [2021]note that quality for unseen speakers often remains undesirable. - Diffusion Models: More recently,
diffusion model-based TTS[Popov et al., 2021, Kim et al., 2022] has been extended tozero-shot TTS[Kang et al., 2022].
- Speaker Adaptation: Pioneers like
- Differentiation:
VALL-Efollows acascadedapproach (uses an externalneural codec decoder) but differs fundamentally by usingaudio codec codesas intermediate representations instead ofmel spectrograms, and by exhibitingin-context learningwithout fine-tuning or complexspeaker encoderengineering.
3.2.2. Spoken Generative Pre-trained Models
- Self-supervised Speech Models:
Self-supervised learningis widely used in speech understanding (e.g.,wav2vec 2.0[Baevski et al., 2020b],HuBERT[Hsu et al., 2021],WavLM[Chen et al., 2022]). These models learn representations from unlabeled speech data.vq-wav2vec[Baevski et al., 2020a] andHuBERT[Hsu et al., 2021] extractdiscrete tokensfrom speech, which canreconstruct contentbut oftendiscard speaker identityand havelow reconstruction quality[Borsos et al., 2022].
- Speech-to-Speech Generation:
GSLM[Lakhotia et al., 2021] synthesizes speech based onHuBERT codes.Polyak et al. [2021]improvedGSLMby combiningHuBERT codeswithVQVAE codesand aspeaker encoder.AudioLM[Borsos et al., 2022] trainsspeech-to-speech language modelsusingk-means tokensfrom self-supervised models andacoustic tokensfromneural codec models[Zeghidour et al., 2022] to generate high-quality speech.AudioLMcan synthesize speech fromaudio codecswithout an additionalvocoder.
- Pre-training for Neural TTS:
Chung et al. [2018]pre-trained aspeech decoderinTTSusingautoregressive mel-spectrogram prediction.SpeechT5[Ao et al., 2022] proposed a unified-modalencoder-decoderframework that leverages unlabeled speech and text for pre-training allTTScomponents.Tjandra et al. [2019]quantized unlabeled speech intodiscrete tokensusingVQVAE[van den Oord et al., 2017] and trained atoken-to-speech sequencemodel, demonstrating efficiency with small fine-tuning data.A3t[Bai et al., 2022] proposed mask and reconstruction onmel spectrogramsfor speech editing and synthesis.
- Differentiation: Previous
TTS pre-trainingwork used less than 1K hours of data, whereasVALL-Euses 60K hours. Crucially,VALL-Eis the first to useaudio codec codesas intermediate representations and to exhibitin-context learningforzero-shot TTS. UnlikeAudioLMwhich isspeech-to-speech,VALL-Eistext-to-speech, providing explicit content control.
3.3. Technological Evolution
The field of speech synthesis has evolved significantly:
- Rule-based/Concatenative Systems: Early
TTSsystems relied on linguistic rules and concatenating pre-recorded phonetic units. - Statistical Parametric Synthesis (e.g., HMM-based):
Hidden Markov Models (HMMs)were used to model speech features, offering more flexibility but often producing robotic-sounding speech. - Deep Learning Era (Cascaded Systems): The last decade saw
dramatic breakthroughswithneural networks.Cascaded systemslikeTacotron2+WaveNetimprovednaturalnessby mapping text tomel spectrograms(acoustic model) and then towaveforms(vocoder). - End-to-End Systems: Integrating the
acoustic modelandvocoderinto a singleneural network(e.g.,Deep Voice,FastSpeech 2,Glow-TTS,Conditional VAEs) simplified pipelines and further enhancedquality. - Multi-Speaker & Zero-Shot TTS: Efforts focused on generalizing to multiple speakers and unseen speakers, leading to
speaker adaptationandspeaker encodingtechniques. - Self-Supervised Learning & Generative Models: Inspired by advancements in
NLP,self-supervised learning(e.g.,wav2vec 2.0,HuBERT) enabled learning representations from large unlabeled speech data.Generative modelslikediffusion modelsandlanguage models(e.g.,AudioLMfor speech-to-speech) started to emerge. - Large Language Model Paradigm Shift (VALL-E): This paper marks a significant point by directly applying the
language modelingparadigm, successful intext generation, totext-to-speech. By usingdiscrete audio codec codesand massivesemi-supervised data,VALL-Eleverages thein-context learningcapabilities oflarge language modelsto achieve unprecedentedzero-shot TTSperformance.
3.4. Differentiation Analysis
The core differences and innovations of VALL-E compared to previous zero-shot TTS and spoken generative pre-trained models are:
- Intermediate Representation:
- Previous TTS: Primarily uses
mel spectrogramsas an intermediate, continuous representation. - VALL-E: Uses
discrete audio codec codes(specifically fromEnCodec). This is a fundamental shift that allowsTTSto be treated as adiscrete token sequence generationproblem, analogous totext language modeling.
- Previous TTS: Primarily uses
- Objective Function:
- Previous TTS: Typically optimized as a
continuous signal regressionproblem (predictingmel spectrogramvalues). - VALL-E: Optimized as a
language modeltask, predicting the nextdiscrete codein a sequence.
- Previous TTS: Typically optimized as a
- Training Data Scale:
- Previous TTS/Pre-training: Generally trained on
dozens to hundreds of hours(e.g.,LibriTTS) or at most ~1K hours of data for pre-training. - VALL-E: Pre-trained on an unprecedented 60,000 hours of
semi-supervisedEnglish speech, hundreds of times larger. This scale is crucial for its emergent capabilities.
- Previous TTS/Pre-training: Generally trained on
- In-Context Learning Capability:
- Previous TTS: Lacks
in-context learning.Zero-shot TTSmethods typically requirefine-tuning,speaker adaptation modules, orcomplex speaker encoderswith pre-designed features. - VALL-E: Emerges
in-context learningcapabilities, similar toGPT-3. It can synthesize high-quality, personalized speech forunseen speakerswith just a 3-secondacoustic prompt, without any additionalfine-tuningorstructural engineering.
- Previous TTS: Lacks
- Generative Diversity and Acoustic Richness:
- Previous TTS: Often produces deterministic outputs for a given text, lacking diversity. Typically focuses on clean speech.
- VALL-E: Can generate diverse outputs (different prosody, pace) due to its
sampling-based decoding. It also preserves richacoustic informationfrom theprompt, includingspeaker's emotionandacoustic environment(e.g., reverberation), which is a novel emergent property inzero-shot TTS.
- Content Control:
-
Speech-to-Speech LMs (e.g., AudioLM): These models primarily take speech as input and generate speech, making
explicit content controlchallenging or requiring additional components. -
VALL-E: As a
text-to-speechmodel, it offers direct and explicit control over the synthesized content via the input text.In essence,
VALL-Eshifts the paradigm ofzero-shot TTSby treating it as aconditional language modelingproblem overdiscrete audio tokens, enabled bymassive data scalingand the inherentin-context learningabilities of largeTransformermodels.
-
4. Methodology
This section details the technical solution proposed by VALL-E, from its foundational principles to its specific architectural components and training procedures.
4.1. Principles
The core idea of VALL-E is to re-frame Text-to-Speech (TTS) synthesis as a conditional language modeling task. Instead of predicting continuous acoustic features like mel spectrograms, VALL-E learns to predict sequences of discrete audio tokens. This approach is motivated by the success of large language models (LMs) in text generation, which have shown powerful in-context learning capabilities when trained on vast amounts of data.
The theoretical basis and intuition behind this approach are:
- Discretization Enables Language Modeling: By converting continuous audio into
discrete codes, speech can be treated like text tokens. This allows the application of powerfulTransformer-based language modelsthat excel at modeling sequences of discrete entities. - Neural Codecs Preserve Rich Information: An
off-the-shelf neural audio codec(likeEnCodec) can compress audio into discrete tokens while preserving high fidelity and crucial acoustic properties such asspeaker identity,emotion, andacoustic environment. This is vital for personalized and expressiveTTS. - Large-Scale Data for Generalization and Emergent Abilities: Training on a massive, diverse dataset allows the model to learn a wide range of speaking styles, prosodies, and acoustic conditions. This scale is hypothesized to enable
in-context learning, where the model can adapt tounseen speakersfrom a shortacoustic promptwithout explicitfine-tuning. - Hierarchical Generation: Leveraging the hierarchical nature of
residual vector quantization (RVQ)inneural codecs,VALL-Eemploys a two-stage generation process: anautoregressive (AR)model for coarseacoustic tokens(which dictate rhythm and speaker identity) and anon-autoregressive (NAR)model for finer details (which refine the acoustic quality). This balancesqualityandinference speed.
4.2. Core Methodology In-depth
The VALL-E framework consists of three main stages: speech quantization, conditional codec language modeling (with AR and NAR components), and inference via prompting.
4.2.1. Speech Quantization
The first step in VALL-E is to convert raw audio waveforms into discrete acoustic codes. This is crucial because raw audio, typically stored as 16-bit integer values at high sampling rates (e.g., 24 kHz), results in extremely long sequences, making it intractable for language models. Speech quantization compresses both the value range and the sequence length.
VALL-E adopts a pre-trained neural audio codec model, EnCodec [Défossez et al., 2022], as its tokenizer. EnCodec is a convolutional encoder-decoder model designed for high-fidelity audio compression.
-
Encoding Process:
EnCodectakes a 24 kHz audio waveform as input. Its encoder producesembeddingsat a rate of 75 Hz (a 320-fold reduction in sampling rate). Eachembeddingis then subjected toResidual Vector Quantization (RVQ). -
Residual Vector Quantization (RVQ): In
RVQ, a series ofquantizersare used. The firstquantizercaptures the most significant information, and subsequentquantizerslearn to encode the residual error from the previous stage. This creates a hierarchical structure ofdiscrete tokens.-
VALL-Euses eight hierarchicalquantizers, each with 1024 entries. This configuration corresponds toEnCodecoperating at 6K bitrates for 24 kHz audio reconstruction. -
For a 10-second waveform, this results in a discrete representation matrix of entries, where 750 is the downsampled time step () and 8 is the number of
quantizers. -
The paper notes that the first
quantizerplays the most important role in reconstruction, with the impact of others gradually decreasing, as illustrated in the figure below. This hierarchy motivates the two-stage (AR for first, NAR for others) language modeling design.The following figure (Figure 2 from the original paper) shows the neural audio codec model:
Figure 2: The neural audio codec model revisit. Because RVQ is employed, the first quantizer plays the most important role in reconstruction, and the impact from others gradually decreases.
-
-
Decoding Process: With the
discrete codesfrom allquantizers, theconvolutional decoderofEnCodecgenerates real-valuedembeddingsand reconstructs the waveform at 24 kHz. The advantage of using aneural codecis that it contains richspeaker informationandacoustic information, allowing forspeaker identitypreservation and high-quality reconstruction without additionalvocodertraining.
4.2.2. Problem Formulation: Regarding TTS as Conditional Codec Language Modeling
Given a dataset of audio samples and their corresponding phoneme transcriptions , the neural codec model encodes each audio sample into a discrete acoustic code matrix , where is the downsampled utterance length and 8 is the number of quantizers. Each row vector represents the eight codes for frame , and each column vector represents the code sequence from the -th codebook (). The neural codec decoder can reconstruct the waveform from , denoted as .
VALL-E frames zero-shot TTS as a conditional codec language modeling task. The goal is to train a neural language model to generate an acoustic code matrix conditioned on:
-
A
phoneme sequence(for content). -
An
acoustic prompt matrix(for speaker's voice, derived from an enrolled recording).The optimization objective is to maximize the probability of the generated codes: $ \operatorname*{max} p ( \mathbf { C } | \mathbf { x } , \tilde { \mathbf { C } } ) $ Here, is obtained by applying the same
neural codecto an enrolled recording (e.g., a 3-second speech segment of the target speaker). Thelanguage modelis expected to learn to extract content from and speaker information from . During inference, the trainedlanguage modelfirst estimates , and then theneural codec decodersynthesizes the high-quality speech.
4.2.3. Training: Conditional Codec Language Modeling
Due to the hierarchical nature of residual quantization, where earlier quantizers capture coarser acoustic properties (like speaker identity) and later quantizers learn finer details, VALL-E designs two conditional language models in a hierarchical manner: one autoregressive (AR) for the first quantizer's codes and one non-autoregressive (NAR) for the subsequent quantizers' codes.
The following figure (Figure 3 from the original paper) shows the structure of the conditional codec language modeling:

Figure 3: The structure of the conditional codec language modeling, which is built in a hierarchical manner. In practice, the NAR decoder will be called seven times to generate codes in seven quantizers.
4.2.3.1. Autoregressive Codec Language Modeling (AR Model)
The autoregressive (AR) language model is responsible for generating the discrete tokens from the first quantizer (i.e., ). This model is crucial for establishing the rhythm, prosody, and overall speaker identity of the synthesized speech, as the first quantizer's codes carry the most significant acoustic information.
- Model Architecture: It comprises a
phoneme embedding layer(), anacoustic embedding layer(), aTransformer decoder(causal transformer), and aprediction layer. - Input and Conditioning: The
AR modelis conditioned on thephoneme sequence(as the content prompt) and theacoustic promptfrom the firstquantizer's codes (). The model's input is the concatenation of the embedded and . Special tokens are appended after each sequence.Sinuous position embeddingsare computed separately for the prompt and input tokens. - Causal Transformer: As a
causal Transformermodel, each token (at time step for the first quantizer) can only attend to priorphoneme tokensand previously generatedacoustic tokens. - Optimization Objective: The model is optimized to maximize the probability of predicting the next token in the first codebook.
$
p ( \mathbf { c } _ { : , 1 } | \mathbf { x } , \tilde { \mathbf { C } } _ { : , 1 } ; \theta _ { A R } ) = \prod _ { t = 0 } ^ { T } p ( \mathbf { c } _ { t , 1 } | \mathbf { c } _ { < t , 1 } , \tilde { \mathbf { c } } _ { : , 1 } , \mathbf { x } ; \theta _ { A R } )
$
Where:
- represents the sequence of
discrete codesfrom the firstquantizerfor the target speech. - is the input
phoneme sequencefor the target content. - represents the sequence of
discrete codesfrom the firstquantizerof theacoustic prompt(enrolled recording). - denotes the parameters of the
autoregressive model. - is the total length of the generated
acoustic code sequence. - is the probability of predicting the code at time step , conditioned on all previous codes , the
acoustic prompt, and thephoneme sequence.
- represents the sequence of
- Parameter Sharing: The output projection layer's parameters are shared with the
acoustic embedding. - Training vs. Inference: During training, the process is pure
causal language model trainingwhere any prefix sequence acts as a prompt for the subsequent part. During inference, theacoustic token sequenceof the enrolled recording () is explicitly used as a prefix forAR decoding, and thephoneme sequenceof the enrolled recording is concatenated with the targetphoneme sequenceas thephoneme prompt. This allows flexible length prediction and adaptation to diverse speaking speeds.
4.2.3.2. Non-Autoregressive Codec Language Modeling (NAR Model)
Once the first quantizer codes () are obtained from the AR model, a non-autoregressive (NAR) language model is employed to generate the codes for the remaining seven quantizers (i.e., ). The NAR model generates all codes for a given stage in parallel, which significantly speeds up inference.
- Model Architecture: Similar to the
AR model, but it includes eight separateacoustic embedding layers(one for eachquantizer). - Training Process: In each training step, a random stage is sampled. The model is trained to maximize the probability of the
acoustic tokensfrom the -thquantizer codebook. - Input and Conditioning: The
NAR modelis conditioned on:- The
phoneme sequence(content prompt). - The full
acoustic prompt matrix(speaker identity and acoustic environment). - The predicted
acoustic tokensfrom all previous codebooks (for current stage ).
- The
acoustic tokensfrom stage 1 to stagei-1(i.e., ) are embedded and summed up as part of the model input. $ e _ { c _ { t , j } } = W _ { a } ^ { j } \odot c _ { t , j } $ Where:- is the embedded representation of the code at time step from
quantizer. - is the
acoustic embedding layerspecific toquantizer. - indicates index selection (retrieving the embedding for the specific code value).
- The summed embedding for a frame from previous
quantizersup toi-1is: $ e _ { { \mathbf c } _ { { \mathbf t } } } = \sum _ { j = 1 } ^ { i - 1 } e _ { c _ { t } , j } $
- is the embedded representation of the code at time step from
- The full
acoustic prompt matrixis also embedded by summing embeddings from all eightquantizersfor each frame: $ e _ { \tilde { \mathbf { c } } _ { \mathbf { t } } } = \sum _ { j = 1 } ^ { 8 } e _ { \tilde { c } _ { t , j } } $ - The final
Transformerinput is the concatenation of the embeddedphoneme sequence(), the embeddedacoustic prompt(), and the summed embeddings of previously generatedacoustic codes().
- The
- Positional Embeddings:
Positional embeddingsare computed separately for prompts and the acoustic sequence. - Stage Injection: The current
quantizerstage is injected into the network usingAdaptive Layer Normalization (AdaLN)[Xu et al., 2019] to inform the model about which stage it is currently generating codes for. $ \mathrm { A d a L } \bar { \mathbf { N } } ( h , i ) = a _ { i } \mathrm { L a y e r N o r m } ( h ) + b _ { i } $ Where represents the intermediate activations, and and are obtained from a linear projection of the stage embedding. - Non-Autoregressive Attention: Unlike
ARmodels, theNAR modelallows each token to attend to all input tokens in theself-attention layer. - Parameter Sharing: Weights of the -th
prediction layerare shared with the -thacoustic embedding layer. - Optimization Objective: The
NAR model's overall objective is to predict codes forquantizers2 through 8: $ p ( \mathbf { C } _ { : , 2 : 8 } | \mathbf { x } , \tilde { \mathbf { C } } ; \theta _ { N A R } ) = \prod _ { j = 2 } ^ { 8 } p ( \mathbf { c } _ { : , j } | \mathbf { C } _ { : , < j } , \mathbf { x } , \tilde { \mathbf { C } } ; \theta _ { N A R } ) $ Where:- refers to the codes from
quantizerstages 2 through 8. - is the sequence of
discrete codesfrom the -thquantizer. - represents the codes from all
quantizersprior to stage that have already been generated. - denotes the parameters of the
non-autoregressive model.
- refers to the codes from
4.2.3.3. Overall Prediction
The complete prediction of the acoustic code matrix is modeled as the product of the AR model for the first quantizer and the NAR model for the subsequent quantizers:
$
p ( { \mathbf { C } } | { \mathbf { x } } , \tilde { { \mathbf { C } } } ; \theta ) = p ( { \mathbf { c } } _ { : , 1 } | \tilde { { \mathbf { C } } } _ { : , 1 } , { \mathbf { X } } ; \theta _ { A R } ) \prod _ { j = 2 } ^ { 8 } p ( { \mathbf { c } } _ { : , \mathbf { j } } | { \mathbf { c } } _ { : , < j } , { \mathbf { x } } , \tilde { { \mathbf { C } } } ; \theta _ { N A R } )
$
This combination leverages the strength of AR for flexible length prediction and initial speaker modeling and the efficiency of NAR for refining acoustic details.
4.2.4. Inference: In-Context Learning via Prompting
VALL-E exhibits in-context learning capabilities in zero-shot TTS, meaning it can synthesize high-quality speech for unseen speakers without any fine-tuning or parameter updates. This is achieved through carefully designed prompting.
The inference process involves:
-
Prompt Preparation:
- The input text is converted into a
phoneme sequence(thephoneme prompt). - The
enrolled recording(e.g., a 3-second sample of the target speaker's voice) is encoded by theneural codec modelinto anacoustic matrix(theacoustic prompt).
- The input text is converted into a
-
Code Generation: Both
promptsare fed into theARandNARmodels.- AR Model Decoding: For the
AR model(generating ),sampling-based decodingis used. This is preferred overbeam searchbecausebeam searchcan sometimes lead thelanguage modelintoinfinity loopsandsampling-based methodssignificantly increase thediversityof the output speech. - NAR Model Decoding: For the
NAR model(generating ),greedy decodingis used, where the token with the highest probability is chosen at each step.
- AR Model Decoding: For the
-
Waveform Synthesis: Finally, the full sequence of eight
discrete codesequences generated byVALL-Eis fed into theneural codec decoder(theEnCodecdecoder) to synthesize the finalwaveform.The paper describes two main inference settings:
-
VALL-E (Main Zero-Shot TTS Setting):
- Goal: To generate specific content (from a given text sentence) in the voice of an
unseen speaker. - Phoneme Prompt: The
phoneme transcriptionof theenrolled speechis prepended to thephoneme sequenceof the target sentence. - Acoustic Prompt: The first layer
acoustic tokens() of theenrolled speechare used as anacoustic prefixfor theAR model. The fullacoustic matrixis used as anacoustic promptfor theNAR model. - Outcome:
VALL-Egeneratesacoustic tokenscorresponding to the target text, cloning the speaker's voice from the prompt.
- Goal: To generate specific content (from a given text sentence) in the voice of an
-
VALL-E-continual (Speech Continuation Setting):
- Goal: To generate a continuation of an utterance, maintaining the speaker's voice and
prosody. - Phoneme Prompt: The full transcription of the utterance is used.
- Acoustic Prompt: The first 3 seconds of the ground-truth utterance are used as the
acoustic prompt. - Outcome: The model generates the continuation of the speech, semantically and acoustically continuous with the
enrolled speech.
- Goal: To generate a continuation of an utterance, maintaining the speaker's voice and
5. Experimental Setup
This section details the datasets, evaluation metrics, and model configurations used to train and evaluate VALL-E.
5.1. Datasets
5.1.1. Training Data
- LibriLight [Kahn et al., 2020]:
- Source & Scale: Consists of 60,000 hours of unlabelled English speech, primarily from audiobooks. This is hundreds of times larger than typical TTS datasets.
- Characteristics: Contains speech from over 7,000 unique speakers. The original data is audio-only.
- Transcription Generation: To obtain transcriptions, a hybrid
DNN-HMM ASR modelwas trained on 960 hours of labeledLibriSpeechdata (following the Kaldi recipe [Povey et al., 2011]). This ASR model then decoded the unlabeledLibriLightspeech data to generatephoneme-level alignmentswith a 30 ms frameshift. - Domain: Audiobook domain.
- Rationale: Chosen for its massive scale, diversity in speakers, and prosodies, despite containing more noisy speech and inaccurate transcriptions compared to clean studio-recorded data. The authors believe the proposed approach is robust to this noise due to leveraging large data.
- Neural Codec Data: The
EnCodecmodel [Défossez et al., 2022] was used to generate theacoustic code matrixfor all 60,000 hours ofLibriLightdata.
5.1.2. Evaluation Data
- LibriSpeech [Panayotov et al., 2015]:
- Source & Scale: The
test-cleansubset ofLibriSpeechwas used forzero-shot TTS evaluation. - Characteristics: Consists of samples with lengths between 4 and 10 seconds, totaling a 2.2-hour subset. Crucially, there is no speaker overlap between the
LibriLighttraining data and theLibriSpeech test-cleanspeakers, ensuring a truezero-shotscenario. - Domain: Audiobook domain.
- Source & Scale: The
- VCTK [Veaux et al., 2016]:
- Source & Scale: A dataset containing speech from 108 speakers.
- Characteristics: None of the 108 speakers were observed during
VALL-E's training. This dataset is known for containing speakers with various accents, posing a greater challenge for generalization compared toLibriSpeech. - Domain: Varied speech.
- EmoV-DB [Adigwe et al., 2018]:
- Source & Scale: A dataset containing speech with five emotions.
- Characteristics: Used for
qualitative analysisofspeaker's emotion maintenanceinzero-shotsettings.
5.2. Evaluation Metrics
The paper employs both automatic and human evaluation metrics to assess VALL-E's performance.
5.2.1. Automatic Metrics
5.2.1.1. Word Error Rate (WER)
- Conceptual Definition:
WERis a common metric for evaluating the performance ofspeech recognitionsystems, but inTTS, it's used to assess therobustnessof the generated speech. It quantifies how accurately the synthesized speech matches the target transcription, indicating whether theTTSsystem introduces errors like deletions, insertions, or substitutions. A lowerWERindicates highersynthesis robustnessand better fidelity to the input text. - Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $
- Symbol Explanation:
- : Number of
substitutions(words in the reference that were replaced by a different word in the hypothesis). - : Number of
deletions(words in the reference that were omitted in the hypothesis). - : Number of
insertions(words in the hypothesis that were not present in the reference). - : Total number of words in the
reference transcription.
- : Number of
- Implementation: An
ASR model(HuBERT-Large[Hsu et al., 2021] fine-tuned onLibriSpeech 960h, aCTC-based modelwithoutlanguage model fusion) is used to transcribe the generated audio, and theWERis calculated against the original text transcriptions.
5.2.1.2. Speaker Similarity Score (SPK)
- Conceptual Definition: This metric quantifies how similar the synthesized speech sounds to the voice of the
enrolled speaker(provided in theacoustic prompt). It measures the success ofspeaker cloningin azero-shotscenario. A higherSPKscore indicates better preservation of thespeaker identity. - Implementation: The
WavLM-TDNNmodel [Chen et al., 2022], a state-of-the-artspeaker verification model(top-ranked at VoxSRC Challenge 2021 and 2022), is used. It predicts asimilarity scorein the range of , where a larger value signifies higher similarity between theprompt(decompressed enrolled speech) and thesynthesized speech.
5.2.2. Human Evaluation
Human listeners provide subjective ratings, which are often considered the most reliable indicators of speech quality.
5.2.2.1. Comparative Mean Option Score (CMOS)
- Conceptual Definition:
CMOSis amean opinion score (MOS)variant used to evaluate thenaturalnessor overallqualityof speech by directly comparing two audio samples (e.g., the proposed system vs. a baseline or ground truth). Listeners are asked to rate one system relative to another. - Scale: Ranges from -3 to 3, with intervals of 1.
- -3: The new system is much worse than the baseline.
- 0: The new system is similar to the baseline.
- +3: The new system is much better than the baseline.
- Goal: Provides an indicator of
speech naturalnessand perceived quality. - Implementation: 12 native English speakers were invited as
CMOScontributors viacrowdsourcing.
5.2.2.2. Similarity Mean Option Score (SMOS)
- Conceptual Definition:
SMOSis anabsolute mean option scorespecifically designed to measure how similar thesynthesized speechis to theoriginal speaker's voicefrom theacoustic prompt. Listeners rate the degree ofspeaker similarity. - Scale: Ranges from 1 to 5, with 0.5-point increments.
- 1: Not similar at all.
- 5: Extremely similar.
- Goal: Directly quantifies
speaker cloningeffectiveness. - Implementation: 6 native English speakers were invited as
SMOScontributors viacrowdsourcing.
5.3. Baselines
VALL-E was primarily compared against the following state-of-the-art zero-shot TTS system:
-
YourTTS [Casanova et al., 2022b]:
-
Description: This is a state-of-the-art
zero-shot multi-speaker TTSsystem. It's anend-to-endTTSmodel that leveragesspeaker embeddingsforzero-shot voice conversionandTTS. -
Training Data:
YourTTSwas trained on a combined dataset includingVCTK[Veaux et al., 2016],LibriTTS[Zen et al., 2019], andTTS-Portuguese[Casanova et al., 2022a]. -
Representativeness: It is a strong baseline because it represents the
speaker encoding-based approachforzero-shot TTSand has demonstrated good performance. The authors used their publicly released checkpoint for comparison.Additionally, for comparison of
robustness(WER) inspeech-to-speech LM-based generation,VALL-Ealso compared itsWERresults with:
-
-
GSLM [Lakhotia et al., 2021]:
- Description: A
speech-to-speechgenerative model that synthesizes speech based onHuBERT codes. It usesTacotron2andWaveGlowfor waveform reconstruction.
- Description: A
-
AudioLM* [Borsos et al., 2022]:
- Description: A
speech-to-speechlanguage modeling approachtoaudio generationthat usesaudio codecsandsemantic codes.
- Description: A
5.4. Model and Training Details
5.4.1. Model Architecture
- Both the
Autoregressive (AR)model and theNon-Autoregressive (NAR)model inVALL-Eshare a similarTransformerarchitecture. - Layers: 12 layers.
- Attention Heads: 16 attention heads.
- Embedding Dimension: 1024.
- Feed-Forward Layer Dimension: 4096.
- Dropout: 0.1.
5.4.2. Training Configuration
- Waveform Length: The average waveform length in
LibriLightis 60 seconds. During training, waveforms were randomly cropped to lengths between 10 and 20 seconds. - Phoneme Prompt: The corresponding
phoneme alignmentsof the cropped waveforms were used. Consecutive repetitions in the force-alignedphoneme sequencewere removed. - NAR Acoustic Prompt: For the
NAR model, a random 3-second segment of waveform from the same utterance as the target was selected to serve as theacoustic prompt. - Hardware: Trained using 16 NVIDIA TESLA V100 32GB GPUs.
- Batch Size: A batch size of 6,000
acoustic tokensper GPU. - Training Steps: 800,000 steps.
- Optimizer:
AdamW optimizer. - Learning Rate Schedule: The learning rate was warmed up for the first 32,000 updates to a peak of , and then linearly decayed.
6. Results & Analysis
This section presents and analyzes the experimental results obtained for VALL-E, comparing its performance against baselines and delving into qualitative observations.
6.1. Core Results Analysis
6.1.1. LibriSpeech Evaluation
The LibriSpeech test-clean dataset was used for zero-shot TTS evaluation, specifically a 2.2-hour subset with no speaker overlap with the LibriLight training data. For each synthesis, VALL-E used a randomly chosen 3-second speech segment from another utterance of the same speaker as the acoustic prompt. VALL-E-continual used the first 3 seconds of the ground-truth speech as the enrolled speech.
The following are the results from Table 2 of the original paper:
| model | WER | SPK |
|---|---|---|
| GroundTruth | 2.2 | 0.754 |
| Speech-to-Speech Systems | ||
| GSLM | 12.4 | 0.126 |
| AudioLM* | 6.0 | - |
| TTS Systems | ||
| YourTTS | 7.7 | 0.337 |
| VALL-E | 5.9 | 0.580 |
| VALL-E-continual | 3.8 | 0.508 |
Table 2: Evaluation results on audio generation. YourTTS and VALL-E are text-to-speech models using phonemes as inputs, while GSLM and AudioLM are speech-to-speech models using latent code as inputs. The WER result of AudioLM is obtained by a Conformer Transducer model [Borsos et al., 2022]. Since AudioLM is not open-source, we cannot evaluate its speaker score with our tool.*
Analysis of Automatic Metrics (Table 2):
-
Robustness (
WER):VALL-Edemonstrates significantly betterrobustnesscompared to theYourTTSbaseline (5.9WERvs. 7.7WER). This indicates thatVALL-Egenerates speech that is more faithful to the input text, with fewerdeletion,insertion, orreplacement errors. TheVALL-E-continualsetting further reducesWERto 3.8, suggesting that having ground-truthacoustic tokensfor the initial segment improvesrobustness. -
Speaker Similarity (
SPK):VALL-Eachieves a much higherspeaker similarity score(0.580) compared toYourTTS(0.337). This highlightsVALL-E's superior ability to clone the voice ofunseen speakersfrom a shortacoustic prompt. -
Comparison with Speech-to-Speech LMs:
VALL-Ealso outperformsGSLM(12.4WER, 0.126SPK) andAudioLM(6.0WER) inrobustnessandspeaker similarity. The paper attributesVALL-E's betterWERto its training withpseudo-phonemes, which provide better alignment quality with input text compared toHuBERT/w2v-BERT codesused byGSLMandAudioLM.GSLM's lowSPKscore is expected, asHuBERT codesdiscard speaker identity.The following are the results from Table 3 of the original paper:
SMOS CMOS (v.s. VALL-E) YourTTS 3.45±0.09 -0.12 VALL-E 4.38±0.10 0.00 GroundTruth 4.5±0.10 +0.17
Table 3: Human evaluation with 40 speakers on LibriSpeech test-clean with 3-second enrolled recording for each.
Analysis of Human Evaluation (Table 3):
- Speaker Similarity (
SMOS):VALL-Eachieves anSMOSof 4.38, which is very close to the ground truth (4.5) and significantly outperformsYourTTS(3.45). This strongly confirms its effectiveness inzero-shot speaker cloning. - Speech Naturalness (
CMOS):VALL-EbeatsYourTTSwith aCMOSof +0.12 (relative toVALL-Eitself being 0.00). This indicates thatVALL-Esynthesizes more natural and realistic speech than the baseline. AlthoughGroundTruthstill has a slightly higherCMOS(+0.17 relative toVALL-E), the performance gap is small, suggesting high perceived naturalness forVALL-E's outputs.
6.1.2. VCTK Evaluation
The VCTK dataset (108 speakers) was used for further zero-shot evaluation. Notably, VALL-E saw none of these speakers during training, while YourTTS had seen 97 of them. Comparisons were made using 3s, 5s, and 10s prompts.
The following are the results from Table 6 of the original paper:
| 3s prompt | 5s prompt | 10s prompt | |
|---|---|---|---|
| 108 full speakers | |||
| YourTTS* | 0.357 | 0.377 | 0.394 |
| VALL-E | 0.382 | 0.423 | 0.484 |
| GroundTruth | 0.546 | 0.591 | 0.620 |
| 11 unseen speakers | |||
| YourTTS | 0.331 | 0.337 | 0.344 |
| VALL-E | 0.389 | 0.380 | 0.414 |
| GroundTruth | 0.528 | - | 0.586 |
*Table 6: Automatic evaluation of speaker similarity with 108 speakers on VCTK. YourTTS has observed 97 speakers during training, while VALL-E observed none of them.
Analysis of Automatic Speaker Similarity (SPK) (Table 6):
-
VALL-Econsistently outperformsYourTTSinspeaker similarity, even whenYourTTShad prior exposure to 97 of the 108 speakers (e.g., 0.382 vs. 0.357 for 3s prompt with all speakers). This highlightsVALL-E's strong generalization to trulyunseen speakers. -
When compared on the strictly
unseen 11 speakers(a fairzero-shotscenario), the performance gap betweenVALL-EandYourTTSwidens, especially with shorter prompts (e.g., 0.389 vs. 0.331 for 3s prompt). -
The
SPKscore for both models improves with longer prompt lengths (e.g., 3s to 10s), which is an intuitive finding as more information is available forspeaker cloning.The following are the results from Table 7 of the original paper:
SMOS CMOS (v.s. VALL-E) YourTTS* 3.70±0.09 -0.23 VALL-E 3.81±0.09 0.00 GroundTruth 4.29±0.09 -0.04
Table 7: Human evaluation with 60 speakers on VCTK with 3-second enrolled recording for each.
Analysis of Human Evaluation (Table 7):
- Speaker Similarity (
SMOS):VALL-Eachieves aSMOSof 3.81, which is better thanYourTTS(3.70), even thoughYourTTShad seen some of these speakers. This further confirmsVALL-E's superiorspeaker similarityforzero-shotscenarios. - Speech Naturalness (
CMOS):VALL-Esignificantly outperformsYourTTSwith aCMOSof +0.23. Remarkably,VALL-Eachieves aCMOSof +0.04 overGroundTruth, indicating no statistically significant difference from human recordings on this dataset. The paper notes thatVCTKis more challenging (diverse accents, noisy environments) thanLibriSpeech, andVALL-E's strong performance here is notable.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Ablation Study of the NAR Model
The authors conducted an ablation study on the Non-Autoregressive (NAR) model to understand the contribution of different prompts. Three NAR models were trained:
-
NAR-no prompt: Nophonemeoracoustic prompts. -
NAR-phn prompt: Onlyphoneme sequenceas a prompt. -
NAR-2 prompts: Bothphoneme promptandacoustic token promptas conditions.For this study, the
ground-truth first-level acoustic tokenswere used as input to theNAR modelto isolate its performance.
The following are the results from Table 4 of the original paper:
| NAR-no prompt | NAR-phn prompt | NAR-2 prompts | |
|---|---|---|---|
| WER | 19.6 | 3.0 | 2.8 |
| SPK | 0.518 | 0.541 | 0.732 |
Table 4: Ablation study of the NAR model. The inputs of the NAR models are the ground-truth for the ablation study.
Analysis of NAR Ablation (Table 4):
NAR-no prompt: This model performs poorly on bothWER(19.6) andSPK(0.518), even withground-truth first-level acoustic tokens. This shows that theNAR modelrequires conditioning to generate meaningful and speaker-consistent speech.NAR-phn prompt: Adding thephoneme promptdramatically reducesWERfrom 19.6 to 3.0. This demonstrates that thephoneme promptis the primary contributor to ensuring the correct content of the generated speech. There is a slight improvement inSPK(0.541).NAR-2 prompts: With bothphonemeandacoustic token prompts, theWERslightly improves to 2.8, but critically, theSPKscore significantly jumps to 0.732. This confirms that theacoustic token promptis essential for theNAR modelto learn and maintainspeaker informationand thus improvespeaker evaluation qualityfor the fineracoustic detailsit generates.
6.2.2. Ablation Study of the AR Model
This study examines the importance of the acoustic prompt for the Autoregressive (AR) model, using the NAR-2 prompts setting as the NAR model.
The following are the results from Table 5 of the original paper:
| WER | SPK | |
|---|---|---|
| VALL-E | 5.9 | 0.585 |
| w/o acoustic prompt | 5.9 | 0.236 |
Table 5: Ablation study of the AR model.
Analysis of AR Ablation (Table 5):
w/o acoustic prompt: When theacoustic promptis removed from theAR model(which generates the crucial first-level codes), thespeaker similarity score(SPK) plummets from 0.585 to 0.236. TheWERremains unchanged.- Conclusion: This demonstrates that the
acoustic promptprovided to theAR modelis extremely crucial for establishing and maintainingspeaker identity. Even though theNAR modellater receives its ownacoustic prompt, the initialspeaker informationcaptured by theAR modelis paramount.
6.3. Qualitative Analysis
Beyond quantitative metrics, VALL-E exhibits several interesting qualitative properties.
6.3.1. Diversity
Traditional TTS systems often produce highly deterministic outputs, leading to a one-to-one mapping between input text and output waveform. This is because mel spectrogram generation typically relies on reconstruction without inherent randomness. VALL-E, by contrast, uses a sampling-based method to generate discrete tokens during inference, which introduces randomness and allows for diverse outputs from the same input text and target speaker.
The figures below (Figure 4 and 5 from the original paper) illustrate this diversity.

Figure 4: Diversity analysis of VALL-E. Each utterance is synthesized two times with different random seeds. We can observe substantial diversity of the two outputs regarding the same input.

Figure 5: The image is a diagram displaying the waveforms of two segments of speech signals. The upper waveform shows relatively lower amplitude variations, while the lower waveform exhibits richer details and amplitude changes, possibly related to different vocal characteristics or emotional expressions.
Analysis of Diversity (Figure 4a and 4b):
- Figure 4a (LibriSpeech sample): Shows two syntheses of the same sentence. They exhibit different lengths and phrase durations, with the first sample having a faster speech rate.
- Figure 4b (VCTK sample): Shows two syntheses of a different sentence. They display variations in accent and emphasis; for example, the second output emphasizes the word "must" with a larger amplitude, while the first does not.
This diversity is valuable for downstream applications like
speech recognition, where varied inputs (different speakers, acoustic environments, prosodies) are beneficial for training robust models, a need that previousTTSsystems struggled to meet.
6.3.2. Acoustic Environment Maintenance
An interesting finding is VALL-E's ability to maintain acoustic environment consistency between the acoustic prompt and the generated speech. If the acoustic prompt contains reverberation (e.g., recorded in a large room), VALL-E can synthesize speech that also includes reverberation. In contrast, baseline systems typically output clean speech regardless of the prompt's environment. The authors attribute this to VALL-E being trained on a large-scale dataset (LibriLight) that likely contains a wider variety of acoustic conditions than the cleaner data used by baselines, enabling it to learn and reproduce such environmental characteristics.
6.3.3. Speaker's Emotion Maintenance
VALL-E also demonstrates the ability to preserve the speaker's emotion present in the acoustic prompt in a zero-shot setting. By selecting acoustic prompts from an emotional speech dataset like EmoV-DB [Adigwe et al., 2018], VALL-E can synthesize speech that retains the same emotion (e.g., anger), even though the model was not explicitly fine-tuned on an emotional TTS dataset. This emergent capability suggests that the discrete audio codec codes and the large-scale training capture high-level prosodic and emotional features effectively.
6.4. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Current Systems | VALL-E | |
| Intermediate representation | mel spectrogram | audio codec code |
| Objective function | continuous signal regression | language model |
| Training data | ≤ 600 hours | 60K hours |
| In-context learning | X | ✓ |
Table 1: A comparison between VALL-E and current cascaded TTS systems.
Analysis of Table 1: This table summarizes the fundamental innovations of VALL-E compared to traditional cascaded TTS systems. It highlights the shift from mel spectrograms to audio codec codes as intermediate representations, from continuous signal regression to language modeling as the objective, a massive increase in training data scale (60K hours vs. ≤ 600 hours), and the crucial emergence of in-context learning capabilities in VALL-E.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduced VALL-E, a groundbreaking language model approach to Text-to-Speech (TTS) synthesis. By leveraging discrete audio codec codes as intermediate representations and scaling pre-training to an unprecedented 60,000 hours of speech data, VALL-E demonstrates powerful in-context learning capabilities for zero-shot TTS. This allows for high-quality personalized speech synthesis for unseen speakers using only a 3-second acoustic prompt, without requiring any fine-tuning or complex architectural modifications. VALL-E achieved new state-of-the-art zero-shot TTS results on both LibriSpeech and VCTK datasets, significantly outperforming existing systems in speech naturalness (CMOS) and speaker similarity (SMOS). Furthermore, the model exhibits emergent properties such as generating diverse outputs, preserving the acoustic environment of the prompt, and maintaining the speaker's emotion in synthesis.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
7.2.1. Synthesis Robustness
- Limitation:
VALL-Esometimes producesunclear,missed, orduplicated words. Thisrobustness issueis primarily attributed to theautoregressive (AR)nature of thephoneme-to-acoustic language modelpart, which can suffer fromdisordered attention alignmentswithout explicit constraints. This phenomenon is also observed in vanillaTransformer-based TTSmodels. - Future Work: The authors plan to leverage techniques from other
TTSresearch, such as applyingnon-autoregressive modelsor modifying theattention mechanism, to address theserobustness issues.
7.2.2. Data Coverage
- Limitation: Despite using 60,000 hours of training data, the dataset still lacks comprehensive coverage of all voices, particularly for
accent speakers(evidenced by slightly worse results onVCTKcompared toLibriSpeech). Additionally, theLibriLightdataset, being derived from audiobooks, predominantly features areading style, limiting the diversity ofspeaking styles. - Future Work: The authors propose to further scale up the training data to improve model performance across
prosody,speaking style, andspeaker similaritydimensions, believing that further data scaling, along with model scaling, could almost entirely solve thezero-shot TTStask.
7.2.3. Model Structure
- Limitation: The current
VALL-Earchitecture uses two separate models (anAR modelfor the firstquantizerand aNAR modelfor the subsequent ones). - Future Work:
- Universal Model: A promising direction is to unify these into a single, large universal model that can predict codes across all
quantizers. - Full NAR Models: Exploring a complete
non-autoregressiveframework could significantly speed up model inference.
- Universal Model: A promising direction is to unify these into a single, large universal model that can predict codes across all
7.2.4. Broader Impacts
- Potential Risks: The ability of
VALL-Eto synthesize speech that accurately maintainsspeaker identityposes potential risks of misuse, such asspoofing voice identification systemsorimpersonating specific speakers. - Mitigation: The authors suggest building a
detection modelto discriminate between human andVALL-E-synthesized audio clips. They also commit to practicingMicrosoft AI Principlesin future model development.
7.3. Personal Insights & Critique
7.3.1. Personal Insights
- Paradigm Shift:
VALL-Erepresents a significant paradigm shift inTTS, effectively porting the success oflarge language modelsfrom text to speech. The idea of treatingTTSas aconditional language modelingproblem overdiscrete audio tokensis elegant and powerful, enablingin-context learningthat was previously elusive inTTS. - Power of Discrete Representations: The choice of
neural audio codec codesfromEnCodecas the intermediate representation is a critical enabler. UnlikeHuBERTorwav2veccodes that focus on linguistic content and discardspeaker identity,EnCodeccodes retain richacoustic information—speaker characteristics, emotion, and environmental nuances—which is crucial for high-fidelity and personalized synthesis. - Data Scaling is Key: The paper strongly reinforces the idea, now prevalent in
AI, thatdata scale(especiallysemi-supervised) is a powerful force for generalization and emergent capabilities. The leap to 60,000 hours, even with noisy transcriptions, appears to be a major factor inVALL-E's success. - Emergent Abilities: The observed abilities to maintain
acoustic environmentandspeaker's emotioninzero-shotsettings are truly remarkable. They suggest that the model has learned a deep, disentangled understanding of variousacoustic attributesbeyond just content and speaker identity.
7.3.2. Critique
- Dependence on ASR for Training Data: The use of
ASR-generated transcriptionsfor 60K hours ofLibriLightdata is a practical solution but also a potential weak point. Inaccuracies in thesepseudo-transcriptionscould introduce noise and propagate errors. While the paper argues robustness, the extent to which this might limit the model's ultimate fidelity or introduce subtle artifacts is worth further investigation. - Robustness Issues (AR Model): The acknowledged
synthesis robustnessissues (missed/duplicated words) from theAR modelindicate that while thelanguage modelingapproach is powerful, the specific challenges of aligning phonemes to variable-length acoustic sequences still persist. This suggests that directly applyingTransformer decoderstospeech generationmight still require specializedattention mechanismsornon-autoregressivetechniques to overcome alignment pitfalls, as done inFastSpeechformel spectrograms. - Compute Resources: Training
VALL-Eon 16 V100 GPUs for 800,000 steps underscores the substantial computational resources required for suchlarge-scale language models. This limits accessibility for smaller research groups or individual researchers. - Black Box Nature: Like many
large neural models,VALL-Eis largely a black box. Understanding precisely how it extracts and disentangles speaker identity, emotion, and environment fromacoustic promptsand reconstructs them is complex. Further interpretability research could provide deeper insights. - Ethical Implications: The
broader impactssection is appropriately included. The potential fordeepfake audiogeneration andimpersonationis a serious concern. While adetection modelis proposed, the arms race between generation and detection is well-known. Robust ethical guidelines and safeguards are paramount.
7.3.3. Transferability and Future Directions
- Cross-Domain Application: The
VALL-Eapproach could be transferable to otheraudio generationtasks beyondTTS, such asmusic synthesis,sound event generation, orspeech-to-speech translation, by adapting the inputpromptsanddiscrete codetargets. - Multilingual/Multimodal Extensions: Extending
VALL-Etomultilingual TTSormultimodal TTS(e.g., conditioning on visual cues like speaker's face) could unlock new capabilities. - Improved Phonetic Representations: To mitigate reliance on
ASR-generated phonemes, future work could exploreself-supervised learningofdiscrete phonetic representationsthat are more robust to noise and capture linguistic information directly from raw audio. - Unified Generative Model: The idea of a single, universal
language modelfor allquantizerstages, or a fullyNARarchitecture, promises further efficiency and potentially even higher quality by optimizing the entire generation process end-to-end within one model.
Similar papers
Recommended via semantic vector search.