Paper status: completed

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Published:06/13/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents StyleTTS 2, a TTS model utilizing style diffusion and adversarial training with large speech language models, achieving human-level synthesis and surpassing human recordings on multiple datasets.

Abstract

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

1.2. Authors

The authors of this paper are Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, and Nima Mesgarani. They are all affiliated with Columbia University.

1.3. Journal/Conference

This paper was published as a preprint on arXiv (arXiv:2306.07691). As an arXiv preprint, it is publicly available but has not necessarily undergone formal peer review by a conference or journal at the time of its initial publication. However, arXiv is a highly influential platform for rapid dissemination of research in fields like artificial intelligence and machine learning.

1.4. Publication Year

2023

1.5. Abstract

The paper introduces StyleTTS 2, a text-to-speech (TTS) model designed to achieve human-level speech synthesis. It innovates upon its predecessor by modeling speech styles as a latent random variable, utilizing diffusion models to generate appropriate styles for text without requiring reference speech. This approach enables efficient latent diffusion while providing diverse speech synthesis capabilities. Furthermore, StyleTTS 2 incorporates large pre-trained speech language models (SLMs), specifically WavLM, as discriminators within a novel differentiable duration modeling framework for end-to-end (E2E) training, which significantly enhances speech naturalness. The model demonstrates superior performance, surpassing human recordings on the single-speaker LJSpeech dataset and matching them on the multi-speaker VCTK dataset, as judged by native English speakers. When trained on the LibriTTS dataset, StyleTTS 2 also outperforms previous publicly available models in zero-shot speaker adaptation. The authors claim this work achieves the first human-level TTS on both single and multi-speaker datasets, highlighting the efficacy of style diffusion and adversarial training with large SLMs.

Official Source: https://arxiv.org/abs/2306.07691 PDF Link: https://arxiv.org/pdf/2306.07691v2.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is achieving robust, accessible, diverse, and expressive human-level text-to-speech (TTS) synthesis. While significant advancements have been made in recent years, traditional TTS systems still face several challenges:

  • Limited Expressiveness and Diversity: Many models struggle to generate speech with rich and varied prosody, emotions, and speaking styles, often producing monotonic or "robot-like" output.

  • Robustness for Out-of-Distribution (OOD) Texts: Models often degrade in quality when presented with texts different from their training data, limiting real-world applicability.

  • Data Requirements for Zero-Shot Adaptation: High-performing zero-shot TTS systems, which can synthesize speech in a new voice from a very short audio sample, typically require massive datasets for pre-training, making them computationally expensive and resource-intensive.

  • Efficiency of Diffusion Models: While diffusion models offer diverse speech sampling, their iterative nature can make them slower than non-iterative generative adversarial network (GAN)-based models.

  • Integration of SLMs: Speech language models (SLMs) like WavLM contain rich speech representations but are not directly optimized for speech synthesis. Integrating their knowledge effectively into generative models without complex two-stage training or latent space mapping remains a challenge.

    The paper's innovative entry point is to build upon the StyleTTS framework by introducing two key advancements:

  1. Style Diffusion: Modeling speech styles as a latent random variable through a diffusion model to generate suitable styles without requiring reference speech, addressing expressiveness and diversity challenges. This also makes the process more efficient by diffusing only a fixed-length style vector, not the entire speech.
  2. Adversarial Training with Large SLMs and Differentiable Duration Modeling: Leveraging the powerful representations of pre-trained SLMs as discriminators in an end-to-end (E2E) training setup, enabled by a novel differentiable duration modeling, to significantly enhance speech naturalness and robustness.

2.2. Main Contributions / Findings

The primary contributions of StyleTTS 2 are:

  • Novel Style Diffusion Mechanism: It introduces a method to model speech styles as a latent random variable, sampled efficiently via a diffusion model conditioned on the input text. This allows for diverse and expressive speech generation without needing a reference audio, overcoming a limitation of its predecessor, StyleTTS.

  • Integration of Large SLMs as Discriminators: StyleTTS 2 pioneers the use of large pre-trained speech language models (e.g., WavLM) directly as discriminators in an adversarial training framework. This transfers knowledge from powerful SLMs to speech generation tasks, enhancing speech naturalness without complex latent space mappings or two-stage training processes.

  • Differentiable Duration Modeling: The paper proposes a new non-parametric differentiable upsampling method for duration prediction. This innovation is crucial for enabling stable end-to-end adversarial training, especially when combined with SLM discriminators, and contributes to improved speech naturalness.

  • Achieving Human-Level TTS Performance:

    • StyleTTS 2 surpasses human recordings in naturalness on the single-speaker LJSpeech dataset (CMOS of +0.28+0.28, p<0.05p < 0.05).
    • It matches human-level performance on the multi-speaker VCTK dataset (CMOS of -0.02, p0.05p \gg 0.05).
    • It outperforms previous publicly available models for zero-shot speaker adaptation on the LibriTTS dataset, achieving comparable naturalness to Vall-E with 250 times less training data.
  • Enhanced Robustness and Diversity: The model demonstrates strong generalization ability and robustness to out-of-distribution (OOD) texts, showing no degradation in quality for such inputs. It also generates highly diverse speech, as evidenced by higher coefficients of variation in duration and pitch.

    These findings solve the problems of limited expressiveness, robustness, and data inefficiency in human-level TTS synthesis, setting a new benchmark for the field.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand StyleTTS 2, a grasp of several foundational concepts in deep learning for speech processing is necessary:

  • Text-to-Speech (TTS): The technology that converts written text into spoken audio. The goal is to produce speech that is natural, intelligible, and expressive.
  • Mel-spectrogram: A time-frequency representation of audio that approximates the human auditory system's non-linear perception of pitch. It is commonly used as an intermediate representation in TTS models because it captures essential acoustic features while being more compact than raw waveforms.
  • Waveform: The raw audio signal, representing sound pressure variations over time. Ultimately, TTS models aim to generate high-fidelity waveforms.
  • Generative Adversarial Networks (GANs): A class of generative AI models consisting of two neural networks:
    • Generator (GG): Tries to create realistic data (e.g., speech) from random noise or conditioned input (e.g., text).
    • Discriminator (DD): Tries to distinguish between real data and data generated by the generator. These two networks are trained adversarially: the generator tries to fool the discriminator, and the discriminator tries to correctly identify fake data. This competition leads to the generator producing increasingly realistic outputs.
  • Diffusion Models: A class of generative models that learn to reverse a gradual diffusion process.
    • Forward Diffusion Process: Gradually adds noise to data until it becomes pure noise.
    • Reverse Denoising Process: The model learns to iteratively denoise a noisy input back to a clean data sample. Diffusion models are known for their high-quality generative capabilities and ability to model complex data distributions, but can be slow due to their iterative nature.
  • Self-supervised Speech Language Models (SLMs): Neural networks pre-trained on vast amounts of unlabeled speech data to learn rich, context-aware representations of speech. Models like Wav2Vec 2.0, HuBERT, and WavLM learn to predict masked speech segments or distinguish true speech segments from distractors. These models capture various aspects of speech, from low-level acoustics to high-level semantics, which can be beneficial for downstream tasks.
  • End-to-End (E2E) Training: A training paradigm where all components of a complex system (e.g., a TTS pipeline from text to waveform) are jointly optimized as a single neural network. This often leads to better performance because intermediate representations are learned in a way that directly contributes to the final objective, avoiding compounding errors from separate stages.
  • Style-based Generative Models: Models that explicitly disentangle content and style information. A "style vector" captures various characteristics of speech like prosody, emotion, speaking rate, and speaker identity, allowing for flexible control and transfer of these attributes.
  • Adaptive Instance Normalization (AdaIN): A neural network layer that normalizes features based on statistics (mean and variance) derived from a style vector. This allows a model to dynamically adjust the style of its output based on the input style vector.
  • Transformers: A neural network architecture that relies heavily on the self-attention mechanism to weigh the importance of different parts of the input sequence. Transformers have become dominant in natural language processing (NLP) and are increasingly used in speech processing.
    • Self-Attention: A mechanism that allows a model to weigh the importance of different parts of the input sequence when processing each element. Given a query (QQ), key (KK), and value (VV) for an input sequence, the attention mechanism calculates a weighted sum of the value vectors. The standard Attention formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
      • QQ (Query), KK (Key), VV (Value) are matrices derived from the input embeddings.
      • QKTQ K^T calculates similarity scores between queries and keys.
      • dk\sqrt{d_k} is a scaling factor, where dkd_k is the dimension of the key vectors, used to prevent very large dot products from pushing the softmax function into regions with extremely small gradients.
      • softmax\mathrm{softmax} normalizes the scores to create weights.
      • The result is a weighted sum of the VV vectors, emphasizing relevant parts of the input.
  • BERT (Bidirectional Encoder Representations from Transformers): A powerful pre-trained transformer-based language model that learns contextual representations of words by considering their bidirectional context. Phoneme-level BERT adapts this idea to sequences of phonemes.
  • Subjective Evaluation Metrics:
    • Mean Opinion Score (MOS): A widely used metric where human listeners rate speech quality on a scale (e.g., 1 to 5). Higher scores indicate better quality.
    • Comparative MOS (CMOS): Listeners compare two speech samples (A and B) and rate B relative to A on a scale (e.g., -6 to +6). This helps to identify more subtle differences than absolute MOS.

3.2. Previous Works

The paper contextualizes StyleTTS 2 by referencing several key prior studies:

  • StyleTTS [6]: The direct predecessor to StyleTTS 2. It was a non-autoregressive TTS framework that used a style encoder to derive a style vector from reference audio, enabling natural and expressive speech. Its limitations included a two-stage training process, reliance on an external vocoder (which could degrade quality), deterministic generation leading to limited expressiveness, and the need for reference speech during inference. StyleTTS 2 directly addresses these drawbacks.
  • Diffusion Models for Speech Synthesis [15, 16, 17, 18, 19, 20]: Prior works applied diffusion models to various speech generation tasks, including mel-based TTS (GradTTS, Diff-TTS, ProDiff) and vocoders (WaveGrad, DiffWave). While offering diverse sampling, they were often criticized for efficiency limitations due to iterative sampling of mel-spectrograms or waveforms proportional to speech duration, making them slower than GAN-based methods. StyleTTS 2 tackles this by diffusing only a fixed-length style vector, not the entire speech, making it more efficient.
  • Text-to-Speech with Large Speech Language Models (SLMs) [8, 29, 34, 35, 36, 37, 38, 39]: Recent advancements demonstrated the effectiveness of large self-supervised SLMs (like Wav2Vec 2.0, HuBERT, WavLM) in enhancing TTS quality and speaker adaptation (Vall-E, NaturalSpeech 2). These approaches typically involve converting text into continuous or quantized representations derived from SLMs for speech reconstruction. However, SLM features are not directly optimized for speech synthesis, and methods like neural codecs (Vall-E, NaturalSpeech 2) often require complex two-stage training processes. StyleTTS 2 differentiates itself by leveraging SLM knowledge directly via adversarial training with SLM features, without latent space mapping or two-stage tuning, thereby learning a latent space optimized for speech synthesis in an E2E manner.
  • Human-Level Text-to-Speech [3, 4, 5]: Several models have aimed for human-level TTS. VITS [3] achieved MOS comparable to human recordings on LJSpeech and VCTK. PnG-BERT [4] reported human-level results on proprietary datasets. NaturalSpeech [5] achieved MOS and CMOS statistically indistinguishable from human recordings on LJSpeech. StyleTTS 2 builds on these efforts, claiming to surpass even these state-of-the-art models and setting a new standard. These models often utilized techniques like BERT pre-training [4] and E2E training with differentiable duration modeling [41, 42], which StyleTTS 2 also incorporates and improves upon.

3.3. Technological Evolution

The field of TTS has evolved significantly:

  1. Concatenative TTS: Early systems pieced together pre-recorded speech units (phonemes, diphones, words). Produced highly natural speech for common phrases but lacked flexibility and expressiveness.

  2. Parametric TTS: Models that synthesize speech from linguistic features using statistical models (e.g., HMMs, concatenative unit selection). More flexible but often sounded artificial.

  3. Neural TTS (Mel-spectrogram based): Deep learning models (e.g., Tacotron, Transformer TTS, FastSpeech) learned to map text to mel-spectrograms, which were then converted to waveforms by a separate vocoder. This significantly improved naturalness and expressiveness.

  4. Neural Vocoders: Advancements in waveform generation models (e.g., WaveNet, WaveGlow, HifiGAN) allowed for high-fidelity conversion from mel-spectrograms to raw audio, closing the quality gap.

  5. End-to-End Neural TTS: Models that directly map text to waveforms, integrating the acoustic model and vocoder into a single trainable system (e.g., VITS). This simplifies the pipeline and often leads to better overall quality.

  6. GANs and Diffusion Models for TTS: Applying powerful generative models like GANs (HifiGAN, VITS) and diffusion models (GradTTS, ProDiff) to either vocoding or end-to-end TTS, pushing the boundaries of speech quality and diversity.

  7. Self-supervised SLMs in TTS: Leveraging pre-trained large SLMs to extract rich speech representations, which can then guide or inform TTS generation, particularly for zero-shot speaker adaptation and improved naturalness (Vall-E, NaturalSpeech 2).

    StyleTTS 2 fits within this timeline by combining the latest advancements: it builds on an existing style-based GAN (StyleTTS), integrates efficient diffusion models for style generation, employs large pre-trained SLMs for adversarial learning, and operates in an end-to-end fashion with novel differentiable duration modeling, pushing towards human-level performance.

3.4. Differentiation Analysis

Compared to the main methods in related work, StyleTTS 2 introduces several core differences and innovations:

  • From StyleTTS (Predecessor):
    • Style Generation: Original StyleTTS required a reference audio to extract a style vector, making it less suitable for real-time applications without explicit style control. StyleTTS 2 removes this reliance by modeling styles as a latent random variable sampled through a diffusion model conditioned on text, allowing reference-free style generation and greater expressiveness.
    • Training Process: StyleTTS used a two-stage training process and a separate vocoder. StyleTTS 2 adopts a fully end-to-end (E2E) training process that jointly optimizes all components for direct waveform synthesis, eliminating potential quality degradation from a fixed vocoder and simplifying the pipeline.
    • Discriminators: StyleTTS used a mel-discriminator. StyleTTS 2 replaces this with multi-period and multi-resolution discriminators for waveform and incorporates large SLMs as discriminators for naturalness.
  • From other Diffusion TTS Models (e.g., GradTTS, ProDiff, FastDiff):
    • Efficiency: Most diffusion TTS models iteratively sample mel-spectrograms or waveforms, which can be computationally expensive and slow inference. StyleTTS 2 employs "style diffusion" where only a fixed-length style vector is sampled by the diffusion model. The actual speech synthesis is then handled by a fast GAN-based decoder, combining the diversity benefits of diffusion with the efficiency of GANs. This makes StyleTTS 2 faster than many diffusion-based TTS models.
  • From TTS with SLMs (e.g., Vall-E, NaturalSpeech 2):
    • SLM Integration: Many SLM-based TTS models use SLM features as latent representations or neural codecs, often requiring complex two-stage training processes. StyleTTS 2 uniquely leverages large pre-trained SLMs (like WavLM) directly as powerful discriminators in an adversarial training setup. This transfers knowledge from SLMs to guide the generative process towards more human-like speech, without the need for latent space mapping or fine-tuning the SLMs themselves, thus learning a latent space directly optimized for speech synthesis.
    • Data Efficiency for Zero-Shot: While Vall-E achieves impressive zero-shot adaptation, it requires an enormous amount of training data (60k hours). StyleTTS 2 achieves comparable naturalness and good similarity with significantly less data (245 hours), showcasing remarkable data efficiency.
  • From other Human-Level TTS Models (e.g., NaturalSpeech, VITS):
    • Superior Performance: StyleTTS 2 quantitatively surpasses NaturalSpeech on LJSpeech and VITS on VCTK in subjective evaluations (CMOS), setting a new benchmark for human-level quality.

    • Robustness to OOD: StyleTTS 2 demonstrates superior robustness to out-of-distribution texts compared to previous models, maintaining high quality where others degrade.

    • Differentiable Duration Modeling: It introduces a novel non-parametric differentiable upsampler that improves stability during adversarial training, addressing limitations of prior differentiable duration methods in E2E settings.

      In essence, StyleTTS 2 uniquely combines the strengths of style-based generation, efficient diffusion models, powerful SLM representations, and stable end-to-end adversarial training to achieve a new level of performance in TTS synthesis.

4. Methodology

4.1. Principles

The core idea behind StyleTTS 2 is to achieve human-level text-to-speech synthesis by enhancing the original StyleTTS framework with three main innovations:

  1. Style Diffusion: Instead of requiring a reference audio to determine the speech style, StyleTTS 2 models speech styles as a latent random variable. This variable is sampled by a probabilistic diffusion model (specifically, EDM), conditioned only on the input text. This allows the model to generate diverse and expressive speech without external style references, while the diffusion process is applied to a compact style vector, ensuring efficiency compared to diffusing entire mel-spectrograms or waveforms.

  2. Adversarial Training with Large Speech Language Models (SLMs): The model leverages the rich representations learned by large pre-trained SLMs (like WavLM) by employing them as powerful discriminators within a Generative Adversarial Network (GAN) setup. This allows the generator to learn to produce speech that is not only perceptually high-quality but also semantically and acoustically consistent with human speech, as judged by the SLM. The SLM discriminator helps transfer deep linguistic and acoustic knowledge from the pre-trained SLM to the speech generation task.

  3. Differentiable Duration Modeling for End-to-End Training: To enable stable end-to-end (E2E) adversarial training—where all components from text input to waveform output are jointly optimized—StyleTTS 2 introduces a novel, non-parametric differentiable duration upsampler. This mechanism allows gradients to flow through the duration prediction module, which is crucial for optimizing the entire system, especially with the SLM discriminator, without introducing instabilities or high computational costs seen in previous differentiable upsampling methods.

    By integrating these principles, StyleTTS 2 aims to synthesize highly natural, expressive, and robust speech efficiently, establishing a new benchmark for human-level TTS.

4.2. Core Methodology In-depth (Layer by Layer)

StyleTTS 2 builds upon the original StyleTTS architecture. First, we will review the original StyleTTS and its limitations, then detail the enhancements introduced in StyleTTS 2.

4.2.1. StyleTTS Overview

The original StyleTTS [6] is a non-autoregressive TTS framework. It uses a style encoder to derive a style vector from reference audio, enabling the generation of natural and expressive speech. This style vector is then integrated into the decoder, duration predictor, and prosody predictor using adaptive instance normalization (AdaIN) [44], allowing control over speech duration, prosody, and emotion.

The model consists of eight modules organized into three categories:

  1. Speech Generation System (Acoustic Modules): Text Encoder, Style Encoder, and Speech Decoder.

  2. TTS Prediction System: Duration Predictor and Prosody Predictor.

  3. Utility System for Training: Discriminator, Text Aligner, and Pitch Extractor.

    StyleTTS employs a two-stage training process:

  • Stage 1: Acoustic Modules Training

    • The text encoder (TT) encodes input phonemes t\pmb{t} into phoneme representations htext=T(t)h_{\mathrm{text}} = T(\pmb{t}).
    • The text aligner (AA) extracts speech-phoneme alignment aalign=A(x,t)\pmb{a}_{\mathrm{align}} = A(\pmb{x}, \pmb{t}) from input speech x\pmb{x} and phonemes t\pmb{t}. This alignment is used to produce aligned phoneme representations halign=htextaalignh_{\mathrm{align}} = h_{\mathrm{text}} \cdot \pmb{a}_{\mathrm{align}} via a dot product.
    • The style encoder (EE) obtains the style vector s=E(x)\pmb{s} = E(\pmb{x}).
    • The pitch extractor (FF) extracts the pitch curve px=F(x)p_{\pmb{x}} = F(\pmb{x}) and its energy nx=xn_{\pmb{x}} = \| \pmb{x} \|.
    • Finally, the speech decoder (GG) reconstructs the mel-spectrogram x^=G(halign,s,px,nx)\hat{\pmb{x}} = G(h_{\mathrm{align}}, \pmb{s}, p_{\pmb{x}}, n_{\pmb{x}}).
    • The decoder is trained to match the input mel-spectrogram x\pmb{x} using an L1L_1 reconstruction loss Lmel\mathcal{L}_{\mathrm{mel}} and adversarial objectives Ladv,Lfm\mathcal{L}_{\mathrm{adv}}, \mathcal{L}_{\mathrm{fm}} with a discriminator DD. Transferable monotonic aligner (TMA) objectives are also applied to learn optimal alignments.
    • The mel-spectrogram reconstruction loss Lmel\mathcal{L}_{\mathrm{mel}} is defined as: $ \mathcal{L}{\mathrm{mel}} = \mathbb{E}{\pmb{x}, \pmb{t}} \left[ | \pmb{x} - M(G(h_{\mathrm{text}} \cdot \pmb{a}{\mathrm{align}}, \pmb{s}a, p{\pmb{x}}, n{\pmb{x}})) |_1 \right] $ Where:
      • E\mathbb{E} denotes the expectation over training samples.
      • x\pmb{x} is the ground truth mel-spectrogram.
      • t\pmb{t} is the input phoneme sequence.
      • M()M(\cdot) represents the mel-spectrogram transformation.
      • G()G(\cdot) is the speech decoder.
      • htext=T(t)h_{\mathrm{text}} = T(\pmb{t}) are the encoded phoneme representations from the text encoder TT.
      • aalign=A(x,t)\pmb{a}_{\mathrm{align}} = A(\pmb{x}, \pmb{t}) is the attention alignment from the text aligner AA.
      • \pmb{s}_a = E_a(\pmb{x}) is the acoustic style vector from the acoustic style encoder EaE_a.
      • pxp_{\pmb{x}} is the pitch (F0) curve extracted by the pitch extractor FF.
      • nx=xn_{\pmb{x}} = \| \pmb{x} \| is the energy curve extracted by FF.
      • 1\|\cdot\|_1 denotes the L1L_1 norm (mean absolute error).
    • TMA objectives include sequence-to-sequence (ASR) loss Ls2s\mathcal{L}_{\mathrm{s2s}} and monotonic loss Lmono\mathcal{L}_{\mathrm{mono}}.
      • Ls2s\mathcal{L}_{\mathrm{s2s}} for fine-tuning the pre-trained text aligner: $ \mathcal{L}{\mathrm{s2s}} = \mathbb{E}{\pmb{x}, \pmb{t}} \left[ \sum_{i=1}^N \mathrm{CE}(t_i, \hat{\pmb{t}}_i) \right] $ Where:
        • NN is the number of phonemes in t\pmb{t}.
        • tit_i is the ii-th phoneme token of t\pmb{t}.
        • t^i\hat{\pmb{t}}_i is the ii-th predicted phoneme token (from the ASR head of the aligner).
        • CE()\mathrm{CE}(\cdot) denotes the cross-entropy loss function.
      • Lmono\mathcal{L}_{\mathrm{mono}} to ensure soft attention approximates monotonic alignment: $ \mathcal{L}{\mathrm{mono}} = \mathbb{E}{\pmb{x}, \pmb{t}} \left[ | \pmb{a}{\mathrm{align}} - \pmb{a}{\mathrm{hard}} |_1 \right] $ Where ahard\pmb{a}_{\mathrm{hard}} is the monotonic version of aalign\pmb{a}_{\mathrm{align}} obtained via dynamic programming.
    • Adversarial objectives (LSGAN loss and feature-matching loss) are used to enhance reconstructed waveform quality. These are generalized from mel-spectrograms to waveforms in StyleTTS 2. For the generator: $ \mathcal{L}{\mathrm{adv}}(G; D) = \mathbb{E}{\pmb{t}, \pmb{x}} \left[ (D(G(h_{\mathrm{text}} \cdot \pmb{a}{\mathrm{align}}, \pmb{s}a, p{\pmb{x}}, n{\pmb{x}})) - 1)^2 \right] $ For the discriminator: $ \mathcal{L}{\mathrm{adv}}(D; G) = \mathbb{E}{\pmb{t}, \pmb{x}} \left[ (D(G(h_{\mathrm{text}} \cdot \pmb{a}{\mathrm{align}}, \pmb{s}a, p{\pmb{x}}, n{\pmb{x}})))^2 \right] + \mathbb{E}_{\pmb{y}} \left[ (D(\pmb{y}) - 1)^2 \right] $ Where:
      • D()D(\cdot) is the discriminator's output.
      • y\pmb{y} denotes real speech samples. The feature-matching loss Lfm\mathcal{L}_{\mathrm{fm}} is: $ \mathcal{L}{\mathrm{fm}} = \mathbb{E}{\pmb{y}, \pmb{t}, \pmb{x}} \left[ \sum_{l=1}^\Lambda \frac{1}{N_l} \left| D^l(\pmb{y}) - D^l(G(h_{\mathrm{text}} \cdot \pmb{a}{\mathrm{align}}, \pmb{s}a, p{\pmb{x}}, n{\pmb{x}})) \right|_1 \right] $ Where:
      • Λ\Lambda is the total number of layers in discriminator DD.
      • DlD^l denotes the output feature map of the ll-th layer.
      • NlN_l is the number of features in layer ll.
  • Stage 2: TTS Prediction Modules Training

    • All components except the discriminator DD are fixed. Only the duration predictor (SS) and prosody predictor (PP) are trained.
    • The duration predictor SS predicts phoneme duration with dpred=S(htext,s)d_{\mathrm{pred}} = S(h_{\mathrm{text}}, \pmb{s}). This is trained to match the ground truth duration dgtd_{\mathrm{gt}} (derived from aalign\pmb{a}_{\mathrm{align}}) using an L1L_1 loss Ldur\mathcal{L}_{\mathrm{dur}}. $ \mathcal{L}{\mathrm{dur}} = \mathbb{E}{d_{\mathrm{gt}}} \left[ | d_{\mathrm{gt}} - d_{\mathrm{pred}} |_1 \right] $
    • The prosody predictor PP predicts pitch and energy as p^x,n^x=P(htext,s)\hat{p}_{\pmb{x}}, \hat{n}_{\pmb{x}} = P(h_{\mathrm{text}}, \pmb{s}). These are trained to match ground truth pitch pxp_{\pmb{x}} and energy nxn_{\pmb{x}} (from FF) with L1L_1 losses Lf0\mathcal{L}_{f0} and Ln\mathcal{L}_n. $ \mathcal{L}{f0} = \mathbb{E}{\pmb{x}} \left[ | p_{\pmb{x}} - \hat{p}{\pmb{x}} |1 \right] $ $ \mathcal{L}{n} = \mathbb{E}{\pmb{x}} \left[ | n_{\pmb{x}} - \hat{n}_{\pmb{x}} |_1 \right] $
    • During inference, dpredd_{\mathrm{pred}} is used to upsample htexth_{\mathrm{text}} into apred\pmb{a}_{\mathrm{pred}} (predicted alignment). The mel-spectrogram is synthesized by xpred=G(htextapred,E(x~),p^x~,n^x~)\pmb{x}_{\mathrm{pred}} = G(h_{\mathrm{text}} \cdot \pmb{a}_{\mathrm{pred}}, E(\tilde{\pmb{x}}), \hat{p}_{\tilde{\pmb{x}}}, \hat{n}_{\tilde{\pmb{x}}}), where x~\tilde{\pmb{x}} is an arbitrary reference audio for style. This mel-spectrogram is then converted to a waveform by a pre-trained vocoder.

Drawbacks of StyleTTS:

  • Two-stage training and vocoding: An additional vocoding stage degrades sample quality.
  • Limited expressiveness: Deterministic generation limits speech diversity.
  • Reliance on reference speech: Requires reference audio for style, hindering real-time applications.

4.2.2. StyleTTS 2 Enhancements

StyleTTS 2 addresses the drawbacks of its predecessor by introducing:

  • An end-to-end (E2E) training process.

  • Direct waveform synthesis with adversarial training.

  • Large speech language models (SLMs) as discriminators.

  • Differentiable duration modeling.

  • Style diffusion for reference-free, diverse style generation.

    An overview of the training and inference scheme is provided in Figure 1.

The following figure (Figure 1 from the original paper) illustrates the training and inference scheme of StyleTTS 2:

该图像是StyleTTS 2模型的示意图,展示了其各个模块之间的数据流。图中包括音频样式编码器、文本编码器及风格扩散去噪器等,公式 \(ln \\, eta \\, ilde{eta} \\, hicksim \\, ext{N}(P_{mean}, P_{std}^2)\) 表示参数的正态分布特性。 该图像是StyleTTS 2模型的示意图,展示了其各个模块之间的数据流。图中包括音频样式编码器、文本编码器及风格扩散去噪器等,公式 ln \, eta \, ilde{eta} \, hicksim \, ext{N}(P_{mean}, P_{std}^2) 表示参数的正态分布特性。

(a) Acoustic modules pre-training and joint training.

该图像是示意图,展示了StyleTTS 2文本到语音合成模型的结构。图中包括声学文本编码器、韵律文本编码器、持续时间预测器和样式扩散采样器等模块。通过WavLM作为判别器,模型生成合成的波形并判断其真实性。框架说明了模型的端到端训练过程,强调了样式扩散和对抗训练的结合。 该图像是示意图,展示了StyleTTS 2文本到语音合成模型的结构。图中包括声学文本编码器、韵律文本编码器、持续时间预测器和样式扩散采样器等模块。通过WavLM作为判别器,模型生成合成的波形并判断其真实性。框架说明了模型的端到端训练过程,强调了样式扩散和对抗训练的结合。

(b) SLM adversarial training and inference.

4.2.2.1. End-to-End Training

To achieve E2E training, StyleTTS 2 modifies the decoder GG to directly generate the waveform from the style vector, aligned phoneme representations, and pitch and energy curves.

  • The last projection layer for mel-spectrograms in the decoder is removed.
  • A waveform decoder is appended, which can be HifiGAN-based [30] or iSTFTNet-based [45]. The iSTFTNet-based decoder generates magnitude and phase, converted to waveforms via inverse short-time Fourier transform for faster training/inference.
  • The snake activation function [46] is used, effective for waveform generation.
  • An AdaIN module [44] is added after each activation function to model style dependence.
  • The mel-discriminator from [6] is replaced with Multi-Period Discriminator (MPD) [30] and Multi-Resolution Discriminator (MRD) [47], along with LSGAN loss functions [48] and a truncated pointwise relativistic loss function [49] to enhance sound quality.

Training Process (Modified):

  1. Acoustic Modules Pre-training: Similar to StyleTTS stage 1, but for waveform generation. This accelerates training but is not strictly necessary. It optimizes Lmel,Ladv,Lfm\mathcal{L}_{\mathrm{mel}}, \mathcal{L}_{\mathrm{adv}}, \mathcal{L}_{\mathrm{fm}} and TMA objectives for NN epochs.
  2. Joint Training: After pre-training, all components are jointly optimized using Lmel,Ladv,Lfm,Ldur,Lf0\mathcal{L}_{\mathrm{mel}}, \mathcal{L}_{\mathrm{adv}}, \mathcal{L}_{\mathrm{fm}}, \mathcal{L}_{\mathrm{dur}}, \mathcal{L}_{\mathrm{f0}} and Ln\mathcal{L}_{\mathrm{n}}.
    • The Lmel\mathcal{L}_{\mathrm{mel}} is modified to match mel-spectrograms of waveforms reconstructed from predicted pitch p^x\hat{p}_{\pmb{x}} and energy n^x\hat{n}_{\pmb{x}} (instead of ground truth pitch/energy). $ \mathcal{L}{\mathrm{mel}} = \mathbb{E}{\pmb{x}, \pmb{t}} \left[ | \pmb{x} - M(G(h_{\mathrm{text}} \cdot \pmb{a}{\mathrm{align}}, \pmb{s}a, \hat{p}{\pmb{x}}, \hat{n}{\pmb{x}})) |_1 \right] $ Where now p^x\hat{p}_{\pmb{x}} and n^x\hat{n}_{\pmb{x}} are the predicted pitch and energy from the prosody predictor. Similar changes are made to Ladv\mathcal{L}_{\mathrm{adv}}, Lfm\mathcal{L}_{\mathrm{fm}}, and Lrel\mathcal{L}_{\mathrm{rel}}.
    • Prosodic and Acoustic Style Encoders: To address stability issues from diverging gradients when the style encoder encodes both acoustic and prosodic information, two separate style encoders are introduced:
      • Acoustic Style Encoder (EaE_a): The original style encoder, producing \pmb{s}_a = E_a(\pmb{x}).
      • Prosodic Style Encoder (EpE_p): A new encoder producing \pmb{s}_p = E_p(\pmb{x}). The style diffusion model then generates an augmented style vector s=[sp,sa]\pmb{s} = [\pmb{s}_p, \pmb{s}_a]. The predictors (SS and PP) take sp\pmb{s}_p as input.
    • Prosodic Text Encoder: The phoneme representations htexth_{\mathrm{text}} from the acoustic text encoder TT are replaced with hberth_{\mathrm{bert}} from a prosodic text encoder (BB) based on BERT transformers (a phoneme-level BERT [7]). This enhances naturalness.
    • SLM Adversarial Training: Differentiable upsampling and fast style diffusion enable fully differentiable speech sample generation during training, which are used to optimize Lslm\mathcal{L}_{slm} (Eq. 5) to update parameters of all components.

4.2.2.2. Style Diffusion

In StyleTTS 2, speech x\pmb{x} is modeled as a conditional distribution p(xt)=p(xt,s)p(st)dsp(\pmb{x}|\pmb{t}) = \int p(\pmb{x}|\pmb{t}, \pmb{s}) p(\pmb{s}|\pmb{t}) d\pmb{s} through a latent variable s\pmb{s}. This s\pmb{s} is the generalized speech style, representing characteristics beyond phonetic content.

  • Sampling: s\pmb{s} is sampled by EDM [50], following the combined probability flow [51] and time-varying Langevin dynamics [52]. The general equation for sampling s\pmb{s} is: $ \pmb{s} = \int -\sigma(\tau) [\beta(\tau)\sigma(\tau) + \dot{\sigma}(\tau)] \nabla_s \log p_\tau(\pmb{s}|\pmb{t}) d\tau + \int \sqrt{2\beta(\tau)} \sigma(\tau) d\tilde{W}_\tau $ Where:

    • σ(τ)\sigma(\tau) is the noise level schedule, determining the amount of noise at time τ\tau.
    • σ˙(τ)\dot{\sigma}(\tau) is its time derivative.
    • β(τ)\beta(\tau) is the stochasticity term, controlling the randomness.
    • W~τ\tilde{W}_\tau is the backward Wiener process for τ[T,0]\tau \in [T, 0], representing a stochastic component.
    • slogpτ(st)\nabla_s \log p_\tau(\pmb{s}|\pmb{t}) is the score function at time τ\tau, which estimates the gradient of the log-probability density of the style vector.
  • EDM Formulation: The paper follows the EDM [50] formulation, with the denoiser K(s;t,σ)K(\pmb{s}; \pmb{t}, \sigma) preconditioned as: $ K(\pmb{s}; \pmb{t}, \sigma) := \left(\frac{\sigma_{\mathrm{data}}}{\sigma^}\right)^2 \pmb{s} + \frac{\sigma \cdot \sigma_{\mathrm{data}}}{\sigma^} \cdot V\left(\frac{\pmb{s}}{\sigma^*}; \pmb{t}, \frac{1}{4}\ln\sigma\right) $ Where:

    • σ\sigma follows a normal distribution lnσN(Pmean,Pstd2)\ln\sigma \sim \mathcal{N}(P_{\mathrm{mean}}, P_{\mathrm{std}}^2) with Pmean=1.2P_{\mathrm{mean}} = -1.2 and Pstd=1.2P_{\mathrm{std}} = 1.2. This defines the noise levels used during training.
    • σ:=σ2+σdata2\sigma^* := \sqrt{\sigma^2 + \sigma_{\mathrm{data}}^2} is a scaling term, where σdata=0.2\sigma_{\mathrm{data}} = 0.2 is the standard deviation of the style vectors (data).
    • VV is a 3-layer transformer [53] conditioned on the input text t\pmb{t} and noise level σ\sigma. This is the neural network that learns to denoise the style vector.
  • Training Objective: The denoiser VV is trained with the following objective: $ \mathcal{L}{\mathrm{edm}} = \mathbb{E}{\pmb{x}, \pmb{t}, \sigma, \pmb{\xi} \sim \mathcal{N}(0, I)} \left[ \lambda(\sigma) \left| K(\mathcal{E}(\pmb{x}) + \sigma\pmb{\xi}; \pmb{t}, \sigma) - \mathcal{E}(\pmb{x}) \right|_2^2 \right] $ Where:

    • E(x):=[Ea(x),Ep(x)]\mathcal{E}(\pmb{x}) := [E_a(\pmb{x}), E_p(\pmb{x})] is the concatenated style vector from the acoustic and prosodic style encoders.
    • ξN(0,I)\pmb{\xi} \sim \mathcal{N}(0, I) is standard Gaussian noise.
    • λ(σ):=(σ/(σσdata))2\lambda(\sigma) := (\sigma^*/(\sigma \cdot \sigma_{\mathrm{data}}))^2 is the weighting factor, which scales the loss based on the noise level.
  • ODE for Sampling: Under this framework, the sampling equation (Eq. 1) becomes an ordinary differential equation (ODE) where the score function depends on σ\sigma: $ \frac{d\pmb{s}}{d\sigma} = -\sigma \nabla_s \log p_\sigma(\pmb{s}|\pmb{t}) = \frac{\pmb{s} - K(\pmb{s}; \pmb{t}, \sigma)}{\sigma}, \quad \pmb{s}(\sigma(T)) \sim \mathcal{N}(0, \sigma(T)^2 I) $ This ODE is solved to sample the style vector s\pmb{s}.

    • Solver: Instead of a 2nd-order Heun solver used in EDM, StyleTTS 2 uses the ancestral DPM-2 solver [54] for faster and more diverse sampling.
    • Scheduler: The same scheduler as in EDM is used with σmin=0.0001\sigma_{\mathrm{min}} = 0.0001, σmax=3\sigma_{\mathrm{max}} = 3, and ρ=9\rho = 9. This combination allows high-quality speech synthesis with only three diffusion steps, minimally impacting inference speed.
  • Conditioning: The denoiser VV conditions on t\pmb{t} through hberth_{\mathrm{bert}} (from the prosodic text encoder) concatenated with the noisy input E(x)+σξ\mathcal{E}(\pmb{x}) + \sigma\pmb{\xi}. σ\sigma is conditioned via sinusoidal positional embeddings.

  • Multispeaker Setting: For multispeaker scenarios, p(st,c)p(\pmb{s}|\pmb{t}, \pmb{c}) is modeled by K(s;t,c,σ)K(\pmb{s}; \pmb{t}, \pmb{c}, \sigma), with an additional speaker embedding c=E(xref)\pmb{c} = E(\pmb{x}_{\mathrm{ref}}) from a reference audio xref\pmb{x}_{\mathrm{ref}}. c\pmb{c} is injected into VV via adaptive layer normalization [6].

  • Detailed EDM Derivation (from Appendix B.1): EDM formulates diffusion sampling as an ODE (Eq. 1 in [50]): $ d\pmb{x} = - \dot{\sigma}(t) \sigma(t) \nabla_{\pmb{x}} \log p(\pmb{x}; \sigma(t)) dt $ Rewritten in terms of dσd\sigma: $ d\pmb{x} = - \sigma \nabla_{\pmb{x}} \log p(\pmb{x}; \sigma) d\sigma $ The denoising score matching objective (Eq. 108 in [50]): $ \mathbb{E}{\pmb{y} \sim p{\mathrm{data}}, \pmb{\xi} \sim \mathcal{N}(0, \sigma^2 I), \ln\sigma \sim \mathcal{N}(P_{\mathrm{mean}}, P_{\mathrm{std}}^2)} \left[ \lambda(\sigma) \left| D(\pmb{y} + \pmb{\xi}; \sigma) - \pmb{y} \right|2^2 \right] $ Where DD is the denoiser, defined as (Eq. 7 in [50]): $ D(\pmb{x}; \sigma) := c{\mathrm{skip}}(\sigma) \pmb{x} + c_{\mathrm{out}}(\sigma) F_\theta(c_{\mathrm{in}}(\sigma) \pmb{x}; c_{\mathrm{noise}}(\sigma)) $

    • FθF_\theta is the trainable neural network (VV in the paper's main text).
    • cskip(σ)c_{\mathrm{skip}}(\sigma), cout(σ)c_{\mathrm{out}}(\sigma), cin(σ)c_{\mathrm{in}}(\sigma), cnoise(σ)c_{\mathrm{noise}}(\sigma) are scaling factors. The score function is (Eq. 3 in [50]): $ \nabla_{\pmb{x}} \log p(\pmb{x}; \sigma(t)) = (D(\pmb{x}; \sigma) - \pmb{x}) / \sigma^2 $ The scaling factors are defined as (Table 1 in [50]): $ c_{\mathrm{skip}}(\sigma) := \sigma_{\mathrm{data}}^2 / (\sigma^2 + \sigma_{\mathrm{data}}^2) = (\sigma_{\mathrm{data}} / \sigma^)^2 $ $ c_{\mathrm{out}}(\sigma) := \sigma \cdot \sigma_{\mathrm{data}} / \sqrt{\sigma_{\mathrm{data}}^2 + \sigma^2} = \sigma \cdot \sigma_{\mathrm{data}} / \sigma^ $ $ c_{\mathrm{in}}(\sigma) := 1 / \sqrt{\sigma_{\mathrm{data}}^2 + \sigma^2} = 1 / \sigma^* $ $ c_{\mathrm{noise}}(\sigma) := \frac{1}{4} \ln(\sigma) $ Where σ:=σdata2+σ2\sigma^* := \sqrt{\sigma_{\mathrm{data}}^2 + \sigma^2}. Substituting these definitions into the denoiser equation yields: $ D(\pmb{x}; \sigma) = \left(\frac{\sigma_{\mathrm{data}}}{\sigma^}\right)^2 \pmb{x} + \frac{\sigma \cdot \sigma_{\mathrm{data}}}{\sigma^} \cdot F_\theta\left(\frac{\pmb{x}}{\sigma^*}; \frac{1}{4}\ln\sigma\right) $ Combining the ODE and score function: $ \frac{d\pmb{x}}{d\sigma} = -\sigma \nabla_{\pmb{x}} \log p(\pmb{x}; \sigma(t)) = -\sigma (D(\pmb{x}; \sigma) - \pmb{x}) / \sigma^2 = \frac{\pmb{x} - D(\pmb{x}; \sigma)}{\sigma} $ Equations 2 and 4 from the main paper are recovered by replacing x\pmb{x} with s\pmb{s}, p(x)p(\pmb{x}) with p(st)p(\pmb{s}|\pmb{t}), FθF_\theta with VV, and DD with KK.

4.2.2.3. SLM Discriminators

Speech language models (SLMs) encode rich information, from acoustic to semantic aspects [55], and their representations can mimic human perception for evaluating synthesized speech quality [45].

  • Mechanism: StyleTTS 2 uniquely transfers knowledge from SLM encoders to generative tasks via adversarial training. It uses a 12-layer WavLM [11] (pre-trained on 94k hours of data) as the discriminator.
  • Architecture: To prevent the discriminator from overpowering the generator (as WavLM has more parameters than StyleTTS 2), the pre-trained WavLM model WW is fixed, and a convolutional neural network (CNN) CC is appended as the discriminative head. The SLM discriminator is denoted as DSLM=CWD_{SLM} = C \circ W.
  • Input: Input audios are downsampled to 16kHz to match WavLM's input. CC pools features hSLM=W(x)h_{SLM} = W(\pmb{x}) from all WavLM layers, linearly mapping 13×76813 \times 768 channels to 256.
  • Training Objective: The generator components (T, B, G, S, P, V, denoted as G\mathcal{G}) and DSLMD_{SLM} are trained to optimize a minimax game: $ \mathcal{L}{slm} = \min{\mathcal{G}} \max_{D_{SLM}} \left( \mathbb{E}{\pmb{x}} [\log D{SLM}(\pmb{x})] + \mathbb{E}{\pmb{t}} [\log (1 - D{SLM}(\mathcal{G}(\pmb{t})))] \right) $ Where:
    • G(t)\mathcal{G}(\pmb{t}) is the generated speech with text t\pmb{t}.
    • x\pmb{x} is the human recording. The LSGAN loss [48] is used in practice for stability: For the generator: $ \mathcal{L}{\mathrm{slm}}(G; D) = \mathbb{E}{\pmb{t}, \pmb{x}} \left[ \left( D_{SLM}\left( G\left( h_{\mathrm{text}} \cdot \pmb{a}{\mathrm{pred}}, \hat{\pmb{s}}a, p{\hat{\pmb{x}}}, n{\hat{\pmb{x}}} \right) \right) - 1 \right)^2 \right] $ For the discriminator: $ \mathcal{L}{\mathrm{slm}}(D; G) = \mathbb{E}{\pmb{t}, \pmb{x}} \left[ \left( D_{SLM}\left( G\left( h_{\mathrm{text}} \cdot \pmb{a}{\mathrm{pred}}, \hat{\pmb{s}}a, p{\hat{\pmb{x}}}, n{\hat{\pmb{x}}} \right) \right) \right)^2 \right] + \mathbb{E}{\pmb{y}} \left[ \left( D{SLM}(\pmb{y}) - 1 \right)^2 \right] $ Where:
    • s^a\hat{\pmb{s}}_a is the acoustic style sampled from style diffusion.
    • apred\pmb{a}_{\mathrm{pred}} is the predicted alignment.
    • px^p_{\hat{\pmb{x}}} and nx^n_{\hat{\pmb{x}}} are the predicted pitch and energy.
    • y\pmb{y} denotes real speech samples.
  • Optimal Discriminator Property: As shown in [56], the optimal discriminator DSLM(x)D^*_{SLM}(\pmb{x}) is: $ D_{SLM}^*(\pmb{x}) = \frac{\mathbb{P}{W \circ \mathcal{T}}(\pmb{x})}{\mathbb{P}{W \circ \mathcal{T}}(\pmb{x}) + \mathbb{P}_{W \circ \mathcal{G}}(\pmb{x})} $ Where T\mathcal{T} and G\mathcal{G} represent true and generated data distributions, and PWT\mathbb{P}_{W \circ \mathcal{T}} and PWG\mathbb{P}_{W \circ \mathcal{G}} are their respective densities in the WavLM feature space. When converged, the generator GG^* matches the generated and true distributions in the SLM representation space, effectively mimicking human perception.
  • OOD Training: The generator loss in Lslm\mathcal{L}_{slm} is independent of ground truth x\pmb{x}, relying only on input text t\pmb{t}. This enables training on out-of-distribution (OOD) texts, improving robustness.

4.2.2.4. Differentiable Duration Modeling

The duration predictor outputs phoneme durations dpredd_{\mathrm{pred}}. The traditional upsampling method (repeating values) is non-differentiable, blocking gradient flow for E2E training.

  • Limitations of existing methods: Attention-based upsamplers [42] (used in NaturalSpeech [5]) can be unstable during adversarial training. Gaussian upsampling [41] uses a fixed width σ\sigma, limiting its ability to model varying alignment lengths. Making σi\sigma_i trainable (Nonattentive Tacotron [57]) can introduce instability.
  • Proposed Non-parametric Differentiable Upsampler:
    • For each phoneme Δti\Delta \pmb{t}_i, the alignment is modeled as a random variable aiNa_i \in \mathbb{N} (index of speech frame).
    • The duration of the ii-th phoneme is di{1,,L}d_i \in \{1, \ldots, L\}, where L=50L=50 is the max phoneme duration (1.25 seconds).
    • The sum rule ai=k=1idka_i = \sum_{k=1}^i d_k is difficult to model due to dependencies. Instead, it is approximated as ai=di+i1a_i = d_i + \ell_{i-1}, where i1\ell_{i-1} is the end position of the (i-1)-th phoneme.
    • The approximated probability mass function (PMF) of aia_i is: $ f_{a_i}[n] = f_{d_i + \ell_{i-1}}[n] = f_{d_i}[n] * f_{\ell_{i-1}}[n] = \sum_k f_{d_i}[k] \cdot \delta_{\ell_{i-1}}[n-k] $ Where:
      • nn is the speech frame index.
      • fdi[n]f_{d_i}[n] is the PMF of the ii-th phoneme's duration.
      • fi1[n]f_{\ell_{i-1}}[n] is the PMF of i1\ell_{i-1}, which is a constant at i1\ell_{i-1}.
      • δi1[nk]\delta_{\ell_{i-1}}[n-k] is the Kronecker delta function, which is not differentiable.
    • Differentiable Approximation: The delta function δi1\delta_{\ell_{i-1}} is replaced with a Gaussian kernel Ni1(n;σ)\mathcal{N}_{\ell_{i-1}}(n; \sigma) (defined in Eq. 7, similar to standard Gaussian upsampling) with σ=1.5\sigma = 1.5. $ \mathcal{N}{c_i}(n; \sigma) := \exp\left(-\frac{(n-c_i)^2}{2\sigma^2}\right), \qquad \ell_i := \sum{k=1}^i d_{\mathrm{pred}}[k] $ Where:
      • ci:=i12dpred[i]c_i := \ell_i - \frac{1}{2} d_{\mathrm{pred}}[i] is the center position of the ii-th phoneme.
      • i\ell_i is the end position.
    • Duration Predictor Output: The duration predictor is modified to output q[k,i], the probability that the ii-th phoneme has a duration of at least kk (for k{1,,L}k \in \{1, \ldots, L\}). This is optimized to be 1 when dgtkd_{\mathrm{gt}} \geq k using cross-entropy loss.
      • The predicted duration dpred[i]d_{\mathrm{pred}}[i] is approximated as k=1Lq[k,i]\sum_{k=1}^L q[k,i].
    • Normalized Alignment: The differentiable approximation f~ai[n]\tilde{f}_{a_i}[n] (from the convolution operation) is normalized across the phoneme axis using a softmax function to obtain apred\pmb{a}_{\mathrm{pred}}: $ a_{\mathrm{pred}}[n,i] := \frac{e^{(\tilde{f}{a_i}[n])}}{\displaystyle \sum{i=1}^N e^{(\tilde{f}{a_i}[n])}}, \qquad \tilde{f}{a_i}[n] := \sum_{k=0}^{\hat{M}} q[n,i] \cdot \mathcal{N}{\ell{i-1}}(n-k; \sigma) $ Where:
      • M^:=N\hat{M} := \lceil \ell_N \rceil is the predicted total duration of the speech.

      • n{1,,M^}n \in \{1, \ldots, \hat{M}\} is the speech frame index.

        The following figure (Figure 4 from the original paper) illustrates the proposed differentiable duration upsampler:

        Figure 4: Illustration of our proposed differentiable duration upsampler. (a) Probability output from the duration predictor for 5 input tokens with \(L = 5\) . (b) Gaussian filter \(\\mathcal { N } _ { \\ell _ { i - 1 } }\) centered at \(\\ell _ { i - 1 }\) . (c) Unnormalized predicted alignment \(\\tilde { f } _ { a _ { i } } \[ n \]\) from the convolution operation between (a) and (b). (d) Normalized predicted alignment \(\\pmb { a } _ { \\mathrm { p r e d } }\) over the phoneme axis. 该图像是一个示意图,展示了我们提出的可微分的持续时间上采样器。图(a)展示了持续时间预测器的输出,图(b)为以 ilde{oldsymbol{L}}_{i-1}为中心的高斯滤波器,图(c)是通过卷积操作得到的未归一化预测对齐,图(d)展示了在音素轴上的归一化预测对齐。

(a) Probability output from the duration predictor for 5 input tokens with L=5L = 5. (b) Gaussian filter Ni1\mathcal{N}_{\ell_{i-1}} centered at i1\ell_{i-1}. (c) Unnormalized predicted alignment f~ai[n]\tilde{f}_{a_i}[n] from the convolution operation between (a) and (b). (d) Normalized predicted alignment apred\pmb{a}_{\mathrm{pred}} over the phoneme axis.

An example of duration predictor output and the predicted alignment with and without differentiable duration upsampler is shown in Figure 5. The following figure (Figure 5 from the original paper) displays an example of duration predictor output and the predicted alignment:

Figure 5: An example of duration predictor output and the predicted alignment with and without differentiable duration upsampler. (a) displays the log probability from the duration predictor for improved visualization. Although (b) and (c) differs, the duration predictor is trained end-to-end with the SLM discriminator, making the difference perceptually indistinguishable in synthesized speech. 该图像是示意图,展示了持续时间预测器的输出及其对齐结果,其中 (a) 为预测器输出, (b) 为非可微上采样, (c) 为提出的上采样方法,extσ=1.5 ext{σ}=1.5

4.2.2.5. Detailed Training Objectives (from Appendix G)

G.1 Acoustic Module Pre-training: This stage trains the acoustic modules (G,A,Ea,F,TG, A, E_a, F, T) and discriminator (DD) for mel-spectrogram reconstruction, alignment, and basic adversarial objectives.

  • Mel-spectrogram reconstruction loss Lmel\mathcal{L}_{\mathrm{mel}} (Eq. 27).

  • TMA objectives:

    • ASR loss Ls2s\mathcal{L}_{\mathrm{s2s}} (Eq. 28).
    • Monotonic loss Lmono\mathcal{L}_{\mathrm{mono}} (Eq. 29).
  • Adversarial objectives:

    • LSGAN generator loss Ladv(G;D)\mathcal{L}_{\mathrm{adv}}(G; D) (Eq. 30).
    • LSGAN discriminator loss Ladv(D;G)\mathcal{L}_{\mathrm{adv}}(D; G) (Eq. 31).
    • Feature-matching loss Lfm\mathcal{L}_{\mathrm{fm}} (Eq. 32).
    • Truncated pointwise relativistic generator loss Lrel(G;D)\mathcal{L}_{\mathrm{rel}}(G; D): $ \mathcal{L}{\mathrm{rel}}(G; D) = \mathbb{E}{{\hat{\pmb{y}}, \pmb{y} | D(\hat{\pmb{y}}) \leq D(\pmb{y}) + m_{GD}}} \left[ \tau - \mathrm{ReLU}\left( \tau - \left( D(\hat{\pmb{y}}) - D(\pmb{y}) - m_{GD} \right)^2 \right) \right] $ Where:
      • y^=G(htextaalign,sa,py,ny)\hat{\pmb{y}} = G(h_{\mathrm{text}} \cdot \pmb{a}_{\mathrm{align}}, \pmb{s}_a, p_{\pmb{y}}, n_{\pmb{y}}) is the generated sample.
      • ReLU()\mathrm{ReLU}(\cdot) is the rectified linear unit function.
      • τ=0.04\tau = 0.04 is the truncation factor.
      • mGDm_{GD} is a margin parameter.
    • Truncated pointwise relativistic discriminator loss Lrel(D;G)\mathcal{L}_{\mathrm{rel}}(D; G): $ \mathcal{L}{\mathrm{rel}}(D; G) = \mathbb{E}{{\hat{\pmb{y}}, \pmb{y} | D(\pmb{y}) \leq D(\hat{\pmb{y}}) + m_{DG}}} \left[ \tau - \mathrm{ReLU}\left( \tau - \left( D(\pmb{y}) - D(\hat{\pmb{y}}) - m_{DG} \right)^2 \right) \right] $ Where mDGm_{DG} is a margin parameter.
    • The margin parameters mGDm_{GD} and mDGm_{DG} are defined by the median of the score difference: $ m_{GD} = \mathbb{M}{\pmb{y}, \hat{\pmb{y}}} [D(\hat{\pmb{y}}) - D(\pmb{y})] $ $ m{DG} = \mathbb{M}_{\pmb{y}, \hat{\pmb{y}}} [D(\pmb{y}) - D(\hat{\pmb{y}})] $ Where M[]\mathbb{M}[\cdot] is the median operation.
  • Full objective functions for acoustic module pre-training: Generator: $ \min_{G, A, E_a, F, T} \mathcal{L}{\mathrm{mel}} + \lambda{\mathrm{s2s}} \mathcal{L}{\mathrm{s2s}} + \lambda{\mathrm{mono}} \mathcal{L}{\mathrm{mono}} + \mathcal{L}{\mathrm{adv}}(G; D) + \mathcal{L}{\mathrm{rel}}(G; D) + \mathcal{L}{\mathrm{fm}} $ Discriminator: $ \min_{D} \mathcal{L}_{\mathrm{adv}}(D; G) $ Hyperparameters: λs2s=0.2\lambda_{\mathrm{s2s}} = 0.2, λmono=5\lambda_{\mathrm{mono}} = 5.

G.2 Joint Training: This stage jointly optimizes all components for end-to-end waveform generation, incorporating the new differentiable duration modeling, prosodic encoders, SLM discriminator, and style diffusion. The pitch extractor FF is fixed.

  • Duration prediction objectives:
    • Cross-entropy loss Lce\mathcal{L}_{\mathrm{ce}} for training the duration predictor's probability output q[k,i]: $ \mathcal{L}{\mathrm{ce}} = \mathbb{E}{d_{\mathrm{gt}}} \left[ \sum_{i=1}^N \sum_{k=1}^L \mathrm{CE}\left( q[k,i], \mathbb{I}(d_{\mathrm{gt}}[i] \geq k) \right) \right] $ Where:
      • I()\mathbb{I}(\cdot) is the indicator function.
      • q = S(h_{\mathrm{bert}}, \pmb{s}_p) is the output of the duration predictor SS given the prosodic style vector \pmb{s}_p = E_p(\pmb{x}) and prosodic text embeddings hbert=B(t)h_{\mathrm{bert}} = B(\pmb{t}).
      • NN is the number of phonemes in t\pmb{t}.
      • dgtd_{\mathrm{gt}} is the ground truth duration.
    • L1 loss Ldur\mathcal{L}_{\mathrm{dur}} for the approximated duration: $ \mathcal{L}{\mathrm{dur}} = \mathbb{E}{d_{\mathrm{gt}}} \left[ | d_{\mathrm{gt}} - d_{\mathrm{pred}} |_1 \right] $ Where dpred[i]=k=1Lq[k,i]d_{\mathrm{pred}}[i] = \sum_{k=1}^L q[k,i] is the approximated predicted duration.
  • Prosody prediction objectives:
    • F0 reconstruction loss Lf0\mathcal{L}_{f0} (Eq. 41).
    • Energy reconstruction loss Ln\mathcal{L}_n (Eq. 42).
  • Mel-spectrogram reconstruction loss Lmel\mathcal{L}_{\mathrm{mel}} (Eq. 43) is modified to use predicted pitch and energy. Similar changes apply to Ladv,Lfm,Lrel\mathcal{L}_{\mathrm{adv}}, \mathcal{L}_{\mathrm{fm}}, \mathcal{L}_{\mathrm{rel}}.
  • SLM adversarial objective: Uses LSGAN loss for SLM discriminator DSLMD_{SLM}.
    • SLM generator loss Lslm(G;D)\mathcal{L}_{\mathrm{slm}}(G; D) (Eq. 44).
    • SLM discriminator loss Lslm(D;G)\mathcal{L}_{\mathrm{slm}}(D; G) (Eq. 45).
  • Full objective functions for joint training: Generator (all components: G,A,Ea,Ep,T,B,V,S,PG, A, E_a, E_p, T, B, V, S, P): $ \min_{G, A, E_a, E_p, T, B, V, S, P} \mathcal{L}{\mathrm{mel}} + \lambda{\mathrm{ce}} \mathcal{L}{\mathrm{ce}} + \lambda{\mathrm{dur}} \mathcal{L}{\mathrm{dur}} + \lambda{f0} \mathcal{L}{f0} + \lambda_n \mathcal{L}n + \lambda{\mathrm{s2s}} \mathcal{L}{\mathrm{s2s}} + \lambda_{\mathrm{mono}} \mathcal{L}{\mathrm{mono}} + \mathcal{L}{\mathrm{adv}}(G; D) + \mathcal{L}{\mathrm{rel}}(G; D) + \mathcal{L}{\mathrm{fm}} + \mathcal{L}{\mathrm{slm}}(G; D) + \mathcal{L}{\mathrm{edm}} $ Discriminator (D, C): $ \min_{D, C} \mathcal{L}{\mathrm{adv}}(D; G) + \mathcal{L}{\mathrm{rel}}(D; G) + \mathcal{L}_{\mathrm{slm}}(D; G) $ Hyperparameters: λs2s=0.2,λmono=5,λdur=1,λf0=0.1,λn=1\lambda_{\mathrm{s2s}} = 0.2, \lambda_{\mathrm{mono}} = 5, \lambda_{\mathrm{dur}} = 1, \lambda_{f0} = 0.1, \lambda_n = 1 and λce=1\lambda_{\mathrm{ce}} = 1.

4.2.2.6. Detailed Model Architectures (from Appendix F)

The architecture maintains the acoustic text encoder, text aligner, and pitch extractor from original StyleTTS. The acoustic style encoder and prosodic style encoder follow the original StyleTTS style encoder. The prosodic text encoder is a pre-trained PL-BERT model [7]. Discriminators are MPD and MRD [31]. The decoder combines the original StyleTTS decoder with iSTFTNet [45] or HifiGAN [28], with AdaIN after each activation. The duration and prosody predictors are similar to [6], but the duration predictor's output changes from 1×N1 \times N to L×NL \times N for probability modeling.

The following are the architectures for the Denoiser and Discriminative Head:

Denoiser Architecture (Table 8): The Denoiser processes the input style vector s\pmb{s}, noise level σ\sigma, phoneme embeddings hberth_{\mathrm{bert}}, and speaker embedding c\pmb{c} (for multi-speaker). c\pmb{c} is injected via adaptive layer normalization (AdaLN): $ \mathrm{AdaLN}(x, s) = L_\sigma(s) \frac{x - \mu(x)}{\sigma(x)} + L_\mu(s) $ Where:

  • xx is the feature map from the previous layer.

  • s\pmb{s} is the style vector (or speaker embedding in this context).

  • μ(x)\mu(x) and σ(x)\sigma(x) are the layer mean and standard deviation.

  • LσL_\sigma and LμL_\mu are learned linear projections for computing adaptive gain and bias using s\pmb{s}.

    The following are the results from Table 8 of the original paper:

    Submodule Input Layer Norm Output Shape
    Embedding σ Sinusoidal Embedding - 256 ×1
    c Addition 256 ×1
    Repeat for N times 256 ×N
    Linear 256 × 1024 1024 × N
    Output Embedding k 1024 × N
    Input S Repeat for N Times 256 × N
    hbert Concat Group(32) 1024×N
    Conv 1024× 1024 1024×N
    Transformer Block (×3) k Addition - 1024 × N
    c 8-Head Self-Attention (64 head features) AdaLN (·, c) 1024×N
    Linear 1024×2048 2048×N
    GELU 2048×N
    Linear 2048× 1024 1024×N
    Output Adaptive Average Pool 1024 ×1
    Linear 1024×256 256×1

Where:

  • NN represents the input phoneme length of the mel-spectrogram.
  • σ\sigma is the noise level.
  • hberth_{\mathrm{bert}} represents the phoneme embeddings.
  • c\pmb{c} is the speaker embedding.
  • The size of s\pmb{s} and c\pmb{c} is 256×1256 \times 1.
  • The size of hberth_{\mathrm{bert}} is 768×N768 \times N.
  • The size of σ\sigma is 1×11 \times 1.
  • Group(32)Group(32) refers to group normalization with a group size of 32.

Discriminative Head Architecture (Table 9): The Discriminative Head is composed of a 3-layer convolutional neural network (CNN) followed by a linear projection. This head processes the features from WavLM to produce a discriminative score.

The following are the results from Table 9 of the original paper:

Layer Norm Output Shape
Input hslm 13×768×T
Reshape 9984×T
Linear 9984×256 256×T
Conv 256×256 256×T
Leaky ReLU (0.2) 256×T
Conv 256×512 512×T
Leaky ReLU (0.2) 512×T
Conv 512×512 512×T
Leaky ReLU (0.2) 512×T
Conv 512×1 1×T

Where TT represents the number of frames (length) of the output feature hslmh_{\mathrm{slm}} from WavLM.

5. Experimental Setup

5.1. Datasets

Experiments were conducted on three benchmark datasets to evaluate StyleTTS 2's performance across different scenarios: single-speaker, multi-speaker, and zero-shot speaker adaptation. All datasets were resampled to 24kHz (to match LibriTTS's native rate, ensuring consistency across models) and texts were converted into phonemes using phonemizer [58].

  • LJSpeech [12]:
    • Source: A public domain audio dataset of 13,100 short audio clips of a single female speaker reading passages from 7 audiobooks.
    • Scale: Roughly 24 hours of speech.
    • Characteristics: Single-speaker, English, professional audiobook narration. Contains in-distribution (ID) texts (from the original dataset) and out-of-distribution (OOD) texts (40 utterances from Librivox spoken by the same narrator but from audiobooks not in the original dataset). An example data sample would be an audio clip like "The quick brown fox jumps over the lazy dog." paired with its transcribed text.
    • Split: 12,500 samples for training, 100 for validation, 500 for testing (same as [3, 5, 6]).
    • Choice: Standard benchmark for single-speaker TTS, suitable for evaluating naturalness and robustness to OOD texts.
  • VCTK [13]:
    • Source: The CSTR VCTK Corpus, an English multi-speaker corpus.
    • Scale: Nearly 44,000 short clips from 109 native English speakers (various accents).
    • Characteristics: Multi-speaker, diverse accents, relatively short utterances, often reading standard paragraphs without deep narrative context.
    • Split: 43,470 samples for training, 100 for validation, 500 for testing (same as [3]).
    • Choice: Standard benchmark for multi-speaker TTS, evaluating naturalness and speaker similarity across different voices.
  • LibriTTS [14] (train-clean-460 subset):
    • Source: A large-scale corpus derived from the LibriSpeech ASR corpus, specifically designed for TTS. train-clean-460 is a clean subset.
    • Scale: Approximately 245 hours of audio from 1,151 speakers. Utterances longer than 30 seconds or shorter than one second were excluded.
    • Characteristics: Large number of speakers, diverse reading styles, varied acoustic environments and accents (derived from audiobooks).
    • Split: Training (98%), validation (1%), and testing (1%), consistent with [6]. The test-clean subset was used for zero-shot adaptation evaluation, with 3-second reference clips.
    • Choice: Ideal for evaluating zero-shot speaker adaptation capabilities due to its large number of speakers and diverse speaking styles. Texts from LibriTTS training split were also used as OOD texts for SLM adversarial training, as LJSpeech alone has limited OOD samples.

5.2. Evaluation Metrics

The paper uses both subjective and objective evaluation metrics.

5.2.1. Subjective Metrics

  • Mean Opinion Score of Naturalness (MOS-N):
    1. Conceptual Definition: Quantifies the perceived naturalness and human-likeness of synthesized speech. Listeners rate how natural the speech sounds, including whether it sounds like it was spoken by a native speaker.
    2. Mathematical Formula: There is no single mathematical formula for MOS, as it's a direct average of human ratings. It's calculated as the arithmetic mean of individual scores. $ \mathrm{MOS} = \frac{1}{N} \sum_{i=1}^N R_i $
    3. Symbol Explanation:
      • NN: Total number of ratings collected.
      • RiR_i: Rating given by the ii-th human evaluator (typically on a 1-5 scale).
  • Mean Opinion Score of Similarity (MOS-S):
    1. Conceptual Definition: Measures how similar the voice of the synthesized speech is to a provided reference speaker's voice. Listeners rate whether the two audio clips (synthetic and reference) could have been produced by the same speaker, focusing on accent and speaking habits.
    2. Mathematical Formula: Similar to MOS-N, MOS-S is the arithmetic mean of human ratings. $ \mathrm{MOS-S} = \frac{1}{N} \sum_{i=1}^N S_i $
    3. Symbol Explanation:
      • NN: Total number of similarity ratings collected.
      • SiS_i: Similarity rating given by the ii-th human evaluator (typically on a 1-5 scale).
  • Comparative Mean Opinion Score (CMOS):
    1. Conceptual Definition: A more sensitive subjective metric that asks listeners to compare two speech samples (e.g., model A vs. model B or model vs. ground truth) and rate the second sample relative to the first. It's designed to detect subtle differences that might be missed in absolute MOS ratings.
    2. Mathematical Formula: CMOS is the arithmetic mean of comparative ratings. $ \mathrm{CMOS} = \frac{1}{N} \sum_{i=1}^N C_i $
    3. Symbol Explanation:
      • NN: Total number of comparative ratings collected.
      • CiC_i: Comparative rating given by the ii-th human evaluator (typically on a -6 to +6 scale, where positive indicates the second sample is better, negative indicates worse, and 0 indicates no difference).
  • p-value from Wilcoxon Test: Used to determine the statistical significance of CMOS differences. A p<0.05p < 0.05 typically indicates a statistically significant difference between models.

5.2.2. Objective Metrics

  • Mel-Cepstral Distortion (MCD):
    1. Conceptual Definition: A widely used objective metric to quantify the spectral distortion between two speech signals (e.g., synthesized and ground truth). Lower MCD values indicate better spectral fidelity, meaning the synthesized speech's timbre and acoustic characteristics are closer to the reference.
    2. Mathematical Formula: $ \mathrm{MCD} = \frac{10}{\ln 10} \sqrt{2 \sum_{k=1}^{K} (c_{k, ref} - c_{k, syn})^2} $
    3. Symbol Explanation:
      • KK: The number of mel-cepstral coefficients (MCCs) used for comparison.
      • ck,refc_{k, ref}: The kk-th MCC of the reference (ground truth) speech.
      • ck,sync_{k, syn}: The kk-th MCC of the synthesized speech.
      • ln10\ln 10: Natural logarithm of 10, used for conversion to decibels.
      • The MCCs are aligned using Dynamic Time Warping (DTW) before comparison.
  • MCD weighted by Speech Length (MCD-SL):
    1. Conceptual Definition: A variation of MCD that additionally considers the length of the speech segments and the quality of alignment between the synthesized and reference speech. It provides a more comprehensive evaluation of spectral fidelity, especially when there might be slight duration mismatches. Lower MCD-SL values are better.
    2. Mathematical Formula: The paper refers to a publicly available library for its computation, suggesting it's a specific implementation. A common approach to weighting by length or alignment quality involves averaging MCD over aligned frames and penalizing unaligned segments or incorporating a length-based normalization factor. Without the specific library's formula, the precise mathematical definition is not provided in the paper.
    3. Symbol Explanation: No specific symbols beyond those for MCD are defined for MCD-SL in the paper.
  • Root Mean Square Error of log F0 pitch (F0 RMSE):
    1. Conceptual Definition: Measures the accuracy of the fundamental frequency (F0, or pitch) contour of the synthesized speech compared to the reference. F0 is crucial for perceived intonation and prosody. Lower RMSE indicates a more accurate pitch contour.
    2. Mathematical Formula: $ \mathrm{RMSE_{F0}} = \sqrt{\frac{1}{N} \sum_{n=1}^{N} (\log F0_{ref}(n) - \log F0_{syn}(n))^2} $
    3. Symbol Explanation:
      • NN: The number of speech frames over which F0 is compared (after alignment).
      • F0_{ref}(n): The F0 value of the reference speech at frame nn.
      • F0_{syn}(n): The F0 value of the synthesized speech at frame nn.
      • log\log: Natural logarithm, applied to F0 values before calculating RMSE to account for the logarithmic perception of pitch.
  • Mean Absolute Deviation of phoneme duration (DUR MAD):
    1. Conceptual Definition: Quantifies the average absolute difference between the predicted phoneme durations and the ground truth phoneme durations. Accurate duration modeling is critical for natural-sounding speech rhythm. Lower DUR MAD values indicate better duration prediction.
    2. Mathematical Formula: $ \mathrm{DUR , MAD} = \frac{1}{P} \sum_{p=1}^{P} |d_{p, ref} - d_{p, syn}| $
    3. Symbol Explanation:
      • PP: The total number of phonemes in the utterance.
      • dp,refd_{p, ref}: The ground truth duration of the pp-th phoneme.
      • dp,synd_{p, syn}: The predicted or synthesized duration of the pp-th phoneme.
  • Word Error Rate (WER):
    1. Conceptual Definition: A common metric used to evaluate the accuracy of speech recognition systems, but also applied in TTS to assess the intelligibility of synthesized speech. It calculates the minimum number of edits (substitutions, deletions, insertions) required to change the synthesized text into the reference text, normalized by the length of the reference text. Lower WER indicates better intelligibility.
    2. Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} \times 100% $
    3. Symbol Explanation:
      • SS: Number of substitutions (words in the reference replaced by different words in the synthesized output).
      • DD: Number of deletions (words in the reference missing from the synthesized output).
      • II: Number of insertions (extra words in the synthesized output not present in the reference).
      • NN: Total number of words in the reference transcript.
  • Coefficient of Variation of duration (CV_dur):
    1. Conceptual Definition: A standardized measure of dispersion of phoneme durations, used to assess the diversity or variability in the speaking rate of synthesized speech. A higher CV_dur indicates greater variability in duration for the same input text, which is desirable for expressive speech.
    2. Mathematical Formula: $ \mathrm{CV_{dur}} = \frac{\sigma_{dur}}{\mu_{dur}} $
    3. Symbol Explanation:
      • σdur\sigma_{dur}: The standard deviation of the phoneme duration values across multiple syntheses of the same text.
      • μdur\mu_{dur}: The mean of the phoneme duration values across multiple syntheses of the same text.
  • Coefficient of Variation of pitch (CV_f0):
    1. Conceptual Definition: Similar to CV_dur, this metric quantifies the dispersion of F0 (pitch) values, indicating the diversity or variability in prosody and intonation. A higher CV_f0 suggests more diverse and expressive pitch contours.
    2. Mathematical Formula: $ \mathrm{CV_{f0}} = \frac{\sigma_{f0}}{\mu_{f0}} $
    3. Symbol Explanation:
      • σf0\sigma_{f0}: The standard deviation of the F0 values across multiple syntheses of the same text.
      • μf0\mu_{f0}: The mean of the F0 values across multiple syntheses of the same text.
  • Real-time Factor (RTF):
    1. Conceptual Definition: Measures the computational speed of a TTS model. It is the ratio of the time taken to synthesize an audio segment to the actual duration of that audio segment. An RTF of 1.0 means it takes 1 second to synthesize 1 second of audio. Lower RTF values indicate faster inference speed.
    2. Mathematical Formula: $ \mathrm{RTF} = \frac{\text{Synthesis Time}}{\text{Audio Duration}} $
    3. Symbol Explanation:
      • Synthesis Time: The time taken by the model to generate a given audio segment.
      • Audio Duration: The actual length (in time) of the generated audio segment.

5.3. Baselines

StyleTTS 2 was compared against several state-of-the-art and widely recognized TTS models:

  • For LJSpeech (Single-Speaker):
    • VITS [3]: A conditional variational autoencoder with adversarial learning for end-to-end speech synthesis, known for high quality.
    • StyleTTS [6]: The predecessor model, a style-based generative model.
    • JETS [32]: Jointly training FastSpeech2 and HiFi-GAN for end-to-end text-to-speech.
    • NaturalSpeech [5]: An end-to-end TTS model that achieved human-level quality on LJSpeech. Samples were obtained directly from the authors for comparison.
  • For VCTK (Multi-Speaker):
    • VITS [3]: A strong multi-speaker baseline.
  • For LibriTTS (Zero-Shot Speaker Adaptation):
    • VITS (from ESPNet Toolkit [63]): A publicly available implementation, used as a baseline for large multi-speaker datasets.

    • YourTTS [60]: A model for zero-shot multi-speaker TTS and voice conversion.

    • StyleTTS + HifiGAN: The predecessor model combined with a high-fidelity vocoder.

    • Vall-E [8]: A neural codec language model for zero-shot TTS, known for its high performance but also for requiring massive training data. Samples were collected from the official demo page for comparison.

      All baseline models, except for the VITS model on LibriTTS (from ESPNet Toolkit), used official checkpoints released by their respective authors. All audio samples from baselines and StyleTTS 2 were resampled to match the baseline models' rates for fair comparison (22.5kHz for LJSpeech and VCTK, 16kHz for LibriTTS).

5.4. Model Training

  • Decoders: iSTFTNet decoder was used for LJSpeech (due to speed and sufficient performance), while HifiGAN decoder was used for VCTK and LibriTTS models.
  • Pre-training Epochs: Acoustic modules were pre-trained for:
    • 100 epochs on LJSpeech.
    • 50 epochs on VCTK.
    • 30 epochs on LibriTTS.
  • Joint Training Epochs: Followed by joint training for:
    • 60 epochs on LJSpeech.
    • 40 epochs on VCTK.
    • 25 epochs on LibriTTS.
  • Optimizer: AdamW [59] with β1=0\beta_1 = 0, β2=0.99\beta_2 = 0.99, and weight decay λ=104\lambda = 10^{-4}.
  • Learning Rate: γ=104\gamma = 10^{-4} for both pre-training and joint training.
  • Batch Size: 16 samples.
  • Loss Weights: Adopted from [6] to balance all loss terms (λs2s=0.2,λmono=5,λdur=1,λf0=0.1,λn=1,λce=1\lambda_{\mathrm{s2s}} = 0.2, \lambda_{\mathrm{mono}} = 5, \lambda_{\mathrm{dur}} = 1, \lambda_{f0} = 0.1, \lambda_n = 1, \lambda_{\mathrm{ce}} = 1).
  • Waveform Segmentation: Waveforms were randomly segmented with a maximum length of 3 seconds.
  • SLM Adversarial Training: Both ground truth and generated samples were ensured to be 3 to 6 seconds in duration, matching the fine-tuning requirements of WavLM models [11]. OOD texts (from the training split of LibriTTS) were used for SLM adversarial training, sampled with equal probability as in-distribution texts.
  • Style Diffusion Steps: Randomly sampled from 3 to 5 during training for speed, and set to 5 during inference for quality.
  • Hardware: Training was conducted on four NVIDIA A40 GPUs.

5.5. Subjective Evaluation Procedures (from Appendix E)

The evaluation strictly followed guidelines for human-like TTS research [43, 5]:

  1. Rater Selection: Native English speakers residing in the United States were selected via Amazon Mechanical Turk (MTurk) filters (HIT Approval Rate > 95%, Location: US, Number of HITs Approved > 50). Self-identified native speakers were verified by residential IP addresses.
  2. Attention Checks:
    • MOS: Participants whose average score for hidden ground truth audios did not rank in the top three among all five models were excluded.
    • CMOS: Raters were disqualified if the sign of their score (A better/worse than B) differed for over half of the sample set (10 samples). Six raters were eliminated across all experiments.
  3. Clear Objectives: Explicit definitions of "naturalness" and "similarity" were provided in the surveys (e.g., naturalness includes sounding like a native American English speaker from a human source; similarity focuses on identifying the voice beyond distortion, including accent and habits).
  4. Model Implementation: Official model checkpoints were used where available. For NaturalSpeech, 20 demo samples were obtained from the author. For Vall-E, 30 samples and 3-second prompts were collected from the official demo page.
  5. MUSHRA-based Approach: MOS evaluations used a MUSHRA-based approach, presenting paired samples from all models.
  6. Statistical Significance: For CMOS experiments, 20 raters evaluated each sample (after exclusions). For NaturalSpeech and Vall-E (where limited samples were available), the number of raters was doubled to compensate.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Human-Level Performance and Comparison

The following are the results from Table 1 of the original paper:

Model Dataset CMOS-N (p-value) CMOS-S (p-value)
Ground Truth LJSpeech +0.28 (p = 0.021)
NaturalSpeech LJSpeech +1.07 (p < 10−6)
Ground Truth VCTK −0.02 (p = 0.628) +0.30 (p = 0.081)
VITS VCTK +0.45 (p = 0.009) +0.43 (p = 0.032)
Vall-E LibriSpeech (zero-shot) +0.67 (p < 10−3) −0.47 (p < 10−3)

Table 1 presents the Comparative Mean Opinion Scores (CMOS) for StyleTTS 2 relative to other models, along with pp-values from the Wilcoxon test. Positive scores indicate StyleTTS 2 performed better.

  • LJSpeech Dataset:
    • StyleTTS 2 significantly outperformed NaturalSpeech with a CMOS-N of +1.07+1.07 (p<106p < 10^{-6}), setting a new state-of-the-art for naturalness on this dataset.
    • Surprisingly, StyleTTS 2 was preferred over human ground truth (GT) recordings with a CMOS-N of +0.28+0.28 (p=0.021p = 0.021). The authors hypothesize this preference may be due to artifacts in the LJSpeech dataset, which consists of fragmented audiobook passages that can disrupt narrative continuity. This might make the original narration sound less "natural" when heard in isolation, compared to the potentially more consistent and optimized output of StyleTTS 2.
  • VCTK Dataset:
    • StyleTTS 2 achieved human-level performance, scoring comparably to the ground truth with a CMOS-N of -0.02 (p=0.628p = 0.628), indicating no statistically significant difference. This supports the hypothesis that the LJSpeech preference might be dataset-specific, as VCTK lacks narrative context.
    • In terms of similarity (CMOS-S), StyleTTS 2 achieved +0.30+0.30 (p=0.081p=0.081) compared to ground truth, suggesting its samples were perceived as slightly more similar to the reference speaker, indicating effective use of reference audio for style diffusion.
    • StyleTTS 2 significantly outperformed VITS on VCTK in both naturalness (CMOSN=+0.45CMOS-N = +0.45, p=0.009p=0.009) and similarity (CMOSS=+0.43CMOS-S = +0.43, p=0.032p=0.032).
  • LibriTTS Dataset (Zero-Shot Adaptation):
    • StyleTTS 2 surpassed Vall-E in naturalness with a CMOS-N of +0.67+0.67 (p<103p < 10^{-3}).
    • However, it fell slightly short in similarity, with a CMOS-S of -0.47 (p<103p < 10^{-3}) compared to Vall-E.
    • A critical finding here is that StyleTTS 2 achieved these results using only 245 hours of training data, which is approximately 250 times less than Vall-E's 60,000 hours, making it a highly data-efficient alternative.

6.1.2. MOS Results and OOD Robustness

The following are the results from Table 2 of the original paper:

Model MOSID (CI) MOSOOD (CI)
Ground Truth 3.81 (± 0.09) 3.70 (± 0.11)
StyleTTS 2 3.83 (± 0.08) 3.87 (± 0.08)
JETS 3.57 (± 0.09) 3.21 (± 0.12)
VITS 3.34 (± 0.10) 3.21 (± 0.11)
StyleTTS + HifiGAN 3.35 (± 0.10) 3.32 (± 0.12)

Table 2 compares MOS-N for in-distribution (MOS_ID) and out-of-distribution (MOS_OOD) texts on LJSpeech. CI represents the 95% confidence interval.

  • StyleTTS 2 achieved the highest MOS_ID (3.83), surpassing all baseline models and even the ground truth (3.81).

  • Crucially, StyleTTS 2 showed no degradation in quality for OOD texts, scoring 3.87, which is even slightly higher than its MOS_ID. This demonstrates its strong generalization ability and robustness.

  • In contrast, all other models, including ground truth, showed some degree of quality degradation for OOD texts, corroborating previous findings on this challenge [7].

    The following are the results from Table 3 of the original paper:

    Model MOS-N (CI) MOS-S (CI)
    Ground Truth 4.60 (± 0.09) 4.35 (± 0.10)
    StyleTTS 2 4.15 (± 0.11) 4.03 (± 0.11)
    YourTTS 2.35 (± 0.07) 2.42 (± 0.09)
    VITS 3.69 (± 0.12) 3.54 (± 0.13)
    StyleTTS + HiFi-GAN 3.91 (± 0.11) 4.01 (± 0.10)

Table 3 presents MOS-N and MOS-S on the test-clean subset of LibriTTS for zero-shot speaker adaptation.

  • StyleTTS 2 achieved the highest MOS-N (4.15) and MOS-S (4.03) among all publicly available baseline models.
  • While StyleTTS 2's MOS scores are lower than ground truth, the large difference in training data compared to Vall-E (demonstrated in CMOS results) highlights its data efficiency.
  • The difference in MOS-S between StyleTTS + HiFi-GAN and StyleTTS 2 was not statistically significant, suggesting room for improvement in speaker similarity for the proposed model.

6.2. Style Diffusion

6.2.1. Style Disentanglement and Emotion Generation

The following figure (Figure 2 from the original paper) shows t-SNE visualizations of style vectors sampled via style diffusion:

Figure 2: t-SNE visualization of style vectors sampled via style diffusion from texts in five emotions, showing that emotions are properly separated for seen and unseen speakers. (a) Clusters of emotion from styles sampled by the LJSpeech model. (b) Distinct clusters of styles sampled from 5 unseen speakers by the LibriTTS model. (c) Loose clusters of emotions from Speaker 1 in (b). 该图像是 t-SNE 可视化图,展示了通过样式扩散从五种情感文本中采样的样式向量,显示出情感在可见和不可见说话者中的良好分离。 (a) 为 LJSpeech 模型的情感聚类,(b) 为 LibriTTS 模型中五个不可见说话者的样式聚类,(c) 为 (b) 中说话者 1 的放大聚类。

(a) Clusters of emotion from styles sampled by the LJSpeech model. (b) Distinct clusters of styles sampled from 5 unseen speakers by the LibriTTS model. (c) Loose clusters of emotions from Speaker 1 in (b).

Figure 2 illustrates the effectiveness of the style diffusion process in disentangling and generating diverse styles.

  • Emotion Separation (Figure 2a): t-SNE visualizations of style vectors sampled from the LJSpeech model, using texts generated by GPT-4 across five emotions (without explicit emotion labels during training), show distinct clusters for each emotion. This demonstrates StyleTTS 2's ability to synthesize expressive speech with varied emotions solely from text sentiment.

  • Speaker Disentanglement (Figure 2b): For the LibriTTS model, when sampling styles for five unseen speakers (each from a 3-second reference audio), distinct clusters formed for each speaker. This indicates StyleTTS 2 can capture a wide stylistic diversity from a minimal reference.

  • Emotion Manipulation for Unseen Speakers (Figure 2c): A closer look at the first speaker from Figure 2b reveals visible emotion-based clusters, despite some overlaps. This suggests the model can manipulate the emotional tone of an unseen speaker independent of the tone in the reference audio. However, these overlaps also hint at the challenge of fully disentangling texts from speakers in zero-shot settings, potentially explaining why the LibriTTS model performs slightly less well than the LJSpeech model.

    The following figure (Figure 3 from the original paper) displays histograms and kernel density estimation of the mean F0 and energy values of speech:

    Figure 3: Histograms and kernel density estimation of the mean F0 and energy values of speech, synthesized with texts in five different emotions. The blue color ("Ground Truth") denotes the distributions of the ground truth samples in the test set. StyleTTS 2 shows distinct distributions for different emotions and produces samples that cover the entire range of the ground truth distributions. 该图像是图表,展示了使用五种不同情感文本合成的语音的平均 F0 和能量值的直方图及核密度估计。图中蓝色部分为“真实样本”分布,StyleTTS 2 显示出不同情感的分布具有明显差异,且样本覆盖了真实分布的整个范围。

Figure 3 presents histograms and kernel density estimations of the mean F0 and energy values of speech synthesized with texts in five different emotions, compared to ground truth.

  • StyleTTS 2 (blue distribution in the figure) shows distinct F0 and energy characteristics for different emotions, particularly for anger and surprise, which deviate from the ground truth average. This confirms its capacity to generate emotive speech.
  • In contrast, VITS (not shown in this specific figure, but discussed in text) exhibits insensitivity to emotional variance, and JETS (also not shown, but discussed) fails to cover the full distribution of F0 and energy, clustering around the mode. StyleTTS 2 spans across both F0 and energy distributions, approximating the ground truth range, indicating better expressiveness.
  • StyleTTS 2 also demonstrates slightly better mode coverage than VITS at the right tail of the energy mean distribution, potentially due to its diffusion-based style modeling.

6.2.2. Speech Diversity and Inference Speed

The following are the results from Table 4 of the original paper:

Model CVdur ↑ CVf0 ↑ RTF ↓
StyleTTS 2 0.0321 0.6962 0.0185
VITS 0.0214 0.5976 0.0599
FastDiff 0.0295 0.6490 0.0769
ProDiff 2e-16 0.5898 0.1454

Table 4 compares speech diversity metrics (CV_dur, CV_f0) and real-time factor (RTF). Higher CV_dur and CV_f0 are better (↑), while lower RTF is better (↓).

  • Diversity: StyleTTS 2 yields the highest CV_dur (0.0321) and CV_f0 (0.6962) among the compared models, indicating its superior potential for generating diverse speech. This is a direct benefit of the style diffusion mechanism.
  • Inference Speed: Despite being diffusion-based for style sampling, StyleTTS 2 achieves a remarkably fast RTF of 0.0185 with 5 diffusion iterations and the iSTFTNet decoder. This makes it faster than VITS, FastDiff, and ProDiff (two of the fastest diffusion-based TTS models), demonstrating the efficiency of diffusing a latent style vector rather than the entire speech.

6.2.3. Effects of Diffusion Steps (from Appendix B.2)

The following are the results from Table 7 of the original paper:

Step MCD ↓ MCD-SL ↓ F0 RMSE ↓ WER ↓ RTF ↓ CVdur ↑ CVfo ↑
4 4.90 5.34 0.650 6.72% 0.0179 0.0207 0.5473
8 4.93 5.33 0.674 6.53% 0.0202 0.0466 0.7073
16 4.92 5.34 0.665 6.44% 0.0252 0.0505 0.7244
32 4.92 5.32 0.663 6.56% 0.0355 0.0463 0.7345
64 4.91 5.34 0.654 6.67% 0.0557 0.0447 0.7245
128 4.92 5.33 0.656 6.73% 0.0963 0.0447 0.7256

Table 7 shows the impact of varying diffusion steps on various objective metrics, RTF, and diversity measures.

  • Sample Quality: Negligible disparities in MCD, MCD-SL, F0 RMSE, and WER across diffusion steps from 4 to 128. Satisfactory quality samples are achievable with as few as three steps.
  • Diversity: An incremental rise in speech diversity (CV_dur, CV_f0) is observed up to around 16 steps, then plateauing. A slight decrease in diversity occurs at very high step counts (e.g., 64, 128), potentially because the ancestral solver converges to a fixed set of solutions despite noise.
  • Computational Speed (RTF): RTF increases with more diffusion steps. While 16 steps increase RTF by 30% compared to 4 steps, it still outperforms VITS by twice the speed, making it suitable for real-time applications.
  • Conclusion: Optimal balance of quality, diversity, and speed is achieved around 16 diffusion steps. For training efficiency, steps between 3 and 5 are randomly sampled.

6.3. Ablation Study

The following are the results from Table 5 of the original paper:

Model CMOS-N
w/o style diffusion −0.46
w/o differentiable upsampler -0.21
w/o SLM adversarial training −0.32
w/o prosodic style encoder -0.35
w/o OOD texts -0.15

Table 5 details the CMOS-N relative to the StyleTTS 2 baseline on OOD texts, highlighting the importance of each proposed component.

  • w/o style diffusion: Replacing style vectors from style diffusion with randomly encoded ones (as in original StyleTTS) leads to a CMOS-N of -0.46. This underscores the critical role of text-dependent style diffusion in achieving human-level TTS and diverse styles.
  • w/o differentiable upsampler: Training without the novel differentiable upsampler results in a CMOS-N of -0.21. This validates its key role in stable E2E training and natural speech synthesis.
  • w/o SLM adversarial training: Removing the SLM discriminator yields a CMOS-N of -0.32. This confirms the SLM discriminator's importance for enhancing naturalness, especially for OOD texts.
  • w/o prosodic style encoder: Excluding the prosodic style encoder leads to a CMOS-N of -0.35, demonstrating its effectiveness in decoupling and improving prosody.
  • w/o OOD texts: Removing OOD texts from adversarial training results in a CMOS-N of -0.15, proving its efficacy in improving OOD speech synthesis.

6.3.1. Objective Ablation Study Results (from Appendix A.3)

The following are the results from Table 6 of the original paper:

Model MCD MCD-SL F0 RMSE DUR MAD WER CMOS
Proposed model 4.93 5.34 0.651 0.521 6.50% 0
w/o style diffusion 8.30 9.33 0.899 0.634 8.77% −0.46
w/o SLM adversarial training 4.95 5.40 0.692 0.513 6.52% −0.32
w/o prosodic style encoder 5.04 5.42 0.663 0.543 6.92% −0.35
w/o differentiable upsampler 4.94 5.34 0.880 0.525 6.54% -0.21
w/o OOD texts 4.93 5.45 0.690 0.516 6.58% -0.15

Table 6 provides objective metrics (MCD, MCD-SL, F0 RMSE, DUR MAD, WER) for the ablation study.

  • w/o style diffusion: This variant severely degrades all objective metrics (MCD, MCD-SL, F0 RMSE, DUR MAD, WER), corroborating its paramount importance as indicated by the large negative CMOS.
  • w/o SLM adversarial training: Shows a slight decline in MCD-SL and F0 RMSE, but no impact on WER. Interestingly, it results in the lowest duration error (DUR MAD), implying SLM discriminators might cause minor underfitting on in-distribution texts. However, subjective CMOS for OOD texts clearly shows its benefit.
  • w/o prosodic style encoder: Affects all metric scores, reinforcing its contribution to the model's overall performance.
  • w/o differentiable upsampler: Primarily increases F0 RMSE, while other metrics remain relatively unaffected. This highlights its role in pitch contour accuracy and training stability.
  • w/o OOD texts: Only F0 RMSE is affected in objective evaluation, but the subjective CMOS difference (Table 5) is notable, demonstrating its role in OOD generalization.

6.3.2. SLM Discriminator Layer-Wise Analysis (from Appendix D.1)

The following figure (Figure 7 from the original paper) shows layer-wise input weight magnitude to the SLM discriminators across different datasets:

Figure 7: Layer-wise input weight magnitude to the SLM discriminators across different datasets. The layer importance shows a divergent pattern for the VCTK model relative to the LJSpeech and LibriTTS models, showcasing the impact of contextual absence on the SLM discriminators. 该图像是图表,展示了在不同数据集上,SLM判别器各层的输入权重大小。图中包含三个子图,分别对应于LJSpeech、VCTK和LibriTTS数据集,显示层级重要性的差异性模式,以及上下文缺失对SLM判别器的影响。

Figure 7 presents a layer-wise feature importance analysis for the SLM discriminator by normalizing the weights of the input linear projection layer into the convolutional discriminative head.

  • LJSpeech and LibriTTS Models: Initial layers (1 and 2) and middle layers (6 and 7) of WavLM showed the highest importance. Initial layers primarily encode acoustic information (energy, pitch, SNR), while middle layers encode semantic aspects (word identity, meaning) [55]. This suggests the SLM discriminator learns to fuse both acoustic and semantic information to discern paralinguistic attributes (prosody, pauses, intonations, emotions) crucial for distinguishing real from synthesized speech.
  • VCTK Model: The SLM discriminator for the VCTK dataset showed no distinct layer preference. This is attributed to VCTK's limited contextual information (standard paragraphs without specific context or emotions) compared to LJSpeech and LibriTTS (audiobook narrations). This contextual shortage could explain why StyleTTS 2's performance increase over VITS on VCTK is marginal compared to LJSpeech and LibriTTS, as the advantages of style diffusion and SLM adversarial training are less pronounced in datasets with restricted expressiveness.

6.3.3. Training Stability (from Appendix D.2 and C.2)

The following figure (Figure 6 from the original paper) illustrates the effects of σ\sigma on MCD and max gradient norm:

Figure 6: Effects of \(\\sigma\) on MCD and max gradient norm. Our choice of \(\\sigma = 1 . 5\) is marked with a star symbol. (a) MCD between samples synthesized with differentiable and non-differentiable upsampling over different \(\\sigma\) . (b) The maximum norm of gradients from the SLM discriminator to the duration predictor over an epoch of training with different \(\\sigma\) . 该图像是一个图表,展示了不同 oldsymbol{ au} 值对样本质量(MCD)和训练稳定性(最大梯度范数)的影响。左侧子图(a)显示了样本质量与 au 的关系,右侧子图(b)展示了训练稳定性与 au 的关系,au=1.5 au = 1.5 的选择用虚线标记。

Figure 6 examines the impact of the σ\sigma hyperparameter in the differentiable upsampler's Gaussian kernel on training stability and sample quality.

  • Gradient Norm: The maximum gradient norm from the SLM discriminator to the duration predictor can be very high, potentially destabilizing training, especially for the prosodic text encoder (BERT model).
  • σ\sigma Optimization: The chosen σ=1.5\sigma = 1.5 minimizes both MCD (between differentiable and non-differentiable upsampling) and max gradient norm across a wide range of σ\sigma values, making it an optimal choice. This value aligns with typical phoneme durations.
  • Gradient Clipping and Scaling: To mitigate instability, gradient clipping (scaling the gradient norm by 0.2 when it surpasses 20) and gradient scaling (scaling gradients to the last projection and LSTM layers of the duration predictor by 0.01) are implemented. These measures ensure stable training across different datasets.

6.4. Feedback Analysis from Survey Participants (from Appendix A.1)

Feedback from CMOS experiment participants indicated that StyleTTS 2 achieved human-like quality, with many noting that "differences were negligible" or "sometimes hard to notice." Some comments suggested that subtle nuances in intonation and word emphasis, or "hyper expressiveness" which sometimes felt "too dramatic," were the main distinctions. The preference for StyleTTS 2 over ground truth on LJSpeech was further attributed to the ground truth's perceived unnaturalness due to fragmented narrative continuity in the dataset's isolated audiobook clips. This highlights the need for future research to incorporate context-aware long-form generation and improve evaluation fairness.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this study, the authors presented StyleTTS 2, a novel text-to-speech (TTS) model that achieves human-level performance by integrating style diffusion and speech language model (SLM) discriminators. Key innovations include modeling speech styles as a latent random variable sampled efficiently through diffusion models, enabling diverse and reference-free style generation. Furthermore, StyleTTS 2 leverages large pre-trained WavLM models as discriminators within an end-to-end (E2E) adversarial training framework, facilitated by a novel differentiable duration modeling.

The model demonstrates significant advancements:

  • It surpasses human recordings on the single-speaker LJSpeech dataset and matches human-level performance on the multi-speaker VCTK dataset in naturalness, as judged by native English speakers.

  • It shows strong generalization and robustness to out-of-distribution (OOD) texts.

  • For zero-shot speaker adaptation on the LibriTTS dataset, StyleTTS 2 outperforms previous publicly available models and achieves comparable naturalness to Vall-E with significantly less training data (250 times less).

  • The innovative style diffusion method ensures expressive and diverse speech generation while maintaining fast inference time.

    StyleTTS 2 sets a new benchmark for human-level TTS synthesis across both single and multi-speaker datasets, showcasing the potential of its combined architectural and training methodologies.

7.2. Limitations & Future Work

The authors acknowledge several limitations and areas for future research:

  • Large-Scale Dataset Handling: While StyleTTS 2 performs well, its results indicate room for improvement when handling extremely large and diverse datasets like LibriTTS, which contain thousands of speakers, varied acoustic environments, accents, and speaking styles. Scaling to such complexity remains a challenge.

  • Speaker Similarity in Zero-Shot Adaptation: Although StyleTTS 2 excels in naturalness for zero-shot adaptation, the speaker similarity to the target voice could benefit from further enhancements, as indicated by its slightly lower CMOS-S compared to Vall-E.

  • Evaluation Methods for Context-Dependent Naturalness: The observation that human evaluators preferred StyleTTS 2 over ground truth on LJSpeech (potentially due to dataset artifacts like fragmented audiobook passages) highlights a limitation in current evaluation methodologies. Future research should aim to improve evaluation methods to address these context dependencies and develop models that generate more natural and human-like speech with longer context dependencies.

    The authors also address the ethical implications of advanced zero-shot speaker adaptation, acknowledging its potential for misuse and deception (e.g., theft, fraud, harassment, impersonation). To manage this, they commit to:

  • Requiring users to adhere to a code of conduct, informing listeners that speech is synthesized or obtaining informed consent.

  • Mandating consent from reference speakers for voice adaptation.

  • Making the source code publicly available to facilitate research in speaker fraud and impersonation detection.

7.3. Personal Insights & Critique

This paper marks a significant leap towards truly human-level TTS, and its ability to surpass human recordings on LJSpeech is a notable achievement, even with the caveats regarding dataset artifacts. The integration of style diffusion for reference-free, diverse style generation with efficient GAN-based synthesis is a clever solution to the long-standing trade-off between diversity (often from diffusion) and speed (often from GANs). This hybrid approach could inspire similar solutions in other generative tasks where both high quality and efficiency are paramount.

The novel use of large SLMs like WavLM directly as discriminators is particularly insightful. Instead of attempting to map SLM features to a generative space (which can be complex and lossy), letting the SLM judge the generated speech in an adversarial setting provides a powerful, human-perception-mimicking gradient signal that implicitly guides the generator towards highly natural outputs. This method of knowledge transfer is both elegant and effective and could be applied to other domains where robust, pre-trained discriminative models exist (e.g., image generation with large vision models).

The differentiable duration modeling is a crucial enabling technology, resolving a long-standing challenge in E2E TTS adversarial training. Its non-parametric, stable nature is a significant improvement over prior, often unstable, attention-based or fixed-width Gaussian approaches.

A key takeaway is the impressive data efficiency of StyleTTS 2 for zero-shot adaptation compared to Vall-E. Achieving comparable naturalness with 250 times less data indicates that intelligent model design and training strategies can sometimes mitigate the need for excessively large pre-training corpora, making high-quality TTS more accessible.

Potential Issues/Areas for Improvement:

  • Generalization to Emotionally Nuanced Long-Form Speech: While the t-SNE plots show good emotional clustering, the noted overlaps and the VCTK model's lack of distinct layer preference for the SLM discriminator suggest that fully disentangling emotion/prosody from speaker identity, especially in complex, long-form narratives with subtle emotional shifts, remains a challenge. Future work could focus on more granular control over emotional intensity and dynamic prosody over longer contexts.

  • Evaluation Protocol Nuance: The finding that StyleTTS 2 can be preferred over ground truth due to dataset artifacts on LJSpeech highlights a critical point: current evaluation metrics, even advanced ones like CMOS, might not fully capture "naturalness" in complex human communicative contexts. Future evaluations should ideally involve longer, contextually rich samples or tasks that go beyond simple naturalness ratings to assess coherence, emotional arc, and engagement over extended dialogues or narratives. This calls for new benchmarks and human evaluation paradigms.

  • Computational Cost of SLM Discriminator: While WavLM is fixed, its size and computational requirements as a discriminator (even with a CNN head) might still be substantial. Exploring more lightweight yet effective SLM-based discriminators or distillation techniques could be an avenue for more resource-constrained applications.

  • Mitigation Strategies for Misuse: The explicit mention of ethical concerns and mitigation strategies is commendable and vital. However, the effectiveness of codes of conduct and consent mechanisms in preventing malicious deepfake creation in a widely accessible open-source model remains an open question. Active research into robust deepfake detection and watermarking for synthesized speech will be crucial alongside such preventative measures.

    Overall, StyleTTS 2 is a landmark paper that not only pushes the boundaries of TTS quality to human parity (and beyond, in some specific contexts) but also offers elegant and effective solutions to long-standing challenges in style control, efficiency, and knowledge transfer from large foundation models. Its methodological innovations will likely influence future research in generative AI for speech and beyond.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.