Abstract

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

1. Bibliographic Information

1.1. Title

Tacotron: Towards End-to-End Speech Synthesis

1.2. Authors

The paper is authored by a large team from Google, Inc.: Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. The collaboration of a large team from a major industrial research lab like Google indicates a project with significant computational resources and a focus on building robust, production-quality systems.

1.3. Journal/Conference

This paper was published as a preprint on arXiv. The provided version ( $v2$ ) was submitted on March 29, 2017. While arXiv is not a peer-reviewed venue, it is a standard platform for rapidly disseminating research in fields like machine learning. This work was subsequently presented at Interspeech 2017, a top-tier international conference for speech processing.

1.4. Publication Year

2017

1.5. Abstract

The abstract outlines the limitations of traditional text-to-speech (TTS) systems, which are composed of multiple complex stages requiring significant domain expertise. As a solution, the paper introduces Tacotron, an end-to-end generative TTS model that synthesizes speech directly from character inputs. The model is trainable from scratch on $<text, audio>$ pairs. The authors highlight key techniques that enable a sequence-to-sequence framework to function effectively for this task. Tacotron achieved a Mean Opinion Score (MOS) of 3.82 on US English, surpassing the naturalness of a production parametric system. Additionally, because Tacotron generates spectrogram frames rather than individual audio samples, it is significantly faster than sample-level autoregressive models.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/1703.10135
PDF Link: https://arxiv.org/pdf/1703.10135v2
Publication Status: This paper is an arXiv preprint.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Traditional Text-to-Speech (TTS) systems are highly complex, multi-stage pipelines. A typical system involves a text analysis frontend (to convert text to linguistic features), a duration model (to predict the timing of each phoneme), an acoustic model (to predict acoustic features like spectrograms), and a vocoder (to synthesize audio from these features).
Importance & Challenges: This modular design is laborious, requires deep domain expertise in linguistics and signal processing, and is prone to compounding errors, where mistakes in an early stage degrade the quality of all subsequent stages. The design choices can be brittle, and adapting the system to new speakers, languages, or styles is a substantial engineering effort.
Innovative Idea: The paper's core idea is to replace this entire complex pipeline with a single, integrated neural network that learns the entire TTS process from end to end. This model, named Tacotron, takes raw text (as a sequence of characters) as input and directly generates a spectrogram, which can then be converted to audio. This approach eliminates the need for manual feature engineering, phoneme-level alignments, and multiple independently trained components.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

Proposal of Tacotron: The introduction of a novel end-to-end generative model for TTS based on a sequence-to-sequence (seq2seq) architecture with attention. It is one of the first successful models to generate speech directly from character inputs.
Architectural Innovations for TTS: The paper presents several key techniques that were crucial for making the seq2seq framework effective for the challenging task of speech synthesis:
- The CBHG module, a powerful feature extractor combining convolutions, highway networks, and a bidirectional GRU, used in both the encoder and a post-processing network.
- A reduction factor $r$ , which allows the decoder to predict $r$ spectrogram frames at each step. This significantly speeds up training and inference and, more importantly, helps the model learn a stable alignment between text and speech.
- A post-processing network that refines the predicted spectrogram, improving audio quality by correcting errors using a global view of the generated sequence.
State-of-the-Art Performance: Tacotron achieved a Mean Opinion Score (MOS) of 3.82, which was shown to be more natural than a well-established production parametric TTS system. This demonstrated that an end-to-end approach could achieve competitive quality.
Efficiency: By generating speech at the frame level (e.g., one frame per 12.5 ms of audio) instead of the sample level (one sample per ~41 µs at 24kHz), Tacotron is substantially faster at inference time than contemporary high-quality models like WaveNet.

3.1. Foundational Concepts

Text-to-Speech (TTS): The process of artificially producing human speech from text. Traditional methods include concatenative TTS, which stitches together pre-recorded speech units, and statistical parametric TTS (SPSS), which uses statistical models (like Hidden Markov Models or DNNs) to generate acoustic parameters that are then synthesized into audio by a vocoder. Tacotron represents a paradigm shift towards end-to-end neural TTS.
Sequence-to-Sequence (seq2seq) Models: An architecture, popularized by Sutskever et al. (2014) for machine translation, consisting of two main parts: an Encoder and a Decoder.
- The Encoder, typically a Recurrent Neural Network (RNN), reads an input sequence (e.g., characters in a sentence) and compresses it into a fixed-size vector representation, often called the "context vector."
- The Decoder, also an RNN, takes the context vector and generates an output sequence (e.g., spectrogram frames) one step at a time.
Attention Mechanism: A critical enhancement to the seq2seq model, proposed by Bahdanau et al. (2014). Instead of relying on a single fixed-size context vector, the attention mechanism allows the decoder to "look back" at the entire sequence of encoder outputs at each decoding step. It computes a set of attention weights that determine which parts of the input sequence are most relevant for generating the current output step. This is crucial for long sequences. The context vector $c_t$ for decoder step $t$ is calculated as a weighted sum of the encoder hidden states $h_i$ : $ c_t = \sum_{i=1}^{T_x} \alpha_{ti} h_i $ The weight $\alpha_{ti}$ for each encoder state $h_i$ is computed by a softmax over alignment scores $e_{ti}$ : $ \alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{k=1}^{T_x} \exp(e_{tk})} $ The alignment score $e_{ti}$ measures how well the input around position $i$ and the output at position $t$ match. It is calculated using the previous decoder hidden state $s_{t-1}$ and the encoder hidden state $h_i$ : $ e_{ti} = v_a^T \tanh(W_a s_{t-1} + U_a h_i) $ where $v_a$ , $W_a$ , and $U_a$ are learnable weight matrices.
Recurrent Neural Networks (RNNs): Neural networks designed to process sequential data. Variants like the Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) use gating mechanisms to control the flow of information, allowing them to learn long-range dependencies more effectively than simple RNNs. Tacotron makes extensive use of GRUs.
Spectrogram: A visual representation of sound. It plots the intensity (loudness) of different frequencies over time. A linear-scale spectrogram has a frequency axis that is linearly spaced, while a mel-scale spectrogram uses the mel scale, which is a perceptual scale of pitches that listeners judge to be equal in distance from one another. The mel scale is more aligned with human hearing, making mel-spectrograms a more compact and perceptually relevant representation.
Griffin-Lim Algorithm: An iterative algorithm for reconstructing a time-domain signal (waveform) from its short-time Fourier transform (STFT) magnitude, which is what a spectrogram represents. The STFT also has a phase component, which is lost in a magnitude spectrogram. Griffin-Lim works by repeatedly estimating the phase, transforming back and forth between the time and frequency domains, and enforcing consistency with the known magnitude at each step.

3.2. Previous Works

WaveNet (van den Oord et al., 2016): A powerful generative model that produces raw audio waveforms one sample at a time. It can generate exceptionally high-quality, natural-sounding speech. However, its sample-level autoregressive nature makes inference extremely slow. Furthermore, the WaveNet described for TTS was not fully end-to-end; it required a separate TTS frontend to provide it with detailed linguistic and phonetic features to condition its generation. Tacotron replaces this frontend.
DeepVoice (Arik et al., 2017): A system that replaces each component of a traditional SPSS pipeline (e.g., phoneme conversion, duration prediction, frequency prediction) with a dedicated, independently trained neural network. While fully neural, it is not an integrated end-to-end system like Tacotron.
Wang et al. (2016): This work was an early attempt at end-to-end TTS using a seq2seq model. However, it had several limitations that Tacotron overcomes: it required a pre-trained Hidden Markov Model (HMM) to guide the attention alignment, it used phoneme inputs instead of characters, it predicted vocoder parameters (requiring a traditional vocoder), and the authors noted that certain tricks used to make it train hurt the final speech prosody.
Char2Wav (Sotelo et al., 2017): An independently developed end-to-end model that also uses character inputs. However, it differs from Tacotron in key ways: it predicts vocoder parameters to be used by a SampleRNN neural vocoder, whereas Tacotron predicts a raw spectrogram. Also, its components required separate pre-training, while Tacotron can be trained from scratch as a single model.

3.3. Technological Evolution

The field of TTS has evolved from complex, handcrafted systems to increasingly automated, data-driven approaches:

Concatenative & Parametric TTS: Dominated for decades. Required significant manual effort and domain expertise. Quality was intelligible but often robotic (parametric) or disjointed (concatenative).
Hybrid/Neural Component Systems: The rise of deep learning led to replacing individual components of the traditional pipeline with neural networks (e.g., DeepVoice). WaveNet provided a breakthrough neural vocoder that massively improved audio quality, but it still relied on a traditional frontend.
End-to-End TTS: Tacotron represents the next major step: collapsing the entire frontend and acoustic model into a single neural network. This simplifies the creation of TTS systems and allows the model to learn complex relationships between text and speech directly from data, paving the way for more expressive and easily adaptable models.

3.4. Differentiation Analysis

Compared to previous works, Tacotron's main innovations are:

True End-to-End Simplicity: It directly maps characters to a spectrogram, bypassing the need for explicit phonetic conversion, duration modeling, and pre-trained aligners.
Novel Architecture for TTS: The CBHG module is a powerful architecture for sequence representation that proved highly effective for both text encoding and spectrogram refinement.
Training and Inference Efficiency: The introduction of the reduction factor $r$ and the frame-level prediction makes Tacotron much faster than sample-level autoregressive models like WaveNet, making it more practical for real-world applications.
Intermediate Representation: The choice to predict a spectrogram rather than vocoder parameters or raw audio is a key design decision. It is a good trade-off between representational richness and model complexity, and it decouples the main model from the final waveform synthesis step.

4. Methodology

4.1. Principles

The core principle of Tacotron is to treat text-to-speech synthesis as a sequence-to-sequence learning problem, analogous to neural machine translation. The model "translates" an input sequence of characters into an output sequence of audio spectrogram frames. It uses an encoder-decoder framework with an attention mechanism to learn the complex, non-linear mapping between text and its corresponding speech representation. The entire model is trained jointly to minimize the reconstruction error between the predicted spectrogram and the ground truth spectrogram.

4.2. Core Methodology In-depth

The Tacotron architecture is composed of an encoder, an attention-based decoder, and a post-processing network, as shown in the overall model diagram.

The model architecture is depicted in the figure below (Figure 1 from the original paper).

fig 4 该图像是Tacotron模型的示意图，展示了从字符嵌入到声谱图生成的过程。图中包含多个模块，包括CBHG、Attention机制和Griffin-Lim重构，说明各模块如何协同工作以实现高效的语音合成。

4.2.1. CBHG Module

Before detailing the main components, it's essential to understand the CBHG module, a key building block used throughout the model. The name stands for Convolution bank, Highway network, and Bidirectional GRU. It is designed to extract robust contextual representations from a sequence.

The following figure (Figure 2 from the original paper) illustrates the CBHG architecture.

fig 3 该图像是一个示意图，展示了Tacotron模型的网络结构。图中包括了一些关键组件，如双向RNN、高速层、卷积层和最大池化层，这些部分通过残差连接进行组合，形成一个有效的文本到语音合成系统。

The data flow through the CBHG module is as follows:

1-D Convolution Bank: The input sequence is passed through $K$ sets of 1-D convolutional filters of varying widths, from $k=1$ to $K$ . This is akin to modeling n-grams (unigrams, bigrams, etc.) and captures local contextual information at different scales. The outputs of these filters are stacked together.
Max Pooling: The stacked convolution outputs are max-pooled along the time dimension with a stride of 1. This helps to increase local invariance while preserving the time resolution of the sequence.
1-D Convolutional Projections: The pooled features are passed through further fixed-width 1-D convolutional layers to create higher-level representations.
Residual Connection: The output from the projection layers is added to the original input sequence of the CBHG module. This residual connection helps with gradient flow in deep networks.
Highway Network: The resulting sequence is fed into a multi-layer highway network. Highway networks use a gating mechanism to regulate information flow, allowing the model to learn which features to pass through and which to transform.
Bidirectional GRU: Finally, a bidirectional GRU (Gated Recurrent Unit) is applied to the output of the highway network. This captures sequential dependencies from both the forward and backward directions, producing the final output representation.

4.2.2. Encoder

The encoder's goal is to convert the input character sequence into a robust sequential representation that the decoder can attend to.

Input & Embedding: The input is a sequence of characters. Each character is represented as a one-hot vector and then embedded into a 256-dimensional continuous vector.
Pre-net: Each embedding is passed through a "pre-net," which consists of two fully-connected layers (FC-256-ReLU and FC-128-ReLU) with dropout (rate 0.5). This non-linear transformation acts as an information bottleneck and helps the model converge better and generalize. Dropout is particularly important here as it introduces noise, which is beneficial for training generative models.
CBHG Transformation: The output sequence from the pre-net is then fed into a CBHG module (with K=16 in the convolution bank). The CBHG module produces the final encoder representation, which is a sequence of hidden states that summarize the context of each character. The paper notes that this CBHG-based encoder reduces overfitting and leads to fewer mispronunciations compared to a standard multi-layer RNN encoder.

4.2.3. Decoder

The decoder is an autoregressive recurrent network that generates the spectrogram frame by frame (or $r$ frames at a time).

Attention Mechanism: The decoder uses a content-based tanh attention mechanism. At each decoding step, a stateful GRU (the "attention RNN") generates a query vector. This query is used along with the encoder outputs to compute attention weights, which produce a context vector summarizing the relevant part of the input text for the current step.
Decoder RNN: The input to the main decoder RNN at each step is a concatenation of the context vector from the attention mechanism and the output from the attention RNN cell. The decoder itself is a stack of two GRU layers with 256 cells and residual connections between the layers to speed up convergence.
Pre-net: Just as in the encoder, the decoder input frame (from the previous time step) is first passed through a pre-net with dropout. During training, the input is the corresponding ground-truth frame. During inference, it's the prediction from the previous step. The dropout in this pre-net is critical for generalization, as it injects noise that helps the model cope with the fact that many different audio signals can correspond to the same text (multimodality).
Target Prediction and Reduction Factor $r$ : A key innovation is that the decoder's output layer predicts $r$ non-overlapping spectrogram frames at each decoder time step. The paper finds this trick crucial for faster convergence and more stable attention learning. This is because a single character often corresponds to multiple speech frames, and predicting them in a single step allows the attention to move forward more naturally. For the final results, $r=2$ .
Decoder Target: The seq2seq decoder predicts an 80-band mel-scale spectrogram. This is a compressed, perceptually-relevant representation that is easier for the model to learn than a high-dimensional linear spectrogram.

4.2.4. Post-processing Net and Waveform Synthesis

Post-processing Net: The sequence of mel-spectrogram frames predicted by the decoder is fed into a post-processing network. This network's task is to convert the mel-spectrogram into a linear-scale magnitude spectrogram, which is required by the Griffin-Lim algorithm. In Tacotron, this network is another CBHG module (with K=8). Since this network can see the entire generated sequence, it can use both past and future context to correct prediction errors made by the left-to-right decoder, resulting in better-defined harmonics and finer details.
Waveform Synthesis: The final linear-scale spectrogram is converted into an audio waveform using the Griffin-Lim algorithm. The paper notes two important details:
- They raise the predicted spectral magnitudes to a power of 1.2 before feeding them to Griffin-Lim, which they found reduces artifacts.
- The algorithm converges in about 50 iterations and is implemented in TensorFlow, making it part of the model graph (though it is not trained).

4.2.5. Training Details

Loss Function: A simple L1 loss (mean absolute error) is used. The total loss is the sum of the L1 loss on the mel-spectrogram predictions (from the decoder) and the L1 loss on the linear-scale spectrogram predictions (from the post-processing net), with equal weight.
Optimizer: The Adam optimizer is used with a scheduled learning rate decay.
Handling End-of-Sequence: A simple but effective trick is used to teach the model when to stop generating output. Instead of masking out the loss on padded frames at the end of sequences, the model is trained to reconstruct the zero-padding. This implicitly teaches the model to output silence/zeros after the speech content has finished.

5. Experimental Setup

5.1. Datasets

Dataset Used: The model was trained on an internal, proprietary North American English dataset.
Scale and Characteristics: The dataset contains approximately 24.6 hours of audio from a single professional female speaker.
Data Preprocessing: The text associated with the audio was normalized. For example, numbers were expanded into words. The paper gives a concrete example: "16" is converted to "sixteen".
Justification: Using a high-quality, single-speaker dataset is a standard practice for developing and evaluating TTS models, as it allows for a clean assessment of model quality without the confounding variable of speaker variation.

5.2. Evaluation Metrics

The primary evaluation metric used in the paper is the Mean Opinion Score (MOS).

Conceptual Definition: MOS is a subjective measure of quality used widely in audio and video processing. For TTS, it quantifies the perceived naturalness of the synthesized speech. A group of human subjects (in this case, native speakers) listen to speech samples and rate them on a numerical scale. The average of all ratings for a given system is its MOS.
Mathematical Formula: The formula for MOS is the arithmetic mean of all individual scores: $ \text{MOS} = \frac{\sum_{i=1}^{N} S_i}{N} $
Symbol Explanation:
- $S_i$ : The individual score given by the $i$ -th rater to a stimulus. In this paper, this is an integer score from 1 to 5 on a Likert scale (e.g., 1=Bad, 2=Poor, 3=Fair, 4=Good, 5=Excellent).
- $N$ : The total number of ratings collected for that stimulus or system.
  
  The tests were crowdsourced, and only ratings from subjects who reported using headphones were included to ensure a controlled listening environment.

5.3. Baselines

Tacotron was compared against two strong baseline systems, both of which were in production at Google:

Parametric System: A state-of-the-art statistical parametric TTS system based on LSTMs, as described in $Zen et al. (2016)$ . This type of system represents the dominant paradigm before end-to-end models. It predicts acoustic features which are then passed to a separate vocoder for synthesis.
Concatenative System: A high-quality unit selection-based system, as described in Gonzalvo et al. (2016). These systems work by selecting and concatenating segments of pre-recorded speech from a very large database. They are often considered the gold standard for naturalness, especially for in-domain text, as they use segments of real human speech.

6. Results & Analysis

6.1. Core Results Analysis

The main quantitative results are presented in a Mean Opinion Score (MOS) test for naturalness. 100 unseen phrases were evaluated by native speakers.

The following are the results from Table 2 of the original paper:

	mean opinion score
Tacotron	3.82 ± 0.085
Parametric	3.69 ± 0.109
Concatenative	4.09 ± 0.119

From this table, we can draw several key conclusions:

Superiority over Parametric TTS: Tacotron achieves an MOS of 3.82, which is statistically significantly higher than the 3.69 MOS of the production parametric system. This was a major breakthrough, demonstrating that an end-to-end model trained from scratch could surpass the quality of a highly engineered, production-level SPSS system.
Gap with Concatenative TTS: The concatenative system still achieves the highest MOS of 4.09. This is expected, as concatenative systems often represent the upper bound of quality by using real audio clips. However, they are less flexible and can suffer from audible joins between speech units.
Promising Results: The authors argue that achieving a 3.82 MOS is a very promising result, especially considering the known artifacts introduced by the non-neural Griffin-Lim algorithm for waveform synthesis. This implies that the core Tacotron model was generating high-quality spectrograms and that the overall system quality could be further improved with a better vocoder.

6.2. Ablation Studies / Parameter Analysis

The paper provides several insightful ablation studies to justify its key design choices. These are primarily qualitative, relying on visual inspection of attention alignments and spectrograms.

6.2.1. Importance of Tacotron's Architecture over Vanilla Seq2Seq

The following figure (Figure 3 in the original paper) compares the attention alignments of different model configurations. The x-axis represents the encoder timesteps (input characters), and the y-axis represents the decoder timesteps (output frames). A clean, diagonal line indicates that the model is correctly stepping through the input text as it generates the speech output.

fig 1 该图像是一个示意图，展示了三种不同编码器的状态随解码时间步的变化。图中包含： (a) 传统seq2seq+调度采样，(b) GRU编码器，以及 (c) 提出的Tacotron模型。不同模型在编码器状态的变化上呈现出不同的特征，对比学习改进效果。

(a) Vanilla Seq2Seq: This model, which lacks the CBHG and pre-net components, produces a very poor alignment. The attention mechanism gets "stuck" on certain characters for many timesteps and then jumps, resulting in a blurry and non-monotonic path. This corresponds to audible repetitions and skips in the synthesized audio, destroying intelligibility.
(c) Tacotron: In stark contrast, the full Tacotron model learns a clean, sharp, and almost perfectly diagonal alignment. This demonstrates that its architecture successfully guides the attention mechanism to learn the monotonic relationship between text and speech.

6.2.2. Benefit of the CBHG Encoder

(b) GRU Encoder: This plot shows the alignment for a model where the CBHG encoder is replaced with a simpler 2-layer residual GRU encoder. While much better than the vanilla model, the alignment is visibly noisier and less sharp than the full Tacotron model in (c). The authors report that this noisy alignment often leads to mispronunciations, especially for long or complex phrases, and that the CBHG encoder generalizes better.

6.2.3. Benefit of the Post-Processing Net

The following figure (Figure 4 in the original paper) visualizes the spectrograms generated by the decoder directly versus the output after the post-processing net.

fig 2 该图像是一个示意图，展示了使用和不使用后处理网络的信号频谱特征对比。上半部分显示的是未使用后处理网络的频谱，下半部分则是使用后处理网络后的频谱变化，体现了后处理对信号清晰度的优化效果。

(a) Without Post-processing Net: The top spectrogram, generated directly by the decoder, shows less defined harmonic structures.
(b) With Post-processing Net: The bottom spectrogram, refined by the post-processing net, shows much clearer and better-resolved harmonics (visible as distinct horizontal lines) and more detailed high-frequency formant structure. The authors state this visual improvement corresponds to a reduction in synthesis artifacts in the final audio. This is because the post-processing net can use the full sequence context (both past and future frames) to correct errors, whereas the autoregressive decoder only has past information.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Tacotron, an end-to-end generative model for text-to-speech synthesis that learns to produce a spectrogram directly from characters. This approach dramatically simplifies the traditional multi-stage TTS pipeline, removing the need for extensive domain expertise and complex feature engineering. Key architectural innovations, such as the CBHG module, a reduction factor for frame prediction, and a post-processing network, were shown to be crucial for the model's success. The resulting system achieved a naturalness score of 3.82 MOS, outperforming a production parametric system and demonstrating the viability and potential of end-to-end TTS. Furthermore, its frame-level generation makes it substantially faster for inference than sample-level autoregressive models like WaveNet.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations and areas for future improvement:

Waveform Synthesis: The most significant limitation identified is the reliance on the Griffin-Lim algorithm. While fast, it is known to introduce audible artifacts and is a major bottleneck for overall audio quality. The authors explicitly state that developing a fast, high-quality neural spectrogram-to-waveform inverter (a neural vocoder) is a key area for future work.
Model Components: Many early design choices (e.g., the specific attention module, output layer, loss function) were not exhaustively explored and are "ripe for improvement."
Text Normalization: The model still relies on a pre-processing step to normalize text (e.g., expanding numbers). The authors suggest that recent advances in learned text normalization could eventually make this step unnecessary.

7.3. Personal Insights & Critique

Landmark Paper: Tacotron is a landmark paper that effectively ushered in the era of end-to-end neural TTS. It provided a clear, successful blueprint that inspired a vast amount of subsequent research, including its successor, Tacotron 2 (which combined Tacotron with a WaveNet vocoder to achieve human-level naturalness).
Clever Engineering: The paper is an excellent example of thoughtful engineering to solve a difficult problem. The "tricks" introduced, such as the reduction factor $r$ and the post-processing net, were not minor tweaks but crucial innovations that addressed fundamental challenges in applying seq2seq models to TTS (i.e., alignment stability and error correction).
Modularity and Pragmatism: The decision to predict a spectrogram and use a separate vocoder was a pragmatic and powerful choice. It decoupled the problem of learning text-to-speech alignment from the problem of generating a high-fidelity waveform. This modularity allowed the field to "plug and play" better vocoders as they were developed, leading to rapid quality improvements.
Potential Issues: The model's performance was demonstrated on a high-quality, single-speaker dataset. Its robustness on more challenging, "in-the-wild" data with multiple speakers, background noise, or expressive styles was not explored. Furthermore, the "black box" nature of end-to-end models makes them harder to debug and control compared to traditional pipelines. For instance, correcting a specific mispronunciation can be difficult without simply adding more data.