Abstract

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

1. Bibliographic Information

1.1. Title

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

1.2. Authors

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu.

The authors are predominantly from Google, Inc., with one author from the University of California, Berkeley. This team includes researchers who were also involved in the development of the original Tacotron and WaveNet models, indicating a deep expertise and continuity in the research line of neural speech synthesis at Google.

1.3. Journal/Conference

The paper was submitted to the arXiv preprint server and does not list a specific conference publication in the provided text. However, its widespread citation and influence suggest it was likely presented at a top-tier speech or machine learning conference like ICASSP or Interspeech. The work is foundational in the field of modern text-to-speech (TTS).

1.4. Publication Year

The paper was first published on arXiv in December 2017.

1.5. Abstract

The paper introduces Tacotron 2, a neural network architecture for synthesizing speech directly from text. The system is composed of two main parts:

A recurrent sequence-to-sequence network that predicts a sequence of mel-scale spectrograms from a sequence of input characters.
A modified WaveNet model that acts as a vocoder, synthesizing a time-domain audio waveform from the predicted mel spectrograms.

The model achieves a Mean Opinion Score (MOS) of 4.53, which is very close to the 4.58 MOS of professionally recorded human speech. The authors validate their design through ablation studies, confirming the benefits of using mel spectrograms as an intermediate representation. This approach simplifies the WaveNet architecture significantly compared to using complex linguistic features.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/1712.05884
PDF Link: https://arxiv.org/pdf/1712.05884v2
Publication Status: This is a preprint available on arXiv. It is considered a canonical paper in the field of speech synthesis.

2. Executive Summary

2.1. Background & Motivation

The core problem is Text-to-Speech (TTS) synthesis, which aims to generate natural-sounding human speech from input text. For years, this field was dominated by two main paradigms:

Concatenative Synthesis: This method stitches together small, pre-recorded audio units from a large database. While it can produce high-quality audio, it often suffers from audible artifacts at the boundaries where units are joined and lacks expressive control.
Statistical Parametric Speech Synthesis (SPSS): This method uses statistical models (like Hidden Markov Models or, later, Deep Neural Networks) to predict a sequence of acoustic features (e.g., spectral parameters, fundamental frequency) from text. A separate algorithm called a vocoder then synthesizes audio from these features. SPSS produces smoother speech but often sounds muffled and less natural than human speech.

A major breakthrough came with WaveNet, a deep generative model that could produce raw audio waveforms of unprecedented quality, rivaling human speech. However, the original WaveNet required a complex front-end system to generate conditioning features, including linguistic information, phoneme durations, and fundamental frequency ( $F_0$ ). This pipeline was cumbersome and required significant domain expertise.

Another line of work, Tacotron, proposed an end-to-end sequence-to-sequence model that could generate a magnitude spectrogram directly from characters, greatly simplifying the TTS pipeline. However, Tacotron used the traditional Griffin-Lim algorithm to convert the spectrogram back to an audio waveform, which introduced artifacts and resulted in lower audio quality compared to WaveNet.

The paper's entry point is to combine the strengths of these two approaches. The authors propose a unified neural system, Tacotron 2, which uses a Tacotron-style architecture to generate a mel spectrogram (a compact and perceptually relevant acoustic representation) and then uses a modified WaveNet as a high-fidelity vocoder to convert that spectrogram into a raw audio waveform. This design aims to achieve the best of both worlds: the simplified end-to-end training of Tacotron and the state-of-the-art audio quality of WaveNet.

2.2. Main Contributions / Findings

The paper's primary contributions are:

A Unified Neural TTS System (Tacotron 2): The paper presents a complete, two-stage neural architecture that synthesizes speech directly from text. It consists of a feature prediction network and a neural vocoder, trained separately but designed to work together seamlessly.
Use of Mel Spectrograms as an Intermediate Representation: The paper demonstrates that mel spectrograms are an effective intermediate link between the text-to-feature network and the waveform synthesizer. This representation is simpler and lower-level than the linguistic features used by the original WaveNet but compact enough to make the vocoder model significantly simpler and more efficient.
State-of-the-Art Speech Quality: The proposed system achieves a Mean Opinion Score (MOS) of 4.53, which is statistically very close to the 4.58 MOS of the ground truth professional recordings. This was a landmark result, demonstrating that neural TTS could produce speech almost indistinguishable from human speech.
Ablation Studies Validating Design Choices: The authors conduct extensive experiments to justify their architectural decisions. They show that:
- Using a WaveNet vocoder is far superior to the Griffin-Lim algorithm.
- The convolutional post-net in the feature prediction network is crucial for high quality.
- Conditioning on mel spectrograms allows for a significantly smaller and simpler WaveNet architecture without sacrificing quality.
- Training the WaveNet vocoder on the predicted (and thus "oversmoothed") spectrograms from the first stage leads to better synthesis quality compared to training it on ground truth spectrograms.

3.1. Foundational Concepts

Text-to-Speech (TTS): The process of artificially producing human speech from written text.
Spectrogram: A visual representation of the spectrum of frequencies of a signal as it varies with time. In audio, it plots time on the x-axis, frequency on the y-axis, and the amplitude (loudness) of a frequency at a particular time is represented by color or intensity.
- Linear-Frequency Spectrogram: Frequencies are spaced linearly. It contains a lot of detail, especially at high frequencies.
- Mel-Frequency Spectrogram: The frequency axis is transformed using the mel scale, a perceptual scale of pitches judged by listeners to be equal in distance from one another. The mel scale gives more resolution to lower frequencies and less to higher frequencies, mimicking human auditory perception. This makes it a more compact and perceptually relevant representation for speech.
Vocoder (Voice Coder): An algorithm or system that analyzes and synthesizes the human voice waveform. In traditional TTS, it takes acoustic features (like spectrograms) and generates an audio signal. Griffin-Lim and WaveNet are both examples of vocoders, though they operate on very different principles.
Sequence-to-Sequence (Seq2Seq) Models: A class of neural network models designed to map an input sequence of one length to an output sequence of a different length. A typical Seq2Seq model consists of:
- An Encoder: A recurrent neural network (RNN), like an LSTM or GRU, that processes the input sequence and compresses it into a fixed-length context vector (or a sequence of hidden states).
- A Decoder: Another RNN that takes the encoder's representation and generates the output sequence one element at a time.
Attention Mechanism: An enhancement to Seq2Seq models that allows the decoder to selectively focus on different parts of the input sequence at each step of generating the output. Instead of relying on a single fixed-length context vector from the encoder, the attention mechanism computes a weighted sum of all encoder hidden states, where the weights determine which input parts are most "relevant" for the current output step. This is crucial for long sequences, as it prevents the model from "forgetting" early parts of the input.
Recurrent Neural Network (RNN): A type of neural network designed for processing sequential data. It maintains a hidden state that is updated at each step, allowing it to have "memory" of previous inputs in the sequence.
- Long Short-Term Memory (LSTM): A sophisticated type of RNN unit that uses gating mechanisms (input, forget, and output gates) to better control the flow of information, enabling it to learn long-range dependencies and avoid the vanishing/exploding gradient problems common in simple RNNs.
- Bidirectional LSTM: An LSTM that processes the input sequence in both forward and backward directions. This allows the hidden state at any point in time to contain information from both past and future elements of the sequence, providing a richer contextual representation.

3.2. Previous Works

WaveNet (van den Oord et al., 2016): A deep generative model for raw audio waveforms. It uses a stack of dilated causal convolutions to achieve a very large receptive field, allowing it to model long-term dependencies in the audio signal. It predicts the probability distribution of the next audio sample conditioned on all previous samples. The original WaveNet was conditioned on complex linguistic features (phonemes, syllables), their durations, and fundamental frequency ( $F_0$ ) to steer the generation process for TTS. While producing extremely high-quality audio, this conditioning required a separate, complex text analysis pipeline.
Tacotron (Wang et al., 2017): A Seq2Seq model for TTS that simplified the traditional pipeline. It takes characters as input and directly outputs a linear-scale magnitude spectrogram. It uses an attention mechanism to align the output spectrogram frames with the input characters. Tacotron's main limitation was its reliance on the Griffin-Lim algorithm for waveform synthesis. Griffin-Lim is an iterative algorithm that estimates the phase of a signal given only its magnitude spectrogram, but it often produces audible artifacts. Tacotron showed that a single neural network could replace the complex text analysis and acoustic modeling front-end.
Deep Voice 1, 2, & 3 (Arik et al., 2017; Ping et al., 2017): A series of fully neural TTS systems developed concurrently with Tacotron 2. Deep Voice 3, like Tacotron 2, also uses a Seq2Seq architecture to predict an intermediate acoustic representation which is then fed to a neural vocoder. However, the authors of Tacotron 2 note that Deep Voice 3 had not been shown to achieve speech quality rivaling that of humans at the time of publication.

3.3. Technological Evolution

Early Systems (Concatenative/Parametric): TTS relied on hand-engineered pipelines, with separate components for text analysis, duration modeling, acoustic feature prediction, and vocoding. Quality was limited by boundary artifacts (concatenative) or a "buzzy" and muffled sound (parametric).
Deep Learning in SPSS: Deep Neural Networks (DNNs) replaced Hidden Markov Models (HMMs) for acoustic feature prediction, improving the naturalness of parametric systems, but the vocoders remained a bottleneck.
Neural Vocoders (WaveNet): WaveNet revolutionized audio generation by directly modeling the raw waveform, producing state-of-the-art quality. However, it was slow and required a complex feature extraction front-end.
End-to-End Spectrogram Prediction (Tacotron): Tacotron showed that the front-end could be replaced with a single Seq2Seq network, mapping text directly to spectrograms. This simplified the pipeline but was held back by the non-neural Griffin-Lim vocoder.
Unified Neural TTS (Tacotron 2): This paper represents the synthesis of the previous two breakthroughs. It combines the end-to-end philosophy of Tacotron with the high-fidelity synthesis of WaveNet, creating a fully neural system that is both simple in design and produces near-human-quality speech.

3.4. Differentiation Analysis

Compared to previous works, Tacotron 2's core innovations are:

vs. Original Tacotron: The key difference is the replacement of the Griffin-Lim vocoder with a WaveNet-based neural vocoder. This single change dramatically improves the final audio quality, eliminating the characteristic artifacts of Griffin-Lim.
vs. Original WaveNet: The key difference is the conditioning input. Instead of a complex set of linguistic features, Tacotron 2's WaveNet is conditioned on mel spectrograms. This simplifies the entire system, as the mel spectrograms are predicted by the first stage of the network directly from text. This also allows for a much simpler and smaller WaveNet architecture.
vs. Deep Voice 3 / Char2Wav: While these models also proposed similar two-stage neural TTS systems, Tacotron 2 was the first to be demonstrated with extensive evaluations (including Mean Opinion Scores) to achieve speech quality that is statistically comparable to professional human recordings.

4. Methodology

4.1. Principles

The core idea of Tacotron 2 is to break down the complex task of text-to-speech synthesis into two manageable, fully neural stages:

Feature Prediction: A sequence-to-sequence model translates the input text (a sequence of characters) into a compact, intermediate acoustic representation: a mel spectrogram. This stage is responsible for learning the alignment between text and audio, pronunciation, and prosody (rhythm and intonation).
Waveform Synthesis (Vocoding): A separate neural network, a modified WaveNet, takes the generated mel spectrogram and synthesizes a high-fidelity, time-domain audio waveform. This stage acts as a neural vocoder, filling in the fine-grained details of the audio signal, including phase information and high-frequency textures, which are not fully captured by the mel spectrogram.

By using a mel spectrogram as the bridge, the two components can be trained separately, which is more practical. The mel spectrogram is smooth and low-dimensional, making it a stable target for the Seq2Seq model to predict using a simple Mean Squared Error loss. It is also a rich enough representation for the WaveNet vocoder to reconstruct high-quality audio.

The overall architecture is depicted in the following figure from the paper.

fig 2 该图像是一个示意图，展示了Tacotron 2神经网络架构的结构。该架构包括将文本输入转换为梅尔频谱图的特征预测网络，以及从梅尔频谱图合成时域波形的WaveNet模型。流程中的关键组件包括位置敏感注意力和多个LSTM层。

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Spectrogram Prediction Network

This network is a sequence-to-sequence model with attention, consisting of an encoder and a decoder.

1. Encoder: The encoder's job is to convert the input character sequence into a rich, contextual hidden representation.

Input: The input is a sequence of characters. Each character is first converted into a 512-dimensional learned character embedding.
Convolutional Layers: These embeddings are passed through a stack of 3 convolutional layers. Each layer has 512 filters of shape 5x1. This means each filter looks at a window of 5 characters at a time. These layers help the model capture local context, similar to learning about N-grams (e.g., how characters form common letter combinations or syllables). Each convolutional layer is followed by batch normalization and ReLU activation.
Bidirectional LSTM: The output from the final convolutional layer is fed into a single bidirectional LSTM layer containing 512 units (256 for the forward pass, 256 for the backward pass). This layer captures long-term dependencies in the entire text sequence. The output of the encoder is a sequence of hidden states, one for each input character, where each state contains information about the entire sentence.

2. Attention Mechanism: The decoder uses an attention mechanism to decide which parts of the encoded input sequence to focus on when generating each frame of the spectrogram.

The paper uses location-sensitive attention, which is an extension of additive attention. This mechanism is designed to encourage the model to move forward monotonically through the input sequence, which is crucial for speech synthesis to prevent skipping words or repeating parts of the sentence.
In addition to the decoder's hidden state and the encoder's outputs, this attention mechanism uses the cumulative attention weights from previous decoder steps as an extra feature. These cumulative weights are processed by 32 1-D convolution filters of length 31 to create location features. By "seeing" where it has focused before, the model learns to shift its attention forward consistently.

3. Decoder: The decoder is an autoregressive RNN, meaning it generates the mel spectrogram one frame at a time, with each new frame conditioned on the previously generated frames.

Pre-Net: The spectrogram frame generated at the previous time step is first passed through a pre-net. This consists of 2 fully connected layers of 256 ReLU units. The authors state this pre-net acts as an "information bottleneck" and is essential for learning the attention alignment. It also helps introduce dropout and improve generalization.
LSTM Layers: The output of the pre-net is concatenated with the attention context vector (the weighted sum of encoder outputs from the attention mechanism). This combined vector is then fed into a stack of 2 uni-directional LSTM layers with 1024 units each.
Output Prediction: The output from the LSTM stack is concatenated again with the attention context vector. This final vector is passed through a linear projection layer (a single fully connected layer) to predict the 80-dimensional mel spectrogram frame for the current time step.

4. Post-Net and Loss Function:

Post-Net: After the decoder generates the entire spectrogram sequence, it is passed through a 5-layer convolutional post-processing network (post-net). Each layer has 512 filters of shape 5x1 with batch normalization and tanh activations (except for the final layer). This post-net predicts a residual that is added to the initial spectrogram prediction. Its purpose is to correct prediction errors and add finer details to the spectrogram by using context from both past and future frames of the generated spectrogram.
Loss Function: The model is trained to minimize the Mean Squared Error (MSE) between the predicted and ground truth spectrograms. The total loss is the sum of the MSE before the post-net and the MSE after the post-net. This encourages both the main decoder and the post-net to produce accurate predictions. $ L_{total} = \sum_{t} (y_t - \hat{y}t)^2 + \sum{t} (y_t - (\hat{y}_t + r_t))^2 $ Where:
- $y_t$ is the ground truth mel spectrogram frame at time $t$ .
- $\hat{y}_t$ is the spectrogram frame predicted by the decoder at time $t$ .
- $r_t$ is the residual predicted by the post-net for frame $t$ .

5. Stop Token Prediction: To enable the model to automatically stop generation when the sentence is complete, a "stop token" is predicted at each decoder step.

The decoder LSTM output and the attention context vector are concatenated, projected to a single scalar value, and passed through a sigmoid activation function. This produces a probability.
During inference, generation stops when this probability exceeds a threshold of 0.5.

4.2.2. Modified WaveNet Vocoder

The second component of Tacotron 2 is a modified WaveNet that acts as a neural vocoder. It takes the predicted 80-channel mel spectrogram and generates a 24 kHz raw audio waveform.

Architecture: The core architecture is similar to the original WaveNet, with a stack of 30 dilated convolution layers organized into 3 cycles (dilation rates are $2^0, 2^1, ..., 2^9$ , repeated 3 times).
Conditioning: The key modification is how it's conditioned. The mel spectrogram frames have a hop size of 12.5 ms. To match this timing, the conditioning input is upsampled using a stack of transposed convolution layers to match the temporal resolution of the audio waveform (24,000 samples per second).
Output Distribution: Instead of a softmax over 256 quantized values like the original WaveNet, this model uses a Mixture of 10 Logistic distributions (MoL) to generate 16-bit audio samples. This approach, inspired by PixelCNN++, allows for more expressive modeling of the continuous waveform and is more efficient.
Loss Function: The WaveNet is trained to maximize the likelihood of the ground truth audio sample given the mel spectrogram and previous audio samples. The loss is the negative log-likelihood of the ground truth sample $x$ $x$ under the predicted mixture distribution. The probability density of $x$ $x$ is given by: $ p(x | c) = \sum_{i=1}^{K} \pi_i(c) \cdot \text{Logistic}(x | \mu_i(c), s_i(c)) $ Where:
- $K=10$ is the number of mixture components.
- $c$ is the conditioning input derived from the mel spectrogram.
- $\pi_i$ , $\mu_i$ , and $s_i$ are the mixture weight, mean, and scale parameter for the $i$ -th logistic component, respectively. These parameters are predicted by the WaveNet stack.
- $\text{Logistic}(x | \mu, s)$ is the probability density function of the logistic distribution.

5. Experimental Setup

5.1. Datasets

Dataset: The models were trained on an internal US English dataset.
Scale and Characteristics: The dataset contains 24.6 hours of speech from a single professional female speaker.
Data Format: The text data is normalized. For example, numbers and symbols are spelled out (e.g., "16" is written as "sixteen"). This simplifies the learning task for the model, as it doesn't need to learn these normalization rules.
Choice of Dataset: Using a high-quality, single-speaker dataset is a standard practice for training TTS systems, as it allows the model to learn a consistent voice without the added complexity of modeling speaker identity. The professional quality of the recordings ensures the target audio is clean and well-articulated, providing a strong learning signal.

5.2. Evaluation Metrics

The primary evaluation metric used is the Mean Opinion Score (MOS).

Conceptual Definition: MOS is a subjective measure of speech quality. Human listeners are asked to rate the quality of synthesized audio samples on a numerical scale. In this paper, the scale is from 1 (bad) to 5 (excellent) with 0.5-point increments. The scores from multiple raters are then averaged to produce a single MOS value for each system. It is considered the gold standard for evaluating TTS quality because it directly measures human perception of naturalness.
Mathematical Formula: $ \text{MOS} = \frac{\sum_{n=1}^{N} \sum_{r=1}^{R_n} S_{nr}}{ \sum_{n=1}^{N} R_n } $
Symbol Explanation:
- $N$ : The total number of speech samples being evaluated.
- $R_n$ : The number of raters who scored the $n$ -th sample.
- $S_{nr}$ : The score given by the $r$ -th rater to the $n$ -th sample.
  
  The paper also uses side-by-side preference tests, where raters listen to two samples (e.g., ground truth vs. synthesized) and indicate their preference on a scale (e.g., -3 to +3). This provides a direct comparison of two systems.

5.3. Baselines

The authors compare their proposed Tacotron 2 system against several strong baselines:

Parametric: A statistical parametric speech synthesis system based on LSTMs, representing a strong conventional SPSS baseline used in production at Google.
Concatenative: A high-quality unit selection concatenative synthesis system, also used in production at Google, which was considered state-of-the-art for many years.
Tacotron (Griffin-Lim): The original Tacotron model that predicts linear spectrograms and uses the Griffin-Lim algorithm for vocoding. This isolates the benefit of using a neural vocoder.
WaveNet (Linguistic): A WaveNet vocoder conditioned on traditional linguistic features (phonemes, durations, $F_0$ ), similar to the original WaveNet setup. This isolates the benefit of using mel spectrograms as the conditioning input.
Ground truth: The original professional audio recordings. This serves as the upper bound for achievable quality.

6. Results & Analysis

6.1. Core Results Analysis

The main results demonstrate the superiority of the Tacotron 2 system.

The following are the results from Table 1 of the original paper:

System	MOS
Parametric	3.492 ± 0.096
Tacotron (Griffin-Lim)	4.001 ± 0.087
Concatenative	4.166 ± 0.091
WaveNet (Linguistic)	4.341 ± 0.051
Tacotron 2 (this paper)	4.526 ± 0.066
Ground truth	4.582 ± 0.053

Analysis:

Tacotron 2 achieves a MOS of 4.526, which is significantly higher than all other synthesis systems.
It dramatically outperforms both traditional systems (Parametric and Concatenative).
It is substantially better than the original Tacotron (Griffin-Lim), with a MOS jump from 4.00 to 4.53. This clearly validates the decision to replace the Griffin-Lim algorithm with a WaveNet vocoder.
It also outperforms WaveNet (Linguistic) (4.34 vs. 4.53), suggesting that the end-to-end spectrogram prediction approach is more effective than using a hand-engineered linguistic feature pipeline.
Most impressively, its MOS is very close to that of the Ground truth recordings (4.526 vs. 4.582). The confidence intervals overlap, indicating that the difference is small, though a side-by-side test reveals a slight but statistically significant preference for ground truth.

The paper also presents a side-by-side evaluation between Tacotron 2 and ground truth audio. The results are shown in the figure below.

Analysis of Side-by-Side Comparison:
The mean score was -0.270, indicating a small but statistically significant preference for ground truth audio.
The most common rating was "About the Same," showing that in many cases, raters could not reliably distinguish the synthesized speech from real human speech.
The authors note that the primary reason for the preference towards ground truth was occasional mispronunciations by the Tacotron 2 system, a known failure mode for attention-based end-to-end models.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Predicted Features versus Ground Truth

This study investigates whether it's better to train the WaveNet vocoder on ground truth mel spectrograms or on the (imperfect) spectrograms predicted by the feature prediction network.

The following are the results from Table 2 of the original paper:

Training	Synthesis
Training	Predicted	Ground truth
Predicted	4.526 ± 0.066	4.449 ± 0.060
Ground truth	4.362 ± 0.066	4.522 ± 0.055

Analysis:

The best performance is achieved when the training condition matches the synthesis (inference) condition. The highest MOS (4.526) comes from training on predicted features and synthesizing from predicted features.
Crucially, when the vocoder is trained on Ground truth spectrograms but used to synthesize from Predicted ones, the quality drops significantly (MOS 4.362).
The reason is that the spectrograms predicted by the Seq2Seq model tend to be oversmoothed due to the MSE loss. A vocoder trained only on clean, detailed ground truth spectrograms does not learn to handle these smoother, less detailed inputs effectively. By training the vocoder on the actual type of data it will see during inference, it becomes robust to these imperfections, leading to better final audio quality.

6.2.2. Linear Spectrograms vs. Mel Spectrograms

This study compares using 80-dimensional mel spectrograms versus more detailed 1,025-dimensional linear-frequency spectrograms as the intermediate representation.

The following are the results from Table 3 of the original paper:

System	MOS
Tacotron 2 (Linear + G-L)	3.944 ± 0.091
Tacotron 2 (Linear + WaveNet)	4.510 ± 0.054
Tacotron 2 (Mel + WaveNet)	4.526 ± 0.066

Analysis:

As expected, WaveNet as a vocoder is vastly superior to Griffin-Lim (G-L), even when both use linear spectrograms (4.510 vs. 3.944).
There is no significant difference in quality between using linear-scale and mel-scale spectrograms when both are paired with a WaveNet vocoder (4.510 vs. 4.526).
This finding is very important: since mel spectrograms are a much more compact representation (80 dimensions vs. 1025), they are a strictly better choice. They achieve the same quality while being easier and faster to predict and process.

6.2.3. Importance of the Post-Processing Network

The authors tested the impact of removing the 5-layer convolutional post-net from the spectrogram prediction network.

Analysis:

Without the post-net, the model's MOS dropped from 4.526 to 4.429.
This demonstrates that the post-net is an important component. Even though the WaveNet vocoder has a large receptive field, improving the quality of the intermediate spectrogram before it is passed to the vocoder still provides a tangible benefit to the final audio quality. The post-net helps refine the spectrogram by using non-causal context, correcting errors made by the autoregressive decoder.

6.2.4. Simplifying WaveNet

This study tests the hypothesis that conditioning on rich mel spectrograms allows for a simpler WaveNet architecture. They varied the number of layers and the receptive field size.

The following are the results from Table 4 of the original paper:

Total layers	Num cycles	Dilation cycle size	Receptive field (samples / ms)	MOS
30	3	10	6,139 / 255.8	4.526 ± 0.066
24	4	6	505 / 21.0	4.547 ± 0.056
12	2	6	253 / 10.5	4.481 ± 0.059
30	30	1	61 / 2.5	3.930 ± 0.076

Analysis:

The results confirm the hypothesis. A much smaller WaveNet with only 12 layers and a receptive field of just 10.5 ms still achieves a very high MOS of 4.481.
A model with 24 layers and a 21.0 ms receptive field even slightly outperformed the baseline 30-layer model.
This shows that when conditioned on mel spectrograms, which already encode long-term dependencies, the WaveNet vocoder does not need an extremely large receptive field to produce high-quality audio. Its main job becomes high-fidelity local waveform generation.
However, completely eliminating dilated convolutions (the last row, where the receptive field is only 2.5 ms) causes a drastic drop in quality (MOS 3.930). This indicates that some local context at the waveform level is still essential for the vocoder to function properly.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Tacotron 2, a fully neural text-to-speech system that achieves a new state-of-the-art in speech quality. By combining a sequence-to-sequence model for mel spectrogram prediction with a WaveNet-based neural vocoder, the system synthesizes speech that is nearly indistinguishable from professional human recordings.

The key findings are that (1) this two-stage architecture effectively marries the simplicity of end-to-end models with the audio fidelity of neural vocoders, and (2) using mel spectrograms as an intermediate representation is highly effective, enabling simplification of the WaveNet vocoder while maintaining top-tier performance. The system can be trained directly from text and audio data, eliminating the need for complex, hand-engineered feature extraction pipelines.

7.2. Limitations & Future Work

The authors acknowledge some limitations and areas for improvement:

Prosody Modeling: While generally natural, the model sometimes produces unnatural prosody, such as placing emphasis on the wrong word or syllable. They note there is still room for improvement in modeling the more subtle aspects of prosody.
Error Modes: The system can still make occasional errors, most notably mispronunciations, especially for rare words, names, or out-of-domain text. This points to the brittleness of attention mechanisms in end-to-end models.
Generalization: The model's performance relies on the coverage of its training data. When faced with out-of-domain text (like news headlines containing names), it is more prone to pronunciation errors than traditional systems with explicit lexicons.
Inference Speed: Although not a primary focus of this paper, autoregressive models like WaveNet are notoriously slow at inference time, as samples must be generated sequentially. This was a major barrier to real-time deployment (a problem later addressed by models like Parallel WaveNet).

7.3. Personal Insights & Critique

A Landmark Paper: Tacotron 2 represents a watershed moment in TTS research. It established a blueprint for modern neural TTS systems that is still influential today. The core idea of a text-to-spectrogram model followed by a neural vocoder became the dominant paradigm for several years.
Power of Intermediate Representations: The paper's most significant insight is arguably the strategic use of the mel spectrogram as an intermediate representation. It perfectly balances several competing needs: it's simple enough for a Seq2Seq model to predict robustly, yet rich enough for a vocoder to reconstruct high-quality audio. This "divide and conquer" strategy proved much more effective than a single, monolithic end-to-end model trying to predict raw audio directly from text.
Unverified Assumptions: The model's reliance on a single-speaker, professionally recorded, and highly curated dataset is a key factor in its success. The performance on more diverse, multi-speaker, or "in-the-wild" data is not explored and would likely be much lower without further adaptation. This highlights a general challenge for data-driven models: their quality is fundamentally bound by the quality and scope of their training data.
Future Impact: The success of Tacotron 2 spurred a massive amount of research in two directions:
1. Improving the Feature Predictor: Work on more robust attention mechanisms, better prosody control (e.g., through explicit prosody embeddings), and faster, non-autoregressive spectrogram prediction (e.g., FastSpeech).
2. Improving the Vocoder: A major focus became developing fast, non-autoregressive neural vocoders (e.g., Parallel WaveNet, WaveGlow, MelGAN) to solve the slow inference speed of the original WaveNet, making high-quality neural TTS practical for real-world applications.
  
  In essence, Tacotron 2 set the standard for quality and defined the core architectural problems that the TTS research community would work to solve for the next several years.

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~21 min read · 28,762 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Spectrogram Prediction Network

4.2.2. Modified WaveNet Vocoder

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Predicted Features versus Ground Truth

6.2.2. Linear Spectrograms vs. Mel Spectrograms

6.2.3. Importance of the Post-Processing Network

6.2.4. Simplifying WaveNet

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers