ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech
TL;DR Summary
ClapFM-EVC is introduced as a novel emotional voice conversion framework that generates high-quality speech with adjustable emotion intensity, using natural language prompts and reference speech. Key technologies include EVC-CLAP for pre-training and a FuEncoder for seamless emot
Abstract
Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct Mel-spectrogram of source speech. Subjective and objective evaluations validate the effectiveness of ClapFM-EVC.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech. It focuses on developing an emotional voice conversion system that offers high quality and versatile control options using both natural language prompts and reference speech, with adjustable emotion intensity.
1.2. Authors
The authors and their affiliations are:
- Yu Pan (1,
anyu.ztj@gmail.com) - Department of Information Science and Technology, Kyushu University, Japan - Yanni Hu (3) - EverestAI, Ximalaya Inc., China
- Yuguang Yang (3) - EverestAI, Ximalaya Inc., China
- Jixun Yao (3) - EverestAI, Ximalaya Inc., China
- Jianhao Ye (3) - EverestAI, Ximalaya Inc., China
- Hongbin Zhou (3) - EverestAI, Ximalaya Inc., China
- Lei Ma (2,
ma.lei@acm.org) - Department of Computer Science, The University of Tokyo, Japan - Jianjun Zhao (1,
zhao@ait.kyushu-u.ac.) - Department of Information Science and Technology, Kyushu University, Japan
1.3. Journal/Conference
The paper is published as a preprint on arXiv. The publication venue is not a peer-reviewed journal or conference at the time of this analysis, but a preprint server. arXiv is a well-known and influential platform for disseminating research in fields like computer science, physics, mathematics, and others, allowing researchers to share their work before formal peer review.
1.4. Publication Year
The paper was published on 2025-05-20.
1.5. Abstract
The paper introduces ClapFM-EVC, a novel framework for emotional voice conversion (EVC) that aims to generate high-fidelity converted speech with flexible and interpretable control. The system can be driven by natural language prompts or reference speech, and it allows for adjustable emotion intensity. Key methodological contributions include EVC-CLAP, an emotional contrastive language-audio pre-training model that aligns fine-grained emotional elements across speech and text using natural language prompts and categorical labels. This is followed by a FuEncoder with an adaptive intensity gate (AIG) to seamlessly fuse emotional features with Phonetic PosteriorGrams (PPGs) from an Automatic Speech Recognition (ASR) model. To further enhance emotion expressiveness and speech naturalness, a flow matching model is proposed, conditioned on these captured features, to reconstruct Mel-spectrograms. Subjective and objective evaluations confirm the effectiveness of ClapFM-EVC.
1.6. Original Source Link
- Official Source: https://arxiv.org/abs/2505.13805v1
- PDF Link: https://arxiv.org/pdf/2505.13805v1.pdf The paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control over emotional expressions. EVC is the task of changing the emotional state of a speaker's voice while preserving the linguistic content and speaker identity. This problem is crucial for various practical applications, including intelligent voice assistants, automatic audiobook production, and professional dubbing.
Existing EVC methods face several challenges:
-
Limited Emotional Diversity, Naturalness, and Speech Quality: Many current systems, often based on
Generative Adversarial Networks (GANs)orautoencoders, struggle to produce speech with a wide range of convincing emotional expressions, and sometimes lack overall naturalness and high speech quality. -
Constrained Control Paradigms: Traditional EVC systems typically rely on reference speech or a limited set of categorical text labels (e.g., "happy," "sad") as conditions for emotion control. This approach restricts the diversity of achievable emotional expressions, lacks intuitiveness for users, and offers limited interpretability regarding the nuanced emotions conveyed.
-
Lack of Fine-Grained Intensity Control: While some studies have introduced intensity control modules, there is still potential for more precise manipulation of emotional expression and its intensity.
The paper's entry point or innovative idea is to address these issues by introducing
ClapFM-EVC, a framework that leverages natural language prompts for intuitive, flexible, and fine-grained emotional control, combined with reference speech control, and incorporates an adaptive intensity gate. It also aims to improve speech naturalness and quality using aconditional flow matching (CFM)model.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
ClapFM-EVCFramework: Introduction of a novel, high-fidelity, any-to-one EVC framework that supports dual control from both natural language prompts and reference speech, offering flexible and interpretable emotion manipulation. -
EVC-CLAPModel: Proposal of an emotional contrastive language-audio pre-training model,EVC-CLAP, which is guided by both natural language prompts and categorical emotion labels. This model is designed to extract and align fine-grained emotional elements across speech and text modalities effectively. -
FuEncoderwithAdaptive Intensity Gate (AIG): Development of aFuEncoderthat seamlessly fuses content features (from a pre-trained ASR model) with emotional embeddings (fromEVC-CLAP). TheAIGmodule within theFuEncoderprovides flexible and explicit control over the intensity of the converted emotion. -
Conditional Flow Matching (CFM)-based Decoder: Integration of aCFMmodel to reconstruct Mel-spectrograms. This component is crucial for enhancing the expressiveness of the converted emotion and improving the overall naturalness and quality of the synthesized speech. -
Comprehensive Evaluation: Extensive subjective and objective experiments and ablation studies were conducted to validate the effectiveness of
ClapFM-EVC, demonstrating its superiority over existing EVC approaches in terms of emotional expressiveness, speech naturalness, and speech quality.The key conclusions and findings are that
ClapFM-EVCsuccessfully addresses the challenges of emotional diversity, naturalness, and control in EVC. It achieves state-of-the-art performance, demonstrating a remarkable capability to precisely capture and effectively transfer target emotional characteristics, while maintaining superior perceptual quality. The dual control mechanism and adaptive intensity gate offer significant advancements in user experience and interpretability compared to prior methods.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the ClapFM-EVC paper, a grasp of several fundamental concepts in speech processing and deep learning is necessary:
-
Emotional Voice Conversion (EVC): The core task of transforming the emotional style of a source speech utterance to a target emotion, while preserving the original linguistic content and speaker identity (timbre). For example, converting a "neutral" sentence into a "happy" one spoken in the same speaker's voice.
-
Generative Models (e.g., GANs, Autoencoders, Flow-based Models):
- Generative Adversarial Networks (GANs): A class of artificial intelligence algorithms used in unsupervised machine learning. A GAN consists of two neural networks, a
generatorand adiscriminator, competing against each other. Thegeneratorlearns to create new data instances that resemble the training data, while thediscriminatorlearns to distinguish between real data and data generated by the generator. Through this adversarial process, the generator improves its ability to create realistic data. - Autoencoders: Neural networks designed for unsupervised learning of efficient data codings (representations). An
autoencoderhas anencoderthat compresses input data into alatent-space representation(a lower-dimensional feature vector) and adecoderthat reconstructs the input data from this latent representation. They are often used for dimensionality reduction, feature learning, and data generation. - Flow-based Models: A class of generative models that explicitly model the probability density function of complex data distributions. They achieve this by learning a sequence of invertible transformations (a "flow") that map a simple base distribution (like a Gaussian) to the complex data distribution. The invertibility allows for exact likelihood computation and efficient sampling.
Conditional Flow Matching (CFM)is a recent advancement in this area, focusing on learning paths between distributions rather than direct mappings, often leading to more stable training and better sample quality.
- Generative Adversarial Networks (GANs): A class of artificial intelligence algorithms used in unsupervised machine learning. A GAN consists of two neural networks, a
-
Contrastive Learning: A self-supervised learning paradigm where a model learns to pull "similar" samples closer in an embedding space while pushing "dissimilar" samples further apart. In
Clap, this is applied to align representations from different modalities (e.g., audio and text) by making sure that corresponding audio-text pairs are close, and non-corresponding pairs are far apart. -
Kullback-Leibler (KL) Divergence: A non-symmetric measure of the difference between two probability distributions and . It quantifies how much information is lost when is used to approximate . A lower KL divergence indicates that the distributions are more similar. $ \mathrm{KL}(P || Q) = \sum_x P(x) \log\left(\frac{P(x)}{Q(x)}\right) $ Where
P(x)andQ(x)are the probabilities of event according to distributions and . -
Phonetic PosteriorGrams (PPGs): A sequence of probability distributions over phonetic units (e.g., phonemes or sub-phonemes) extracted from speech.
PPGsare often derived from the output of anAutomatic Speech Recognition (ASR)model's acoustic encoder. They are considered speaker-independent representations of linguistic content, meaning they capture what is being said, rather than who is saying it or how it's being said. -
Mel-spectrogram: A common acoustic feature representation of speech. It is a time-frequency representation where the frequencies are mapped to the Mel scale, which better approximates human auditory perception. It's often the target output for speech synthesis and voice conversion models.
-
Vocoder: A component that converts acoustic features (like Mel-spectrograms) back into raw audio waveforms. It's essential for synthesizing audible speech from the model's output. Common vocoders include
WaveNet,WaveGlow,HiFi-GAN, orBigVGAN. -
Pre-trained Models (e.g., HuBERT, XLMRoBERTa):
- HuBERT (Hidden Unit Bidirectional Encoder Representations from Transformers): A self-supervised speech representation learning model. It learns robust speech features by masking portions of the input speech and predicting discrete
hidden units(clusters of speech features) from the unmasked parts. It's often used as a powerful audio encoder. - XLMRoBERTa: A large, multi-lingual
Transformer-based language model trained on a massive corpus of text in 100 languages. It's used as a powerful text encoder to generate contextualized embeddings for text inputs, including natural language prompts.
- HuBERT (Hidden Unit Bidirectional Encoder Representations from Transformers): A self-supervised speech representation learning model. It learns robust speech features by masking portions of the input speech and predicting discrete
3.2. Previous Works
The paper discusses several categories of previous work in EVC:
-
GANs and Autoencoder-based EVC:
StarGAN [11]: This was an early influential work that adaptedStarGAN(originally for image-to-image translation) to EVC. It used a cycle-consistent and class-conditional GAN architecture to convert emotional states.AINN [16]: Anattention-based interactive disentangling networkthat used a two-stage pipeline for fine-grained EVC.- Limitation: While promising, these systems often lacked emotional diversity, naturalness, and high speech quality, particularly when dealing with a wide range of emotions or subtle nuances.
-
EVC with Intensity Control:
Emovox [18]: This system focused on disentangling speaker style and encoding emotion in a continuous space, allowing for some control over emotional intensity.EINet [17]: Predicted emotional class and intensity using an emotion evaluator and intensity mapper, aiming to enhance naturalness and diversity through controllable emotional intensity.- Limitation: These methods typically relied on reference speech or categorical labels, limiting flexibility and the diversity of expressible emotions. They also didn't fully address the issues of intuitiveness and interpretability.
-
Content-based ASR models for speech features:
HybridFormer [20]: A pre-trainedASRmodel mentioned in the paper, used to extractPhonetic PosteriorGrams (PPGs)as content features. It improvesSqueezeformerwith hybrid attention andNSRmechanisms.
-
Deep Learning Components:
ResNet [29]: Residual Networks are deep neural networks that introduce "skip connections" or "residual connections" to allow gradients to flow through the network more easily, enabling the training of much deeper models without performance degradation.Multi-head Self-Attention [30]: A key mechanism from theTransformerarchitecture. It allows a model to weigh the importance of different parts of the input sequence when processing each element.Multi-headmeans the attention mechanism is run multiple times in parallel, each with different linear projections, to capture different aspects of relationships. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- : Query matrix.
- : Key matrix.
- : Value matrix.
- : Dimensionality of the key vectors.
- The dot product measures the similarity between queries and keys. The
softmaxnormalizes these scores, and the result is multiplied by to get the weighted sum of values.
FiLM (Feature-wise Linear Modulation) [31]: A general conditioning layer that applies an affine transformation (scaling and shifting) to the features of a neural network layer based on a conditioning input. It allows a model to dynamically adapt its computations based on external information, such as emotional embeddings in this case.
3.3. Technological Evolution
The field of EVC has evolved significantly:
-
Early Statistical Methods: Initially, EVC relied on statistical models like
Gaussian Mixture Models (GMMs)to map acoustic features between different emotional states. These methods often struggled with naturalness and disentangling attributes. -
Deep Learning Wave (GANs & Autoencoders): The advent of deep learning brought models like
StarGAN,CycleGAN, andVariational Autoencoders (VAEs)to EVC. These models showed improved performance in converting emotions and preserving speaker identity. However, they faced challenges in generating diverse emotions, maintaining high speech quality, and offering flexible control. -
Disentanglement and Intensity Control: Researchers started focusing on better disentangling content, timbre, and emotion. Modules for controlling emotion intensity emerged, allowing for finer-grained manipulation beyond simple categorical shifts.
-
Reference-based and Language-based Control: The push for more intuitive control led to
reference-based EVC(converting to the emotion of a given audio sample) and more recently, the integration ofnatural language promptsto specify target emotions. -
Advanced Generative Models: The adoption of powerful generative models like
Diffusion ModelsandFlow-based Models(e.g.,Conditional Flow Matching) has become a key trend to achieve higher speech quality and naturalness in various speech synthesis tasks, including EVC. -
Self-supervised Pre-training and Multimodality: Leveraging large-scale self-supervised pre-trained models (like
HuBERTfor audio andXLMRoBERTafor text) has become standard for robust feature extraction. Combining modalities through contrastive learning (likeCLAPfor audio-text alignment) allows for richer, more interpretable control.This paper's work fits within the latest stages of this evolution, specifically at the intersection of advanced generative models (flow matching), multimodal learning (audio-text contrastive pre-training), and flexible, intuitive control paradigms (natural language prompts with intensity adjustment).
3.4. Differentiation Analysis
Compared to the main methods in related work, ClapFM-EVC introduces several core differences and innovations:
-
Dual Control Modalities: Unlike most prior methods that rely solely on reference speech or a limited set of categorical labels,
ClapFM-EVCoffers a flexibledual controlmechanism. Users can specify emotions using either anatural language prompt(e.g., "speak in a slightly worried tone") or areference speechutterance. This significantly enhances user experience and the diversity of achievable emotional expressions. -
Fine-Grained Emotion Capture via Natural Language: The proposed
EVC-CLAPmodel is specifically designed to extract and alignfine-grained emotional elementsacross speech and text modalities. This allows the system to understand and leverage the nuanced emotional information conveyed by descriptive natural language prompts, going beyond predefined emotional categories. -
Soft-Labels-Guided Contrastive Pre-training:
EVC-CLAPutilizes a novelsoft-labels-guided training strategywithsymKL-loss. This approach combines categorical labels with natural language prompt labels, making the emotional embeddings more robust and discriminative, as demonstrated by the ablation studies. -
Adaptive Emotion Intensity Control (
AIG): TheFuEncoderintegrates anAdaptive Intensity Gate (AIG)which allows explicit and flexible control over the emotional intensity of the converted speech. This moves beyond simply changing emotions to allowing users to dial up or down the strength of the target emotion. -
High-Quality Speech Generation with Conditional Flow Matching: By employing a
Conditional Flow Matching (CFM)-based decoder,ClapFM-EVCaims for superior speech naturalness and quality.CFMmodels are known for their ability to learn optimal transport paths between distributions, often leading to better generative performance than GANs or VAEs, which can sometimes suffer from mode collapse or blurry outputs. -
Any-to-One EVC: The framework is designed for
any-to-oneEVC, meaning it can convert emotional expressions from any source speech to a fixed target speaker's voice, allowing for consistent voice output.In essence,
ClapFM-EVCdifferentiates itself by offering a more intuitive, flexible, and high-fidelity EVC solution through the synergistic combination of multimodal contrastive learning, adaptive intensity control, and state-of-the-art flow-based generative modeling.
4. Methodology
4.1. Principles
The core idea behind ClapFM-EVC is to build a conditional latent model that effectively disentangles and controls various speech attributes. It operates on the principle of separating linguistic content from emotional information, allowing for independent manipulation of emotion while preserving content and speaker identity. The framework leverages a multi-stage approach:
-
Emotional Feature Extraction and Alignment: Use a contrastive learning framework (
EVC-CLAP) to extract fine-grained emotional embeddings that are aligned across both audio and natural language text modalities. This allows for flexible control using either input type. -
Content-Emotion Fusion with Intensity Control: Integrate the extracted emotional embeddings with linguistic content features (
PPGs) in aFuEncoder. A crucial aspect is theAdaptive Intensity Gate (AIG), which allows explicit control over the strength of the target emotion during this fusion process. -
High-Fidelity Speech Reconstruction: Employ a
Conditional Flow Matching (CFM)model to reconstruct the Mel-spectrogram of the target speech.CFMis chosen for its ability to generate high-quality, natural-sounding speech by learning a continuous transformation path from simple noise to the complex data distribution of Mel-spectrograms. -
Waveform Synthesis: Finally, a pre-trained vocoder converts the reconstructed Mel-spectrogram into the final audio waveform.
This architecture ensures that emotional control is flexible, interpretable, and high-fidelity, addressing the limitations of prior EVC systems.
4.2. Core Methodology In-depth (Layer by Layer)
The ClapFM-EVC framework is characterized as a conditional latent model, integrating several key components: EVC-CLAP, FuEncoder, a CFM-based decoder, and pre-trained ASR and vocoder models. The training process involves two main stages: training EVC-CLAP and then training AdaFM-VC (comprising FuEncoder and CFM-based decoder).
4.2.1. System Overview
As illustrated in Figure 1, the overall training architecture of ClapFM-EVC involves several interconnected modules.
The following figure (Figure 1 from the original paper) shows the overall training architecture of the proposed ClapFM-EVC framework.
该图像是展示 ClapFM-EVC 框架整体训练结构的示意图。图中包括了多种音频与文本编码模块,利用 symKL-Loss 进行情感标签的融合,并展示了通过条件流匹配生成转换语音的过程。
The EVC-CLAP model is initially trained to align emotional representations across audio and text. This is achieved using a symmetric Kullback-Leibler divergence-based contrastive loss (symKL-loss) with soft labels derived from natural language prompts and their corresponding categorical emotion labels. This pre-training step allows EVC-CLAP to capture fine-grained emotional information from natural language prompts, producing emotional embeddings.
Subsequently, the AdaFM-VC model is trained. AdaFM-VC utilizes the emotional elements obtained from EVC-CLAP and content representations (Phonetic PosteriorGrams or PPGs) extracted by a pre-trained HybridFormer ASR model.
The FuEncoder within AdaFM-VC is responsible for seamlessly integrating these emotional and content characteristics. A crucial component of the FuEncoder is the Adaptive Intensity Gate (AIG) module, which explicitly controls the intensity of the emotional conversion.
Concurrently, the CFM model, which is part of AdaFM-VC, samples outputs from the FuEncoder from random Gaussian noise. Conditioned on the target emotional vector generated by EVC-CLAP, the CFM model generates the Mel-spectrogram features of the target speech.
Finally, these generated Mel-spectrogram features are fed into a pre-trained vocoder (specifically, the paper mentions [23], which refers to BigVGAN) to synthesize the converted speech waveform.
During inference, ClapFM-EVC offers three distinct modes for obtaining the target emotional embeddings:
- Reference Speech: The system can directly extract emotional embeddings from a provided reference speech utterance.
- Natural Language Prompt: Users can directly input a natural language emotional prompt (e.g., "speak with a gentle, happy tone") to guide the emotion conversion.
- Corpus Retrieval:
EVC-CLAPcan retrieve relevant high-quality reference speech from a pre-constructed corpus using a specified natural language emotional prompt, and then extract target emotion elements from the retrieved speech.
4.2.2. Soft-Labels-Guided EVC-CLAP
The primary goal of EVC-CLAP training is to minimize the distance between data pairs that belong to the same emotional class or share similar emotional prompts, while maximizing the distance between pairs from different categories. This is a typical objective of contrastive learning.
Assume an input data pair for training is denoted as , where:
-
is the source speech waveform for the -th sample.
-
is its corresponding emotional categorical label (e.g., "happy", "sad").
-
is its corresponding natural language prompt (e.g., "A slightly sad tone").
-
where is the batch size.
The process for
EVC-CLAPinvolves:
-
Feature Compression:
EVC-CLAPfirst utilizes pre-trained encoders to convert the raw inputs into latent representations:- A pre-trained
HuBERT [26]-based audio encoder compresses into a latent audio embedding . - A pre-trained
XLMRoBERTa [27]-based text encoder compresses into a latent text embedding . Here, is the hidden state dimension, which is specified as 512.
- A pre-trained
-
Similarity Prediction: Similarity scores between audio and text embeddings are calculated as: Where:
- : Predicted similarity matrix where rows correspond to audio samples and columns to text prompts.
- : Predicted similarity matrix where rows correspond to text prompts and columns to audio samples.
- : Dot product between audio embeddings and the transpose of text embeddings, forming a similarity matrix.
- and : Two learnable hyperparameters, empirically initialized to 2.3. These scales are often used in contrastive learning to make the scores more discriminative.
-
Soft Label Generation: To guide the contrastive learning,
soft labelsare derived from both categorical emotional labels and natural language prompt labels : Where:- : The combined soft ground truth similarity matrix.
- : A ground truth matrix based on categorical emotional labels. If two data pairs in the batch have identical categorical emotional labels, their corresponding entry in is 1; otherwise, it's 0.
- : A ground truth matrix based on natural language prompt labels. If two data pairs in the batch have identical natural language prompt labels, their corresponding entry in is 1; otherwise, it's 0.
- : A hyperparameter used to adjust the weighting between and , empirically set to 0.2. This means categorical labels contribute 20% to the soft label, and natural language prompts contribute 80%. To ensure consistent label distributions, both and are normalized such that the sum of each row equals 1, effectively capturing relative similarities.
-
Training Loss (
symKL-loss): The training loss forEVC-CLAPis formulated as a symmetricKullback-Leibler (KL) divergenceloss: Note: The formula for in the abstract and main text differs. The abstract mentions , implying two terms. The main text formula (in Section 2.2) explicitly shows four terms: and implicitly based on symmetry. The provided formula only lists two terms, which is likely a truncation. For rigor, I will use the two terms presented in the mathematical block, assuming it's the intended simplified form, but acknowledge the common symmetric CLAP loss would involve both and being compared against and its smoothed version. Since the paper explicitly writes two terms in the formula block, I will stick to that and explain it as presented. The formula, as it appears in the text, is: This formula appears to be an incomplete representation of thesymKL-lossas typically applied inCLAP(which would usually include terms for as well, making it truly symmetric across modalities). However, strictly adhering to the provided formula:- : The
Kullback-Leibler divergencefunction defined as: This calculates the divergence between the predicted similarity matrix and the ground truth similarity matrix . - : A smoothed version of the soft ground truth matrix, used to prevent potential issues with zero probabilities in the KL divergence calculation:
Where:
-
: A small hyperparameter, empirically set to , used for label smoothing.
-
: The batch size. This term ensures that all entries are slightly positive, making the logarithm well-defined.
This
symKL-lossaims to align the predicted audio-text similarities with the combined categorical and natural language soft labels, creating robust emotional embeddings.
-
- : The
4.2.3. AdaFM-VC
The AdaFM-VC model is the end-to-end voice conversion system that takes the content features and emotional embeddings to generate the Mel-spectrogram. It consists of the FuEncoder and the Conditional Flow Matching (CFM)-based decoder.
4.2.3.1. FuEncoder with AIG
The FuEncoder is a crucial intermediate component designed to integrate content features with emotional embeddings while providing flexible control over emotion intensity via the Adaptive Intensity Gate (AIG).
The FuEncoder comprises the following sub-components:
- Preprocessing Network (PreNet): This network compresses the source content features (extracted by
HybridFormer) into a latent space. It includes a dropout mechanism to prevent overfitting. - Positional Encoding Module: This module applies sinusoidal positional encoding to the latent content features . It then performs element-wise addition of these positional encodings with . This ensures that the
FuEncodercan understand and utilize the sequential and structural information inherent in the content features. - Adaptive Intensity Gate (AIG) Module: This is a novel component designed for emotion intensity control. The
AIGmodule multiplies a learnable hyperparameter by the emotional features obtained fromEVC-CLAP. This multiplication effectively scales the influence of the emotional features, allowing for flexible adjustment of the emotional intensity in the converted speech. - Adaptive Fusion Module: This is the core of the
FuEncoder, responsible for combining the content and emotion information. It consists of multiplefusion blocks. Each fusion block includes:- A
multi-head self-attention layer: This mechanism allows the module to weigh the importance of different parts of the input sequence (content features) when integrating emotional information. - Two
emotion adaptive layer norm layers [28]: These layers normalize features based on the emotional embeddings, ensuring that the fusion process is adaptive to the target emotion. - A
position-wise feed-forward network layer: This applies a separate, identical transformation to each position of the sequence. This module generates rich embedding representations that contain both linguistic content and emotional characteristics.
- A
- Linear Mapping Layer: A fully connected layer that maps the fused features to a specific dimension , where is batch size, is sequence length, and is feature dimension.
4.2.3.2. Conditional Flow Matching-based Decoder
To further improve speech naturalness and quality, an optimal transport (OT)-based Conditional Flow Matching (CFM) model is incorporated. This model reconstructs the target Mel-spectrogram x _ { 1 } = p _ { 1 } ( x ) from a standard Gaussian noise .
Here's how it works:
-
OT Flow: Conditioned on the emotional embeddings captured by
EVC-CLAP, anOT flowis adopted to train theCFM-based decoder. This flow represents a continuous path from the initial noise distribution to the target data distribution. The decoder itself consists of 6CFM blocks, each containing aResNet [29]module, amulti-head self-attention [30]module, and aFiLM [31]layer, along with timestep fusion. -
Vector Field Modeling: The
CFMmodel uses an ordinary differential equation (ODE) to model a learnable and time-dependent vector field . This vector field approximates the optimal transport path from the simple Gaussian noise distributionp _ { 0 } ( x )to the complex target distributionp _ { 1 } ( x )(Mel-spectrograms): Where:- : The position along the flow at time .
- : The vector field, representing the velocity at which a point moves along the flow at time .
- : The time variable, ranging from 0 (starting distribution) to 1 (target distribution).
- : The initial condition, meaning the flow starts at the input .
-
Simplified OT Flow Trajectories: Drawing inspiration from works that suggest adopting straighter trajectories for
OT flows [32], the paper simplifies theOT flowformula: Where:- : The simplified flow trajectory.
- : The mean of the distribution at time , linearly interpolated from 0 to .
- : The standard deviation of the distribution at time , which decreases linearly from 1 to .
- : Represents the random conditioned input, implicitly referring to the target Mel-spectrogram sample in the context of the loss function below.
- : The minimum standard deviation of the white noise introduced to perturb individual samples, empirically set to 0.0001. This ensures that the trajectories are nearly straight while maintaining some numerical stability.
-
Training Loss for AdaFM-VC: The training objective for
AdaFM-VCis defined as: Where:- : The training loss for
AdaFM-VC. - : Expectation over time , initial noise , and target data .
- : A sample drawn from the standard Gaussian noise distribution.
- : A sample drawn from the true (potentially non-Gaussian) data distribution of target Mel-spectrograms.
- : Time step sampled uniformly between 0 and 1.
- : A time-dependent variance schedule, often set to from the simplified flow, or similar, representing the noise level at time .
- : Represents the target vector field component (related to the direction of flow from to ).
- : The predicted vector field by the
CFMmodel, conditioned on the current state of the flow and the conditional emotion embeddings extracted byEVC-CLAP. This loss function aims to train theCFMmodel to accurately predict the vector field that transforms noise into the target Mel-spectrogram, guided by the emotional features.
- : The training loss for
5. Experimental Setup
5.1. Datasets
The authors used an internally developed expressive single-speaker Mandarin corpus for training ClapFM-EVC.
- Scale: The corpus contains 20 hours of speech data.
- Sampling Rate: The speech data is sampled at .
- Utterances: From this corpus, 12,000 utterances were specifically selected.
- Emotion Classes: These utterances represent 7 original categorical emotion classes:
neutral,happy,sad,angry,fear,surprise, anddisgust. - Annotations: To ensure high-quality annotations for natural language prompts, 15 professional annotators were enlisted to provide natural language prompts for the selected waveforms. This unique annotation process is crucial for training the
EVC-CLAPcomponent. The choice of an internally developed corpus, specifically annotated with natural language prompts, is essential as there is no open-source EVC corpus with such comprehensive emotional natural language prompts available. This dataset directly supports the paper's core innovation of natural language prompt-driven EVC.
5.2. Evaluation Metrics
To assess the performance of ClapFM-EVC, both subjective and objective evaluations were conducted for speech quality and emotion similarity.
5.2.1. Objective Metrics
-
Mel-cepstral distortion (MCD):
- Conceptual Definition:
MCDquantifies the spectral distortion between two speech signals, typically between synthesized/converted speech and its corresponding ground truth or reference. It measures the average Euclidean distance between Mel-cepstral coefficient (MCC) vectors, which are compact representations of the speech spectrum. LowerMCDvalues indicate higher speech quality and closer spectral resemblance. - Mathematical Formula: $ \mathrm{MCD} = \frac{10}{\ln 10} \sqrt{2 \sum_{d=1}^{D} (mc_d^{\text{converted}} - mc_d^{\text{reference}})^2} $
- Symbol Explanation:
- : Mel-cepstral distortion.
- : Number of Mel-cepstral coefficients (typically 12-24).
- : The -th Mel-cepstral coefficient of the converted speech.
- : The -th Mel-cepstral coefficient of the reference speech.
- Conceptual Definition:
-
Root Mean Squared Error (RMSE):
- Conceptual Definition:
RMSEis a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It represents the square root of the average of the squares of the errors. In speech processing, it can measure the average magnitude of the error between generated acoustic features (e.g., Mel-spectrograms) and their target values. LowerRMSEindicates higher accuracy. - Mathematical Formula: $ \mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2} $
- Symbol Explanation:
- : Root Mean Squared Error.
- : The number of samples or data points.
- : The actual (ground truth) value for the -th sample.
- : The predicted or converted value for the -th sample.
- Conceptual Definition:
-
Character Error Rate (CER):
- Conceptual Definition:
CERis a common metric used to evaluate the accuracy of speech recognition systems or to assess the linguistic content preservation in speech conversion tasks. It measures the number of character-level errors (substitutions, deletions, insertions) required to transform the recognized or converted text into the reference text, divided by the total number of characters in the reference text. A lowerCERindicates better content preservation. - Mathematical Formula: $ \mathrm{CER} = \frac{S + D + I}{N} $
- Symbol Explanation:
- : Character Error Rate.
- : The number of substitution errors (a character in the converted text is different from the reference).
- : The number of deletion errors (a character present in the reference is missing in the converted text).
- : The number of insertion errors (a character present in the converted text is not in the reference).
- : The total number of characters in the reference text.
- Calculation: The
CERis calculated using a pre-trainedCTC-basedASRmodel.
- Conceptual Definition:
-
Predicted MOS (UTMOS):
- Conceptual Definition:
UTMOSis a UniversalMOSpredictor, a deep learning model trained to predict humanMean Opinion Score (MOS)for speech quality. It provides an objective estimate of subjective speech quality, correlating well with human perception of naturalness and overall quality. HigherUTMOSscores indicate better predicted speech quality.
- Conceptual Definition:
-
Emotion Embedding Cosine Similarity (EECS):
- Conceptual Definition:
EECSmeasures the cosine similarity between the emotion embeddings of the converted speech and the reference speech. Cosine similarity is a metric that determines how similar two non-zero vectors are by measuring the cosine of the angle between them. A value of 1 means identical direction (perfect similarity), 0 means orthogonality (no similarity), and -1 means opposite direction. In this context, higherEECSvalues indicate greater similarity in emotional content between the converted and reference speech. - Mathematical Formula (Cosine Similarity): $ \mathrm{CosineSimilarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $
- Symbol Explanation:
- , : Two vectors representing emotion embeddings (one from converted speech, one from reference speech).
- , : The -th component of vectors and , respectively.
- : The dimensionality of the emotion embedding vectors.
- : The dot product of vectors and .
- : The Euclidean norm (or magnitude) of vector , calculated as .
- Conceptual Definition:
5.2.2. Subjective Metrics
- Mean Opinion Score (MOS):
- Conceptual Definition:
MOSis a widely used subjective quality measure where human listeners rate the quality of speech samples on a numerical scale. In this paper, it's used for:nMOS (naturalness MOS): Measures the perceived naturalness of the converted speech.eMOS (emotion similarity MOS): Measures how well the emotion in the converted speech matches the target emotion.
- Scoring Scale: The scoring scale ranges from 1 to 5, with increments of 1. Higher scores indicate better performance.
- Raters: 12 professional raters were invited to participate in the evaluation.
- Confidence Interval: The results are presented with a confidence interval, indicating the reliability of the average scores.
- Conceptual Definition:
5.3. Baselines
The ClapFM-EVC system was compared against several existing EVC methods, specifically when using reference speech as input for emotion control:
-
StarGAN-EVC [11]: A pioneering EVC method utilizing a cycle-consistent and class-conditionalGANarchitecture. -
Seq2seq-EVC [34]: An EVC approach leveraging a two-stage sequence-to-sequence training, often usingText-to-Speech (TTS)components. -
MixEmo [35]: A method that integratesspeech synthesis with emotions, likely focusing on mixing or blending emotional styles.These baselines are representative of conventional EVC approaches that primarily rely on a reference waveform to facilitate emotion conversion, making them suitable for comparison with
ClapFM-EVC's reference speech mode.
5.4. Implementation Details
The training and inference procedures involved specific configurations:
- Hardware: All experiments, including training
EVC-CLAPandAdaFM-VC, were conducted using 8 NVIDIA RTX 3090 GPUs. - EVC-CLAP Training:
- Optimizer: Adam optimizer.
- Initial Learning Rate: .
- Batch Size: 16.
- Epochs: Trained for 40 epochs within the PyTorch framework.
- AdaFM-VC Training:
- Optimizer: AdamW optimizer.
- Initial Learning Rate: .
- Batch Size: 32.
- Iterations: Trained over 500,000 iterations on the same GPU setup.
- Inference:
- Mel-spectrogram Sampling: The Mel-spectrogram of the target waveform is sampled using 25 Euler steps within the
CFM-based decoder. Euler steps refer to numerical integration steps for solving the ODE of the flow matching model. - Guidance Scale: A guidance scale set to 1.0 was used, which typically influences how strongly the model adheres to the conditioning information (emotional embeddings).
- Mel-spectrogram Sampling: The Mel-spectrogram of the target waveform is sampled using 25 Euler steps within the
6. Results & Analysis
6.1. Core Results Analysis
The experiments aim to validate the effectiveness of ClapFM-EVC by comparing its performance against state-of-the-art baselines and through ablation studies.
6.1.1. EVC by Reference Speech
The paper first evaluates ClapFM-EVC when the emotion is controlled by a reference speech, comparing it to StarGAN-EVC, Seq2seq-EVC, and MixEmo.
The following are the results from Table 1 of the original paper:
| Model | MCD (↓) | RMSE (↓) | CER (↓) | UTMOS (↑) | nMOS (↑) | EECS (↑) | eMOS (↑) |
| StarGAN-EVC [11] | 8.85 | 19.48 | 13.07 | 1.45 | 2.09 ± 0.12 | 0.49 | 1.97 ± 0.09 |
| Seq2seq-EVC [34] | 6.93 | 15.79 | 10.56 | 1.81 | 2.52 ± 0.11 | 0.54 | 2.23 ± 0.11 |
| MixEmo [35] | 6.28 | 13.84 | 8.93 | 2.09 | 2.98 ± 0.07 | 0.65 | 2.58 ± 0.13 |
| ClapFM-EVC | 5.83 | 10.91 | 6.76 | 3.68 | 4.09 ± 0.09 | 0.82 | 3.85 ± 0.06 |
Analysis:
- Emotion Similarity (
EECS,eMOS):ClapFM-EVCdemonstrates significant improvements. It achieves anEECSof 0.82, which is a relative improvement of overMixEmo(the best baseline at 0.65). For subjective emotion similarity (eMOS),ClapFM-EVCscores 3.85, a remarkable relative improvement of overMixEmo(2.58). This highlightsClapFM-EVC's superior ability to capture and transfer target emotional characteristics. - Speech Quality (
MCD,RMSE,CER,UTMOS,nMOS):ClapFM-EVCconsistently outperforms baselines across all objective and subjective speech quality metrics.- Objective: It achieves the lowest
MCD(5.83),RMSE(10.91), andCER(6.76), indicating higher spectral fidelity and better preservation of linguistic content. ForUTMOS,ClapFM-EVCscores 3.68, a substantial relative improvement of overMixEmo. - Subjective: In terms of perceived naturalness (
nMOS),ClapFM-EVCscores 4.09, which is a relative improvement of overMixEmo. These results confirm thatClapFM-EVCproduces converted speech with exceptional perceptual quality while accurately conveying emotions.
- Objective: It achieves the lowest
6.1.2. EVC by Natural Language Prompt
To demonstrate the flexibility of ClapFM-EVC, an ABX preference test was conducted to compare the performance when using reference speech (Reference) versus natural language prompts (Prompt) for emotion control.
The following figure (Figure 2 from the original paper) shows the ABX preference test results compare the Reference with Prompt.

Analysis:
-
Emotion Similarity: When participants were asked to rate emotion similarity (with
Referenceas benchmark, -1 forReferencebetter, 0 for no preference, 1 forPromptbetter), chose "no preference". This suggests that for a majority of listeners, the emotion similarity achieved byPrompt-driven conversion is comparable to that ofReference-driven conversion. Furthermore, favoredPrompt, indicating that natural language control can be highly effective and sometimes even preferred for emotional expression. -
Speech Quality: For speech quality (closer to ground truth),
Referencereceived preference,Promptreceived preference, and "no preference" was chosen by of participants. The close preference rates betweenReferenceandPromptmodes, combined with over half of participants having no preference, indicate thatClapFM-EVCmaintains high-quality EVC regardless of whether a reference speech or a natural language prompt is used for emotion control.These
ABXtest results validate the dual control capability ofClapFM-EVCand demonstrate that natural language prompts can serve as an effective and high-quality alternative for emotional control.
6.2. Ablation Studies / Parameter Analysis
To understand the contribution of each proposed component, ablation studies were performed, with results summarized in Table 2. These studies used natural language prompts to drive the EVC.
The following are the results from Table 2 of the original paper:
| Model | UTMOS (↑) | nMOS (↑) | EECS (↑) | eMOS (↑) |
| ClapFM-EVC | 3.63 | 4.01 ± 0.06 | 0.79 | 3.72 ± 0.08 |
| w/o emo label | 3.61 | 3.96 ± 0.11 | 0.66 | 3.01 ± 0.07 |
| w/o symKL | 3.57 | 3.89 ± 0.05 | 0.71 | 3.28 ± 0.08 |
| w/o AIG | 3.25 | 3.62 ± 0.12 | 0.74 | 3.52 ± 0.05 |
Analysis:
-
w/o emo label(without emotional categorical labels):- Impact: When categorical emotional labels are removed from
EVC-CLAPtraining,EECSdrops significantly from 0.79 to 0.66 (a relative drop of ), andeMOSdrops from 3.72 to 3.01 (a relative drop of ).UTMOSandnMOS(speech quality metrics) remain largely unaffected (3.61 vs. 3.63 and 3.96 vs. 4.01, respectively). - Conclusion: This validates the effectiveness of the proposed
soft-label-guided training strategyinEVC-CLAP. Combining natural language prompts with categorical labels is crucial for enhancing the model's emotional representation capacity and achieving better emotion similarity.
- Impact: When categorical emotional labels are removed from
-
w/o symKL(withoutsymKL-loss):- Impact: Replacing the
symKL-losswith a standardKL-Loss(presumably unidirectional) results inEECSdropping from 0.79 to 0.71 (a relative drop of ) andeMOSdropping from 3.72 to 3.28 (a relative drop of ). Speech quality metrics (UTMOS,nMOS) also show slight reductions. - Conclusion: This indicates that the
symmetric KL-Lossis more effective than a standardKL-Lossin aligning audio and text emotional features. The bidirectional nature ofsymKL-losslikely contributes to a more robust and enhanced emotion representation capability withinEVC-CLAP.
- Impact: Replacing the
-
w/o AIG(withoutAdaptive Intensity Gate):-
Impact: Removing the
AIGmodule leads to a notable deterioration in speech quality metrics:UTMOSdrops from 3.63 to 3.25 (a relative drop of ), andnMOSdrops from 4.01 to 3.62 (a relative drop of ). There is also a slight reduction in emotional similarity (EECSfrom 0.79 to 0.74,eMOSfrom 3.72 to 3.52). -
Conclusion: This underscores the critical role of the
AIGmodule. It not only enables adaptive control of emotion intensity but also significantly contributes to the overall speech quality by effectively integrating content and emotional characteristics. Its absence negatively impacts both the quality and emotional expressiveness of the converted speech.The ablation studies collectively confirm that all proposed components – the
soft-label-guided training,symKL-loss, and theAIGmodule – are vital for the superior performance of theClapFM-EVCsystem, contributing to both emotional accuracy and speech quality.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This study successfully introduced ClapFM-EVC, an innovative and effective high-fidelity any-to-one Emotional Voice Conversion (EVC) framework. A key strength of ClapFM-EVC is its flexible and interpretable emotion control, achieved through dual input modalities: natural language prompts and reference speech, coupled with adjustable emotion intensity.
The framework's core innovations include:
-
EVC-CLAP: An emotional contrastive language-audio pre-training model designed to extract and align fine-grained emotional elements across audio and text. Its training is guided by a novelsymKL-Lossand soft labels derived from both natural language prompts and categorical emotion labels, significantly enhancing emotional representational capacity. -
AdaFM-VC: Comprising aFuEncoderwith anAdaptive Intensity Gate (AIG)and aConditional Flow Matching (CFM)-based decoder. TheFuEncoderseamlessly fuses content and emotional features, withAIGproviding explicit control over emotion intensity. TheCFMdecoder, combined with a pre-trainedvocoder, ensures high-fidelity speech reconstruction, improving naturalness and speech quality.Extensive experiments, including comparisons with state-of-the-art baselines and detailed ablation studies, rigorously validated that
ClapFM-EVCconsistently generates converted speech with precise emotion control and high speech quality, particularly when driven by natural language prompts.
7.2. Limitations & Future Work
The paper does not explicitly detail a "Limitations" or "Future Work" section. However, based on the description and common challenges in EVC research, potential limitations and future directions can be inferred:
Inferred Limitations:
- Single-Speaker Model: The current system is described as an "any-to-one" EVC framework, implying conversion to a single target speaker's voice. This means it cannot perform "any-to-any" EVC (converting to arbitrary target speakers) or "one-to-many" EVC (converting one source speaker to multiple target speakers). This limits its flexibility in scenarios requiring diverse target voices.
- Proprietary Dataset: The reliance on an "internally developed expressive single-speaker Mandarin corpus" limits reproducibility and direct comparison by other researchers who do not have access to this specific dataset. The generalization capabilities to other languages or emotional styles not present in this dataset are also not explicitly tested.
- Interpretability of Intensity Control: While the
AIGmodule allows "adjustable emotion intensity," the paper does not extensively quantify or subjectively evaluate the range and perceptual linearity of this intensity control. Further exploration could include how different intensity levels map to human perception. - Computational Cost:
Flow Matchingmodels, especially with many steps (25 Euler steps in inference), and large pre-trained encoders (HuBERT,XLMRoBERTa) can be computationally intensive during training and inference, though this is a common trade-off for high quality.
Inferred Future Work:
- Any-to-Any EVC: Extending the framework to support
any-to-anyvoice conversion, allowing transfer of emotion to arbitrary target speakers (including zero-shot scenarios). - Cross-Lingual EVC: Investigating the model's performance and adaptability for cross-lingual emotion conversion, where source and target speech are in different languages.
- Quantitative Intensity Control Evaluation: Developing more robust quantitative and qualitative evaluations for the adjustable emotion intensity, perhaps through psychological experiments or fine-grained annotation of intensity levels.
- Real-time Performance Optimization: Exploring methods to reduce the computational overhead and accelerate inference for potential real-time applications.
- Broader Emotional Range: Training on even larger and more diverse emotional datasets, potentially including more nuanced or complex emotional states beyond the 7 categorical emotions.
- Integration with Text-to-Speech (TTS): Combining
ClapFM-EVCwithTTSsystems to enable generation of emotional speech directly from text, with fine-grained emotional control.
7.3. Personal Insights & Critique
This paper presents a significant advancement in emotional voice conversion, particularly in addressing the critical aspects of flexible control and high-fidelity output.
Personal Insights:
- Synergy of Components: The most impressive aspect is the synergistic combination of
CLAP-based multimodal learning,adaptive intensity control, andflow matching. This multi-pronged approach effectively tackles the disentanglement problem in EVC, allowing for robust control over emotion while maintaining content and quality. - Natural Language Control as a Game Changer: The ability to control emotion via natural language prompts is a substantial leap forward for user experience and interpretability. It unlocks a far richer space of emotional expression compared to discrete labels or reliance on reference audio, making EVC systems more practical and intuitive for applications like creative content generation or personalized voice assistants.
- Robust Emotional Embeddings: The
soft-labels-guided trainingandsymKL-lossforEVC-CLAPare clever techniques to create strong, discriminative emotional embeddings by leveraging both structured categorical knowledge and unstructured linguistic descriptions. This likely contributes significantly to the model's ability to interpret subtle emotional nuances. - Flow Matching for Quality: The integration of
Conditional Flow Matchingis a strong choice for achieving high naturalness and speech quality, aligning with current trends in state-of-the-art generative models for audio.
Critique / Areas for Improvement:
-
Any-to-One vs. Any-to-Any Clarity: While stated as "any-to-one," which is clear, the implications for practical application (i.e., needing a specific target speaker's voice) could be discussed more explicitly in the abstract or introduction, as many desire "any-to-any" EVC for maximum flexibility.
-
Detailed Evaluation of Intensity Control: Although the
AIGmodule is introduced and its contribution is shown in ablation, a more detailed evaluation of the range and perceptual effect of the intensity control would be valuable. For instance, demonstrating conversion at "low," "medium," and "high" intensity levels for a given emotion and including subjective ratings specific to intensity. -
Generalizability to Other Languages/Speakers: The use of an internal Mandarin corpus raises questions about cross-lingual transferability and generalizability to diverse speaker characteristics. While the underlying models (
HuBERT,XLMRoBERTa) are often multilingual, the EVC-specific components might benefit from training on more diverse datasets. Making the dataset public or providing a benchmark for it would greatly benefit the community. -
Ablation of
CFMvs. other generative models: WhileCFMis chosen for its benefits, a direct ablation study comparingCFMwith alternative generative backbones (e.g.,GAN-based orVAE-based decoders, or evendiffusion modelsifCFMis considered distinct enough) could further emphasize its contribution to quality and naturalness. -
Ethical Considerations: As with any advanced voice synthesis technology, a brief discussion on ethical implications (e.g., deepfakes, misuse) would be a valuable addition, especially for a system offering such flexible emotional control.
Overall,
ClapFM-EVCrepresents a robust and innovative step forward in EVC, pushing the boundaries of control flexibility and output quality. Its dual control mechanism and intelligent integration of multimodal and generative techniques pave the way for more natural and user-friendly emotional speech synthesis applications.
Similar papers
Recommended via semantic vector search.