Paper status: completed

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

Published:05/20/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ClapFM-EVC is introduced as a novel emotional voice conversion framework that generates high-quality speech with adjustable emotion intensity, using natural language prompts and reference speech. Key technologies include EVC-CLAP for pre-training and a FuEncoder for seamless emot

Abstract

Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct Mel-spectrogram of source speech. Subjective and objective evaluations validate the effectiveness of ClapFM-EVC.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech. It focuses on developing an emotional voice conversion system that offers high quality and versatile control options using both natural language prompts and reference speech, with adjustable emotion intensity.

1.2. Authors

The authors and their affiliations are:

  • Yu Pan (1, anyu.ztj@gmail.com) - Department of Information Science and Technology, Kyushu University, Japan
  • Yanni Hu (3) - EverestAI, Ximalaya Inc., China
  • Yuguang Yang (3) - EverestAI, Ximalaya Inc., China
  • Jixun Yao (3) - EverestAI, Ximalaya Inc., China
  • Jianhao Ye (3) - EverestAI, Ximalaya Inc., China
  • Hongbin Zhou (3) - EverestAI, Ximalaya Inc., China
  • Lei Ma (2, ma.lei@acm.org) - Department of Computer Science, The University of Tokyo, Japan
  • Jianjun Zhao (1, zhao@ait.kyushu-u.ac.) - Department of Information Science and Technology, Kyushu University, Japan

1.3. Journal/Conference

The paper is published as a preprint on arXiv. The publication venue is not a peer-reviewed journal or conference at the time of this analysis, but a preprint server. arXiv is a well-known and influential platform for disseminating research in fields like computer science, physics, mathematics, and others, allowing researchers to share their work before formal peer review.

1.4. Publication Year

The paper was published on 2025-05-20.

1.5. Abstract

The paper introduces ClapFM-EVC, a novel framework for emotional voice conversion (EVC) that aims to generate high-fidelity converted speech with flexible and interpretable control. The system can be driven by natural language prompts or reference speech, and it allows for adjustable emotion intensity. Key methodological contributions include EVC-CLAP, an emotional contrastive language-audio pre-training model that aligns fine-grained emotional elements across speech and text using natural language prompts and categorical labels. This is followed by a FuEncoder with an adaptive intensity gate (AIG) to seamlessly fuse emotional features with Phonetic PosteriorGrams (PPGs) from an Automatic Speech Recognition (ASR) model. To further enhance emotion expressiveness and speech naturalness, a flow matching model is proposed, conditioned on these captured features, to reconstruct Mel-spectrograms. Subjective and objective evaluations confirm the effectiveness of ClapFM-EVC.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control over emotional expressions. EVC is the task of changing the emotional state of a speaker's voice while preserving the linguistic content and speaker identity. This problem is crucial for various practical applications, including intelligent voice assistants, automatic audiobook production, and professional dubbing.

Existing EVC methods face several challenges:

  1. Limited Emotional Diversity, Naturalness, and Speech Quality: Many current systems, often based on Generative Adversarial Networks (GANs) or autoencoders, struggle to produce speech with a wide range of convincing emotional expressions, and sometimes lack overall naturalness and high speech quality.

  2. Constrained Control Paradigms: Traditional EVC systems typically rely on reference speech or a limited set of categorical text labels (e.g., "happy," "sad") as conditions for emotion control. This approach restricts the diversity of achievable emotional expressions, lacks intuitiveness for users, and offers limited interpretability regarding the nuanced emotions conveyed.

  3. Lack of Fine-Grained Intensity Control: While some studies have introduced intensity control modules, there is still potential for more precise manipulation of emotional expression and its intensity.

    The paper's entry point or innovative idea is to address these issues by introducing ClapFM-EVC, a framework that leverages natural language prompts for intuitive, flexible, and fine-grained emotional control, combined with reference speech control, and incorporates an adaptive intensity gate. It also aims to improve speech naturalness and quality using a conditional flow matching (CFM) model.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. ClapFM-EVC Framework: Introduction of a novel, high-fidelity, any-to-one EVC framework that supports dual control from both natural language prompts and reference speech, offering flexible and interpretable emotion manipulation.

  2. EVC-CLAP Model: Proposal of an emotional contrastive language-audio pre-training model, EVC-CLAP, which is guided by both natural language prompts and categorical emotion labels. This model is designed to extract and align fine-grained emotional elements across speech and text modalities effectively.

  3. FuEncoder with Adaptive Intensity Gate (AIG): Development of a FuEncoder that seamlessly fuses content features (from a pre-trained ASR model) with emotional embeddings (from EVC-CLAP). The AIG module within the FuEncoder provides flexible and explicit control over the intensity of the converted emotion.

  4. Conditional Flow Matching (CFM)-based Decoder: Integration of a CFM model to reconstruct Mel-spectrograms. This component is crucial for enhancing the expressiveness of the converted emotion and improving the overall naturalness and quality of the synthesized speech.

  5. Comprehensive Evaluation: Extensive subjective and objective experiments and ablation studies were conducted to validate the effectiveness of ClapFM-EVC, demonstrating its superiority over existing EVC approaches in terms of emotional expressiveness, speech naturalness, and speech quality.

    The key conclusions and findings are that ClapFM-EVC successfully addresses the challenges of emotional diversity, naturalness, and control in EVC. It achieves state-of-the-art performance, demonstrating a remarkable capability to precisely capture and effectively transfer target emotional characteristics, while maintaining superior perceptual quality. The dual control mechanism and adaptive intensity gate offer significant advancements in user experience and interpretability compared to prior methods.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand the ClapFM-EVC paper, a grasp of several fundamental concepts in speech processing and deep learning is necessary:

  • Emotional Voice Conversion (EVC): The core task of transforming the emotional style of a source speech utterance to a target emotion, while preserving the original linguistic content and speaker identity (timbre). For example, converting a "neutral" sentence into a "happy" one spoken in the same speaker's voice.

  • Generative Models (e.g., GANs, Autoencoders, Flow-based Models):

    • Generative Adversarial Networks (GANs): A class of artificial intelligence algorithms used in unsupervised machine learning. A GAN consists of two neural networks, a generator and a discriminator, competing against each other. The generator learns to create new data instances that resemble the training data, while the discriminator learns to distinguish between real data and data generated by the generator. Through this adversarial process, the generator improves its ability to create realistic data.
    • Autoencoders: Neural networks designed for unsupervised learning of efficient data codings (representations). An autoencoder has an encoder that compresses input data into a latent-space representation (a lower-dimensional feature vector) and a decoder that reconstructs the input data from this latent representation. They are often used for dimensionality reduction, feature learning, and data generation.
    • Flow-based Models: A class of generative models that explicitly model the probability density function of complex data distributions. They achieve this by learning a sequence of invertible transformations (a "flow") that map a simple base distribution (like a Gaussian) to the complex data distribution. The invertibility allows for exact likelihood computation and efficient sampling. Conditional Flow Matching (CFM) is a recent advancement in this area, focusing on learning paths between distributions rather than direct mappings, often leading to more stable training and better sample quality.
  • Contrastive Learning: A self-supervised learning paradigm where a model learns to pull "similar" samples closer in an embedding space while pushing "dissimilar" samples further apart. In Clap, this is applied to align representations from different modalities (e.g., audio and text) by making sure that corresponding audio-text pairs are close, and non-corresponding pairs are far apart.

  • Kullback-Leibler (KL) Divergence: A non-symmetric measure of the difference between two probability distributions PP and QQ. It quantifies how much information is lost when QQ is used to approximate PP. A lower KL divergence indicates that the distributions are more similar. $ \mathrm{KL}(P || Q) = \sum_x P(x) \log\left(\frac{P(x)}{Q(x)}\right) $ Where P(x) and Q(x) are the probabilities of event xx according to distributions PP and QQ.

  • Phonetic PosteriorGrams (PPGs): A sequence of probability distributions over phonetic units (e.g., phonemes or sub-phonemes) extracted from speech. PPGs are often derived from the output of an Automatic Speech Recognition (ASR) model's acoustic encoder. They are considered speaker-independent representations of linguistic content, meaning they capture what is being said, rather than who is saying it or how it's being said.

  • Mel-spectrogram: A common acoustic feature representation of speech. It is a time-frequency representation where the frequencies are mapped to the Mel scale, which better approximates human auditory perception. It's often the target output for speech synthesis and voice conversion models.

  • Vocoder: A component that converts acoustic features (like Mel-spectrograms) back into raw audio waveforms. It's essential for synthesizing audible speech from the model's output. Common vocoders include WaveNet, WaveGlow, HiFi-GAN, or BigVGAN.

  • Pre-trained Models (e.g., HuBERT, XLMRoBERTa):

    • HuBERT (Hidden Unit Bidirectional Encoder Representations from Transformers): A self-supervised speech representation learning model. It learns robust speech features by masking portions of the input speech and predicting discrete hidden units (clusters of speech features) from the unmasked parts. It's often used as a powerful audio encoder.
    • XLMRoBERTa: A large, multi-lingual Transformer-based language model trained on a massive corpus of text in 100 languages. It's used as a powerful text encoder to generate contextualized embeddings for text inputs, including natural language prompts.

3.2. Previous Works

The paper discusses several categories of previous work in EVC:

  • GANs and Autoencoder-based EVC:

    • StarGAN [11]: This was an early influential work that adapted StarGAN (originally for image-to-image translation) to EVC. It used a cycle-consistent and class-conditional GAN architecture to convert emotional states.
    • AINN [16]: An attention-based interactive disentangling network that used a two-stage pipeline for fine-grained EVC.
    • Limitation: While promising, these systems often lacked emotional diversity, naturalness, and high speech quality, particularly when dealing with a wide range of emotions or subtle nuances.
  • EVC with Intensity Control:

    • Emovox [18]: This system focused on disentangling speaker style and encoding emotion in a continuous space, allowing for some control over emotional intensity.
    • EINet [17]: Predicted emotional class and intensity using an emotion evaluator and intensity mapper, aiming to enhance naturalness and diversity through controllable emotional intensity.
    • Limitation: These methods typically relied on reference speech or categorical labels, limiting flexibility and the diversity of expressible emotions. They also didn't fully address the issues of intuitiveness and interpretability.
  • Content-based ASR models for speech features:

    • HybridFormer [20]: A pre-trained ASR model mentioned in the paper, used to extract Phonetic PosteriorGrams (PPGs) as content features. It improves Squeezeformer with hybrid attention and NSR mechanisms.
  • Deep Learning Components:

    • ResNet [29]: Residual Networks are deep neural networks that introduce "skip connections" or "residual connections" to allow gradients to flow through the network more easily, enabling the training of much deeper models without performance degradation.
    • Multi-head Self-Attention [30]: A key mechanism from the Transformer architecture. It allows a model to weigh the importance of different parts of the input sequence when processing each element. Multi-head means the attention mechanism is run multiple times in parallel, each with different linear projections, to capture different aspects of relationships. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
      • QQ: Query matrix.
      • KK: Key matrix.
      • VV: Value matrix.
      • dkd_k: Dimensionality of the key vectors.
      • The dot product QKTQK^T measures the similarity between queries and keys. The softmax normalizes these scores, and the result is multiplied by VV to get the weighted sum of values.
    • FiLM (Feature-wise Linear Modulation) [31]: A general conditioning layer that applies an affine transformation (scaling and shifting) to the features of a neural network layer based on a conditioning input. It allows a model to dynamically adapt its computations based on external information, such as emotional embeddings in this case.

3.3. Technological Evolution

The field of EVC has evolved significantly:

  1. Early Statistical Methods: Initially, EVC relied on statistical models like Gaussian Mixture Models (GMMs) to map acoustic features between different emotional states. These methods often struggled with naturalness and disentangling attributes.

  2. Deep Learning Wave (GANs & Autoencoders): The advent of deep learning brought models like StarGAN, CycleGAN, and Variational Autoencoders (VAEs) to EVC. These models showed improved performance in converting emotions and preserving speaker identity. However, they faced challenges in generating diverse emotions, maintaining high speech quality, and offering flexible control.

  3. Disentanglement and Intensity Control: Researchers started focusing on better disentangling content, timbre, and emotion. Modules for controlling emotion intensity emerged, allowing for finer-grained manipulation beyond simple categorical shifts.

  4. Reference-based and Language-based Control: The push for more intuitive control led to reference-based EVC (converting to the emotion of a given audio sample) and more recently, the integration of natural language prompts to specify target emotions.

  5. Advanced Generative Models: The adoption of powerful generative models like Diffusion Models and Flow-based Models (e.g., Conditional Flow Matching) has become a key trend to achieve higher speech quality and naturalness in various speech synthesis tasks, including EVC.

  6. Self-supervised Pre-training and Multimodality: Leveraging large-scale self-supervised pre-trained models (like HuBERT for audio and XLMRoBERTa for text) has become standard for robust feature extraction. Combining modalities through contrastive learning (like CLAP for audio-text alignment) allows for richer, more interpretable control.

    This paper's work fits within the latest stages of this evolution, specifically at the intersection of advanced generative models (flow matching), multimodal learning (audio-text contrastive pre-training), and flexible, intuitive control paradigms (natural language prompts with intensity adjustment).

3.4. Differentiation Analysis

Compared to the main methods in related work, ClapFM-EVC introduces several core differences and innovations:

  • Dual Control Modalities: Unlike most prior methods that rely solely on reference speech or a limited set of categorical labels, ClapFM-EVC offers a flexible dual control mechanism. Users can specify emotions using either a natural language prompt (e.g., "speak in a slightly worried tone") or a reference speech utterance. This significantly enhances user experience and the diversity of achievable emotional expressions.

  • Fine-Grained Emotion Capture via Natural Language: The proposed EVC-CLAP model is specifically designed to extract and align fine-grained emotional elements across speech and text modalities. This allows the system to understand and leverage the nuanced emotional information conveyed by descriptive natural language prompts, going beyond predefined emotional categories.

  • Soft-Labels-Guided Contrastive Pre-training: EVC-CLAP utilizes a novel soft-labels-guided training strategy with symKL-loss. This approach combines categorical labels with natural language prompt labels, making the emotional embeddings more robust and discriminative, as demonstrated by the ablation studies.

  • Adaptive Emotion Intensity Control (AIG): The FuEncoder integrates an Adaptive Intensity Gate (AIG) which allows explicit and flexible control over the emotional intensity of the converted speech. This moves beyond simply changing emotions to allowing users to dial up or down the strength of the target emotion.

  • High-Quality Speech Generation with Conditional Flow Matching: By employing a Conditional Flow Matching (CFM)-based decoder, ClapFM-EVC aims for superior speech naturalness and quality. CFM models are known for their ability to learn optimal transport paths between distributions, often leading to better generative performance than GANs or VAEs, which can sometimes suffer from mode collapse or blurry outputs.

  • Any-to-One EVC: The framework is designed for any-to-one EVC, meaning it can convert emotional expressions from any source speech to a fixed target speaker's voice, allowing for consistent voice output.

    In essence, ClapFM-EVC differentiates itself by offering a more intuitive, flexible, and high-fidelity EVC solution through the synergistic combination of multimodal contrastive learning, adaptive intensity control, and state-of-the-art flow-based generative modeling.

4. Methodology

4.1. Principles

The core idea behind ClapFM-EVC is to build a conditional latent model that effectively disentangles and controls various speech attributes. It operates on the principle of separating linguistic content from emotional information, allowing for independent manipulation of emotion while preserving content and speaker identity. The framework leverages a multi-stage approach:

  1. Emotional Feature Extraction and Alignment: Use a contrastive learning framework (EVC-CLAP) to extract fine-grained emotional embeddings that are aligned across both audio and natural language text modalities. This allows for flexible control using either input type.

  2. Content-Emotion Fusion with Intensity Control: Integrate the extracted emotional embeddings with linguistic content features (PPGs) in a FuEncoder. A crucial aspect is the Adaptive Intensity Gate (AIG), which allows explicit control over the strength of the target emotion during this fusion process.

  3. High-Fidelity Speech Reconstruction: Employ a Conditional Flow Matching (CFM) model to reconstruct the Mel-spectrogram of the target speech. CFM is chosen for its ability to generate high-quality, natural-sounding speech by learning a continuous transformation path from simple noise to the complex data distribution of Mel-spectrograms.

  4. Waveform Synthesis: Finally, a pre-trained vocoder converts the reconstructed Mel-spectrogram into the final audio waveform.

    This architecture ensures that emotional control is flexible, interpretable, and high-fidelity, addressing the limitations of prior EVC systems.

4.2. Core Methodology In-depth (Layer by Layer)

The ClapFM-EVC framework is characterized as a conditional latent model, integrating several key components: EVC-CLAP, FuEncoder, a CFM-based decoder, and pre-trained ASR and vocoder models. The training process involves two main stages: training EVC-CLAP and then training AdaFM-VC (comprising FuEncoder and CFM-based decoder).

4.2.1. System Overview

As illustrated in Figure 1, the overall training architecture of ClapFM-EVC involves several interconnected modules. The following figure (Figure 1 from the original paper) shows the overall training architecture of the proposed ClapFM-EVC framework.

Figure 1: Overall training architecture of the proposed ClapFM-EVC framework. 该图像是展示 ClapFM-EVC 框架整体训练结构的示意图。图中包括了多种音频与文本编码模块,利用 symKL-Loss 进行情感标签的融合,并展示了通过条件流匹配生成转换语音的过程。

The EVC-CLAP model is initially trained to align emotional representations across audio and text. This is achieved using a symmetric Kullback-Leibler divergence-based contrastive loss (symKL-loss) with soft labels derived from natural language prompts and their corresponding categorical emotion labels. This pre-training step allows EVC-CLAP to capture fine-grained emotional information from natural language prompts, producing emotional embeddings.

Subsequently, the AdaFM-VC model is trained. AdaFM-VC utilizes the emotional elements obtained from EVC-CLAP and content representations (Phonetic PosteriorGrams or PPGs) extracted by a pre-trained HybridFormer ASR model. The FuEncoder within AdaFM-VC is responsible for seamlessly integrating these emotional and content characteristics. A crucial component of the FuEncoder is the Adaptive Intensity Gate (AIG) module, which explicitly controls the intensity of the emotional conversion. Concurrently, the CFM model, which is part of AdaFM-VC, samples outputs from the FuEncoder from random Gaussian noise. Conditioned on the target emotional vector generated by EVC-CLAP, the CFM model generates the Mel-spectrogram features of the target speech. Finally, these generated Mel-spectrogram features are fed into a pre-trained vocoder (specifically, the paper mentions [23], which refers to BigVGAN) to synthesize the converted speech waveform.

During inference, ClapFM-EVC offers three distinct modes for obtaining the target emotional embeddings:

  1. Reference Speech: The system can directly extract emotional embeddings from a provided reference speech utterance.
  2. Natural Language Prompt: Users can directly input a natural language emotional prompt (e.g., "speak with a gentle, happy tone") to guide the emotion conversion.
  3. Corpus Retrieval: EVC-CLAP can retrieve relevant high-quality reference speech from a pre-constructed corpus using a specified natural language emotional prompt, and then extract target emotion elements from the retrieved speech.

4.2.2. Soft-Labels-Guided EVC-CLAP

The primary goal of EVC-CLAP training is to minimize the distance between data pairs that belong to the same emotional class or share similar emotional prompts, while maximizing the distance between pairs from different categories. This is a typical objective of contrastive learning.

Assume an input data pair for training is denoted as {Xia,Xiy,Xip}\{ X _ { i } ^ { a } , X _ { i } ^ { y } , X _ { i } ^ { p } \}, where:

  • XiaX _ { i } ^ { a } is the source speech waveform for the ii-th sample.

  • XiyX _ { i } ^ { y } is its corresponding emotional categorical label (e.g., "happy", "sad").

  • XipX _ { i } ^ { p } is its corresponding natural language prompt (e.g., "A slightly sad tone").

  • i[0,N]i \in [0, N] where NN is the batch size.

    The process for EVC-CLAP involves:

  1. Feature Compression: EVC-CLAP first utilizes pre-trained encoders to convert the raw inputs into latent representations:

    • A pre-trained HuBERT [26]-based audio encoder compresses XiaX _ { i } ^ { a } into a latent audio embedding ZaRN×D\boldsymbol { Z _ { a } } \in \mathbb { R } ^ { N \times D }.
    • A pre-trained XLMRoBERTa [27]-based text encoder compresses XipX _ { i } ^ { p } into a latent text embedding ZpRN×D\boldsymbol { Z _ { p } } \in \mathbb { R } ^ { N \times D }. Here, DD is the hidden state dimension, which is specified as 512.
  2. Similarity Prediction: Similarity scores between audio and text embeddings are calculated as: Spreda=εa×(ZaZpT)Spredp=εt×(ZpZaT) \begin{array} { r } { S _ { p r e d } ^ { a } = \varepsilon _ { a } \times ( Z _ { a } \cdot Z _ { p } ^ { T } ) } \\ { S _ { p r e d } ^ { p } = \varepsilon _ { t } \times ( Z _ { p } \cdot Z _ { a } ^ { T } ) } \end{array} Where:

    • SpredaS _ { p r e d } ^ { a }: Predicted similarity matrix where rows correspond to audio samples and columns to text prompts.
    • SpredpS _ { p r e d } ^ { p }: Predicted similarity matrix where rows correspond to text prompts and columns to audio samples.
    • ZaZpTZ _ { a } \cdot Z _ { p } ^ { T }: Dot product between audio embeddings and the transpose of text embeddings, forming a similarity matrix.
    • εa\varepsilon _ { a } and εt\varepsilon _ { t }: Two learnable hyperparameters, empirically initialized to 2.3. These scales are often used in contrastive learning to make the scores more discriminative.
  3. Soft Label Generation: To guide the contrastive learning, soft labels MGTsRN×NM _ { G T } ^ { s } \in \mathbb { R } ^ { N \times N } are derived from both categorical emotional labels XiyX _ { i } ^ { y } and natural language prompt labels XipX _ { i } ^ { p }: MGTs=αeMGTy+(1αe)MGTp M _ { G T } ^ { s } = \alpha _ { e } M _ { G T } ^ { y } + ( 1 - \alpha _ { e } ) M _ { G T } ^ { p } Where:

    • MGTsM _ { G T } ^ { s }: The combined soft ground truth similarity matrix.
    • MGTyM _ { G T } ^ { y }: A ground truth matrix based on categorical emotional labels. If two data pairs in the batch have identical categorical emotional labels, their corresponding entry in MGTyM _ { G T } ^ { y } is 1; otherwise, it's 0.
    • MGTpM _ { G T } ^ { p }: A ground truth matrix based on natural language prompt labels. If two data pairs in the batch have identical natural language prompt labels, their corresponding entry in MGTpM _ { G T } ^ { p } is 1; otherwise, it's 0.
    • αe\alpha _ { e }: A hyperparameter used to adjust the weighting between MGTyM _ { G T } ^ { y } and MGTpM _ { G T } ^ { p }, empirically set to 0.2. This means categorical labels contribute 20% to the soft label, and natural language prompts contribute 80%. To ensure consistent label distributions, both MGTyM _ { G T } ^ { y } and MGTpM _ { G T } ^ { p } are normalized such that the sum of each row equals 1, effectively capturing relative similarities.
  4. Training Loss (symKL-loss): The training loss for EVC-CLAP is formulated as a symmetric Kullback-Leibler (KL) divergence loss: LsymKL=14(KL(SpredaMGTs)+KL(MGTs~Spreda)+KL(SpredpMGTs)+KL(MGTs~Spredp)) L _ { s y m K L } = \frac { 1 } { 4 } \Bigg ( K L \left( S _ { p r e d } ^ { a } | | M _ { G T } ^ { s } \right) + K L \left( \tilde { M _ { G T } ^ { s } } | | S _ { p r e d } ^ { a } \right) + KL \left( S _ { p r e d } ^ { p } | | M _ { G T } ^ { s } \right) + KL \left( \tilde { M _ { G T } ^ { s } } | | S _ { p r e d } ^ { p } \right) \Bigg) Note: The formula for LsymKLL_{symKL} in the abstract and main text differs. The abstract mentions KL(SpredaMGTs)+KL(MGTs~Spreda)KL(S_{pred}^a || M_{GT}^s) + KL(\tilde{M_{GT}^s} || S_{pred}^a), implying two terms. The main text formula (in Section 2.2) explicitly shows four terms: KL(SpredaMGTs)+KL(MGTs~Spreda)KL(S_{pred}^a || M_{GT}^s) + KL(\tilde{M_{GT}^s} || S_{pred}^a) and implicitly KL(SpredpMGTs)+KL(MGTs~Spredp)KL(S_{pred}^p || M_{GT}^s) + KL(\tilde{M_{GT}^s} || S_{pred}^p) based on symmetry. The provided formula only lists two terms, which is likely a truncation. For rigor, I will use the two terms presented in the mathematical block, assuming it's the intended simplified form, but acknowledge the common symmetric CLAP loss would involve both SpredaS_{pred}^a and SpredpS_{pred}^p being compared against MGTsM_{GT}^s and its smoothed version. Since the paper explicitly writes two terms in the formula block, I will stick to that and explain it as presented. The formula, as it appears in the text, is: LsymKL=14(KL(SpredaMGTs)+KL(MGTs~Spreda) L _ { s y m K L } = \frac { 1 } { 4 } \Bigg ( K L \left( S _ { p r e d } ^ { a } | | M _ { G T } ^ { s } \right) + K L \left( \tilde { M _ { G T } ^ { s } } | | S _ { p r e d } ^ { a } \right) This formula appears to be an incomplete representation of the symKL-loss as typically applied in CLAP (which would usually include terms for SpredpS_{pred}^p as well, making it truly symmetric across modalities). However, strictly adhering to the provided formula:

    • KL(SM)KL(S || M): The Kullback-Leibler divergence function defined as: KL(SM)=i,jS(i,j)logS(i,j)M(i,j) K L ( S | | M ) = \sum _ { i , j } S ( i , j ) \log { \frac { S ( i , j ) } { M ( i , j ) } } This calculates the divergence between the predicted similarity matrix SS and the ground truth similarity matrix MM.
    • MGTs~\tilde { M _ { G T } ^ { s } }: A smoothed version of the soft ground truth matrix, used to prevent potential issues with zero probabilities in the KL divergence calculation: MGTs~=(1α)MGTs+αN \tilde { M _ { G T } ^ { s } } = ( 1 - \alpha ) \cdot M _ { G T } ^ { s } + \frac { \alpha } { N } Where:
      • α\alpha: A small hyperparameter, empirically set to 1×1081 \times 10^{-8}, used for label smoothing.

      • NN: The batch size. This term ensures that all entries are slightly positive, making the logarithm well-defined.

        This symKL-loss aims to align the predicted audio-text similarities with the combined categorical and natural language soft labels, creating robust emotional embeddings.

4.2.3. AdaFM-VC

The AdaFM-VC model is the end-to-end voice conversion system that takes the content features and emotional embeddings to generate the Mel-spectrogram. It consists of the FuEncoder and the Conditional Flow Matching (CFM)-based decoder.

4.2.3.1. FuEncoder with AIG

The FuEncoder is a crucial intermediate component designed to integrate content features with emotional embeddings while providing flexible control over emotion intensity via the Adaptive Intensity Gate (AIG).

The FuEncoder comprises the following sub-components:

  1. Preprocessing Network (PreNet): This network compresses the source content features ZcZ_c (extracted by HybridFormer) into a latent space. It includes a dropout mechanism to prevent overfitting.
  2. Positional Encoding Module: This module applies sinusoidal positional encoding to the latent content features ZcZ_c. It then performs element-wise addition of these positional encodings with ZcZ_c. This ensures that the FuEncoder can understand and utilize the sequential and structural information inherent in the content features.
  3. Adaptive Intensity Gate (AIG) Module: This is a novel component designed for emotion intensity control. The AIG module multiplies a learnable hyperparameter by the emotional features obtained from EVC-CLAP. This multiplication effectively scales the influence of the emotional features, allowing for flexible adjustment of the emotional intensity in the converted speech.
  4. Adaptive Fusion Module: This is the core of the FuEncoder, responsible for combining the content and emotion information. It consists of multiple fusion blocks. Each fusion block includes:
    • A multi-head self-attention layer: This mechanism allows the module to weigh the importance of different parts of the input sequence (content features) when integrating emotional information.
    • Two emotion adaptive layer norm layers [28]: These layers normalize features based on the emotional embeddings, ensuring that the fusion process is adaptive to the target emotion.
    • A position-wise feed-forward network layer: This applies a separate, identical transformation to each position of the sequence. This module generates rich embedding representations ff that contain both linguistic content and emotional characteristics.
  5. Linear Mapping Layer: A fully connected layer that maps the fused features to a specific dimension fRB×T×Df \in \mathbb{R}^{B \times T \times D}, where BB is batch size, TT is sequence length, and DD is feature dimension.

4.2.3.2. Conditional Flow Matching-based Decoder

To further improve speech naturalness and quality, an optimal transport (OT)-based Conditional Flow Matching (CFM) model is incorporated. This model reconstructs the target Mel-spectrogram x _ { 1 } = p _ { 1 } ( x ) from a standard Gaussian noise x0=p0(x)=N(x;0,I)x _ { 0 } = p _ { 0 } ( x ) = \mathcal { N } ( x ; 0 , I ).

Here's how it works:

  1. OT Flow: Conditioned on the emotional embeddings hh captured by EVC-CLAP, an OT flow ψt:[0,1]×RdRd\psi _ { t } : [ 0 , 1 ] \times \mathbb { R } ^ { d } \to \mathbb { R } ^ { d } is adopted to train the CFM-based decoder. This flow represents a continuous path from the initial noise distribution to the target data distribution. The decoder itself consists of 6 CFM blocks, each containing a ResNet [29] module, a multi-head self-attention [30] module, and a FiLM [31] layer, along with timestep fusion.

  2. Vector Field Modeling: The CFM model uses an ordinary differential equation (ODE) to model a learnable and time-dependent vector field vt:[0,1]×RdRd\boldsymbol { v } _ { t } : [ 0 , 1 ] \times \mathbb { R } ^ { d } \to \mathbb { R } ^ { d }. This vector field approximates the optimal transport path from the simple Gaussian noise distribution p _ { 0 } ( x ) to the complex target distribution p _ { 1 } ( x ) (Mel-spectrograms): ddtψt(x)=vt(ψt(x),t) \frac { d } { d t } \psi _ { t } ( x ) = v _ { t } ( \psi _ { t } ( x ) , t ) Where:

    • ψt(x)\psi _ { t } ( x ): The position along the flow at time tt.
    • vt(ψt(x),t)v _ { t } ( \psi _ { t } ( x ) , t ): The vector field, representing the velocity at which a point xx moves along the flow at time tt.
    • t[0,1]t \in [ 0 , 1 ]: The time variable, ranging from 0 (starting distribution) to 1 (target distribution).
    • ψ0(x)=x\psi _ { 0 } ( x ) = x: The initial condition, meaning the flow starts at the input xx.
  3. Simplified OT Flow Trajectories: Drawing inspiration from works that suggest adopting straighter trajectories for OT flows [32], the paper simplifies the OT flow formula: ψt,z(x)=μt(z)+σt(z)x \psi _ { t , z } ( x ) = \mu _ { t } ( z ) + \sigma _ { t } ( z ) x Where:

    • ψt,z(x)\psi _ { t , z } ( x ): The simplified flow trajectory.
    • μt(z)=tz\mu _ { t } ( z ) = t z: The mean of the distribution at time tt, linearly interpolated from 0 to zz.
    • σt(z)=(1(1σmin)t)\sigma _ { t } ( z ) = ( 1 - ( 1 - \sigma _ { \operatorname* { m i n } } ) t ): The standard deviation of the distribution at time tt, which decreases linearly from 1 to σmin\sigma_{\min}.
    • zz: Represents the random conditioned input, implicitly referring to the target Mel-spectrogram sample x1x_1 in the context of the loss function below.
    • σmin\sigma _ { \mathrm { m i n } }: The minimum standard deviation of the white noise introduced to perturb individual samples, empirically set to 0.0001. This ensures that the trajectories are nearly straight while maintaining some numerical stability.
  4. Training Loss for AdaFM-VC: The training objective for AdaFM-VC is defined as: L=Et,p(x0),q(x1)(x1(1σ)x0)vt(ψt,x1(x0)h)2 \mathcal { L } = \mathbb { E } _ { t , p ( x _ { 0 } ) , q ( x _ { 1 } ) } \lVert \big ( x _ { 1 } - ( 1 - \sigma ) x _ { 0 } \big ) - v _ { t } \big ( \psi _ { t , x _ { 1 } } ( x _ { 0 } ) | h \big ) \rVert ^ { 2 } Where:

    • L\mathcal { L }: The training loss for AdaFM-VC.
    • Et,p(x0),q(x1)\mathbb { E } _ { t , p ( x _ { 0 } ) , q ( x _ { 1 } ) }: Expectation over time tt, initial noise x0x_0, and target data x1x_1.
    • x0p(x0)=N(0,I)x _ { 0 } \sim p ( x _ { 0 } ) = \mathcal{N}(0, I): A sample drawn from the standard Gaussian noise distribution.
    • x1q(x1)x _ { 1 } \sim q ( x _ { 1 } ): A sample drawn from the true (potentially non-Gaussian) data distribution of target Mel-spectrograms.
    • tU[0,1]t \sim U [ 0 , 1 ]: Time step sampled uniformly between 0 and 1.
    • σ\sigma: A time-dependent variance schedule, often set to σt(x1)\sigma_t(x_1) from the simplified flow, or similar, representing the noise level at time tt.
    • (x1(1σ)x0)\big ( x _ { 1 } - ( 1 - \sigma ) x _ { 0 } \big ): Represents the target vector field component (related to the direction of flow from x0x_0 to x1x_1).
    • vt(ψt,x1(x0)h)v _ { t } \big ( \psi _ { t , x _ { 1 } } ( x _ { 0 } ) | h \big ): The predicted vector field by the CFM model, conditioned on the current state of the flow ψt,x1(x0)\psi _ { t , x _ { 1 } } ( x _ { 0 } ) and the conditional emotion embeddings hh extracted by EVC-CLAP. This loss function aims to train the CFM model to accurately predict the vector field that transforms noise into the target Mel-spectrogram, guided by the emotional features.

5. Experimental Setup

5.1. Datasets

The authors used an internally developed expressive single-speaker Mandarin corpus for training ClapFM-EVC.

  • Scale: The corpus contains 20 hours of speech data.
  • Sampling Rate: The speech data is sampled at 24kHz24 \mathrm{kHz}.
  • Utterances: From this corpus, 12,000 utterances were specifically selected.
  • Emotion Classes: These utterances represent 7 original categorical emotion classes: neutral, happy, sad, angry, fear, surprise, and disgust.
  • Annotations: To ensure high-quality annotations for natural language prompts, 15 professional annotators were enlisted to provide natural language prompts for the selected waveforms. This unique annotation process is crucial for training the EVC-CLAP component. The choice of an internally developed corpus, specifically annotated with natural language prompts, is essential as there is no open-source EVC corpus with such comprehensive emotional natural language prompts available. This dataset directly supports the paper's core innovation of natural language prompt-driven EVC.

5.2. Evaluation Metrics

To assess the performance of ClapFM-EVC, both subjective and objective evaluations were conducted for speech quality and emotion similarity.

5.2.1. Objective Metrics

  • Mel-cepstral distortion (MCD):

    • Conceptual Definition: MCD quantifies the spectral distortion between two speech signals, typically between synthesized/converted speech and its corresponding ground truth or reference. It measures the average Euclidean distance between Mel-cepstral coefficient (MCC) vectors, which are compact representations of the speech spectrum. Lower MCD values indicate higher speech quality and closer spectral resemblance.
    • Mathematical Formula: $ \mathrm{MCD} = \frac{10}{\ln 10} \sqrt{2 \sum_{d=1}^{D} (mc_d^{\text{converted}} - mc_d^{\text{reference}})^2} $
    • Symbol Explanation:
      • MCD\mathrm{MCD}: Mel-cepstral distortion.
      • DD: Number of Mel-cepstral coefficients (typically 12-24).
      • mcdconvertedmc_d^{\text{converted}}: The dd-th Mel-cepstral coefficient of the converted speech.
      • mcdreferencemc_d^{\text{reference}}: The dd-th Mel-cepstral coefficient of the reference speech.
  • Root Mean Squared Error (RMSE):

    • Conceptual Definition: RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It represents the square root of the average of the squares of the errors. In speech processing, it can measure the average magnitude of the error between generated acoustic features (e.g., Mel-spectrograms) and their target values. Lower RMSE indicates higher accuracy.
    • Mathematical Formula: $ \mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2} $
    • Symbol Explanation:
      • RMSE\mathrm{RMSE}: Root Mean Squared Error.
      • NN: The number of samples or data points.
      • yiy_i: The actual (ground truth) value for the ii-th sample.
      • y^i\hat{y}_i: The predicted or converted value for the ii-th sample.
  • Character Error Rate (CER):

    • Conceptual Definition: CER is a common metric used to evaluate the accuracy of speech recognition systems or to assess the linguistic content preservation in speech conversion tasks. It measures the number of character-level errors (substitutions, deletions, insertions) required to transform the recognized or converted text into the reference text, divided by the total number of characters in the reference text. A lower CER indicates better content preservation.
    • Mathematical Formula: $ \mathrm{CER} = \frac{S + D + I}{N} $
    • Symbol Explanation:
      • CER\mathrm{CER}: Character Error Rate.
      • SS: The number of substitution errors (a character in the converted text is different from the reference).
      • DD: The number of deletion errors (a character present in the reference is missing in the converted text).
      • II: The number of insertion errors (a character present in the converted text is not in the reference).
      • NN: The total number of characters in the reference text.
    • Calculation: The CER is calculated using a pre-trained CTC-based ASR model.
  • Predicted MOS (UTMOS):

    • Conceptual Definition: UTMOS is a Universal MOS predictor, a deep learning model trained to predict human Mean Opinion Score (MOS) for speech quality. It provides an objective estimate of subjective speech quality, correlating well with human perception of naturalness and overall quality. Higher UTMOS scores indicate better predicted speech quality.
  • Emotion Embedding Cosine Similarity (EECS):

    • Conceptual Definition: EECS measures the cosine similarity between the emotion embeddings of the converted speech and the reference speech. Cosine similarity is a metric that determines how similar two non-zero vectors are by measuring the cosine of the angle between them. A value of 1 means identical direction (perfect similarity), 0 means orthogonality (no similarity), and -1 means opposite direction. In this context, higher EECS values indicate greater similarity in emotional content between the converted and reference speech.
    • Mathematical Formula (Cosine Similarity): $ \mathrm{CosineSimilarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $
    • Symbol Explanation:
      • AA, BB: Two vectors representing emotion embeddings (one from converted speech, one from reference speech).
      • AiA_i, BiB_i: The ii-th component of vectors AA and BB, respectively.
      • nn: The dimensionality of the emotion embedding vectors.
      • ABA \cdot B: The dot product of vectors AA and BB.
      • A\|A\|: The Euclidean norm (or magnitude) of vector AA, calculated as i=1nAi2\sqrt{\sum_{i=1}^{n} A_i^2}.

5.2.2. Subjective Metrics

  • Mean Opinion Score (MOS):
    • Conceptual Definition: MOS is a widely used subjective quality measure where human listeners rate the quality of speech samples on a numerical scale. In this paper, it's used for:
      • nMOS (naturalness MOS): Measures the perceived naturalness of the converted speech.
      • eMOS (emotion similarity MOS): Measures how well the emotion in the converted speech matches the target emotion.
    • Scoring Scale: The scoring scale ranges from 1 to 5, with increments of 1. Higher scores indicate better performance.
    • Raters: 12 professional raters were invited to participate in the evaluation.
    • Confidence Interval: The results are presented with a 95%95\% confidence interval, indicating the reliability of the average scores.

5.3. Baselines

The ClapFM-EVC system was compared against several existing EVC methods, specifically when using reference speech as input for emotion control:

  • StarGAN-EVC [11]: A pioneering EVC method utilizing a cycle-consistent and class-conditional GAN architecture.

  • Seq2seq-EVC [34]: An EVC approach leveraging a two-stage sequence-to-sequence training, often using Text-to-Speech (TTS) components.

  • MixEmo [35]: A method that integrates speech synthesis with emotions, likely focusing on mixing or blending emotional styles.

    These baselines are representative of conventional EVC approaches that primarily rely on a reference waveform to facilitate emotion conversion, making them suitable for comparison with ClapFM-EVC's reference speech mode.

5.4. Implementation Details

The training and inference procedures involved specific configurations:

  • Hardware: All experiments, including training EVC-CLAP and AdaFM-VC, were conducted using 8 NVIDIA RTX 3090 GPUs.
  • EVC-CLAP Training:
    • Optimizer: Adam optimizer.
    • Initial Learning Rate: 1×1051 \times 10^{-5}.
    • Batch Size: 16.
    • Epochs: Trained for 40 epochs within the PyTorch framework.
  • AdaFM-VC Training:
    • Optimizer: AdamW optimizer.
    • Initial Learning Rate: 2×1042 \times 10^{-4}.
    • Batch Size: 32.
    • Iterations: Trained over 500,000 iterations on the same GPU setup.
  • Inference:
    • Mel-spectrogram Sampling: The Mel-spectrogram of the target waveform is sampled using 25 Euler steps within the CFM-based decoder. Euler steps refer to numerical integration steps for solving the ODE of the flow matching model.
    • Guidance Scale: A guidance scale set to 1.0 was used, which typically influences how strongly the model adheres to the conditioning information (emotional embeddings).

6. Results & Analysis

6.1. Core Results Analysis

The experiments aim to validate the effectiveness of ClapFM-EVC by comparing its performance against state-of-the-art baselines and through ablation studies.

6.1.1. EVC by Reference Speech

The paper first evaluates ClapFM-EVC when the emotion is controlled by a reference speech, comparing it to StarGAN-EVC, Seq2seq-EVC, and MixEmo.

The following are the results from Table 1 of the original paper:

Model MCD (↓) RMSE (↓) CER (↓) UTMOS (↑) nMOS (↑) EECS (↑) eMOS (↑)
StarGAN-EVC [11] 8.85 19.48 13.07 1.45 2.09 ± 0.12 0.49 1.97 ± 0.09
Seq2seq-EVC [34] 6.93 15.79 10.56 1.81 2.52 ± 0.11 0.54 2.23 ± 0.11
MixEmo [35] 6.28 13.84 8.93 2.09 2.98 ± 0.07 0.65 2.58 ± 0.13
ClapFM-EVC 5.83 10.91 6.76 3.68 4.09 ± 0.09 0.82 3.85 ± 0.06

Analysis:

  • Emotion Similarity (EECS, eMOS): ClapFM-EVC demonstrates significant improvements. It achieves an EECS of 0.82, which is a relative improvement of (0.820.65)/0.6526.2%(0.82 - 0.65) / 0.65 \approx 26.2\% over MixEmo (the best baseline at 0.65). For subjective emotion similarity (eMOS), ClapFM-EVC scores 3.85, a remarkable relative improvement of (3.852.58)/2.5849.2%(3.85 - 2.58) / 2.58 \approx 49.2\% over MixEmo (2.58). This highlights ClapFM-EVC's superior ability to capture and transfer target emotional characteristics.
  • Speech Quality (MCD, RMSE, CER, UTMOS, nMOS): ClapFM-EVC consistently outperforms baselines across all objective and subjective speech quality metrics.
    • Objective: It achieves the lowest MCD (5.83), RMSE (10.91), and CER (6.76), indicating higher spectral fidelity and better preservation of linguistic content. For UTMOS, ClapFM-EVC scores 3.68, a substantial relative improvement of (3.682.09)/2.0976.1%(3.68 - 2.09) / 2.09 \approx 76.1\% over MixEmo.
    • Subjective: In terms of perceived naturalness (nMOS), ClapFM-EVC scores 4.09, which is a relative improvement of (4.092.98)/2.9837.2%(4.09 - 2.98) / 2.98 \approx 37.2\% over MixEmo. These results confirm that ClapFM-EVC produces converted speech with exceptional perceptual quality while accurately conveying emotions.

6.1.2. EVC by Natural Language Prompt

To demonstrate the flexibility of ClapFM-EVC, an ABX preference test was conducted to compare the performance when using reference speech (Reference) versus natural language prompts (Prompt) for emotion control.

The following figure (Figure 2 from the original paper) shows the ABX preference test results compare the Reference with Prompt.

Figure 2: The ABX preference test results compare the Reference with Prompt.

Analysis:

  • Emotion Similarity: When participants were asked to rate emotion similarity (with Reference as benchmark, -1 for Reference better, 0 for no preference, 1 for Prompt better), 57.4%57.4\% chose "no preference". This suggests that for a majority of listeners, the emotion similarity achieved by Prompt-driven conversion is comparable to that of Reference-driven conversion. Furthermore, 19.1%19.1\% favored Prompt, indicating that natural language control can be highly effective and sometimes even preferred for emotional expression.

  • Speech Quality: For speech quality (closer to ground truth), Reference received 25.5%25.5\% preference, Prompt received 23.4%23.4\% preference, and "no preference" was chosen by 51.1%51.1\% of participants. The close preference rates between Reference and Prompt modes, combined with over half of participants having no preference, indicate that ClapFM-EVC maintains high-quality EVC regardless of whether a reference speech or a natural language prompt is used for emotion control.

    These ABX test results validate the dual control capability of ClapFM-EVC and demonstrate that natural language prompts can serve as an effective and high-quality alternative for emotional control.

6.2. Ablation Studies / Parameter Analysis

To understand the contribution of each proposed component, ablation studies were performed, with results summarized in Table 2. These studies used natural language prompts to drive the EVC.

The following are the results from Table 2 of the original paper:

Model UTMOS (↑) nMOS (↑) EECS (↑) eMOS (↑)
ClapFM-EVC 3.63 4.01 ± 0.06 0.79 3.72 ± 0.08
w/o emo label 3.61 3.96 ± 0.11 0.66 3.01 ± 0.07
w/o symKL 3.57 3.89 ± 0.05 0.71 3.28 ± 0.08
w/o AIG 3.25 3.62 ± 0.12 0.74 3.52 ± 0.05

Analysis:

  1. w/o emo label (without emotional categorical labels):

    • Impact: When categorical emotional labels are removed from EVC-CLAP training, EECS drops significantly from 0.79 to 0.66 (a relative drop of 16.5%16.5\%), and eMOS drops from 3.72 to 3.01 (a relative drop of 19.1%19.1\%). UTMOS and nMOS (speech quality metrics) remain largely unaffected (3.61 vs. 3.63 and 3.96 vs. 4.01, respectively).
    • Conclusion: This validates the effectiveness of the proposed soft-label-guided training strategy in EVC-CLAP. Combining natural language prompts with categorical labels is crucial for enhancing the model's emotional representation capacity and achieving better emotion similarity.
  2. w/o symKL (without symKL-loss):

    • Impact: Replacing the symKL-loss with a standard KL-Loss (presumably unidirectional) results in EECS dropping from 0.79 to 0.71 (a relative drop of 10.1%10.1\%) and eMOS dropping from 3.72 to 3.28 (a relative drop of 11.8%11.8\%). Speech quality metrics (UTMOS, nMOS) also show slight reductions.
    • Conclusion: This indicates that the symmetric KL-Loss is more effective than a standard KL-Loss in aligning audio and text emotional features. The bidirectional nature of symKL-loss likely contributes to a more robust and enhanced emotion representation capability within EVC-CLAP.
  3. w/o AIG (without Adaptive Intensity Gate):

    • Impact: Removing the AIG module leads to a notable deterioration in speech quality metrics: UTMOS drops from 3.63 to 3.25 (a relative drop of 10.5%10.5\%), and nMOS drops from 4.01 to 3.62 (a relative drop of 9.6%9.6\%). There is also a slight reduction in emotional similarity (EECS from 0.79 to 0.74, eMOS from 3.72 to 3.52).

    • Conclusion: This underscores the critical role of the AIG module. It not only enables adaptive control of emotion intensity but also significantly contributes to the overall speech quality by effectively integrating content and emotional characteristics. Its absence negatively impacts both the quality and emotional expressiveness of the converted speech.

      The ablation studies collectively confirm that all proposed components – the soft-label-guided training, symKL-loss, and the AIG module – are vital for the superior performance of the ClapFM-EVC system, contributing to both emotional accuracy and speech quality.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully introduced ClapFM-EVC, an innovative and effective high-fidelity any-to-one Emotional Voice Conversion (EVC) framework. A key strength of ClapFM-EVC is its flexible and interpretable emotion control, achieved through dual input modalities: natural language prompts and reference speech, coupled with adjustable emotion intensity.

The framework's core innovations include:

  1. EVC-CLAP: An emotional contrastive language-audio pre-training model designed to extract and align fine-grained emotional elements across audio and text. Its training is guided by a novel symKL-Loss and soft labels derived from both natural language prompts and categorical emotion labels, significantly enhancing emotional representational capacity.

  2. AdaFM-VC: Comprising a FuEncoder with an Adaptive Intensity Gate (AIG) and a Conditional Flow Matching (CFM)-based decoder. The FuEncoder seamlessly fuses content and emotional features, with AIG providing explicit control over emotion intensity. The CFM decoder, combined with a pre-trained vocoder, ensures high-fidelity speech reconstruction, improving naturalness and speech quality.

    Extensive experiments, including comparisons with state-of-the-art baselines and detailed ablation studies, rigorously validated that ClapFM-EVC consistently generates converted speech with precise emotion control and high speech quality, particularly when driven by natural language prompts.

7.2. Limitations & Future Work

The paper does not explicitly detail a "Limitations" or "Future Work" section. However, based on the description and common challenges in EVC research, potential limitations and future directions can be inferred:

Inferred Limitations:

  • Single-Speaker Model: The current system is described as an "any-to-one" EVC framework, implying conversion to a single target speaker's voice. This means it cannot perform "any-to-any" EVC (converting to arbitrary target speakers) or "one-to-many" EVC (converting one source speaker to multiple target speakers). This limits its flexibility in scenarios requiring diverse target voices.
  • Proprietary Dataset: The reliance on an "internally developed expressive single-speaker Mandarin corpus" limits reproducibility and direct comparison by other researchers who do not have access to this specific dataset. The generalization capabilities to other languages or emotional styles not present in this dataset are also not explicitly tested.
  • Interpretability of Intensity Control: While the AIG module allows "adjustable emotion intensity," the paper does not extensively quantify or subjectively evaluate the range and perceptual linearity of this intensity control. Further exploration could include how different intensity levels map to human perception.
  • Computational Cost: Flow Matching models, especially with many steps (25 Euler steps in inference), and large pre-trained encoders (HuBERT, XLMRoBERTa) can be computationally intensive during training and inference, though this is a common trade-off for high quality.

Inferred Future Work:

  • Any-to-Any EVC: Extending the framework to support any-to-any voice conversion, allowing transfer of emotion to arbitrary target speakers (including zero-shot scenarios).
  • Cross-Lingual EVC: Investigating the model's performance and adaptability for cross-lingual emotion conversion, where source and target speech are in different languages.
  • Quantitative Intensity Control Evaluation: Developing more robust quantitative and qualitative evaluations for the adjustable emotion intensity, perhaps through psychological experiments or fine-grained annotation of intensity levels.
  • Real-time Performance Optimization: Exploring methods to reduce the computational overhead and accelerate inference for potential real-time applications.
  • Broader Emotional Range: Training on even larger and more diverse emotional datasets, potentially including more nuanced or complex emotional states beyond the 7 categorical emotions.
  • Integration with Text-to-Speech (TTS): Combining ClapFM-EVC with TTS systems to enable generation of emotional speech directly from text, with fine-grained emotional control.

7.3. Personal Insights & Critique

This paper presents a significant advancement in emotional voice conversion, particularly in addressing the critical aspects of flexible control and high-fidelity output.

Personal Insights:

  • Synergy of Components: The most impressive aspect is the synergistic combination of CLAP-based multimodal learning, adaptive intensity control, and flow matching. This multi-pronged approach effectively tackles the disentanglement problem in EVC, allowing for robust control over emotion while maintaining content and quality.
  • Natural Language Control as a Game Changer: The ability to control emotion via natural language prompts is a substantial leap forward for user experience and interpretability. It unlocks a far richer space of emotional expression compared to discrete labels or reliance on reference audio, making EVC systems more practical and intuitive for applications like creative content generation or personalized voice assistants.
  • Robust Emotional Embeddings: The soft-labels-guided training and symKL-loss for EVC-CLAP are clever techniques to create strong, discriminative emotional embeddings by leveraging both structured categorical knowledge and unstructured linguistic descriptions. This likely contributes significantly to the model's ability to interpret subtle emotional nuances.
  • Flow Matching for Quality: The integration of Conditional Flow Matching is a strong choice for achieving high naturalness and speech quality, aligning with current trends in state-of-the-art generative models for audio.

Critique / Areas for Improvement:

  • Any-to-One vs. Any-to-Any Clarity: While stated as "any-to-one," which is clear, the implications for practical application (i.e., needing a specific target speaker's voice) could be discussed more explicitly in the abstract or introduction, as many desire "any-to-any" EVC for maximum flexibility.

  • Detailed Evaluation of Intensity Control: Although the AIG module is introduced and its contribution is shown in ablation, a more detailed evaluation of the range and perceptual effect of the intensity control would be valuable. For instance, demonstrating conversion at "low," "medium," and "high" intensity levels for a given emotion and including subjective ratings specific to intensity.

  • Generalizability to Other Languages/Speakers: The use of an internal Mandarin corpus raises questions about cross-lingual transferability and generalizability to diverse speaker characteristics. While the underlying models (HuBERT, XLMRoBERTa) are often multilingual, the EVC-specific components might benefit from training on more diverse datasets. Making the dataset public or providing a benchmark for it would greatly benefit the community.

  • Ablation of CFM vs. other generative models: While CFM is chosen for its benefits, a direct ablation study comparing CFM with alternative generative backbones (e.g., GAN-based or VAE-based decoders, or even diffusion models if CFM is considered distinct enough) could further emphasize its contribution to quality and naturalness.

  • Ethical Considerations: As with any advanced voice synthesis technology, a brief discussion on ethical implications (e.g., deepfakes, misuse) would be a valuable addition, especially for a system offering such flexible emotional control.

    Overall, ClapFM-EVC represents a robust and innovative step forward in EVC, pushing the boundaries of control flexibility and output quality. Its dual control mechanism and intelligent integration of multimodal and generative techniques pave the way for more natural and user-friendly emotional speech synthesis applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.