Paper status: completed

Conditional out-of-sample generation for unpaired data using trVAE

Published:10/04/2019

Conditional Variational Autoencoder (1)Maximum Mean Discrepancy Regularization (1)transformer VAE (trVAE) (1)Single-Cell Gene Expression Data Generation (1)Conditional Generation of High-Dimensional Data (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

trVAE uses MMD regularization in the decoder to align distributions across conditions, improving CVAE's out-of-sample generalization, showing superior robustness and accuracy on high-dimensional image and single-cell gene expression data.

Abstract

While generative models have shown great success in generating high-dimensional samples conditional on low-dimensional descriptors (learning e.g. stroke thickness in MNIST, hair color in CelebA, or speaker identity in Wavenet), their generation out-of-sample poses fundamental problems. The conditional variational autoencoder (CVAE) as a simple conditional generative model does not explicitly relate conditions during training and, hence, has no incentive of learning a compact joint distribution across conditions. We overcome this limitation by matching their distributions using maximum mean discrepancy (MMD) in the decoder layer that follows the bottleneck. This introduces a strong regularization both for reconstructing samples within the same condition and for transforming samples across conditions, resulting in much improved generalization. We refer to the architecture as \emph{transformer} VAE (trVAE). Benchmarking trVAE on high-dimensional image and tabular data, we demonstrate higher robustness and higher accuracy than existing approaches. In particular, we show qualitatively improved predictions for cellular perturbation response to treatment and disease based on high-dimensional single-cell gene expression data, by tackling previously problematic minority classes and multiple conditions. For generic tasks, we improve Pearson correlations of high-dimensional estimated means and variances with their ground truths from 0.89 to 0.97 and 0.75 to 0.87, respectively.

Mind Map

In-depth Reading

English Analysis~33 min read · 43,262 chars

1. Bibliographic Information

1.1. Title

Conditional out-of-sample generation for unpaired data using trVAE

1.2. Authors

The authors are M. Lotfollahi, Mohsen Naghipourfar, Fabian J. Theis, and F. Alexander Wolf. Their affiliations are primarily with the Institute of Computational Biology, Helmholtz Center Munich, Neuherberg, Germany, and the School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany. Mohsen Naghipourfar is also affiliated with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. Fabian J. Theis is also associated with the Department of Mathematics, Technische Universität München, Munich, Germany.

1.3. Journal/Conference

The paper was published on arXiv, an open-access preprint server for scientific papers. While arXiv itself is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating research in fields like machine learning, physics, mathematics, and computer science. The paper's content, particularly its focus on single-cell gene expression data, suggests it aligns with computational biology or machine learning conferences/journals such as NeurIPS, ICML, Nature Methods, or Cell Systems. The v2 indicates a revision after initial submission.

1.4. Publication Year

The paper was published (or revised, given v2) on October 4, 2019.

1.5. Abstract

This paper introduces transformer VAE (trVAE), a novel conditional generative model designed for out-of-sample generation with unpaired data. While existing conditional generative models like the conditional variational autoencoder (CVAE) struggle with out-of-sample generation due to their lack of explicit condition-relationship learning during training, trVAE overcomes this by incorporating maximum mean discrepancy (MMD) regularization. This MMD term is applied in the decoder layer immediately following the bottleneck, forcing the intermediate representations across different conditions to be more compact and aligned. This regularization significantly improves generalization, enabling trVAE to reconstruct samples within the same condition more effectively and transform samples across conditions more accurately. Benchmarked on high-dimensional image data (Morpho-MNIST, CelebA) and tabular single-cell gene expression data, trVAE demonstrates superior robustness and accuracy compared to existing approaches. Specifically, it achieves qualitatively improved predictions for cellular perturbation responses, particularly for problematic minority classes and multiple conditions in biological data. For generic tasks, trVAE boosts Pearson correlations for estimated means and variances with ground truths from 0.89 to 0.97 and 0.75 to 0.87, respectively.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/1910.01791v2 PDF Link: https://arxiv.org/pdf/1910.01791v2.pdf Publication Status: This is a preprint version (v2) hosted on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the limitation of existing conditional generative models, particularly conditional variational autoencoders (CVAEs), in performing out-of-sample generation. This refers to the task of generating samples in a target condition (d, s) when training data for that specific condition is unavailable. For instance, if a model is trained on black-haired men, blonde-haired women, and black-haired women, an out-of-sample task would be to predict how a black-haired man would look with blonde hair, given that blonde-haired men were not in the training set.

This problem is highly significant, especially in applications with high relevance, such as computational biology. For example, predicting how human cells (domain $d=0$ ) would respond to a drug treatment (condition $s=1$ ) when training data for treated human cells is absent, but data from in vitro ( $d=1$ ) or mouse ( $d=2$ ) experiments under treatment, and untreated human cells ( $s=0$ ), are available. Existing CVAEs fall short because they do not explicitly enforce a relationship between different conditions during training. This lack of incentive means the model's internal representation for different conditions can diverge significantly, making generalization to unseen combinations difficult. Prior methods either relied on hard-coded transformations or histogram matching, which are not data-driven or generalizable to multiple conditions.

The paper's entry point or innovative idea is to introduce a strong regularization using Maximum Mean Discrepancy (MMD) within the CVAE framework. Instead of regularizing the latent bottleneck layer, they propose applying MMD to the first layer of the decoder. This forces the intermediate representations for different conditions to be more similar, thereby learning a compact joint distribution across conditions and enabling more accurate out-of-sample predictions.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Introduction of trVAE (transformer VAE): A novel MMD-regularized CVAE architecture that explicitly relates conditions during training by matching their distributions in the decoder's first layer.
Improved Generalization: The MMD regularization leads to a more compact representation that displays subtle, controlled differences across conditions, thereby significantly improving the model's ability to generalize to out-of-sample scenarios.
Enhanced Robustness and Accuracy: trVAE demonstrates higher robustness and accuracy compared to existing generative models and baselines on diverse datasets, including high-dimensional image and tabular data.
Qualitatively Improved Biological Predictions: For cellular perturbation response prediction using single-cell gene expression data, trVAE shows qualitatively superior results, especially in tackling minority classes and handling multiple conditions, which were previously problematic.
Quantitative Performance Gains: The model achieves substantial improvements in quantitative metrics, increasing Pearson correlations of high-dimensional estimated means from 0.89 to 0.97 and variances from 0.75 to 0.87 with their ground truths for generic tasks.

The key conclusions are that by strategically applying MMD regularization to an intermediate layer of the decoder, trVAE learns condition-invariant features that are crucial for effective out-of-sample transformation. This data-driven, end-to-end approach overcomes the limitations of prior methods, offering a more general and accurate solution for conditional generation tasks, especially in complex domains like biological data analysis.

3.1. Foundational Concepts

To fully grasp the trVAE model, understanding several foundational machine learning concepts is essential.

3.1.1. Variational Autoencoder (VAE)

A Variational Autoencoder (VAE) (Kingma & Welling, 2013) is a type of generative model that learns a compressed, meaningful representation (a latent space) of input data and can generate new data samples similar to the training data.

Architecture: A VAE consists of two main parts:
- Encoder: Maps input data $X$ to a distribution over a latent space $Z$ . Instead of a single point, the encoder outputs the parameters (mean $\mu$ and variance $\sigma^2$ ) of a probability distribution (typically a Gaussian) in the latent space, $q_{\phi}(Z|X)$ .
- Decoder: Maps samples from the latent space $Z$ back to the original data space, generating $\hat{X}$ . This is a generative distribution $p_{\theta}(X|Z)$ .
Objective: The goal of VAE is to maximize the evidence lower bound (ELBO). This objective function balances two terms:
1. Reconstruction Loss: Measures how well the decoder reconstructs the original input data from its latent representation. This term typically uses binary cross-entropy for binary data or mean squared error for continuous data. The paper refers to it as $\mathbb { E } _ { q _ { \phi } ( Z | X , S ) } [ \log p _ { \theta } ( X \mid Z , S ) ]$ .
2. Regularization Term: A Kullback-Leibler (KL) divergence term that forces the latent distribution $q_{\phi}(Z|X)$ to be close to a prior distribution p(Z) (often a standard normal distribution). This ensures the latent space is continuous and structured, allowing for smooth interpolation and meaningful generation. The paper refers to this as $D _ { \operatorname { K L } } ( q _ { \phi } ( Z | X , S ) | | p _ { \theta } ( Z | S ) )$ .
Reparameterization Trick: To allow backpropagation through the sampling process, the reparameterization trick is used. Instead of sampling $Z \sim q_{\phi}(Z|X)$ , $Z = \mu + \sigma \cdot \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$ is sampled from a standard normal distribution, making the sampling process deterministic and differentiable.

3.1.2. Conditional Variational Autoencoder (CVAE)

A Conditional Variational Autoencoder (CVAE) (Sohn et al., 2015) extends the VAE by conditioning both the encoder and decoder on an additional conditional variable $S$ . This allows the model to generate data specific to certain attributes or conditions.

How it works: The conditional variable $S$ $S$ (e.g., class label, hair color, treatment type) is fed as an additional input to both the encoder and the decoder.
- Encoder: Learns $q_{\phi}(Z|X, S)$ , meaning the latent representation is conditioned on both the input data and the desired condition.
- Decoder: Learns $p_{\theta}(X|Z, S)$ , meaning data generation is conditioned on both the sampled latent variable and the desired condition.
Objective: The ELBO objective is modified to include the conditional variable $S$ : $ \mathcal{L}{\mathrm{CVAE}} = \mathbb{E}{q_{\phi}(Z|X,S)}[\log p_{\theta}(X|Z,S)] - D_{\mathrm{KL}}(q_{\phi}(Z|X,S) || p_{\theta}(Z|S)) $ This allows CVAEs to generate samples that not only resemble the training data but also possess specific characteristics dictated by $S$ .

3.1.3. Maximum Mean Discrepancy (MMD)

Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) is a statistical distance metric used to measure the difference between two probability distributions. It is widely used in machine learning for tasks like domain adaptation and generative adversarial networks.

Intuition: MMD measures the distance between the mean embeddings of two distributions in a Reproducing Kernel Hilbert Space (RKHS). If two distributions are identical, their mean embeddings in a sufficiently rich RKHS will also be identical, and their MMD will be zero.
Reproducing Kernel Hilbert Space (RKHS): An RKHS is a special type of Hilbert space where evaluation functionals (functions that evaluate a function at a point) are continuous. This property allows for the definition of a kernel function k(x, x') that implicitly maps data points into a high-dimensional feature space. The key idea is that instead of explicitly working in this high-dimensional space, calculations are performed using the kernel function, which is often computationally more efficient (kernel trick).
Kernel Function: A kernel function k(x, x') quantifies the similarity between two data points $x$ and $x'$ . Common kernels include the Gaussian (RBF) kernel: $k(x, x') = e^{-\gamma \|x - x'\|^2}$ . The paper uses a multi-scale RBF kernel, which is a sum of several RBF kernels with different bandwidths $\gamma_i$ , allowing it to capture differences at various scales.
MMD Calculation: Given samples from two distributions $X = \{x_1, \dots, x_{n_0}\}$ and $X' = \{x'_1, \dots, x'_{n_1}\}$ , the empirical MMD estimate is: $ \ell_{\mathrm{MMD}}(X, X') = \frac{1}{n_0^2} \sum_{n,m} k(x_n, x_m) + \frac{1}{n_1^2} \sum_{n,m} k(x'_n, x'm) - \frac{2}{n_0 n_1} \sum{n,m} k(x_n, x'_m) $ The goal is to minimize this value during training to make the two distributions as similar as possible.

3.2. Previous Works

The paper contextualizes trVAE by discussing several related works, emphasizing their limitations or different application focuses.

3.2.1. Standard Conditional Variational Autoencoder (CVAE)

(Sohn et al., 2015) As discussed above, CVAEs are the direct predecessors and a baseline for trVAE. While they can generate conditional samples, their primary limitation for out-of-sample generation is that they do not explicitly enforce any relationship or compactness between the representations of different conditions. This means that during training, the CVAE has no incentive to learn a compact joint distribution across conditions, leading to distinct, potentially non-overlapping representations for different $S$ values. This makes generalization to unseen $S$ values or transformations between $S$ values difficult and unreliable.

3.2.2. MMD for Domain Adaptation and Latent Space Regularization

MMD has been successfully integrated into VAEs for various purposes:

Unsupervised Domain Adaptation: (Louizos et al., 2015) proposed the Variational Fair Autoencoder (VFAE), which uses MMD to match latent distributions $q_{\phi}(Z|s=0)$ and $q_{\phi}(Z|s=1)$ for different domains $s$ . This aims to learn domain-invariant latent representations. The VFAE loss function is given as: $ \mathcal { L } _ { \mathrm { V F A E } } ( \phi , \theta ; X , X ^ { \prime } , S , S ^ { \prime } ) = \mathcal { L } _ { \mathrm { V A E } } ( \phi , \theta ; X , S ) + \mathcal { L } _ { \mathrm { V A E } } ( \phi , \theta ; X ^ { \prime } , S ^ { \prime } ) - \beta \ell _ { \mathrm { M M D } } ( Z _ { s = 0 } , Z ^ { \prime } _ { s ^ { \prime } = 1 } ) $ where $Z_{s=0}$ and $Z'_{s'=1}$ represent the latent variables from two different conditions/domains. This is a crucial reference because it applies MMD but specifically to the bottleneck (latent) layer $Z$ , which trVAE explicitly avoids, claiming it does not improve performance for their task.
Statistically Independent Latent Dimensions: (Lopez et al., 2018b) explored using MMD to learn statistically independent latent dimensions in VAEs.

3.2.3. Causal Inference and Counterfactual Prediction

(Johansson et al., 2016) This work focused on improving counterfactual inference by learning representations that enforce similarity between perturbed and control groups. They used a linear discrepancy measure for this purpose but also mentioned MMD as an alternative. While related in the goal of understanding effects across conditions, it wasn't specifically within the CVAE framework for out-of-sample generation in the same manner as trVAE.

3.2.4. Other Out-of-Sample Transformation Methods

Hardcoded Latent Space Vector Arithmetics: (Lotfollahi et al., 2019), which includes one of the trVAE authors, previously addressed transformation problems by performing arithmetic operations on latent space vectors, for example, vector_treated = vector_control + (vector_treated_avg - vector_control_avg). This approach is hard-coded and not data-driven or learned end-to-end.
Histogram Matching: (Amodio et al., 2018) used histogram matching to align distributions for transformation. Similar to vector arithmetics, this is a non-end-to-end, pre-defined approach.

3.3. Technological Evolution

The evolution in this field has moved from unsupervised generative models (VAEs, GANs) to conditional ones (CVAEs, conditional GANs), allowing for more control over generated samples. A key challenge remained in scenarios where specific condition-combinations were out-of-sample (unseen during training), limiting the practical applicability of these models for tasks like perturbation prediction.

Early attempts to bridge domain gaps or enforce specific latent space properties often involved MMD or adversarial training (GANs). VFAE (Louizos et al., 2015) represented a step towards domain adaptation in VAEs by using MMD on the latent bottleneck. However, the present paper argues that for out-of-sample transformation, MMD on the bottleneck isn't optimal.

trVAE represents an advancement by recognizing that for out-of-sample transformation, the crucial point of regularization is not the most abstract latent representation $Z$ (which CVAE already makes somewhat condition-invariant by design), but rather the first layer of the decoder $Y$ . This intermediate layer often retains strong conditional information that needs to be aligned across conditions to enable smooth and accurate transformations.

3.4. Differentiation Analysis

The core differences and innovations of trVAE compared to related work are:

Placement of MMD Regularization:
- trVAE vs. VFAE / MMD-CVAE: VFAE (and the paper's MMD-CVAE baseline) applies MMD regularization to the latent bottleneck layer $Z$ to make it domain-invariant. trVAE explicitly argues and demonstrates that this approach is not optimal for their out-of-sample transformation task. Instead, trVAE applies MMD to the first layer of the decoder $Y = g_1(\hat{z}, s)$ . This is the primary innovation.
Rationale for MMD Placement:
- The authors observe that while the bottleneck $Z$ in a CVAE often becomes condition-invariant ( $Z \perp \perp S$ ) due to optimization incentives (the decoder directly receives $S$ ), the subsequent decoder layers, particularly the first one ( $Y$ ), still retain strong conditional information ( $Y \not \perp S$ ). By regularizing $Y$ with MMD, trVAE forces these intermediate representations to align across conditions, leading to a more compact and condition-invariant feature space right before generating the high-dimensional output. This compactification directly facilitates out-of-sample transformation.
Data-Driven and End-to-End Transformation:
- trVAE vs. scGen / Amodio et al. (vector arithmetics, histogram matching): trVAE provides a fully data-driven, end-to-end approach for conditional generation and out-of-sample transformation. Unlike methods relying on hard-coded vector arithmetics in latent space or histogram matching, trVAE learns the transformation implicitly through its MMD-regularized training objective. This makes it more flexible and generalizable, especially to multiple conditions.
Performance and Robustness:
- trVAE consistently outperforms vanilla CVAE, MMD-CVAE, CycleGAN, scGen, scVI, and MMD-regularized autoencoder (SAUCIE) across various quantitative metrics (e.g., Pearson correlations for mean and variance) on both image and biological tabular data. It demonstrates higher robustness and better generalization for complex real-world scenarios, such as cellular perturbation prediction.
Handling Multiple Conditions:
- The model is shown to effectively handle multiple conditions simultaneously, a capability that was a limitation for some previous approaches that could only handle two conditions.

4. Methodology

4.1. Principles

The core principle behind trVAE is to improve the generalization capability of a Conditional Variational Autoencoder (CVAE) for out-of-sample conditional generation by enforcing condition-invariant features at a specific, crucial intermediate layer of the decoder. The intuition is that while the latent bottleneck of a CVAE might naturally become somewhat condition-independent (as the condition $S$ is explicitly provided to the decoder), the subsequent intermediate representations still carry strong conditional information, leading to separate, non-compact distributions for different conditions. By applying Maximum Mean Discrepancy (MMD) regularization to the first layer of the decoder, trVAE forces these intermediate representations to align, making them more compact and shared across conditions. This facilitates the learning of common features essential for accurate transformation from one condition to another, even for unseen condition combinations.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. CVAE Framework

The trVAE builds upon the CVAE framework. A VAE aims to maximize the likelihood of observed data $X$ given conditions $S$ . This likelihood is expressed as an integral over the latent variable $Z$ : $ p _ { \theta } ( X \mid S ) = \int p _ { \theta } ( X \mid Z , S ) p _ { \theta } ( Z \mid S ) d Z $ Here:

$X$ : A high-dimensional random variable representing the observed data (e.g., an image, gene expression profile).
$S$ : A random variable representing conditions (e.g., stroke thickness, smile presence, treatment status).
$Z$ : A latent random vector, a low-dimensional representation of $X$ .
$\theta$ : The parameters of the generative model (decoder).
$p _ { \theta } ( X \mid Z , S )$ : The generative distribution that decodes $Z$ and $S$ into $X$ .
$p _ { \theta } ( Z \mid S )$ : The prior distribution over the latent variable $Z$ , potentially conditioned on $S$ .

To make the integral tractable, a variational approximation $q_{\phi}(Z|X,S)$ (the encoder) is introduced. The CVAE optimizes the Evidence Lower Bound (ELBO), which serves as its loss function: $ \mathcal { L } _ { \mathrm { C V a E } } ( \phi , \theta ; X , S , \alpha , \eta ) = \eta \mathbb { E } _ { q _ { \phi } ( Z | X , S ) } [ \log p _ { \theta } ( X \mid Z , S ) ] - \alpha D _ { \operatorname { K L } } ( q _ { \phi } ( Z | X , S ) | | p _ { \theta } ( Z | S ) ) $ In this CVAE loss:
$\phi$ : The parameters of the inference model (encoder).
$\mathbb { E } _ { q _ { \phi } ( Z | X , S ) } [ \log p _ { \theta } ( X \mid Z , S ) ]$ : This is the reconstruction term. It measures how well the decoder $p_{\theta}$ can reconstruct the input $X$ given a latent sample $Z$ (drawn from the encoder's distribution $q_{\phi}$ ) and the condition $S$ . $\eta$ is a weighting factor for this term.
$D _ { \operatorname { K L } } ( q _ { \phi } ( Z | X , S ) | | p _ { \theta } ( Z | S ) )$ : This is the KL divergence term. It measures the difference between the encoder's approximate posterior $q_{\phi}(Z|X,S)$ and the prior $p_{\theta}(Z|S)$ . It acts as a regularizer, encouraging the latent space to be structured and well-behaved. $\alpha$ is a weighting factor for this term.
The paper adapts the notation from Lopez et al. (2018b), where $q_{\phi}$ is the encoder and $p_{\theta}$ is the decoder.

4.2.2. Maximum Mean Discrepancy (MMD)

The MMD is used to measure the distance between two distributions, and its empirical estimate from samples is given by: $ \ell _ { \mathrm { M M D } } ( X , X ^ { \prime } ) = \frac { 1 } { n _ { 0 } ^ { 2 } } \sum _ { n , m } k ( x _ { n } , x _ { m } ) + \frac { 1 } { n _ { 1 } ^ { 2 } } \sum _ { n , m } k ( x _ { n } ^ { \prime } , x _ { m } ^ { \prime } ) - \frac { 2 } { n _ { 0 } n _ { 1 } } \sum _ { n , m } k ( x _ { n } , x _ { m } ^ { \prime } ) $ Here:

$X = \{x_1, \dots, x_{n_0}\}$ : A set of $n_0$ samples from the first distribution.
$X' = \{x'_1, \dots, x'_{n_1}\}$ : A set of $n_1$ samples from the second distribution.
k(x, x'): A kernel function that measures the similarity between two points. The paper uses a multi-scale RBF kernel: $ k ( x , x ^ { \prime } ) = \sum _ { i = 1 } ^ { l } k ( x , x ^ { \prime } , \gamma _ { i } ) = \sum _ { i = 1 } ^ { l } e ^ { - \gamma _ { i } | x - x ^ { \prime } | ^ { 2 } } $ where $\gamma _ { i }$ are hyperparameters defining the bandwidths of the RBF kernels, and $l$ is the number of scales.

4.2.3. trVAE Architecture and Data Flow

The trVAE architecture is a modified CVAE where the key distinction lies in the application of MMD regularization. The model defines the transformation as follows:

Encoder: The encoder $f$ takes a high-dimensional observation $x$ and its corresponding condition $s$ as input and produces a latent representation $\hat{z}$ . $ \hat { z } = f ( x , s ) $
Decoder Decomposition: The decoder $g$ $g$ is conceptually split into two parts:
- First Layer(s) of Decoder: $g_1$ takes the latent representation $\hat{z}$ and the condition $s$ , and outputs an intermediate representation $\hat{y}$ . $ \hat { y } = g _ { 1 } ( \hat { z } , s ) $
- Remaining Layers of Decoder: $g_2$ takes the intermediate representation $\hat{y}$ and transforms it into the reconstructed output $\hat{x}$ . Note that $g_2$ does not receive $s$ directly in this formulation, implying the conditional information is integrated into $\hat{y}$ . $ \hat { x } = g _ { 2 } ( \hat { y } ) $ Thus, the full decoder is $g = g_2 \circ g_1$ .

The crucial insight is that while the latent variable $Z$ in a CVAE often becomes statistically independent of the condition $S$ (denoted $Z \perp \perp S$ ), the intermediate representation $\hat{y}$ from the first decoder layer $g_1$ still retains a strong dependence on $S$ (denoted $Y \not \perp S$ ). This strong $s$ component in $\hat{y}$ leads to distinct, non-compact distributions for $\hat{y}_{s=0}$ and $\hat{y}_{s=1}$ .

To address this, trVAE introduces MMD regularization on the $\hat{y}$ representation. The goal is to compactify the distribution of $\hat{y}$ across different conditions $s$ , forcing it to occupy the same region of its support $\mathcal{V}$ . This encourages the model to learn common features across conditions, which is vital for accurate out-of-sample transformation.

The following figure (Figure 1 from the original paper) illustrates the trVAE architecture:

Figure 1: The transformer VAE (trVAE) is an MMD-regularized CVAE. It receives randomized batches of data `( x )` and condition (s) as input during training, stratified for approximately equal proport… 该图像是论文中图1的示意图，展示了带有最大均值差异（MMD）正则化的条件变分自编码器（CVAE）trVAE架构。网络以输入 $x$ 和条件 $s$ 编码，并在解码器第一层通过MMD层正则化条件影响，实现从条件 $s_i=0$ 到 $s_i=1$ 的转换，公式包括 $MMD(g_1(z_0', s_i=0), g_1(z_1', s_i=1))$ ， $z_0' = f(x, s_i=0)$ ， $z_1' = f(x, s_i=1)$ 。

The diagram shows that the MMD regularization layer is applied to the output of the first decoder layer, $g_1(\hat{z}, s)$ . During training, randomized batches with different conditions $s$ are fed into the network. The MMD loss term then operates on the $\hat{y}$ outputs corresponding to these different conditions.

4.2.4. trVAE Loss Function

The trVAE loss function extends the standard CVAE loss by adding an MMD regularization term. To apply MMD between representations of different conditions, the CVAE loss is duplicated for two observations ( $X$ and $X'$ ) potentially from different conditions ( $S$ and $S'$ ).

Let $\hat { y } _ { s = 0 } = g _ { 1 } ( f ( x , s = 0 ) , s = 0 )$ be the intermediate representation for an input $x$ under condition $s=0$ . Let $\hat { y } _ { s = 1 } = g _ { 1 } ( f ( x ^ { \prime } , s = 1 ) , s = 1 )$ be the intermediate representation for an input $x'$ under condition $s=1$ .

The full trVAE loss function is defined as: $ \begin{array} { r l } & { \mathcal { L } _ { \mathrm { t r V A E } } ( \phi , \theta ; X , X ^ { \prime } , S , S ^ { \prime } , \alpha , \eta , \beta ) = \mathcal { L } _ { \mathrm { C V A E } } ( \phi , \theta ; X , S , \alpha , \eta ) } \ & { \qquad + \mathcal { L } _ { \mathrm { C V A E } } ( \phi , \theta ; X ^ { \prime } , S ^ { \prime } , \alpha , \eta ) } \ & { \qquad - \beta \ell _ { \mathrm { M M D } } ( \hat { Y } _ { s = 0 } , \hat { Y } _ { s ^ { \prime } = 1 } ) . } \end{array} $ Here:

$\mathcal { L } _ { \mathrm { C V A E } } ( \phi , \theta ; X , S , \alpha , \eta )$ : The standard CVAE loss for the first batch (X, S).
$\mathcal { L } _ { \mathrm { C V A E } } ( \phi , \theta ; X ^ { \prime } , S ^ { \prime } , \alpha , \eta )$ : The standard CVAE loss for the second batch (X', S'). This duplication allows the model to process and reconstruct data from both conditions in a single training step.
$\beta$ : A hyperparameter weighting the MMD regularization term.
$\ell _ { \mathrm { M M D } } ( \hat { Y } _ { s = 0 } , \hat { Y } _ { s ^ { \prime } = 1 } )$ : The MMD distance calculated between the intermediate representations $\hat{Y}_{s=0}$ and $\hat{Y}_{s'=1}$ obtained from the first decoder layer, where $\hat{Y}_{s=0}$ corresponds to samples from condition $s=0$ and $\hat{Y}_{s'=1}$ corresponds to samples from condition $s'=1$ . The negative sign on the MMD term means that minimizing $\mathcal { L } _ { \mathrm { t r V A E } }$ will maximize the MMD between $\hat{Y}_{s=0}$ and $\hat{Y}_{s'=1}$ if $\beta$ is positive. Self-correction: This is an error in the paper's formula or convention. In most MMD regularization contexts, MMD is minimized to bring distributions closer. The VFAE loss from Louizos et al., 2015, which the paper cites, also has a negative beta MMD term, implying minimization of MMD. I will follow the paper's formula precisely, but note this common convention in my personal insights.

During training, trVAE uses randomized batches of data and conditions, stratified to ensure approximately equal proportions of different conditions. This allows the MMD term to effectively compare and align the intermediate representations from different conditions.

4.2.5. Prediction Time

The process of out-of-sample transformation in trVAE is as follows:

Encoding Source: An input sample from a source condition (e.g., $x$ with $s=0$ ) is passed through the encoder $f$ to obtain its latent representation $\hat{z}_{s=0}$ . $ \hat { z } _ { 0 } = f ( x _ { 0 } , s = 0 ) $
Conditional Decoding to Target: The obtained latent representation $\hat{z}_{s=0}$ is then passed to the decoder $g$ , but with the target condition (e.g., $s=1$ ). $ \hat { x } _ { s = 1 } = g ( \hat { z } _ { 0 } , s = 1 ) $ This process allows the model to transform the input $x_{s=0}$ to $\hat{x}_{s=1}$ , effectively predicting how $x_{s=0}$ would look or behave under condition $s=1$ . The MMD regularization on the intermediate $y$ layer ensures that the decoder can effectively handle the switch in condition $s$ and produce a meaningful output, leveraging the shared features learned across conditions.

5. Experimental Setup

5.1. Datasets

The paper evaluates trVAE on a variety of high-dimensional image and tabular datasets to demonstrate its versatility and performance.

5.1.1. Morpho-MNIST

Domain: Image data (digits).
Characteristics: Contains 60,000 images each of "normal," "thin," and "thick" digits. This dataset allows for clear style transfer tasks.
Training Setup: For training, all normal-stroke digits ( $s=0$ ) for all digit domains ( $d \in \{0, 1, \ldots, 9\}$ ) were used. For the transformed conditions (thin and thick strokes, $s \in \{1, 2\}$ ), only a subset of digit domains ( $d \in \{1, 3, 6, 7\}$ ) were included. This creates an out-of-sample scenario where the model must transform digits not seen in the thin/thick condition during training.
Example: A normal '8' digit. The task is to transform it into a thin '8' or thick '8', where '8's were only seen as normal during training for the thin/thick conditions. The following figure (Figure 3 from the original paper) shows an example of style transfer for Morpho-MNIST:

$Figure 3: Out-of-sample style transfer for Morpho-MNIST dataset containing normal, thin and thick digits. trVAE successfully transforms normal digits to thin (a) and thick ( \$\\mathbf { \\eta } ^ { ( \\…$ 该图像是图表，展示了Morpho-MNIST数据集中数字样本的风格迁移，trVAE模型成功将训练中未见的正常数字转换为细线(a)和粗线(b)风格，有效实现了风格的out-of-sample转化。

5.1.2. CelebA

Domain: Image data (celebrity faces).
Characteristics: CelebA (Liu et al., 2015) comprises 202,599 images with 40 binary attributes per image. The paper focuses on smiling ( $s$ ) and gender ( $d$ ) attributes.
Training Setup: The model was trained with images of both smiling and non-smiling men (i.e., men in both $s=0$ and $s=1$ conditions). However, for women, only non-smiling images were included in the training data (i.e., women only in $s=0$ condition). This creates a challenging out-of-sample task: transforming non-smiling women into smiling women, a condition not directly observed during training.
Example: A non-smiling female face. The task is to generate the same face with a smile. The following figure (Figure 4 from the original paper) shows an example of CelebA transformation:

该图像是图像生成结果展示，展示了CelebA数据集中不同人物脸部表情的对比。trVAE模型在训练数据缺乏女性微笑样本的情况下，成功为女性脸部添加了微笑，实现了条件下样本外生成。

5.1.3. Gut Single-Cell Gene Expression Data (Infection Response)

Domain: Tabular data (single-cell gene expression).
Source: Haber et al., 2017.
Characteristics: Characterizes gut cell responses to Salmonella or Heligmosomoides polygyrus (H. poly) infections. The dataset includes eight different cell types and four conditions: control (healthy), H.Poly.Day3, H.Poly.Day10, and Salmonella infection. The normalized data has 1,000 dimensions (genes).
Training Setup: To compare with baselines limited to two conditions, experiments were initially conducted with control and H.Poly.Day10 conditions. Tuft cells were held out from training and validation in the infected condition, representing an out-of-sample challenge due to their potentially distinct features and fewer training instances. For testing trVAE's multi-condition capabilities, all three perturbed conditions were included, holding out each of the eight cell types.
Example: Gene expression profile of a control Tuft cell. The task is to predict its profile if it were H.Poly.Day10 infected. The following figure (Figure 5 from the original paper) presents results for the gut cell data:

该图像是图表，展示了论文中Figure 5各子图的结果：包括(a)胃肠细胞的UMAP条件与细胞类型可视化，(b-c)trVAE预测和真实感染Tuft细胞的基因表达均值与方差的比较，(d)不同模型下Defa24基因表达的分布对比，(e)不同模型预测均值和方差的Pearson相关系数比较，以及(f)trVAE在不同细胞类型和条件下基因表达均值相关系数的统计。各图数据反映模型预测的准确性和泛化能力。

5.1.4. PBMC Single-Cell Gene Expression Data (Stimulation Response)

Domain: Tabular data (single-cell gene expression).
Source: Kang et al., 2018.
Characteristics: Comprises 7,217 $IFN-β stimulated$ and 6,359 control peripheral blood mononuclear cells (PBMCs) from eight human Lupus patients. IFN-β stimulation causes significant transcriptional changes. The data has 2,000 dimensions.
Training Setup: Natural killer (NK) cells were held out from training in the stimulated condition for out-of-sample prediction.
Example: Gene expression profile of a control NK cell. The task is to predict its profile if it were $IFN-β stimulated$ . The following figure (Figure 6 from the original paper) presents results for the PBMC data:

该图像是论文中图6的多子图图表，展示了外周血单核细胞(PBMCs)的UMAP可视化、trVAE预测和真实自然杀伤细胞(NK)在2000维度上的均值与方差比较、IFN-β刺激后基因ISG15表达分布，以及不同模型在基因表达均值和方差预测中 $R^2$ 值的比较。

5.2. Evaluation Metrics

The primary evaluation metric used in the paper is Pearson correlation coefficient (R^2) for comparing predicted gene expression means and variances with their ground truths.

5.2.1. Pearson Correlation Coefficient ( $R$ )

Conceptual Definition: The Pearson correlation coefficient measures the linear relationship between two sets of data. It is a normalized measure of the covariance between two variables, ranging from -1 to +1. A value of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. In the context of the paper, $R^2$ (the coefficient of determination) is often reported, which represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). Here, it quantifies how well the predicted gene expression values (or their statistics like mean/variance) linearly correspond to the real, observed values.
Mathematical Formula: For two sets of data, $X = \{x_1, \dots, x_n\}$ and $Y = \{y_1, \dots, y_n\}$ , the Pearson correlation coefficient $r$ is calculated as: $ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} $ The coefficient of determination, $R^2$ , is simply $r^2$ .
Symbol Explanation:
- $n$ : The number of data points (e.g., genes, or samples in a comparison).
- $x_i$ : The $i$ -th value in the first dataset (e.g., predicted mean expression for gene $i$ ).
- $y_i$ : The $i$ -th value in the second dataset (e.g., ground truth mean expression for gene $i$ ).
- $\bar{x}$ : The mean of the first dataset.
- $\bar{y}$ : The mean of the second dataset.
- $\sum$ : Summation operator.

5.3. Baselines

The trVAE model is benchmarked against a variety of existing methods and alternative VAE configurations:

Vanilla CVAE (Conditional Variational Autoencoder): (Sohn et al., 2015)
- A standard CVAE without any MMD regularization. It serves as a direct baseline to show the impact of the MMD term.
CVAE with MMD on Bottleneck (MMD-CVAE): Similar to VFAE (Louizos et al., 2015)
- This is a CVAE where MMD regularization is applied to the latent bottleneck layer $Z$ . It directly tests the authors' hypothesis that MMD is more effective when applied to the first decoder layer rather than the bottleneck.
MMD-regularized Autoencoder: (Dziugaite et al., 2015b; Amodio et al., 2019)
- A non-variational autoencoder (AE) regularized with MMD. This baseline helps to assess if the VAE component itself is crucial or if MMD with a simpler AE is sufficient. This baseline is referred to as SAUCIE in some figures and tables.
CycleGAN (Cycle-Consistent Generative Adversarial Network): (Zhu et al., 2017)
- A prominent unpaired image-to-image translation model that uses adversarial losses and cycle consistency loss to learn mappings between two image domains without paired examples. It represents a strong generative adversarial model (GAM) baseline for style transfer tasks.
scGen (Single-Cell Gene Expression Network): (Lotfollahi et al., 2019)
- A VAE combined with vector arithmetics in the latent space for predicting single-cell perturbation responses. This is a direct competitor for the biological tasks, and notably, one of trVAE's authors also co-authored scGen. It relies on hard-coded transformations rather than end-to-end learning.
scVI (Single-Cell Variational Inference): (Lopez et al., 2018a)
- A CVAE specifically designed for single-cell transcriptomics data, using a negative binomial output distribution to model count data inherent in gene expression. It's a specialized and highly-regarded baseline for biological applications.
  
  For the image examples (Morpho-MNIST and CelebA), convolutional layers were used, while for the single-cell gene expression datasets, fully connected layers were employed. The optimal hyperparameters for each model were determined via a parameter grid-search for each application. Detailed hyperparameters for trVAE and all baselines are provided in Tables 1-9 in Appendix A of the paper.

5.4. Hyperparameters

The following tables provide the detailed architectures and hyperparameters for trVAE and the baseline models used in the experiments.

The following are the results from Table 1 of the original paper:

Name	Operation	NoF/Kernel Dim.	Dropout	BN	Activation	Input
input		(28, 28, 1)	×	×
conditions		2	×	×
FC-1	FC	128	×	×	Leaky ReLU	conditions
FC-2	FC	784	0.2	√	Leaky ReLU	FC-1
FC-2_resh	Reshape	(28, 28, 1)	×	×	×	FC-2
Conv2D_1	Conv2D	(4, 4, 64, 2)	×	×	Leaky ReLU	[FC-2_resh, input]
Conv2D_2	Conv2D	(4, 4, 64, 64)	×	×	Leaky ReLU	Conv2D_1
FC-3	FC	128	×	√	Leaky ReLU	Flatten(Conv2D_2)
mean	FC	50	×	×	Linear	FC-3
var	FC	50	×	×	Linear	FC-3
Z	FC	50	×	×	Linear	[mean, var]
FC-4	FC	128	×	×	Leaky ReLU	conditions
FC-5	FC	784	0.2	√	Leaky ReLU	FC-4
FC-5_resh	Reshape	(28, 28, 1)	×	×	×	FC-5
MMD	FC	128	×	√	Leaky ReLU	[z, FC-5_resh]
FC-6	FC	256	×	×	Leaky ReLU	MMD
FC-7_resh	Reshape	(2, 2, 64)	×	×	×	FC-6
Conv_transp_1	Conv2D Transpose	(4, 4, 128, 64)	×	×	Leaky ReLU	FC-7_resh
Conv_transp_2	Conv2D Transpose	(4, 4, 64, 64)	×	×	Leaky ReLU	UpSampling2D(Conv_tr
Conv_transp_3	Conv2D Transpose	(4, 4, 64, 64)	×	×	Leaky ReLU	Conv_transp_2
Conv_transp_4	Conv2D Transpose	(4, 4, 2, 64)	×	×	Leaky ReLU	UpSampling2D(Conv_tr
output	Conv2D Transpose	(4, 4, 1, 2)	×	×	ReLU	UpSampling2D(Conv_tr
Optimizer	Adam
Learning Rate	0.001
Leaky ReLU slope	0.2
Batch Size	1024
# of Epochs	5000
α	0.001
β	1000

The following are the results from Table 2 of the original paper:

Name	Operation	NoF/Kernel Dim.	Dropout	BN	Activation	Input
input		(64, 64, 3)	×	×
conditions		2	×	×
FC-1	FC	128	×	×	ReLU	conditions
FC-2	FC	1024	0.2	√	ReLU	FC-1
FC-2_reshaped	Reshape	(64, 64, 1)	×	×	×	FC-2
Conv_1	Conv2D	(3, 3, 64, 4)	×	×	ReLU	[FC-2_reshaped, input]
Conv_2	Conv2D	(3, 3, 64, 64)	×	×	ReLU	Conv_1
Pool_1	Pooling2D	×	×	×	×	Conv_2
Conv_3	Conv2D	(3, 3, 128, 64)	×	×	ReLU	Pool_1
Conv_4	Conv2D	(3, 3, 128, 128)	×	×	ReLU	Conv_3
Pool_2	Pooling2D	×	×	×	×	Conv_4
Conv_5	Conv2D	(3, 3, 256, 128)	×	×	ReLU	Pool_2
Conv_6	Conv2D	(3, 3, 256, 256)	×	×	ReLU	Conv_5
Conv_7	Conv2D	(3, 3, 256, 256)	×	×	ReLU	Conv_6
Pool_3	Pooling2D	×	×	×	×	Conv_7
Conv_8	Conv2D	(3, 3, 512, 256)	×	×	ReLU	Pool_3
Conv_9	Conv2D	(3, 3, 512, 512)	×	×	ReLU	Conv_8
Conv_10	Conv2D	(3, 3, 512, 512)	×	×	ReLU	Conv_9
Pool_4	Pooling2D	×	×	×	×	Conv_10
Conv_11	Conv2D	(3, 3, 512, 256)	×	×	ReLU	Pool_4
Conv_12	Conv2D	(3, 3, 512, 512)	×	×	ReLU	Conv_11
Conv_13	Conv2D	(3, 3, 512, 512)	×	×	ReLU	Conv_12
Pool_4	Pooling2D	×	×	×	×	Conv_13
flat	Flatten	×	×	×	×	Pool_4
FC-3	FC	1024	×	×	ReLU	flat
FC-4	FC	256	0.2	×	ReLU	FC-3
mean	FC	60	×	×	Linear	FC-4
var	FC	60	×	×	Linear	FC-4
Z-sample	FC	60	×	×	Linear	[mean, var]
FC-5	FC	128	×	×	ReLU	conditions
MMD	FC	256	×	√	ReLU	[z-sample, FC-5]
FC-6	FC	1024	×	×	ReLU	MMD
FC-7	FC	4096	×	×	ReLU	FC-6
FC-7_reshaped	Reshape	×	×	×	FC-7
Conv_transp_1	Conv2D Transpose	(3, 3, 512, 512)	×	×	ReLU	FC-7_reshaped
Conv_transp_2	Conv2D Transpose	(3, 3, 512, 512)	×	×	ReLU	Conv_transp_1
Conv_transp_3	Conv2D Transpose	(3, 3, 512, 512)	×	×	ReLU	Conv_transp_2
up_sample_1	UpSampling2D	×	×	×	×	Conv_transp_3
Conv_transp_4	Conv2D Transpose	(3, 3, 512, 512)	×	×	ReLU	up_sample_1
Conv_transp_5	Conv2D Transpose	(3, 3, 512, 512)	×	×	ReLU	Conv_transp_4
Conv_transp_6	Conv2D Transpose	(3, 3, 512, 512)	×	×	ReLU	Conv_transp_5
up_sample_2	UpSampling2D	×	×	×	×	Conv_transp_6
Conv_transp_7	Conv2D Transpose	(3, 3, 128, 256)	×	×	ReLU	up_sample_2
Conv_transp_8	Conv2D Transpose	(3, 3, 128, 128)	×	×	ReLU	Conv_transp_7
up_sample_3	UpSampling2D	×	×	×	×	Conv_transp_8
Conv_transp_9	Conv2D Transpose	(3, 3, 64, 128)	×	×	ReLU	up_sample_3
Conv_transp_10	Conv2D Transpose	(3, 3, 64, 64)	×	×	ReLU	Conv_transp_9
output Ontim	Conv2D Transpose	(1, 1, 3, 64)	×	×	ReLU	Conv_transp_10

The following are the results from Table 3 of the original paper:

Name	Operation	NoF/Kernel Dim.	Dropout	BN	Activation	Input
input		input_dim	×	×
conditions		n_conditions	×	×
FC-1	FC	800	0.2		Leaky ReLU	[input, conditions]
FC-2	FC	800	0.2	√	Leaky ReLU	FC-1
FC-3	FC	128	0.2		Leaky ReLU	FC-2
mean	FC	50	×	×	Linear	FC-3
var	FC	50	×	×	Linear	FC-3
z-sample	FC	50	×	×	Linear	[mean, var]
MMD	FC	128	0.2		Leaky ReLU	[z-sample, conditions]
FC-4	FC	800	0.2	√	Leaky ReLU	MMD
FC-5	FC	800	0.2	√	Leaky ReLU	FC-3
output	FC	input_dim	×	×	ReLU	FC-4
Optimizer	Adam
Learning Rate	0.001
Leaky ReLU slope	0.2
Batch Size	512
# of Epochs	5000
α	0.00001
β	100
η	100

The following are the results from Table 4 of the original paper:

Name	Operation	NoF/Kernel Dim.	Dropout	BN	Activation	Input
input		input_dim	×
FC-1	FC	800	0.2		Leaky ReLU	input
FC-2	FC	800	0.2	√	Leaky ReLU	F-1
FC-3	FC	128	0.2		Leaky ReLU	FC-2
mean	FC	100	×	×	Linear	FC-3
var	FC	100	×	×	Linear	FC-3
Z	FC	100	×	×	Linear	[mean, var]
MMD	FC	128	0.2		Leaky ReLU	Z
FC-4	FC	800	0.2	√	Leaky ReLU	MMD
FC-5	FC	800	0.2		Leaky ReLU	FC-3
output	FC	input_dim	×	×	ReLU	FC-4
Optimizer	Adam
Learning Rate	0.001
Leaky ReLU slope	0.2
Batch Size	32
# of Epochs	300
α	0.00050
β	100
η	100

The following are the results from Table 5 of the original paper:

Name	Operation	NoF/Kernel Dim.	Dropout	BN	Activation	Input
input		input_dim	×	×
conditions		1	×	×
FC-1	FC	800	0.2	√	Leaky ReLU	[input, conditions]
FC-2	FC	800	0.2		Leaky ReLU	FC-1
FC-3	FC	128	0.2	√	Leaky ReLU	FC-2
mean	FC	50	×	×	Linear	FC-3
var	FC	50	×	×	Linear	FC-3
Z-sample	FC	50	×		Linear	[mean, var]
MMD	FC	128	0.2	×	Leaky ReLU	[z-sample, conditions]
FC-4	FC	800	0.2	√	Leaky ReLU	MMD
FC-5	FC	800	0.2		Leaky ReLU	FC-3
output	FC	input_dim	×	×	ReLU	FC-4
Optimizer	Adam
Learning Rate	0.001
Leaky ReLU slope	0.2
Batch Size	512
# of Epochs	300
α	0.001

The following are the results from Table 6 of the original paper:

Name	Operation	NoF/Kernel Dim.	Dropout	BN	Activation	Input
input		input_dim	×	×
conditions		1	×	×
FC-1	FC	800	0.2	√	Leaky ReLU	[input, conditions]
FC-2	FC	800	0.2		Leaky ReLU	FC-1
FC-3	FC	128	0.2	√	Leaky ReLU	FC-2
mean	FC	50	×	×	Linear	FC-3
var	FC	50	×	×	Linear	FC-3
Z-sample	FC	50	×	×	Linear	[mean, var]
MMD	FC	128	0.2		Leaky ReLU	[z-sample, conditions]
FC-4	FC	800	0.2	√	Leaky ReLU	MMD
FC-5	FC	800	0.2		Leaky ReLU	FC-3
output	FC	input_dim	×	×	ReLU	FC-4
Optimizer	Adam
Learning Rate	0.001
Leaky ReLU slope	0.2
Batch Size	512
# of Epochs	500
α	0.001
β	1

The following are the results from Table 7 of the original paper:

Name	Operation	NoF/Kernel Dim.	Dropout	BN	Activation	Input
input		input_dim	×		Leaky ReLU	input
FC-1	FC	700	0.5
FC-2	FC	100	0.5		Leaky ReLU	FC-1
FC-3	FC	50	0.5	√	Leaky ReLU
FC-4	FC	100	0.5		Leaky ReLU	FC-2
FC-5	FC	700	0.5		Leaky ReLU	FC-3
generator_out	FC	6,998	×		ReLU	FC-4
FC-6	FC	700	0.5	√	Leaky ReLU	generator_out
FC-7	FC	100	0.5		Leaky ReLU	FC-6
discriminator_out	FC	1			Sigmoid	FC-7
Generator Optimizer	Adam
Discriminator Optimizer	Adam
Optimizer	Adam
Learning Rate	0.001
Leaky ReLU slope	0.2
# of Epochs	1000

The following are the results from Table 8 of the original paper:

Name	Operation	NoF/Kernel Dim.	Dropout	BN	Activation	Input
input		input_dim	×	×
conditions		1	×	×
FC-1	FC	128	0.2	√	ReLU	input
mean	FC	10	×	×	Linear	FC-1
var	FC	10	×	×	Linear	FC-1
Z	FC	10	×	×	Linear	[mean, var]
FC-2	FC	128	0.2	√	ReLU	[z, conditions]
output	FC	input_dim	×	×	ReLU	FC-2
Optimizer	Adam

The following are the results from Table 9 of the original paper:

Name	Operation	NoF/Kernel Dim.	Dropout	BN	Activation	Input
input		input_dim	×	×
conditions		1	×	×
FC-1	FC	512	×	√	Leaky ReLU	[input, conditions]
FC-2	FC	256	×	×	Leaky ReLU	FC-1
FC-3	FC	128	×	×	Leaky ReLU	FC-2
FC-4		20	×	×	Leaky ReLU	FC-3
FC-5	FC	128	×	×	Leaky ReLU	FC-4
FC-6	FC	256	×	×	Leaky ReLU	FC-5
FC-7	FC	512	×	×	Leaky ReLU	FC-6
output	FC	input_dim	×	×	ReLU	FC-4
Optimizer	Adam
Learning Rate	0.001
Leaky ReLU slope	0.2
Batch Size	256
# of Epochs	1000

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate trVAE's superior performance in out-of-sample generation and conditional transformation across diverse data types, both qualitatively for images and quantitatively for biological data.

6.1.1. Qualitative Results: Image Style Transfer

Morpho-MNIST (Figure 3): trVAE successfully performs style transfer for out-of-sample digits. It can transform normal digits to thin or thick strokes even when those specific digits were not seen in the thin/thick conditions during training. This indicates that the model learned abstract features of "thinness" and "thickness" that generalize across different digit identities.
CelebA (Figure 4): This is a more complex out-of-sample task. trVAE demonstrates the ability to add a smile to non-smiling female faces, despite never having seen smiling female faces in the training data (only non-smiling women, and smiling/non-smiling men). The model preserves most aspects of the original image, showing its capacity to handle intricate facial features and perform a targeted, unseen transformation. This highlights the power of MMD regularization in forcing condition-invariant features that allow for robust generalization.

6.1.2. Quantitative Results: Single-Cell Gene Expression Data

For high-dimensional tabular data, particularly single-cell gene expression, trVAE shows significant quantitative improvements.

Infection Response (Gut Cells - Figure 5):
- Mean and Variance Prediction (Figure 5b-c): trVAE accurately predicts the mean and variance for 1,000 genes in Tuft cells infected with H.Poly.Day10, even though these specific cells were held out during training. The strong correlation ( $R^2=0.97$ for mean, $R^2=0.87$ for variance) indicates high fidelity in predicting the complex transcriptional changes. Genes showing the highest differential expression (highlighted in red) are well captured.
- Distribution of Marker Gene (Figure 5d): For Defa24, a gene with a strong response to H.Poly infection, trVAE's predicted expression distribution closely matches the real stimulated cells, outperforming other models (e.g., CVAE, MMD-CVAE, SAUCIE) which show larger discrepancies. This demonstrates trVAE's ability to model not just central tendencies but also the overall distribution of gene expression.
- Model Comparison (Figure 5e): trVAE significantly outperforms all baselines in Pearson $R^2$ values for both mean and variance gene expression. It improves mean $R^2$ from 0.89 (best baseline: scGen) to 0.97, and variance $R^2$ from 0.75 (best baseline: scGen) to 0.87. This quantitatively validates the effectiveness of trVAE's MMD regularization strategy over existing approaches, including those with MMD in the bottleneck (MMD-CVAE). The poor performance of MMD-CVAE (MMD on bottleneck) supports the paper's argument for placing MMD in the decoder's first layer.
- Multiple Conditions (Figure 5f): Beyond two conditions, trVAE accurately predicts all eight cell types across all three perturbed conditions, which was a challenge for many existing models. This highlights its generalizability to more complex scenarios with multiple interacting conditions.
Stimulation Response (PBMCs - Figure 6):
- Mean and Variance Prediction (Figure 6b-c): For $IFN-β stimulated NK cells$ (held out during training), trVAE accurately predicts both mean and variance across 2,000 dimensions. Similar to the gut data, strongly responding genes are well-captured.
- Distribution of Marker Gene (Figure 6d): trVAE correctly predicts the increase in ISG15 expression, a key gene responding to IFN-β perturbation, for NK cells it has never seen stimulated. Other models show less accurate predictions for this distribution.
- Model Comparison (Figure 6e): Consistent with the previous dataset, trVAE achieves the highest Pearson $R^2$ values for both mean and variance predictions (0.97 for mean, 0.87 for variance), demonstrating its superior performance over CVAE, MMD-CVAE, CycleGAN, and SAUCIE for this task.

6.1.3. UMAP Visualization (Figure 2)

The following figure (Figure 2 from the original paper) qualitatively illustrates the effect of MMD regularization:

Figure 2: Comparison of representations for MMD-layer in trVAE and the corresponding layer in the vanilla CVAE using UMAP (McInnes et al., 2018). The MMD regularization incentivizes the model to lear… 该图像是图表，展示了图2中trVAE和常规模型Vanilla CVAE利用UMAP对PBMC数据中MMD层表示的对比。trVAE通过MMD正则化学得条件不变特征，实现了更紧凑的样本表示。

This UMAP visualization demonstrates the core mechanism of trVAE. The MMD regularization applied to the first decoder layer forces the representations ( $y$ ) from different conditions to overlap more significantly. In the vanilla CVAE, the representations for different conditions are distinct and separated. In contrast, trVAE's MMD regularization incentivizes the model to learn condition-invariant features, resulting in a more compact and unified representation space for the $y$ layer across conditions. This qualitative observation underpins the quantitative improvements, as such compactification is essential for robust out-of-sample transformation.

6.2. Data Presentation (Tables)

The architecture tables were presented in Section 5.4.

6.3. Ablation Studies / Parameter Analysis

While the paper does not present a formal ablation study table, it explicitly discusses a key architectural decision that functions as an implicit ablation:

MMD Regularization Placement: The paper states, "Experiments below demonstrate that indeed, MMD regularization on the bottleneck layer $z$ does not improve performance." This is a crucial finding that justifies trVAE's design choice to place MMD regularization in the decoder's first layer ( $y$ ) rather than the bottleneck ( $z$ ). The quantitative results in Figure 5e and 6e visually confirm this, showing that MMD-CVAE (which applies MMD to the bottleneck) performs worse than trVAE and often on par with or only slightly better than vanilla CVAE. This indicates that the specific placement of MMD regularization is critical for achieving the desired out-of-sample transformation capabilities. The authors argue that the $z$ -representation is already incentivized to be free from condition information, but the $y$ -representation still strongly covaries with $S$ , making it the ideal target for regularization.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces trVAE, a Conditional Variational Autoencoder enhanced with Maximum Mean Discrepancy (MMD) regularization applied to the first layer of its decoder. The central insight is that while the latent bottleneck of a CVAE may already exhibit condition-invariance, the subsequent intermediate representations in the decoder still strongly depend on the conditional variable. By forcing these intermediate representations to be compact and aligned across different conditions using MMD, trVAE significantly improves its ability to perform out-of-sample generation and conditional transformation. The model demonstrates superior robustness and accuracy over existing methods, both qualitatively on image datasets (Morpho-MNIST, CelebA) and quantitatively on high-dimensional single-cell gene expression data, particularly for complex biological perturbation prediction tasks involving minority classes and multiple conditions. This data-driven, end-to-end approach represents a notable advancement in conditional generative modeling.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Further Regularization at Later Layers: While trVAE focuses on the first decoder layer, the authors suggest that further regularization at later layers might be beneficial. However, they note that this could be numerically costly and unstable due to the high dimensionality of representations in deeper layers. This remains an area for systematic investigation.
Application to Larger and More Complex Data: Future work will involve applying trVAE to larger datasets and focusing on interaction effects among conditions. This is particularly relevant for drug interaction studies, an important application domain previously highlighted by Amodio et al. (2018).
Connections to Causal Inference Models: The paper aims to establish stronger conceptual links to causal-inference-inspired models beyond Johansson et al. (2016), such as CEVAE (Louizos et al., 2017). This direction seeks to formally re-frame faithful modeling of an interventional distribution as successful perturbation effect prediction across domains, which could provide deeper theoretical foundations for trVAE's capabilities.

7.3. Personal Insights & Critique

The trVAE paper presents an elegant and effective solution to a pertinent problem in conditional generation. The strategic placement of MMD regularization is a key innovation and a prime example of how a seemingly minor architectural tweak, guided by a deep understanding of model behavior, can yield significant performance improvements.

Strengths:

Clear Problem Definition: The paper clearly articulates the challenge of out-of-sample generation and transformation in CVAEs, highlighting why existing approaches fall short.
Effective Solution: The MMD regularization on the first decoder layer is intuitively appealing and empirically validated. The argument that $Z$ is already disentangled, but $Y$ still carries strong conditional information, is a crucial insight.
Broad Applicability: Demonstrated on diverse data types (images, single-cell gene expression) and tasks (style transfer, perturbation prediction), showing its versatility. The biological applications are particularly impactful, addressing real-world needs in drug discovery and understanding cellular responses.
Quantitative and Qualitative Evidence: The paper provides strong evidence through both compelling visual examples (Morpho-MNIST, CelebA, UMAPs) and significant improvements in quantitative metrics (Pearson correlations on gene expression data).

Potential Issues/Critique:

MMD Sensitivity: Like all MMD-based methods, trVAE can be sensitive to the choice of kernel function and its hyperparameters (e.g., $\gamma_i$ for the RBF kernel, and $\beta$ for the MMD weighting). The multi-scale RBF kernel is a good heuristic, but finding optimal $\gamma_i$ values might require careful tuning.
Computational Cost of MMD: Calculating MMD involves quadratic sums over batch sizes ( $n_0^2, n_1^2, n_0 n_1$ ). For very large batch sizes or high-dimensional $y$ representations, this could become computationally intensive. The choice of batch size (e.g., 512 for gene expression data) seems reasonable, but scalability for extremely large datasets could be a concern.
Negative MMD Term: In the paper's loss function, the MMD term is subtracted ( $- \beta \ell _ { \mathrm { M M D } }$ ). Given that MMD is a distance metric (non-negative), minimizing a loss with a negative MMD term would typically imply maximizing the MMD distance, which is counter-intuitive for matching distributions. Usually, MMD is added to the loss (with a positive weight $\beta$ ) so that minimizing the loss minimizes the MMD. While the VFAE paper also uses a negative $\beta$ , this might be a convention specific to certain implementations or a slight misstatement in how the loss is presented vs. what is actually optimized to reduce discrepancy. If the intention is to reduce discrepancy, $\beta$ should conceptually be a positive weight on an added MMD term, or the MMD term itself should be defined as a penalty to be minimized. Assuming the authors optimized it correctly to bring distributions closer, the formula might imply maximizing similarity, but it's an important detail for rigor.
Unpaired Data Handling: The paper states "unpaired data," which is a strength. The training strategy of using randomized batches stratified by condition allows for collecting samples from different conditions within a batch to calculate MMD, effectively using unpaired data.

Personal Insights & Future Value: The trVAE approach offers valuable insights for anyone working with conditional generative models and domain adaptation. The lesson regarding the optimal placement of regularization (not always the bottleneck!) is broadly applicable. Its success in single-cell biology is particularly exciting, as perturbation prediction is a critical component of systems biology and drug development. The ability to predict drug responses or disease effects for minority cell types or unseen conditions without expensive wet-lab experiments could significantly accelerate biological research. The conceptual connection to causal inference also points to a rich area for future theoretical development, potentially bridging generative modeling with stronger guarantees of counterfactual reasoning. This work could inspire similar MMD-based regularization strategies in other generative architectures (e.g., GANs) or other model types where condition-invariance or domain alignment in intermediate representations is desired.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Conditional out-of-sample generation for unpaired data using trVAE

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 43,262 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Variational Autoencoder (VAE)

3.1.2. Conditional Variational Autoencoder (CVAE)

3.1.3. Maximum Mean Discrepancy (MMD)

3.2. Previous Works

3.2.1. Standard Conditional Variational Autoencoder (CVAE)

3.2.2. MMD for Domain Adaptation and Latent Space Regularization

3.2.3. Causal Inference and Counterfactual Prediction

3.2.4. Other Out-of-Sample Transformation Methods

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. CVAE Framework

4.2.2. Maximum Mean Discrepancy (MMD)

4.2.3. trVAE Architecture and Data Flow

4.2.4. trVAE Loss Function

4.2.5. Prediction Time

5. Experimental Setup

5.1. Datasets

5.1.1. Morpho-MNIST

5.1.2. CelebA

5.1.3. Gut Single-Cell Gene Expression Data (Infection Response)

5.1.4. PBMC Single-Cell Gene Expression Data (Stimulation Response)

5.2. Evaluation Metrics

5.2.1. Pearson Correlation Coefficient (RRR)

5.3. Baselines

5.4. Hyperparameters

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Qualitative Results: Image Style Transfer

6.1.2. Quantitative Results: Single-Cell Gene Expression Data

6.1.3. UMAP Visualization (Figure 2)

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

5.2.1. Pearson Correlation Coefficient ( $R$ )