AiPaper
Paper status: completed

A cross entropy test allows quantitative statistical comparison of t-SNE and UMAP representations

Published:01/01/2023
Original Link
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces a robust cross-entropy-based KS test to quantitatively compare t-SNE and UMAP embeddings, capturing true biological variation and enabling distance-based analyses for complex single-cell datasets beyond visualization.

Abstract

Article A cross entropy test allows quantitative statistical comparison of t-SNE and UMAP representations Graphical abstract Highlights d A cross entropy test enables evaluation of differences between t-SNE and UMAP projections d The cross entropy test can distinguish biological variation from technical variation d The cross entropy test can quantify differences between multiple samples d Full code and instructions are given for applying the test to single cell datasets Authors Carlos P. Roca, Oliver T. Burton, Julika Neumann, ..., Rafael V. Veiga, Ste ´ phanie Humblet-Baron, Adrian Liston Correspondence al989@cam.ac.uk In brief Dimensionality-reduction tools such as t- SNE and UMAP allow visualizations of single-cell datasets. Roca et al. develop and validate the cross entropy test for robust comparison of dimensionality- reduced datasets in flow cytometry, mass cytometry, and single-cell sequencing. The test allows statistical significance assessment and quantification of differences. Roca et al., 2023, Cell Reports Methods 3 , 100390 January 23, 2023 ª 2022 The Author(s). https://doi.org/10.1016/j.crmeth.2022.100390 ll

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

A cross entropy test allows quantitative statistical comparison of t-SNE and UMAP representations

1.2. Authors

Carlos P. Roca, Oliver T. Burton, Julika Neumann, Samar Tareen, Carly E. Whyte, Vaclav Gergelits, Rafael V. Veiga, Stéphanie Humblet-Baron, and Adrian Liston.

Their affiliations include: Immunology Programme, The Babraham Institute, Babraham Research Campus, Cambridge, UK; VIB Center for Brain and Disease Research, Leuven, Belgium; and KU Leuven University of Leuven, Department of Microbiology and Immunology, Leuven, Belgium. Carlos P. Roca and Oliver T. Burton are noted to have contributed equally, and Adrian Liston is the lead contact.

1.3. Journal/Conference

Cell Reports Methods.

This journal is a peer-reviewed scientific journal publishing methods across biological and biomedical sciences. It is generally considered a reputable venue for methodological advances.

1.4. Publication Year

2023 (Received: August 3, 2022; Revised: October 29, 2022; Accepted: December 20, 2022; Published: January 13, 2023).

1.5. Abstract

The research addresses the growing need for quantitative analysis of high-dimensional single-cell data, particularly from single-cell sequencing, flow cytometry, and mass cytometry. While dimensionality-reduction tools like t-Distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) are widely used for visualization, a robust statistical method for comparing these representations has been lacking. The authors propose a novel statistical test based on applying the Kolmogorov-Smirnov (KS) test to the distributions of cross entropy of single cells within dimensionality-reduced datasets. This approach enables the quantitative statistical comparison of t-SNE and UMAP plots, effectively distinguishing true biological variation from technical noise or rotational symmetry inherent in these algorithms. Furthermore, the test provides a valid distance metric between single-cell datasets, allowing for the hierarchical organization (dendrograms) of multiple samples. The study demonstrates the broad utility of this test across various single-cell technologies, highlighting the untapped potential of dimensionality-reduction tools beyond mere visualization for biomedical data analysis.

/files/papers/68ff6fb883c43dcf2b92fa29/paper.pdf

This is a direct link to the PDF of the paper. Its publication status is "Published" as indicated by the dates in the abstract.

2. Executive Summary

2.1. Background & Motivation (Why)

The explosion of high-dimensional single-cell data (e.g., from single-cell sequencing, flow cytometry, mass cytometry) has led to the widespread adoption of dimensionality-reduction techniques like t-SNE and UMAP for visualization. These tools are excellent for giving an "overview" and identifying visual patterns in complex datasets. However, their primary use has been limited to qualitative "eye-balling" for identifying differences or similarities between samples.

The core problem addressed by this paper is the lack of robust statistical methods to quantitatively compare these dimensionality-reduced datasets. This limitation means that researchers often revert to "pseudo-bulk" analyses (e.g., comparing cluster frequencies or aggregated mean expressions) for statistical rigor, thereby losing the rich, single-cell level information preserved by t-SNE and UMAP. The paper aims to bridge this gap by providing a quantitative, statistical test that can assess differences between t-SNE and UMAP representations, thereby unlocking their full analytical potential beyond just visualization.

2.2. Main Contributions / Findings (What)

The primary contributions and findings of this paper are:

  • Development of a Cross Entropy Test: The authors derived a novel statistical test for quantitatively comparing dimensionality-reduced datasets (t-SNE and UMAP) by applying the Kolmogorov-Smirnov (KS) test to the distributions of cross entropy values calculated for individual cells within each representation.
  • Robustness and Specificity: The test reliably:
    • Fails to detect significant differences between technical replicates, biological replicates, and independent runs of t-SNE/UMAP on the same sample (even with rotational symmetry).
    • Successfully identifies significant differences between biologically distinct samples.
  • Quantitative Distance Metric: The test provides the LL^\infty distance (also known as the Kolmogorov-Smirnov statistic or D-statistic) as a valid measure of difference between single-cell datasets, enabling the construction of dendrograms for hierarchical clustering and comparison of multiple samples.
  • Sensitivity to Both Quantitative and Qualitative Changes: The cross entropy test is sensitive to both shifts in inter-cluster frequency (changes in the proportion of different cell populations) and intra-cluster phenotype (subtle changes in marker expression or transcriptional profile within a cell population).
  • Broad Applicability: The test is validated across multiple single-cell technologies, including flow cytometry, mass cytometry, and single-cell sequencing, and works for both t-SNE and UMAP transformations.
  • Open-Source Implementation: Full code and instructions are provided, making the test accessible for application to single-cell datasets.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner needs to grasp several key concepts related to dimensionality reduction and information theory:

  • High-Dimensional Data: Data with many features or parameters (e.g., tens to thousands of protein markers in flow cytometry, thousands of gene expressions in single-cell sequencing). Visualizing or directly analyzing this data is challenging for humans.
  • Dimensionality Reduction: Techniques that transform high-dimensional data into a lower-dimensional space (e.g., 2D or 3D) while preserving important information, making it easier to visualize and analyze.
  • t-Distributed Stochastic Neighbor Embedding (t-SNE):
    • A non-linear dimensionality reduction algorithm widely used for visualizing high-dimensional data, especially in single-cell biology.
    • Goal: To convert similarities between data points in high-dimensional space into probabilities that represent similarities in low-dimensional space.
    • Mechanism: It models pairwise similarities between data points. In the high-dimensional space, Gaussian distributions are used to measure probabilities. In the low-dimensional space, Student's t-distributions (Cauchy distribution in 2D) are used. The algorithm iteratively adjusts the positions of points in the low-dimensional space to minimize the difference between these two probability distributions.
    • Focus: Primarily focuses on preserving local relationships (i.e., making sure points that are close in high-dimensional space are also close in low-dimensional space). It is less reliable for preserving global distances between clusters.
    • Perplexity: A key parameter in t-SNE that can be thought of as a smooth measure of the effective number of neighbors. It balances attention between local and global aspects of the data.
  • Uniform Manifold Approximation and Projection (UMAP):
    • Another non-linear dimensionality reduction technique, often used as an alternative to t-SNE.
    • Goal: Similar to t-SNE, it aims to create a low-dimensional representation of high-dimensional data.
    • Mechanism: UMAP is based on Riemannian geometry and algebraic topology. It constructs a fuzzy simplicial complex (a type of graph) in both high and low dimensions and then optimizes the low-dimensional graph to be as structurally similar as possible to the high-dimensional one.
    • Advantages: Generally faster than t-SNE and often better at preserving global structure (i.e., the relative distances between clusters) while still maintaining local relationships.
  • Entropy (H(P)): A measure of the uncertainty or randomness of a probability distribution PP. Higher entropy means more uncertainty.
  • Cross-Entropy (H(P, Q)): A measure of the dissimilarity between two probability distributions PP and QQ. If PP is the true distribution and QQ is an approximation, cross-entropy quantifies the average number of bits needed to encode an event from PP using an encoding scheme optimized for QQ. The lower the cross-entropy, the more similar the distributions.
  • Kullback-Leibler (KL) Divergence (D(P, Q)): Also known as relative entropy, it quantifies how much one probability distribution QQ diverges from a reference probability distribution PP. It's always non-negative, and zero if and only if PP and QQ are identical. The relationship D(P,Q)=H(P,Q)H(P)D(P, Q) = H(P, Q) - H(P) shows that KL divergence is the difference between cross-entropy and entropy. In the context of t-SNE, minimizing KL divergence is equivalent to minimizing cross-entropy because the entropy of the high-dimensional distribution H(P) is fixed.
  • Kolmogorov-Smirnov (KS) Test:
    • A non-parametric statistical test used to determine if two samples are drawn from the same underlying probability distribution, or if a sample is drawn from a particular theoretical distribution.
    • Mechanism: It compares the cumulative distribution functions (CDFs) of two samples. The KS statistic (or D-statistic) is the maximum absolute difference between the two CDFs.
    • Interpretation: A large KS statistic (D-value) and a small p-value suggest that the two distributions are significantly different.

3.2. Previous Works

The paper builds upon the fundamental algorithms of t-SNE and UMAP, which are extensively used in single-cell analysis:

  • t-SNE: Introduced by Laurens van der Maaten and Geoffrey Hinton in 2008 (Ref. 3). It revolutionized visualization of high-dimensional data, especially for its ability to reveal complex local structures.
  • UMAP: Developed by Leland McInnes, John Healy, and James Melville in 2018 (Ref. 4). It emerged as a faster and often more globally preserving alternative to t-SNE.
  • Density-preserving variants: The paper mentions den-SNE and densMAP (Ref. 6), which are modifications to t-SNE and UMAP, respectively, designed to address the issue that cluster size in standard t-SNE/UMAP plots doesn't always reflect actual cell density. These demonstrate the ongoing evolution of these tools to refine their representations.
  • Graph-based clustering: Tools like FlowSOM (Ref. 8) are frequently paired with t-SNE/UMAP to provide biological meaning by clustering cells based on their high-dimensional phenotypes, which are then overlaid on the low-dimensional plots.

3.3. Technological Evolution

The field of single-cell analysis has rapidly evolved, moving from bulk measurements to single-cell resolution. This shift necessitated new data analysis tools:

  • Early methods: Traditional dimensionality reduction techniques like Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) were useful but are linear and struggle to capture complex non-linear relationships prevalent in single-cell data.
  • Non-linear methods: t-SNE and UMAP filled this gap by providing non-linear representations that could accurately reflect the high similarity of cells within specific populations. However, their qualitative nature limited their utility as analytical tools.
  • The current challenge: The challenge addressed by this paper is moving beyond "visualization" to "quantification" for these powerful non-linear representations. Researchers often resorted to pseudo-bulk analysis, where clusters (identified from t-SNE/UMAP plots or other methods) are treated as bulk populations, and statistical comparisons are made on their frequencies or average marker expressions. This approach discards much of the single-cell level detail that t-SNE/UMAP preserve.

3.4. Differentiation

The proposed cross entropy test differentiates itself from existing approaches in several key ways:

  • Quantitative vs. Qualitative: Unlike the traditional "eye-balling" of t-SNE/UMAP plots, this test provides a rigorous statistical p-value and a quantitative distance (LL^\infty) to compare datasets.
  • Single-Cell Level Information: It directly operates on the inter-relationship of single cells within the dimensionality-reduced space, rather than relying on aggregated cluster statistics (pseudo-bulk analysis). This means it retains the rich information about individual cell phenotypes.
  • Sensitivity to Subtle Changes: It is sensitive to both inter-cluster frequency changes (proportions of cell types) and intra-cluster phenotypic shifts (subtle changes in marker expression within a cell type), which are often missed by simpler frequency-based comparisons.
  • Robustness to Algorithmic Noise: It is specifically designed to be robust to common variations in t-SNE/UMAP outputs, such as run-based rotational symmetry or small parameter changes, ensuring that detected differences are biologically meaningful rather than algorithmic artifacts.
  • Broad Applicability: It is demonstrated to work consistently across diverse single-cell technologies (flow, mass, single-cell sequencing) and both t-SNE and UMAP, making it a versatile addition to the single-cell toolkit.

4. Methodology

4.1. Principles

The core principle of the cross entropy test is to leverage the inherent mathematical basis of t-SNE (and, by extension, UMAP) for quantitative comparison. t-SNE minimizes the Kullback-Leibler (KL) divergence between a high-dimensional probability distribution PP (representing pairwise similarities in the original data) and a low-dimensional probability distribution QQ (representing pairwise similarities in the embedding). This minimization is equivalent to minimizing the cross entropy H(P, Q) because the entropy of the high-dimensional distribution H(P) is fixed by the perplexity setting.

The key insight is that each cell in a t-SNE or UMAP representation can be associated with a point cross entropy divergence Hi(P,Q)H_i^{(P,Q)}. By calculating these point cross entropies for all cells in a dataset, we obtain a distribution of cross entropy divergences. If two datasets are truly similar (e.g., biological replicates or independent runs of the same sample), their distributions of point cross entropies should also be similar. Conversely, if they are biologically different, their cross-entropy distributions should diverge.

The Kolmogorov-Smirnov (KS) test is then employed to statistically compare these two distributions of point cross entropies. The KS test yields a p-value, indicating the statistical significance of the difference between the distributions, and an LL^\infty distance (or D-statistic), quantifying the magnitude of this difference.

4.2. Steps & Procedures

The overall procedure for applying the cross entropy test involves the following steps:

  1. High-Dimensional Data Input: Start with two (or more) high-dimensional single-cell datasets, denoted as {xi}\left\{ x_i \right\} and {xi}\left\{ x_i' \right\}, where xix_i represents the high-dimensional features of individual cell ii.
  2. Dimensionality Reduction: Apply t-SNE or UMAP to each high-dimensional dataset to obtain their respective low-dimensional representations, {yi}\left\{ y_i \right\} and {yi}\left\{ y_i' \right\} (typically 2D coordinates).
  3. Calculate Pairwise Probabilities in High-Dimensional Space (PP): For each cell ii, calculate the probability pjip_{j|i} that cell jj is a neighbor of cell ii in the high-dimensional space, based on their Euclidean distances and a Gaussian kernel (Equation 1). These probabilities are then symmetrized to get pijp_{ij} (Equation 2).
  4. Calculate Pairwise Probabilities in Low-Dimensional Space (QQ): For each cell ii, calculate the probability qijq_{ij} that cell jj is a neighbor of cell ii in the low-dimensional space, based on their Euclidean distances and a Student's t-distribution (Cauchy distribution in 2D) kernel (Equation 4).
  5. Calculate Point Cross Entropy (Hi(P,Q)H_i^{(P,Q)}): For each individual cell ii, calculate its point cross entropy based on the local high-dimensional probabilities (pijp_{ij}^*) and low-dimensional probabilities (qijq_{ij}^*). This captures how well the local neighborhood of cell ii in the high-dimensional space is preserved in the low-dimensional embedding (Equation 11).
  6. Form Cross Entropy Distributions: Collect all Hi(P,Q)H_i^{(P,Q)} values for all cells in each dataset to form two distributions of point cross entropies: {hi}\{h_i\} for the first dataset and {hi}\{h_i'\} for the second dataset.
  7. Apply Kolmogorov-Smirnov Test: Perform a two-sample KS test on the two distributions of point cross entropies ({hi}\{h_i\} and {hi}\{h_i'\}).
    • P-value: The KS test yields a p-value, which indicates the probability of observing a difference as extreme as, or more extreme than, that observed, assuming the null hypothesis (that the two distributions are identical) is true. A low p-value (e.g., < 0.05) suggests a statistically significant difference between the two dimensionality-reduced representations.
    • LL^\infty Distance: The KS test also provides the LL^\infty distance (also known as the D-statistic), which is the maximum absolute difference between the empirical cumulative distribution functions (CDFs) of the two cross entropy distributions. This value quantifies the magnitude of the difference between the two datasets, allowing for quantitative comparison and dendrogram construction.
  8. Interpretation:
    • No significant difference (high p-value): The dimensionality-reduced representations are statistically similar, indicating consistency between technical replicates, biological replicates, or different runs of the same sample.
    • Significant difference (low p-value): The dimensionality-reduced representations are statistically different, suggesting a true biological variation between the samples.
    • LL^\infty distance: Can be used to build dendrograms to visualize the hierarchical relationships and overall similarity/dissimilarity between multiple samples.

4.3. Mathematical Formulas & Key Details

The paper details the mathematical foundation for t-SNE and how the cross entropy is derived.

Let nn be the number of data points (cells) and dd be the number of dimensions in the original high-dimensional space. xix_i are data points in the original space, and yiy_i are their corresponding points in the low-dimensional space.

  1. Conditional Probability in High-Dimensional Space: The probability pjip_{j|i} that data point xjx_j would be a neighbor of data point xix_i is defined using a Gaussian distribution centered at xix_i: pji=exp(xjxi2/(2σi2))kiexp(xkxi2/(2σi2)) p_{j|i} = \frac { \exp \big ( - ||x_j - x_i||^2 / (2\sigma_i^2) \big ) } { \sum_{k \neq i} \exp \big ( - ||x_k - x_i||^2 / (2\sigma_i^2) \big ) } (Equation 1)

    • xi,xj,xkx_i, x_j, x_k: Data points (vectors) in the high-dimensional space.
    • xjxi2||x_j - x_i||^2: Squared Euclidean distance between xjx_j and xix_i.
    • σi2\sigma_i^2: Variance of the Gaussian centered at xix_i. This variance is adjusted for each xix_i such that the perplexity of the distribution PiP_i (defined by pjip_{j|i} for a fixed ii) matches a predefined value.
    • exp()\exp(\cdot): Exponential function.
    • ki\sum_{k \neq i}: Sum over all data points kk except ii.
    • This formula effectively measures how likely xjx_j is to be a neighbor of xix_i, with closer points having higher probabilities.
  2. Symmetrized Pairwise Probability in High-Dimensional Space: To simplify the gradient calculations, the conditional probabilities are symmetrized: pij=pij+pji2n p_{ij} = \frac { p_{i|j} + p_{j|i} } { 2n } (Equation 2)

    • pijp_{i|j}: Conditional probability of xix_i being a neighbor of xjx_j.
    • nn: Total number of data points.
    • This ensures pij=pjip_{ij} = p_{ji}, making the joint probabilities symmetric.
  3. Global Normalization of pijp_{ij}: The sum of all pairwise probabilities in the high-dimensional space is normalized to 1: i=1nj=1npij=1 \sum_{i=1}^n \sum_{j=1}^n p_{ij} = 1 (Equation 3)

  4. Pairwise Probability in Low-Dimensional Space: In the low-dimensional space, the probability qijq_{ij} that yjy_j is a neighbor of yiy_i is defined using a Student's t-distribution with one degree of freedom (which is equivalent to a Cauchy distribution): qij=(1+yjyi2)1ki(1+ykyi2)1 q_{ij} = \frac { (1 + ||y_j - y_i||^2)^{-1} } { \sum_{k \neq i} (1 + ||y_k - y_i||^2)^{-1} } (Equation 4)

    • yi,yj,yky_i, y_j, y_k: Data points (vectors) in the low-dimensional space (e.g., 2D coordinates).
    • yjyi2||y_j - y_i||^2: Squared Euclidean distance between yjy_j and yiy_i.
    • (1+yjyi2)1(1 + ||y_j - y_i||^2)^{-1}: Represents the kernel of the Cauchy distribution. This specific form allows for heavy tails, meaning that moderately distant points in the low-dimensional space still exert a non-negligible force, which helps prevent crowding (where all points collapse into a single cluster).
  5. Global Normalization of qijq_{ij}: The sum of all pairwise probabilities in the low-dimensional space is also normalized to 1: i=1nj=1nqij=1 \sum_{i=1}^n \sum_{j=1}^n q_{ij} = 1 (Equation 5)

  6. Kullback-Leibler (KL) Divergence: t-SNE minimizes the KL divergence between the high-dimensional distribution PP and the low-dimensional distribution QQ: D(P,Q)=i=1nj=1,jinpijlogpijqij D(P, Q) = \sum_{i=1}^n \sum_{j=1, j \neq i}^n p_{ij} \log \frac{p_{ij}}{q_{ij}} (Equation 6)

    • log\log: Natural logarithm.
    • This measures how much the low-dimensional representation QQ diverges from the high-dimensional representation PP. The goal of t-SNE is to make QQ as similar to PP as possible, hence minimizing this value.
  7. Cross-Entropy: The KL divergence can be expressed in terms of cross-entropy H(P, Q) and entropy H(P): H(P,Q)=i=1nj=1,jinpijlogqij H(P, Q) = - \sum_{i=1}^n \sum_{j=1, j \neq i}^n p_{ij} \log q_{ij} (Equation 7)

    • This is the cross-entropy between PP and QQ.
  8. Entropy of PP: H(P)=i=1,j=1,jinpijlogpij H(P) = - \sum_{i=1, j=1, j \neq i}^n p_{ij} \log p_{ij} (Equation 8)

    • This is the entropy of the high-dimensional distribution PP.
  9. Relationship between KL Divergence, Cross-Entropy, and Entropy: D(P,Q)=H(P,Q)H(P)D(P, Q) = H(P, Q) - H(P) (Equation 9)

    • Since H(P) is fixed by the perplexity parameter during the t-SNE optimization, minimizing D(P, Q) is equivalent to minimizing H(P, Q).
  10. Local Point Entropy and Cross-Entropy: The paper introduces local (per-point) versions of entropy and cross-entropy, adjusted using local pair probabilities pijp_{ij}^* and qijq_{ij}^*. These local probabilities are normalized versions of pijp_{ij} and qijq_{ij} to make their sum per point closer to 1 (Equations 12-15).

    • Local Point Entropy: Hi(P)=j=1,jinpijlogpij H_i^{(P)} = - \sum_{j=1, j \neq i}^n p_{ij}^* \log p_{ij}^* (Equation 10, first part)
    • Local Point Cross-Entropy: Hi(P,Q)=j=1,jinpijlogqij H_i^{(P,Q)} = - \sum_{j=1, j \neq i}^n p_{ij}^* \log q_{ij}^* (Equation 11, formerly Equation 10 second part, with index j=1j=1 corrected to jij \neq i as per standard definition)
      • pij=npijp_{ij}^* = n p_{ij} (Equation 12)
      • qij=nqijq_{ij}^* = n q_{ij} (Equation 13)
      • These local point cross-entropies Hi(P,Q)H_i^{(P,Q)} are the key values used to form the distributions for the KS test. They reflect how well the local neighborhood of point ii is preserved from high to low dimension.
  11. Local Point KL Divergence: Di(P,Q)=Hi(P,Q)Hi(P) D_i^{(P, Q)} = H_i^{(P, Q)} - H_i^{(P)} (Equation 16)

    • The global KL divergence D(P, Q) is the average of these point divergences (Equation 17).
  12. Kolmogorov-Smirnov (KS) Test and LL^\infty Distance: Given two distributions of point cross-entropies, {hi}\{h_i\} and {hi}\{h_i'\}, the KS test determines if they are significantly different. The LL^\infty distance is the maximum absolute difference between their empirical cumulative distribution functions (CDFs): L(f,g)=maxx(F(x)G(x)) L^\infty(f, g) = \max_x (|F(x) - G(x)|) (Formula derived from KS test definition, not explicitly numbered in paper)

    • F(x): Cumulative distribution function (CDF) of the first cross entropy distribution.
    • G(x): Cumulative distribution function (CDF) of the second cross entropy distribution.
    • maxx()\max_x(\cdot): The maximum difference between the two CDFs over all possible values xx.
    • This LL^\infty distance is the D-statistic of the KS test and serves as a quantitative measure of dissimilarity between the two dimensionality-reduced datasets.

4.4. Additional Dimensionality Reduction Analysis

The paper also mentions that the cross entropy test was applied to other dimensionality reduction techniques like PacMAP and triMAP using Python via the R package reticulate, suggesting its broader applicability. This implies that the underlying principle of comparing cross-entropy distributions is not strictly limited to t-SNE or UMAP but can extend to other embedding methods that rely on preserving local relationships.

4.5. General Implementation Details

The test was implemented in RR and uses several standard R packages: FlowSOM, ggplot2, ggridges, RANN, RColorBrewer, reshape2, Rtsne, and umap. This indicates a standard bioinformatic programming environment, making the tool accessible to researchers familiar with R. The code is available on GitHub and Zenodo, with a guide provided online.

5. Experimental Setup

5.1. Datasets

The study validated the cross entropy test across various single-cell technologies and biological contexts.

  1. MUS (Mouse) Dataset:

    • Origin: High-dimensional flow cytometry data from C57BL/6 inbred mice.
    • Characteristics: Immunological profiling of lymphocytes from different tissues: lymph nodes, spleen, and small intestinal lamina propria (tissue).
    • Purpose: This dataset was specifically designed to test the robustness of the cross entropy test against:
      • Technical replicates: Splitting a single sample.
      • Biological replicates: Comparing analogous samples from different, but genetically identical, mice.
      • True biological differences: Comparing lymphocytes from distinct tissues within the same mice (e.g., spleen vs. lymph node vs. intestinal tissue).
      • Algorithmic variability: Independent t-SNE/UMAP runs on the same sample.
    • Availability: Accessible via FlowRepository (id/FR-FCM-Z48W).
  2. MC (Mass Cytometry) Dataset:

    • Origin: Human peripheral blood samples from patients with COVID-19.
    • Characteristics: Mass cytometric analysis of lymphocyte subsets at different time points: admission to ICU, during ICU stay (intermediate), and upon discharge from ICU. This dataset includes major immune subsets like neutrophils, CD4+ T cells, CD8+ T cells, and monocytes.
    • Purpose: To demonstrate the utility of the test in a clinical setting and with mass cytometry data, identifying immune landscape changes during disease progression and recovery.
    • Availability: Penttila et al. (2021), available via FlowRepository (id/FR-FCM-Z34U).
  3. SCS (Single-Cell Sequencing) Dataset:

    • Origin: Human bronchoalveolar lavage (BAL) fluid samples.

    • Characteristics: 10x single-cell RNA sequencing data from patients with COVID-19 pneumonia and non-COVID pneumonia. Includes annotated cell clusters such as epithelial, neutrophil, monocyte/macrophage, CD4 T cell, CD8 T cell, dendritic cell, B cell, and natural killer (NK) cell clusters.

    • Purpose: To demonstrate the test's applicability to single-cell gene expression data and its ability to recapitulate findings from complex transcriptional analyses.

    • Availability: Wauters et al. (2021), available in the EGA European Genome-Phenome Archive database (EGAS00001004717).

      Example Data Sample: While the paper describes the general nature of the data (flow cytometry, mass cytometry, single-cell sequencing), it doesn't provide a specific, raw data sample (e.g., a single cell's full marker expression profile or gene counts). However, the results are visualized as 2D scatter plots where each point represents a single cell, colored by FlowSOM clustering or tissue/patient source.

The figure below (Image 2 from the original paper) provides a visual example of how the t-SNE plots of single-cell data look, with different clusters representing distinct cell populations. For instance, in Figure 1A, Technical replicates are shown as t-SNE plots of splenocytes, where each point is a cell, and colors represent FlowSOM clusters. The input to t-SNE for such plots would be the high-dimensional expression data for each cell (e.g., 20+ protein markers).

该图像是图表,展示了不同技术重复、不同生物重复及不同组织样本的t-SNE或UMAP降维结果及其交叉熵检验的p值,反映了数据集间的统计差异。 该图像是图表,展示了不同技术重复、不同生物重复及不同组织样本的t-SNE或UMAP降维结果及其交叉熵检验的p值,反映了数据集间的统计差异。

5.2. Evaluation Metrics

The primary evaluation metrics used in this paper are:

  1. P-value from Kolmogorov-Smirnov (KS) Test:

    • Conceptual Definition: The p-value assesses the statistical significance of the observed difference between two distributions of single-cell cross-entropy values. It quantifies the probability of obtaining a KS statistic (D-value) as extreme as, or more extreme than, the one calculated from the data, assuming that the two distributions are actually identical (the null hypothesis).
    • Importance: A low p-value (typically < 0.05 or < 0.001 in high-powered studies) indicates strong evidence against the null hypothesis, suggesting that the two dimensionality-reduced datasets are statistically different. Conversely, a high p-value suggests that any observed differences could be due to random chance, supporting the null hypothesis of no significant difference.
    • Mathematical Concept: The KS test compares the empirical cumulative distribution functions (ECDFs) of two samples. For two samples X1,,XnX_1, \ldots, X_n and Y1,,YmY_1, \ldots, Y_m, their empirical distribution functions are defined as: Fn(x)=1ni=1n1Xix F_n(x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{X_i \leq x} Gm(x)=1mj=1m1Yjx G_m(x) = \frac{1}{m} \sum_{j=1}^m \mathbf{1}_{Y_j \leq x} where 1()\mathbf{1}_{(\cdot)} is the indicator function (1 if the condition is true, 0 otherwise). The KS statistic (D-value) is then: Dn,m=supxFn(x)Gm(x) D_{n,m} = \sup_x |F_n(x) - G_m(x)|
      • n, m: Number of observations in the first and second samples, respectively.
      • Fn(x),Gm(x)F_n(x), G_m(x): Empirical cumulative distribution functions (ECDFs) of the two samples.
      • supx\sup_x: Supremum (least upper bound) over all xx, which in practice means the maximum absolute difference between the two ECDFs at any point.
      • The p-value is calculated based on this Dn,mD_{n,m} statistic and the sample sizes nn and mm.
  2. LL^\infty Distance (Kolmogorov-Smirnov Statistic):

    • Conceptual Definition: The LL^\infty distance, in the context of the KS test, is the maximum absolute difference between the empirical cumulative distribution functions of the two samples being compared. It directly measures the magnitude of the difference between the two distributions.
    • Importance: Unlike the p-value, which is sensitive to sample size, the LL^\infty distance provides a direct quantification of how "far apart" the two distributions are, regardless of sample size. It's particularly useful for:
      • Quantifying effect size: Giving an intuition of the biological magnitude of the difference.
      • Dendrogram construction: Serving as a distance metric to hierarchically cluster multiple samples, allowing for a visual organization of their similarities and differences.
    • Mathematical Formula: L(F,G)=maxx(F(x)G(x)) L^\infty(F, G) = \max_x (|F(x) - G(x)|)
      • F(x): Cumulative distribution function (CDF) of the first distribution of cross entropies.
      • G(x): Cumulative distribution function (CDF) of the second distribution of cross entropies.
      • maxx\max_x: The maximum absolute difference between the two CDFs over all possible values of xx. This is identical to the KS Dn,mD_{n,m} statistic.

5.3. Baselines

The paper does not compare the cross entropy test against other statistical tests for t-SNE/UMAP representations (as such robust tests were largely absent). Instead, the validation strategy involves comparing the test's performance against:

  • Null Hypothesis Scenarios:
    • Technical replicates: Two aliquots from the same biological sample. The test should not find a significant difference.
    • Biological replicates: Samples from different individuals but under similar biological conditions (e.g., age-/sex-matched mice). The test should not find a significant difference (or at least a much lower significance than true biological differences).
    • Independent t-SNE/UMAP runs: Multiple runs of the same algorithm on the same dataset. Due to the stochastic nature of these algorithms, minor visual variations (like rotational symmetry) can occur. The test should not find a significant difference.
  • Alternative Hypothesis Scenarios (True Biological Differences):
    • Different biological samples: Samples from distinct biological sources (e.g., spleen vs. lymph node vs. tissue). The test should find a significant difference.

    • Artificially constructed differences: Creating synthetic datasets with known degrees of difference (e.g., mixing cells from two distinct tissues at defined ratios) to assess the test's sensitivity and the proportionality of the LL^\infty distance.

    • Quantitative vs. Qualitative changes: Artificially manipulating cell frequencies (inter-cluster frequency) or phenotypic expressions (intra-cluster phenotype) to verify the test's sensitivity to both types of biological variation.

      The effectiveness of the test is thus validated by its ability to correctly identify these known similarities and differences, demonstrating its robustness and specificity.

6. Results & Analysis

6.1. Core Results

The cross entropy test demonstrated robust performance across various validation scenarios, confirming its ability to distinguish biological variation from technical noise and algorithmic artifacts.

1. Robustness in Technical and Biological Replicates:

  • Technical Replicates (Figure 1A): When comparing t-SNE plots generated from technical replicates of a single splenocyte sample, the cross entropy test consistently yielded high p-values (0.370-1.0), indicating no significant difference. This confirms its ability to correctly dismiss non-biological variations arising from sample processing or data splitting.
  • Biological Replicates (Figure 1B): Similarly, comparing splenocyte samples from age-/sex-matched mice (biological replicates) resulted in high p-values (0.202-0.636), supporting the null hypothesis of no significant difference. This is crucial for ensuring the test doesn't over-report differences due to minor individual variability.

2. Identification of True Biological Differences (Figure 1C):

  • The test effectively identified significant differences when comparing lymphocytes from biologically distinct tissues (spleen, lymph node, and intestinal tissue). For example, comparisons between spleen and lymph node, spleen and tissue, and lymph node and tissue all showed highly significant p-values (<0.001 to <0.0001). This demonstrates the test's sensitivity to genuine biological distinctions.

3. Robustness to t-SNE Run Variations (Figure 1D):

  • Independent runs of t-SNE on the same lymph node sample often produce visually distinct plots due to rotational symmetry and stochastic initialization. Despite these visual disparities, the cross entropy test correctly assigned a high p-value (0.585) between such runs, indicating no significant underlying difference. This highlights its resilience to common artifacts of t-SNE generation that can mislead human visual assessment.

4. Quantitative Comparison with LL^\infty Distance (Figure 2):

  • To test the quantitative aspect, artificial datasets were created by mixing spleen and tissue cells (e.g., spleentissue with 90% spleen, 10% tissue; tissuespleen with 10% spleen, 90% tissue).

  • The LL^\infty distance correctly reflected the known biological closeness: spleentissue was quantitatively closer to the pure spleen sample, and tissuespleen was closer to the pure tissue sample.

  • A dendrogram constructed using LL^\infty distances accurately grouped samples based on their composition (Image 3). This validates LL^\infty as a valuable metric for quantifying relative distances between t-SNE plots and organizing complex datasets.

    Figure 2. \(L ^ { \\infty }\) provides a quantitative comparison of different t-SNE visualizations

    5. Sensitivity to Inter-cluster Frequency and Intra-cluster Phenotype Shifts (Figure 3):

  • The test's sensitivity to two types of biological changes was assessed using artificial datasets:

    • Intra-cluster phenotype changes: A lymph%spleen dataset (lymph node cells with cluster frequencies normalized to spleen) was compared to the spleen sample. Despite similar cluster frequencies, the test found a significant difference (p < 0.0001), indicating its ability to detect subtle phenotypic shifts within cell populations.
    • Inter-cluster frequency changes: A spleen%lymph dataset (spleen cells with cluster frequencies normalized to lymph node) was compared to the spleen sample. The test also found a significant difference (p < 0.0001), showing its sensitivity to changes in the proportions of different cell types, even if the underlying cell phenotypes are identical.
  • This demonstrates that the cross entropy test captures both types of biological variation that are relevant in single-cell data analysis.

6. Broad Utility Across Technologies (Figure 4):

  • Mass Cytometry (Figure 4A, 4B): Applied to COVID-19 patient peripheral blood data, the test showed that the immune landscape during ICU stay resembled admission more closely overall (Figure 4A). However, focusing solely on monocytes, the intermediate time point resembled discharge, consistent with monocytes being among the first to recover in severe COVID-19 (Figure 4B).

  • Single-Cell Sequencing (Figure 4C, 4D): Comparing bronchoalveolar lavage from COVID vs. non-COVID pneumonia, significant changes in cross entropy were found for epithelial, neutrophil, monocyte/macrophage, CD4 T cell, and CD8 T cell populations. The LL^\infty distance identified neutrophils as having the largest change, followed by CD8 T cells being more affected than CD4 T cells, all consistent with published biological findings from detailed traditional analyses.

    该图像是图表,展示了基于免疫细胞亚群的单细胞数据在不同ICU阶段及COVID-19状态下的t-SNE降维分布、细胞类型频率和统计比较,体现了细胞群的异质性及统计显著性分析。

    7. UMAP Validation (Figure 5):

  • The cross entropy test was successfully validated for UMAP representations using the MUS dataset.

  • It produced appropriate conclusions for technical replicates (high p-values), biological replicates (high p-values), biological differences (low p-values for different tissues), and repeat UMAP runs (high p-values despite visual variations).

  • Like with t-SNE, it was sensitive to both subcluster cellular phenotype and cluster frequency shifts in UMAP plots.

  • Crucially, the test was robust to changes in UMAP settings (iterations, number of neighbors, number of dimensions) and even other dimensionality-reduction approaches like PacMAP and triMAP (Supplemental Figures S2, S3), suggesting broad utility.

    Figure 5. A cross entropy test provides a robust statistical test for UMAP comparison 该图像是图表,展示了图5中基于交叉熵检验的UMAP不同数据集之间的统计比较结果,包含技术重复(A)、生物学重复(B)、不同组织(C)、重复运行(D)、累积分布函数(E)及不同组织间比较(F)的可视化及调整后的p值。

6.2. Data Presentation (Tables)

The paper contains one table, the KEY RESOURCES TABLE. The following table shows the results from Key Resources Table:

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
MUS dataset This paper https://flowrepository.org/id/FR-FCM-Z48W
MC dataset Penttila et al.10 https://flowrepository.org/id/FR-FCM-Z34U
SCS dataset Wauters et al.11 https://ega-archive.org/studies/EGAS00001004717
Software and algorithms
Cross Entropy test https://github.com/AdrianListon/Cross- Entropy-test https://doi.org/10.5281/zenodo.7420921
Guide to running the test https://www.liston.babraham.ac.uk/ flowcytoscript/

6.3. Ablations / Parameter Sensitivity

The paper includes supplementary information regarding the sensitivity of the cross entropy test to various parameters of t-SNE and UMAP, which are crucial for understanding its robustness.

  • t-SNE Parameter Robustness (Figure S1, mentioned in text): The test was found to be robust to changes in t-SNE settings, such as perplexity values, iteration values, or analysis of independent runs. This means that minor adjustments to these algorithmic parameters, which can sometimes alter the visual layout of t-SNE plots, do not unduly affect the statistical conclusion of the cross entropy test. This is vital because t-SNE outputs can be sensitive to these parameters, and a robust test should not pick up these parameter-induced visual changes as significant biological differences.

  • UMAP Parameter Robustness (Figure S2, mentioned in text): Similarly, the UMAP cross entropy test was not sensitive to changes in UMAP iterations, the number of neighbors (a key UMAP parameter), or the number of dimensions used in the initial dimensionality reduction step before UMAP. This confirms its reliability across different UMAP configurations.

  • General Applicability to Other DR Methods (Figure S3, mentioned in text): The application of the cross entropy test to additional dimensionality-reduction approaches like PacMAP and triMAP (which are also non-linear embedding techniques) yielded highly similar p-values regardless of the specific reduction technique applied. This suggests that the principle of using cross entropy distributions to compare embeddings has broad utility beyond just t-SNE and UMAP, hinting at the generalizability of the method to the rapidly evolving landscape of embedding algorithms.

    These results from parameter sensitivity and broader applicability studies reveal that the test is primarily sensitive to the underlying data structure rather than the specific algorithmic choices or nuances of a particular dimensionality reduction method.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces a groundbreaking statistical test for the quantitative comparison of dimensionality-reduced single-cell datasets generated by t-SNE and UMAP. By leveraging the distributions of single-cell cross entropy values and applying the Kolmogorov-Smirnov test, the authors have developed a robust method that accurately distinguishes true biological variation from technical noise, biological replicates, or algorithmic artifacts like rotational symmetry. The test further provides the LL^\infty distance, a quantitative metric that allows for the hierarchical clustering and comparison of multiple samples. Validated across flow cytometry, mass cytometry, and single-cell sequencing, and applicable to both t-SNE and UMAP, this cross entropy test unlocks the largely untapped analytical potential of dimensionality-reduction tools, moving them beyond mere visualization into rigorous statistical analysis for biomedical data.

7.2. Limitations & Future Work

The authors candidly discuss several important limitations and provide guidance for appropriate use:

  • P-value Misinterpretation: They caution against misusing p-values as a direct measure of biological meaningfulness. A very small p-value might indicate a statistical difference, but not necessarily a biologically relevant one, especially with very large sample sizes.

  • Cell Number Dependency: The power of the cross entropy test (and thus the p-value) is dependent on cell number.

    • Low cell numbers: Biologically distinct samples might return non-significant p-values if the cell count is very low (e.g., thousands).
    • High cell numbers: Conversely, very high cell numbers (e.g., >10,000 cells) can provide enough statistical power to detect minute differences between biological replicates, which, while statistically real, might not be biologically meaningful. The authors suggest a threshold of p < 0.001 for high cell number analyses to exclude biological replicates while identifying biological differences.
  • Recommendation for LL^\infty Distance: To mitigate the p-value dependency on cell number, they strongly recommend using the LL^\infty distance as a measure of difference magnitude. This metric is less sensitive to cell count and provides a more intuitive sense of "how different" samples are.

  • Importance of Controls: They advise including biologically distinct positive controls (to gauge the expected magnitude of difference) and negative controls (technical replicates) in analyses. A significant difference between technical replicates should be interpreted as unacceptable technical variability.

  • Focus on Similarity, Not Metric Preservation: The test evaluates the similarity between t-SNE/UMAP representations, not their fidelity in preserving original distances or metric properties.

  • Complementary to High-Dimensional Analysis: The cross entropy test does not replace the value of full analysis of the original high-dimensional dataset. Dimensionality reduction tools have inherent limitations, and their use should be mindful of these.

  • Data Quality and Experimental Design: Like all statistical tests, the utility of this analysis is entirely dependent on the quality of the input data and the soundness of the experimental design.

    Future work implications are not explicitly detailed, but the broad applicability suggests continued integration into diverse single-cell workflows and potential adaptation to new embedding algorithms. The authors also hint at the potential for clinical diagnostics, where this quantitative approach could offer higher sensitivity for detecting subtle immunological disorders or hematological malignancies by integrating aberrant marker expression and rare populations.

7.3. Personal Insights & Critique

This paper addresses a critical gap in single-cell data analysis. For too long, t-SNE and UMAP have been "black boxes" for visualization, with researchers relying on subjective interpretation or falling back to aggregated data for statistical rigor. The cross entropy test provides a much-needed objective framework to quantify differences, elevating these visualization tools to powerful analytical instruments.

Strengths:

  • Rigorous Foundation: The test is grounded in information theory (cross entropy) and robust statistics (Kolmogorov-Smirnov test), providing a solid theoretical basis.
  • Practical Utility: The validation against diverse biological and technical scenarios, across multiple technologies, makes it highly practical and trustworthy for single-cell biologists.
  • Addressing Key Challenges: Its robustness to t-SNE/UMAP artifacts (like rotational symmetry) and sensitivity to both frequency and phenotypic shifts are crucial for real-world application.
  • Open Science: Providing code and instructions is excellent for reproducibility and adoption.
  • Quantitative Distance: The LL^\infty distance offers a magnitude of difference, which is invaluable for biological interpretation and building relationships between samples (dendrograms).

Potential Improvements/Critiques:

  • Interpretability of Cross Entropy Distribution: While the method is mathematically sound, understanding what a specific difference in cross entropy distribution means biologically might still require some training. Is a shift in the mean of the distribution more indicative than a change in its variance? Further work might explore specific biological interpretations of different forms of divergence in these distributions.

  • Guidance for Cell Number vs. P-value: The advice to use a p < 0.001 threshold for high cell numbers is helpful but still heuristic. More rigorous methods for adjusting p-values based on cell count or focusing primarily on effect size metrics (like LL^\infty) could be emphasized even more strongly. Perhaps a suggested minimum cell count for reliable statistical inference could be provided.

  • Scalability for Extremely Large Datasets: While t-SNE/UMAP are computationally intensive, the calculation of pairwise probabilities for all cells for cross-entropy could also be demanding for datasets with millions of cells. While not explicitly mentioned as a limitation, computational performance for very large NN might be a practical consideration.

  • Integration with Existing Workflows: The paper mentions that downstream analysis often involves clustering. How does the cross entropy test interact with or complement existing cluster-based statistical comparisons? For instance, if two samples show no difference in cluster frequencies but a significant difference in cross entropy, it points to subtle phenotypic shifts within clusters – this synergy could be explored further.

    Overall, this paper represents a significant step forward in quantitative single-cell data analysis. By providing a statistically sound method to compare dimensionality-reduced representations, it enables researchers to extract more nuanced and robust insights from their high-dimensional data, pushing the boundaries of what these powerful visualization tools can achieve.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.