AiPaper
Paper status: completed

DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder

Published:05/23/2023
Original LinkPDF
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DIVA is a nonparametric incremental deep clustering framework that combines a Dirichlet Process Mixture Model with a Variational Auto-Encoder, allowing dynamic clustering adjustments without needing a predefined number of clusters, outperforming existing models on dynamic data.

Abstract

Generative model-based deep clustering frameworks excel in classifying complex data, but are limited in handling dynamic and complex features because they require prior knowledge of the number of clusters. In this paper, we propose a nonparametric deep clustering framework that employs an infinite mixture of Gaussians as a prior. Our framework utilizes a memoized online variational inference method that enables the "birth" and "merge" moves of clusters, allowing our framework to cluster data in a "dynamic-adaptive" manner, without requiring prior knowledge of the number of features. We name the framework as DIVA, a Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder. Our framework, which outperforms state-of-the-art baselines, exhibits superior performance in classifying complex data with dynamically changing features, particularly in the case of incremental features. We released our source code implementation at: https://github.com/Ghiara/diva

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder

1.2. Authors

  • Zhenshan Bing: Technical University of Munich (TUM)
  • Yuan Meng: Technical University of Munich (TUM)
  • Yuqi Yun: Technical University of Munich (TUM)
  • Hang Su: Politecnico di Milano
  • Xiaojie Su: Chongqing University
  • Kai Huang: Sun Yat-sen University
  • Alois Knoll: Technical University of Munich (TUM)

1.3. Journal/Conference

This paper was published on arXiv on May 23, 2023. The Technical University of Munich (TUM) is a world-renowned research university, particularly strong in engineering and computer science.

1.4. Publication Year

2023

1.5. Abstract

Generative deep clustering models are powerful but typically suffer from a major limitation: they require the number of clusters (categories) to be defined in advance. This makes them unsuitable for dynamic environments where new features or classes appear over time. This paper introduces DIVA (Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder). It combines a Variational Auto-Encoder (VAE) with a Dirichlet Process Mixture Model (DPMM) prior. Using a "memoized online variational inference" method, DIVA can dynamically "birth" (create) and "merge" (combine) clusters during training. This allows the model to adaptively determine the number of clusters without prior knowledge, outperforming state-of-the-art baselines on complex and incremental datasets.

2. Executive Summary

2.1. Background & Motivation

The Core Problem: In unsupervised learning, specifically clustering, most deep learning algorithms (like Deep Embedded Clustering or VAE-based clustering) require the user to specify the number of clusters, denoted as KK, before training begins. For example, if you want to cluster handwritten digits, you must tell the model "there are 10 clusters." Why this matters: In real-world applications (e.g., a robot exploring a new environment or a news aggregator collecting stories), data is dynamic. New objects or topics appear over time. If a model is fixed to K=10K=10 but the data introduces an 11th category, the model will fail to recognize it, forcing the new data into an existing, incorrect cluster. The Gap: While Bayesian Nonparametric methods (which can theoretically handle infinite clusters) exist, integrating them with Deep Neural Networks is difficult. Previous attempts (like stick-breaking VAEs) often struggled to capture complex cluster shapes or were difficult to train effectively due to optimization challenges.

2.2. Main Contributions / Findings

  1. Nonparametric Integration: The authors propose DIVA, which successfully integrates a Dirichlet Process Mixture Model (DPMM) as a prior into the latent space of a Variational Auto-Encoder (VAE). This removes the constraint of pre-defining KK.

  2. Dynamic Adaptation Mechanism: They employ Memoized Online Variational Inference (memoVB). This is a specific training algorithm that allows the model to mathematically decide when to add a new cluster ("birth") or combine two existing ones ("merge") based on the data it observes.

  3. Incremental Learning Success: The experiments demonstrate that DIVA can handle "incremental features." For instance, if a dataset starts with 3 types of images and grows to 10, DIVA dynamically adjusts its internal structure to accommodate the new types without catastrophic forgetting, achieving superior accuracy compared to parametric baselines.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand DIVA, one must grasp three pillars: VAEs, DPMMs, and Variational Inference.

1. Variational Auto-Encoder (VAE) A VAE is a generative neural network with two parts:

  • Encoder: Compresses input data (e.g., an image xx) into a lower-dimensional "latent vector" zz. Instead of a single point, it predicts a probability distribution (usually a Gaussian mean μ\mu and variance σ\sigma).
  • Decoder: Takes a sampled vector zz and tries to reconstruct the original image xx.
  • Objective: It minimizes the "reconstruction error" (how badly xx was restored) plus the "KL Divergence" (how much the learned distribution deviates from a standard Gaussian prior).

2. Dirichlet Process (DP) & Stick-Breaking Process How do we define "infinite" clusters mathematically?

  • Dirichlet Process (DP): Imagine a distribution over distributions. It allows a model to have a potentially infinite number of components (clusters), but only a finite number are used to explain the finite data observed so far.

  • Stick-Breaking Process: This is a constructive way to visualize the DP. Imagine a stick of length 1.

    1. Break off a fraction β1\beta_1 of the stick. This is the weight (probability) of Cluster 1.
    2. Take the remaining stick (length 1β11-\beta_1) and break off a fraction β2\beta_2 of that. This is the weight of Cluster 2.
    3. Repeat forever.
    • The result is an infinite sequence of weights πk\pi_k that sum to 1. This allows the model to always have "room" for a new cluster further down the stick.

      The following figure (Figure 1 from the original paper) illustrates this process:

  • (a) shows the stick being broken sequentially to determine weights πk\pi_k.

  • (b) shows a histogram of data sampled from a DP, where bars represent the weights of different components.

    Figure 1: (a) Stick-breaking process. (b) Histogram of \(\\mathrm { D P } ( \\alpha =\) \(5 , H = \\mathcal { N } ( 0 , 1 ) ,\) .

    3. Variational Inference (VI) In Bayesian statistics, calculating the exact "posterior" distribution (the updated probability of clusters after seeing data) is often impossible because it requires integrating over all possible configurations.

  • VI turns this into an optimization problem. It proposes a simpler "variational distribution" qq (usually from a known family like Gaussian) and tries to adjust the parameters of qq to make it look as close as possible to the true posterior pp.

  • ELBO (Evidence Lower Bound): The metric we maximize in VI. Maximizing ELBO is equivalent to minimizing the difference (KL divergence) between qq and pp.

3.2. Previous Works

  • Parametric Methods (Fixed K):
    • DEC (Deep Embedded Clustering): Uses K-means concepts to iteratively refine clusters.
    • GMVAE (Gaussian Mixture VAE): Uses a Mixture of Gaussians as a prior. It is powerful but requires fixing KK (e.g., K=10K=10). If the data has 11 classes, GMVAE fails.
  • Nonparametric Methods (Dynamic K):
    • SB-VAE (Stick-Breaking VAE): Replaces the Gaussian prior with a stick-breaking process. However, it often treats latent dimensions as features rather than clusters, lacking geometric shape information.
    • DeepDPM: Uses split/merge moves but relies on neural network splits which can be computationally heavy and data-inefficient.

3.3. Differentiation Analysis

DIVA distinguishes itself by:

  1. Hybrid Approach: It essentially embeds a full Bayesian Nonparametric clustering algorithm (DPMM) inside the latent space of a deep VAE.

  2. Explicit Cluster Management: Unlike SB-VAE which implicitly handles clusters, DIVA explicitly tracks cluster parameters (mean, covariance) and uses heuristic "birth" and "merge" moves (borrowed from the Memoized Online Variational Bayes algorithm) to actively manage the model complexity. This makes it more robust to incremental data.


4. Methodology

4.1. Principles

The core idea of DIVA is alternating optimization. The framework consists of two major modules that are updated in turns:

  1. The VAE Module: Learns to map complex data (images) into a simple latent space ZZ.

  2. The DPMM Module: Clusters the points in the latent space ZZ using an infinite mixture model.

    The intuition is that the VAE handles the complexity of high-dimensional perception (seeing the image), while the DPMM handles the logic of categorization (grouping the concepts), and the DPMM's flexibility allows the number of categories to grow.

The following figure (Figure 3 from the original paper) provides an overview of the architecture. The left side shows the VAE (Encoder/Decoder), and the right side shows the DPMM modeling the latent space with infinite components.

Figure 3: Overview of the DIVA. The DPMM and the VAE are optimized alternately. When updating the DPMM, we use the current latent sample \(z\) obtained from the VAE. When updating the VAE, we assign the outputs of the encoder to the clusters of DPMM and minimize the \(\\mathcal { L } _ { K L }\) with respect to the assigned cluster. 该图像是示意图,展示了 DIVA 框架的结构。左侧为编码器、潜在空间和解码器的组成部分,而右侧则表示 DPMM 先验空间的多个高斯分布 N(μ1,Σ1),N(μ2,Σ2),,N(μk,Σk)N(\mu_1, \Sigma_1), N(\mu_2, \Sigma_2), \ldots, N(\mu_k, \Sigma_k),其中 kk \to \infty 表示聚类数量不固定。

4.2. Core Methodology In-depth (Layer by Layer)

The algorithm alternates between Updating the DPMM and Updating the VAE. Let's break down the mathematical model and the steps.

4.2.1. The Generative Model (DPMM)

The paper assumes the data is generated by a specific probabilistic process. This is the "worldview" of the model.

  1. Stick-Breaking Prior: The mixing proportions (weights of clusters) π\pi are generated by breaking a stick.

    • βkB(1,α)\beta_k \sim \mathcal{B}(1, \alpha) (Sample a fraction from Beta distribution)
    • πk=βki=1k1(1βi)\pi_k = \beta_k \prod_{i=1}^{k-1} (1 - \beta_i) (The weight of cluster kk is the current fraction times what's left of the stick).
  2. Cluster Parameters: Each cluster kk has a mean μk\mu_k and covariance Σk\Sigma_k. These are drawn from a base distribution HH (Normal-Wishart).

  3. Assignment: For each data point ii, a cluster assignment viv_i is drawn from the categorical distribution defined by π\pi: viCat(π)v_i \sim \text{Cat}(\pi).

  4. Latent Variable: The latent vector ziz_i is generated from the chosen cluster's Gaussian: ziN(μvi,Σvi)z_i \sim \mathcal{N}(\mu_{v_i}, \Sigma_{v_i}).

  5. Observation: The actual data xix_i is generated by the decoder fθf_\theta from ziz_i: xifθ(zi)x_i \sim f_\theta(z_i).

    The joint probability of the entire system (DPMM part) is given by:

p(z,v,β,μ,Σ)=i=1NN(ziμvn,Σvn)Cat(vnπ(β))k=1B(βk1,α)k=1N(μkμ0,(λΣk)1)N(ΣkW,ν) p ( z , \pmb { v } , \beta , \pmb { \mu } , \Sigma ) = \prod _ { i = 1 } ^ { N } \mathcal { N } ( z _ { i } | \mu _ { v _ { n } } , \Sigma _ { v _ { n } } ) \mathsf { C a t } ( v _ { n } | \pmb { \pi } ( \beta ) ) \prod _ { k = 1 } ^ { \infty } \mathcal { B } ( \beta _ { k } | 1 , \alpha ) \prod _ { k = 1 } ^ { \infty } \mathcal { N } ( \mu _ { k } | \mu _ { 0 } , ( \lambda \Sigma _ { k } ) ^ { - 1 } ) \mathcal { N } ( \Sigma _ { k } | \mathbf { W } , \nu )

Symbol Explanation:

  • ziz_i: Latent vector for data point ii.
  • vnv_n: Cluster assignment for data point nn.
  • β\beta: Stick-breaking variables.
  • μk,Σk\mu_k, \Sigma_k: Mean and Covariance of cluster kk.
  • N\mathcal{N}: Normal (Gaussian) distribution.
  • Cat\text{Cat}: Categorical distribution.
  • B\mathcal{B}: Beta distribution.
  • μ0,λ,W,ν\mu_0, \lambda, \mathbf{W}, \nu: Hyperparameters of the Normal-Wishart prior (controlling the shape of clusters).

4.2.2. Step 1: Updating the DPMM (Variational Inference)

In this step, the VAE is frozen. We use the latent vectors zz generated by the VAE as fixed data points to train the DPMM. Since the true posterior is intractable, we approximate it using a variational distribution qq.

The paper uses a mean-field variational distribution qq, which assumes independence between factors:

q(v,β,μ,Σ)=n=1NCat(vnr^n1,,,r^nk)k=1KB(βkα^k1,α^k0)k=1KN(μkμ^0,(λ^Σk)1)N(ΣkN^,ν^) q ( \boldsymbol { v } , \boldsymbol { \beta } , \boldsymbol { \mu } , \Sigma ) = \prod _ { n = 1 } ^ { N } \mathrm { C a t } ( v _ { n } | \hat { r } _ { n _ { 1 } } , \cdot \cdot , \cdot , \hat { r } _ { n _ { k } } ) \prod _ { k = 1 } ^ { K } \mathcal { B } ( \beta _ { k } | \hat { \alpha } _ { k _ { 1 } } , \hat { \alpha } _ { k _ { 0 } } ) \prod _ { k = 1 } ^ { K } \mathcal { N } ( \mu _ { k } | \hat { \mu } _ { 0 } , ( \hat { \lambda } \Sigma _ { k } ) ^ { - 1 } ) \mathcal { N } ( \Sigma _ { k } | \hat { \mathcal { N } } , \boldsymbol { \hat { \nu } } )

Symbol Explanation:

  • r^nk\hat{r}_{nk}: The probability that data point nn belongs to cluster kk.
  • α^k1,α^k0\hat{\alpha}_{k_1}, \hat{\alpha}_{k_0}: Parameters for the Beta distribution of stick-breaking.
  • The other variables with "hats" (μ^,λ^\hat{\mu}, \hat{\lambda}, etc.) are the learnable variational parameters.

Optimization (ELBO): The goal is to maximize the Evidence Lower Bound (ELBO). The paper employs Memoized Online Variational Bayes (memoVB). This involves iterative updates:

  1. Local Update: Update r^nk\hat{r}_{nk} (cluster responsibilities) for the current batch of data.
  2. Global Update: Update the global cluster parameters (μ^,Σ\hat{\mu}, \Sigma, etc.) based on the aggregated statistics from the batches.

Birth and Merge Moves: To handle the dynamic number of clusters, the algorithm performs heuristic moves during the DPMM update:

  • Birth: It identifies data points that are poorly explained by current clusters (low probability). It proposes a new cluster initialized with these points. If the ELBO increases with this new cluster, it is kept.
  • Merge: It considers pairs of clusters. If merging them into one cluster increases the ELBO (explains the data just as well but with less complexity), they are merged.

4.2.3. Step 2: Updating the VAE

In this step, the DPMM is frozen. We update the VAE (Encoder/Decoder) parameters to minimize the VAE loss.

The Loss Function: The standard VAE loss is L=Lrecon+LKL\mathcal{L} = \mathcal{L}_{recon} + \mathcal{L}_{KL}.

  1. Reconstruction Loss (Lrecon\mathcal{L}_{recon}): Standard Mean Squared Error between input xx and reconstructed xx^*.

  2. KL Divergence Loss (LKL\mathcal{L}_{KL}): This is where DIVA differs from standard VAE. Instead of measuring distance to a single standard Gaussian N(0,I)\mathcal{N}(0, I), we measure distance to the specific cluster component assigned to the data point.

    However, we don't know for sure which cluster kk a point ii belongs to. We use the soft assignment probability pikp_{ik} (calculated by the DPMM in the previous step).

    The weighted KL divergence for a single data point ii is:

LKLi=k=1KpikLKLik \mathcal { L } _ { \mathrm { K L } _ { i } } = \sum _ { k = 1 } ^ { K } p _ { i k } \mathcal { L } _ { \mathrm { K L } _ { i k } }

Where LKLik\mathcal{L}_{\mathrm{KL}_{ik}} is the KL divergence between the encoder's output distribution for xix_i and the specific Gaussian distribution of cluster kk (with mean μk\mu_k and covariance Σk\Sigma_k from DPMM).

The specific formula for the KL divergence between the encoder output N(μ(xi),Σ(xi))\mathcal{N}(\mu(x_i), \Sigma(x_i)) and cluster component N(μk,Σk)\mathcal{N}(\mu_k, \Sigma_k) is:

LKLik=12[logΣkΣ(xi;ϕ)D+tr{Σk1Σ(xi;ϕ)}+(μkμ(xi;ϕ))TΣk1(μkμ(xi;ϕ))] \mathcal { L } _ { \mathrm { K L } _ { i k } } = \frac { 1 } { 2 } \Bigg [ \log \frac { \left| \Sigma _ { k } \right| } { \left| \Sigma ( x _ { i } ; \phi ) \right| } - D + \mathrm { t r } \{ \Sigma _ { k } ^ { - 1 } \Sigma ( x _ { i } ; \phi ) \} + \left( \mu _ { k } - \mu ( x _ { i } ; \phi ) \right) ^ { T } \Sigma _ { k } ^ { - 1 } \big ( \mu _ { k } - \mu ( x _ { i } ; \phi ) \big ) \Bigg ]

Symbol Explanation:

  • DD: Dimensionality of the latent space.
  • Σk,μk\Sigma_k, \mu_k: Covariance and Mean of the kk-th DPMM cluster.
  • Σ(xi;ϕ),μ(xi;ϕ)\Sigma(x_i; \phi), \mu(x_i; \phi): Covariance and Mean output by the Encoder for input xix_i.
  • tr\mathrm{tr}: Trace of a matrix (sum of diagonal elements).
  • |\cdot|: Determinant of a matrix.

4.2.4. Algorithm Summary

The complete workflow is summarized in Algorithm 1 of the paper:

  1. Initialize VAE and DPMM (with K=1K=1).

  2. Loop:

    • Sample a mini-batch of data xx.
    • VAE Step: Pass xx through VAE to get zz. Use current DPMM to get assignment probabilities pikp_{ik}. Compute LKL\mathcal{L}_{KL} and Lrecon\mathcal{L}_{recon}. Update VAE parameters.
    • DPMM Step: (Every few epochs) Use the updated latent zz's. Run Variational Inference to update cluster parameters. Perform Birth and Merge moves to adjust KK.
  3. Repeat until convergence.


5. Experimental Setup

5.1. Datasets

The authors evaluated DIVA on a comprehensive set of 8 datasets, ranging from simple to complex.

  • MNIST: Handwritten digits (0-9). 60k training images.

  • Fashion-MNIST: Images of clothing (10 classes). More complex than digits.

  • Reuters10k: Text dataset of news articles (10 categories).

  • HHAR: Human Activity Recognition from sensors.

  • STL-10: Real-world images (10 classes). Higher resolution.

  • ImageNet-50: A subset of ImageNet with 50 classes. Very complex.

  • CIFAR-10: 10 classes of objects (cars, birds, etc.).

  • SVHN: Street View House Numbers.

    Note regarding STL-10 and ImageNet: Due to the difficulty of generating high-res images from scratch, the authors used pre-trained feature extractors (ResNet-50 and MOCO) to get feature vectors first, then trained DIVA on those vectors.

5.2. Evaluation Metrics

1. Unsupervised Clustering Accuracy (ACC)

  • Concept: Since the model doesn't know the true labels (e.g., it groups all "7"s together but calls them "Cluster 5"), we need to map the learned clusters to the ground truth labels to maximize accuracy.
  • Formula: $ ACC = \max _ { f } \frac { \sum _ { i = 1 } ^ { N } \mathbf { 1 } { l _ { i } = f ( v _ { i } ) } } { N } $
  • Symbol Explanation:
    • NN: Total number of samples.
    • lil_i: True label of sample ii.
    • viv_i: Predicted cluster assignment for sample ii.
    • ff: A mapping function (permutation) that matches predicted clusters to true labels (usually found via the Hungarian algorithm).
    • 1{}\mathbf{1}\{\cdot\}: Indicator function (returns 1 if true, 0 otherwise).

2. Normalized Mutual Information (NMI)

  • Concept: Measures the shared information between the predicted clustering and the ground truth labels, normalized to be between 0 and 1. It is independent of the absolute number of clusters.
  • Formula: (Standard definition, implied in paper) $ \text{NMI}(Y, C) = \frac{2 \cdot I(Y; C)}{H(Y) + H(C)} $
  • Symbol Explanation:
    • YY: Ground truth labels.
    • CC: Predicted clusters.
    • I(Y;C): Mutual Information between YY and CC.
    • H()H(\cdot): Entropy.

3. Adjusted Rand Index (ARI)

  • Concept: Measures the similarity between two clusterings by counting pairs of samples that are assigned to the same or different clusters in both the predicted and true partitions. Adjusted for chance.

5.3. Baselines

The authors compared DIVA against 8 state-of-the-art methods:

  • Parametric (Fixed K): GMM, DEC, GMVAE. These require knowing KK is 10 (for MNIST).
  • Nonparametric (Dynamic):
    • DPMM+memoVB: Traditional DPMM without deep learning (applied to raw data).

    • VSB-DVM: A variational stick-breaking method.

    • DDPM & DeepDPM: Recent deep clustering methods. DeepDPM is a strong competitor that splits clusters using neural networks.


6. Results & Analysis

6.1. Core Results Analysis (Static Datasets)

The primary experiment tested how well DIVA clusters standard datasets compared to baselines.

Key Finding: DIVA outperformed both parametric and nonparametric baselines on almost all datasets. Notably, it achieved 0.94 ACC on MNIST and 0.69 ACC on ImageNet-50, which is significant for unsupervised clustering.

The following table (Table 1 from the original paper) details these results.

Frameworks MNIST Fashion-MNIST Reuters10k (imb) HHAR STL-10 ImageNet-50
GMM .60 ±.01 .49 ± .02 .73 ± .06 .43 ± .00 .58 ± .03 .60 ±.01
DEC† .84 ± .00‡ .60 ±.04 .72 ± .00‡ .79 ±.01 .80 ± .01 .63 ±.01
GMVAE† .82 ±.04 .61 ±.01 .73 ± .08 .65 ± .03 .79 ± .04 .62 ± .02
DPMM+memoVB .63 ± .02 .57 ±.01 .56 ± .05 .68 ± .04 .64 ± .05 .57 ±.00
VSB-DVM† .86 ± .01 .64 ±.03 .60 ± .03 .66 ± .06 .52 ±.03 .49 ±.02
DDPM† .91 ±.01 .63 ±.02 .71 ± .02 .74±.01 .72 ±.01 .63 ±.02
DeepDPM† .93 ± .02 .63 ±.01 .83 ± .01 .79 ±.02 .81 ±.02 .66 ± .01
DIVA (Ours) .94 ±.01 .72 ±.01 .83 ± .01 .83 ±.01 .88 ±.01 .69 ±.02

Analysis:

  • vs. GMVAE: GMVAE requires a fixed K. Even with the correct K, DIVA outperforms it (e.g., 0.94 vs 0.82 on MNIST). This suggests the DPMM prior is better at capturing the latent structure than a static GMM.
  • vs. DeepDPM: DeepDPM is the closest competitor (0.93 on MNIST), but DIVA shows a larger margin on more complex image datasets like Fashion-MNIST (0.72 vs 0.63) and STL-10 (0.88 vs 0.81).

Latent Space Visualization: The paper provides t-SNE plots (Figure 4 below) showing that DIVA creates very distinct, well-separated clusters, whereas other methods like SB-VAE show overlapping or "sticky" clusters.

Figure 4: t-SNE projection on full static MNIST, colored by the ground truth. It is clearly to see that DIVA can learn a clustered latent representation with high distinction. GMVAE with improper defined cluster number (Fig. 4f, 4g) can not learn a distinct clustering representation. Notably, the advantages of our framework are more apparent when handling dynamic changing features. 该图像是一个图表,展示了不同算法在 MNIST 数据集上的 t-SNE 投影结果,包括 DIVA、SB-VAE、VSB-DVM、DeepPDM 和 GMVAE。可以明显看出,DIVA 方法能够更好地学习到清晰的聚类表现,而其他方法在聚类上存在明显的不足。

6.2. Incremental Representation Learning

This is the most innovative experiment. The authors simulated a scenario where the number of classes in the training data increases over time.

Setup:

  1. Train on digits {0, 1, 2} (3 features).
  2. Then train on {0, 1, 2, 3, 4} (5 features).
  3. Then {0...6} (7 features), and finally all 10 digits.

Results: Parametric models (GMVAE) fail completely if initialized with K=3K=3. Their accuracy drops because they force 5 digits into 3 clusters. DIVA, however, maintains high accuracy. As shown in the figure below (Figure 5 from the paper), DIVA (yellow bars) consistently scores high, while others collapse as the feature count grows.

Figure 5: Clustering accuracy with incremental features for MNIST (left), Fashion-MNIST (middle) and STL-10 (right). mean \(\\pm\) (std.dev.) of 5 runs. We evaluate our framework and parametric baselines on test dataset with incremental number of features, e.g., for MNIST the \(\\mathbf { X }\) -axis with \(^ { 6 6 } 3\) features" means the dataset contains 3 types of digit to be classified, which are \(^ { 6 } 0 , 1 , 2 ^ { 5 }\) , respectively.

Dynamic Adaptation Process: Figure 6 (below) tracks the training process over epochs.

  • Solid lines: Accuracy.

  • Dashed line (DIVA): Number of Clusters (KK). We see that when new data is introduced (at epoch 30, 60, 90), DIVA's cluster count (dashed line) jumps up. The accuracy temporarily dips as the model adjusts, then recovers to a high level. DeepDPM (purple) fails to adapt in this zero-shot manner because its neural network splitter requires extensive retraining.

    Figure 6: (a) Zero-shot adaptation performance on MNIST. We introduce dynamic feature in training, where the model is initially trained on 3 digits, and the number of digits is increased to 5, 7, and 10 at epochs 30, 60, and 90. We record the test accuracy (solid lines) and the number of clusters for DIVA (dashed line). (b) t-SNE projection of DIVA on MNIST at last epoch, colored by clusters. Each coarse label is learned by 2 or 3 clusters, enabling the sub-features of individual digits to be captured. pre: pre-extract raw data to reduce training burden. 该图像是图表,展示了增量训练的测试准确性与聚类数量随训练轮数的变化。图中,DIVA(黄色线条)相较其他算法(如GMVAE和DPDM)在不同轮次中展现出持续上升的准确性,并显示出聚类数量的变化情况,从而反映出其动态适应能力。

6.3. Generative Performance & Sub-Features

Can the clusters generate realistic images? The authors sampled from the learned DPMM clusters. Interestingly, DIVA often learned more clusters than the ground truth (e.g., more than 10 for MNIST). Why? The clusters captured "sub-features" or styles. For example, one cluster might capture "upright 0s" while another captures "slanted 0s".

Figure 7 (below) shows these reconstructions. In panel (a)-(c), different clusters generate variations of the digit '0'. This proves the model learns a hierarchical or fine-grained representation of the data.

Figure 7: Reconstruction images from DPMM learned clusters on MNIST (a)-(c), (j)-(1); FashionMNIST (d)-(f), (m)-(o); CIFAR-10 (g)-(i) and SVHN (p)-(r). Each subfigure with 16 plots stems from one cluster. It is noting that our proposed framework DIVA can efficiently extract informative sub-features from coarse labels. More results refer to appendix Sec. A.4.4. 该图像是图表,展示了 DIVA 框架在不同数据集上的重构图像,包括 MNIST、Fashion-MNIST、CIFAR-10 和 SVHN。每个子图包含来自一个集群的16个图像,展示了该框架在提取信息子特征方面的有效性。


7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents DIVA, a robust framework for deep clustering that removes the need for a pre-defined number of clusters. By integrating a nonparametric Dirichlet Process Mixture Model into the VAE latent space and utilizing "birth/merge" variational updates, DIVA achieves:

  1. State-of-the-art performance on standard clustering benchmarks.
  2. True dynamic adaptation, successfully handling datasets where the number of classes grows over time.
  3. Rich representation learning, capturing fine-grained sub-features (styles) within broader categories.

7.2. Limitations & Future Work

  • Computational Complexity: The "memoized online variational inference" involves calculating sufficient statistics and performing birth/merge checks, which adds computational overhead compared to a simple standard VAE.
  • Pre-extracted Features: For high-resolution images (ImageNet), the method relies on pre-trained extractors (ResNet). Applying DIVA end-to-end on raw high-res pixels remains a challenge, likely due to the difficulty of training stable reconstructions simultaneously with complex probabilistic priors.
  • Future Directions: The authors suggest applications in Continuous Learning (to prevent catastrophic forgetting) and robotics, where agents must learn novel objects in open-ended environments.

7.3. Personal Insights & Critique

  • Innovation: The combination of classical Bayesian Nonparametrics (DPMM) with modern Deep Learning (VAE) is elegant. While "stick-breaking VAEs" existed, explicitly using the birth/merge heuristics from the MemoVB algorithm effectively bridges the gap between static deep learning and dynamic Bayesian inference.
  • Practicality: The ability to handle "unknown K" is a massive practical advantage. In most real-world data mining tasks, we simply do not know how many clusters exist. DIVA offers a solution that doesn't require running K-means with K=2,,100K=2, \dots, 100 to find the elbow point.
  • Critique: The paper shows DIVA creates more clusters than ground truth (e.g., splitting '0' into two styles). While this is touted as a feature ("sub-features"), in some strict classification tasks, this might be a bug if a 1-to-1 mapping is desired. Users might need a post-processing step to merge these sub-clusters if they strictly want KK coarse categories.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.