DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder
TL;DR Summary
DIVA is a nonparametric incremental deep clustering framework that combines a Dirichlet Process Mixture Model with a Variational Auto-Encoder, allowing dynamic clustering adjustments without needing a predefined number of clusters, outperforming existing models on dynamic data.
Abstract
Generative model-based deep clustering frameworks excel in classifying complex data, but are limited in handling dynamic and complex features because they require prior knowledge of the number of clusters. In this paper, we propose a nonparametric deep clustering framework that employs an infinite mixture of Gaussians as a prior. Our framework utilizes a memoized online variational inference method that enables the "birth" and "merge" moves of clusters, allowing our framework to cluster data in a "dynamic-adaptive" manner, without requiring prior knowledge of the number of features. We name the framework as DIVA, a Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder. Our framework, which outperforms state-of-the-art baselines, exhibits superior performance in classifying complex data with dynamically changing features, particularly in the case of incremental features. We released our source code implementation at: https://github.com/Ghiara/diva
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder
1.2. Authors
- Zhenshan Bing: Technical University of Munich (TUM)
- Yuan Meng: Technical University of Munich (TUM)
- Yuqi Yun: Technical University of Munich (TUM)
- Hang Su: Politecnico di Milano
- Xiaojie Su: Chongqing University
- Kai Huang: Sun Yat-sen University
- Alois Knoll: Technical University of Munich (TUM)
1.3. Journal/Conference
This paper was published on arXiv on May 23, 2023. The Technical University of Munich (TUM) is a world-renowned research university, particularly strong in engineering and computer science.
1.4. Publication Year
2023
1.5. Abstract
Generative deep clustering models are powerful but typically suffer from a major limitation: they require the number of clusters (categories) to be defined in advance. This makes them unsuitable for dynamic environments where new features or classes appear over time. This paper introduces DIVA (Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder). It combines a Variational Auto-Encoder (VAE) with a Dirichlet Process Mixture Model (DPMM) prior. Using a "memoized online variational inference" method, DIVA can dynamically "birth" (create) and "merge" (combine) clusters during training. This allows the model to adaptively determine the number of clusters without prior knowledge, outperforming state-of-the-art baselines on complex and incremental datasets.
1.6. Original Source Link
-
Source: arXiv:2305.14067
-
Status: Preprint / Published online.
2. Executive Summary
2.1. Background & Motivation
The Core Problem: In unsupervised learning, specifically clustering, most deep learning algorithms (like Deep Embedded Clustering or VAE-based clustering) require the user to specify the number of clusters, denoted as , before training begins. For example, if you want to cluster handwritten digits, you must tell the model "there are 10 clusters." Why this matters: In real-world applications (e.g., a robot exploring a new environment or a news aggregator collecting stories), data is dynamic. New objects or topics appear over time. If a model is fixed to but the data introduces an 11th category, the model will fail to recognize it, forcing the new data into an existing, incorrect cluster. The Gap: While Bayesian Nonparametric methods (which can theoretically handle infinite clusters) exist, integrating them with Deep Neural Networks is difficult. Previous attempts (like stick-breaking VAEs) often struggled to capture complex cluster shapes or were difficult to train effectively due to optimization challenges.
2.2. Main Contributions / Findings
-
Nonparametric Integration: The authors propose DIVA, which successfully integrates a Dirichlet Process Mixture Model (DPMM) as a prior into the latent space of a Variational Auto-Encoder (VAE). This removes the constraint of pre-defining .
-
Dynamic Adaptation Mechanism: They employ Memoized Online Variational Inference (memoVB). This is a specific training algorithm that allows the model to mathematically decide when to add a new cluster ("birth") or combine two existing ones ("merge") based on the data it observes.
-
Incremental Learning Success: The experiments demonstrate that DIVA can handle "incremental features." For instance, if a dataset starts with 3 types of images and grows to 10, DIVA dynamically adjusts its internal structure to accommodate the new types without catastrophic forgetting, achieving superior accuracy compared to parametric baselines.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand DIVA, one must grasp three pillars: VAEs, DPMMs, and Variational Inference.
1. Variational Auto-Encoder (VAE) A VAE is a generative neural network with two parts:
- Encoder: Compresses input data (e.g., an image ) into a lower-dimensional "latent vector" . Instead of a single point, it predicts a probability distribution (usually a Gaussian mean and variance ).
- Decoder: Takes a sampled vector and tries to reconstruct the original image .
- Objective: It minimizes the "reconstruction error" (how badly was restored) plus the "KL Divergence" (how much the learned distribution deviates from a standard Gaussian prior).
2. Dirichlet Process (DP) & Stick-Breaking Process How do we define "infinite" clusters mathematically?
-
Dirichlet Process (DP): Imagine a distribution over distributions. It allows a model to have a potentially infinite number of components (clusters), but only a finite number are used to explain the finite data observed so far.
-
Stick-Breaking Process: This is a constructive way to visualize the DP. Imagine a stick of length 1.
- Break off a fraction of the stick. This is the weight (probability) of Cluster 1.
- Take the remaining stick (length ) and break off a fraction of that. This is the weight of Cluster 2.
- Repeat forever.
-
The result is an infinite sequence of weights that sum to 1. This allows the model to always have "room" for a new cluster further down the stick.
The following figure (Figure 1 from the original paper) illustrates this process:
-
(a) shows the stick being broken sequentially to determine weights .
-
(b) shows a histogram of data sampled from a DP, where bars represent the weights of different components.

3. Variational Inference (VI) In Bayesian statistics, calculating the exact "posterior" distribution (the updated probability of clusters after seeing data) is often impossible because it requires integrating over all possible configurations.
-
VI turns this into an optimization problem. It proposes a simpler "variational distribution" (usually from a known family like Gaussian) and tries to adjust the parameters of to make it look as close as possible to the true posterior .
-
ELBO (Evidence Lower Bound): The metric we maximize in VI. Maximizing ELBO is equivalent to minimizing the difference (KL divergence) between and .
3.2. Previous Works
- Parametric Methods (Fixed K):
- DEC (Deep Embedded Clustering): Uses K-means concepts to iteratively refine clusters.
- GMVAE (Gaussian Mixture VAE): Uses a Mixture of Gaussians as a prior. It is powerful but requires fixing (e.g., ). If the data has 11 classes, GMVAE fails.
- Nonparametric Methods (Dynamic K):
- SB-VAE (Stick-Breaking VAE): Replaces the Gaussian prior with a stick-breaking process. However, it often treats latent dimensions as features rather than clusters, lacking geometric shape information.
- DeepDPM: Uses split/merge moves but relies on neural network splits which can be computationally heavy and data-inefficient.
3.3. Differentiation Analysis
DIVA distinguishes itself by:
-
Hybrid Approach: It essentially embeds a full Bayesian Nonparametric clustering algorithm (DPMM) inside the latent space of a deep VAE.
-
Explicit Cluster Management: Unlike SB-VAE which implicitly handles clusters, DIVA explicitly tracks cluster parameters (mean, covariance) and uses heuristic "birth" and "merge" moves (borrowed from the Memoized Online Variational Bayes algorithm) to actively manage the model complexity. This makes it more robust to incremental data.
4. Methodology
4.1. Principles
The core idea of DIVA is alternating optimization. The framework consists of two major modules that are updated in turns:
-
The VAE Module: Learns to map complex data (images) into a simple latent space .
-
The DPMM Module: Clusters the points in the latent space using an infinite mixture model.
The intuition is that the VAE handles the complexity of high-dimensional perception (seeing the image), while the DPMM handles the logic of categorization (grouping the concepts), and the DPMM's flexibility allows the number of categories to grow.
The following figure (Figure 3 from the original paper) provides an overview of the architecture. The left side shows the VAE (Encoder/Decoder), and the right side shows the DPMM modeling the latent space with infinite components.
该图像是示意图,展示了 DIVA 框架的结构。左侧为编码器、潜在空间和解码器的组成部分,而右侧则表示 DPMM 先验空间的多个高斯分布 ,其中 表示聚类数量不固定。
4.2. Core Methodology In-depth (Layer by Layer)
The algorithm alternates between Updating the DPMM and Updating the VAE. Let's break down the mathematical model and the steps.
4.2.1. The Generative Model (DPMM)
The paper assumes the data is generated by a specific probabilistic process. This is the "worldview" of the model.
-
Stick-Breaking Prior: The mixing proportions (weights of clusters) are generated by breaking a stick.
- (Sample a fraction from Beta distribution)
- (The weight of cluster is the current fraction times what's left of the stick).
-
Cluster Parameters: Each cluster has a mean and covariance . These are drawn from a base distribution (Normal-Wishart).
-
Assignment: For each data point , a cluster assignment is drawn from the categorical distribution defined by : .
-
Latent Variable: The latent vector is generated from the chosen cluster's Gaussian: .
-
Observation: The actual data is generated by the decoder from : .
The joint probability of the entire system (DPMM part) is given by:
Symbol Explanation:
- : Latent vector for data point .
- : Cluster assignment for data point .
- : Stick-breaking variables.
- : Mean and Covariance of cluster .
- : Normal (Gaussian) distribution.
- : Categorical distribution.
- : Beta distribution.
- : Hyperparameters of the Normal-Wishart prior (controlling the shape of clusters).
4.2.2. Step 1: Updating the DPMM (Variational Inference)
In this step, the VAE is frozen. We use the latent vectors generated by the VAE as fixed data points to train the DPMM. Since the true posterior is intractable, we approximate it using a variational distribution .
The paper uses a mean-field variational distribution , which assumes independence between factors:
Symbol Explanation:
- : The probability that data point belongs to cluster .
- : Parameters for the Beta distribution of stick-breaking.
- The other variables with "hats" (, etc.) are the learnable variational parameters.
Optimization (ELBO): The goal is to maximize the Evidence Lower Bound (ELBO). The paper employs Memoized Online Variational Bayes (memoVB). This involves iterative updates:
- Local Update: Update (cluster responsibilities) for the current batch of data.
- Global Update: Update the global cluster parameters (, etc.) based on the aggregated statistics from the batches.
Birth and Merge Moves: To handle the dynamic number of clusters, the algorithm performs heuristic moves during the DPMM update:
- Birth: It identifies data points that are poorly explained by current clusters (low probability). It proposes a new cluster initialized with these points. If the ELBO increases with this new cluster, it is kept.
- Merge: It considers pairs of clusters. If merging them into one cluster increases the ELBO (explains the data just as well but with less complexity), they are merged.
4.2.3. Step 2: Updating the VAE
In this step, the DPMM is frozen. We update the VAE (Encoder/Decoder) parameters to minimize the VAE loss.
The Loss Function: The standard VAE loss is .
-
Reconstruction Loss (): Standard Mean Squared Error between input and reconstructed .
-
KL Divergence Loss (): This is where DIVA differs from standard VAE. Instead of measuring distance to a single standard Gaussian , we measure distance to the specific cluster component assigned to the data point.
However, we don't know for sure which cluster a point belongs to. We use the soft assignment probability (calculated by the DPMM in the previous step).
The weighted KL divergence for a single data point is:
Where is the KL divergence between the encoder's output distribution for and the specific Gaussian distribution of cluster (with mean and covariance from DPMM).
The specific formula for the KL divergence between the encoder output and cluster component is:
Symbol Explanation:
- : Dimensionality of the latent space.
- : Covariance and Mean of the -th DPMM cluster.
- : Covariance and Mean output by the Encoder for input .
- : Trace of a matrix (sum of diagonal elements).
- : Determinant of a matrix.
4.2.4. Algorithm Summary
The complete workflow is summarized in Algorithm 1 of the paper:
-
Initialize VAE and DPMM (with ).
-
Loop:
- Sample a mini-batch of data .
- VAE Step: Pass through VAE to get . Use current DPMM to get assignment probabilities . Compute and . Update VAE parameters.
- DPMM Step: (Every few epochs) Use the updated latent 's. Run Variational Inference to update cluster parameters. Perform Birth and Merge moves to adjust .
-
Repeat until convergence.
5. Experimental Setup
5.1. Datasets
The authors evaluated DIVA on a comprehensive set of 8 datasets, ranging from simple to complex.
-
MNIST: Handwritten digits (0-9). 60k training images.
-
Fashion-MNIST: Images of clothing (10 classes). More complex than digits.
-
Reuters10k: Text dataset of news articles (10 categories).
-
HHAR: Human Activity Recognition from sensors.
-
STL-10: Real-world images (10 classes). Higher resolution.
-
ImageNet-50: A subset of ImageNet with 50 classes. Very complex.
-
CIFAR-10: 10 classes of objects (cars, birds, etc.).
-
SVHN: Street View House Numbers.
Note regarding STL-10 and ImageNet: Due to the difficulty of generating high-res images from scratch, the authors used pre-trained feature extractors (ResNet-50 and MOCO) to get feature vectors first, then trained DIVA on those vectors.
5.2. Evaluation Metrics
1. Unsupervised Clustering Accuracy (ACC)
- Concept: Since the model doesn't know the true labels (e.g., it groups all "7"s together but calls them "Cluster 5"), we need to map the learned clusters to the ground truth labels to maximize accuracy.
- Formula: $ ACC = \max _ { f } \frac { \sum _ { i = 1 } ^ { N } \mathbf { 1 } { l _ { i } = f ( v _ { i } ) } } { N } $
- Symbol Explanation:
- : Total number of samples.
- : True label of sample .
- : Predicted cluster assignment for sample .
- : A mapping function (permutation) that matches predicted clusters to true labels (usually found via the Hungarian algorithm).
- : Indicator function (returns 1 if true, 0 otherwise).
2. Normalized Mutual Information (NMI)
- Concept: Measures the shared information between the predicted clustering and the ground truth labels, normalized to be between 0 and 1. It is independent of the absolute number of clusters.
- Formula: (Standard definition, implied in paper) $ \text{NMI}(Y, C) = \frac{2 \cdot I(Y; C)}{H(Y) + H(C)} $
- Symbol Explanation:
- : Ground truth labels.
- : Predicted clusters.
I(Y;C): Mutual Information between and .- : Entropy.
3. Adjusted Rand Index (ARI)
- Concept: Measures the similarity between two clusterings by counting pairs of samples that are assigned to the same or different clusters in both the predicted and true partitions. Adjusted for chance.
5.3. Baselines
The authors compared DIVA against 8 state-of-the-art methods:
- Parametric (Fixed K): GMM, DEC, GMVAE. These require knowing is 10 (for MNIST).
- Nonparametric (Dynamic):
-
DPMM+memoVB: Traditional DPMM without deep learning (applied to raw data).
-
VSB-DVM: A variational stick-breaking method.
-
DDPM & DeepDPM: Recent deep clustering methods. DeepDPM is a strong competitor that splits clusters using neural networks.
-
6. Results & Analysis
6.1. Core Results Analysis (Static Datasets)
The primary experiment tested how well DIVA clusters standard datasets compared to baselines.
Key Finding: DIVA outperformed both parametric and nonparametric baselines on almost all datasets. Notably, it achieved 0.94 ACC on MNIST and 0.69 ACC on ImageNet-50, which is significant for unsupervised clustering.
The following table (Table 1 from the original paper) details these results.
| Frameworks | MNIST | Fashion-MNIST | Reuters10k (imb) | HHAR | STL-10 | ImageNet-50 |
|---|---|---|---|---|---|---|
| GMM | .60 ±.01 | .49 ± .02 | .73 ± .06 | .43 ± .00 | .58 ± .03 | .60 ±.01 |
| DEC† | .84 ± .00‡ | .60 ±.04 | .72 ± .00‡ | .79 ±.01 | .80 ± .01 | .63 ±.01 |
| GMVAE† | .82 ±.04 | .61 ±.01 | .73 ± .08 | .65 ± .03 | .79 ± .04 | .62 ± .02 |
| DPMM+memoVB | .63 ± .02 | .57 ±.01 | .56 ± .05 | .68 ± .04 | .64 ± .05 | .57 ±.00 |
| VSB-DVM† | .86 ± .01 | .64 ±.03 | .60 ± .03 | .66 ± .06 | .52 ±.03 | .49 ±.02 |
| DDPM† | .91 ±.01 | .63 ±.02 | .71 ± .02 | .74±.01 | .72 ±.01 | .63 ±.02 |
| DeepDPM† | .93 ± .02 | .63 ±.01 | .83 ± .01 | .79 ±.02 | .81 ±.02 | .66 ± .01 |
| DIVA (Ours) | .94 ±.01 | .72 ±.01 | .83 ± .01 | .83 ±.01 | .88 ±.01 | .69 ±.02 |
Analysis:
- vs. GMVAE: GMVAE requires a fixed K. Even with the correct K, DIVA outperforms it (e.g., 0.94 vs 0.82 on MNIST). This suggests the DPMM prior is better at capturing the latent structure than a static GMM.
- vs. DeepDPM: DeepDPM is the closest competitor (0.93 on MNIST), but DIVA shows a larger margin on more complex image datasets like Fashion-MNIST (0.72 vs 0.63) and STL-10 (0.88 vs 0.81).
Latent Space Visualization: The paper provides t-SNE plots (Figure 4 below) showing that DIVA creates very distinct, well-separated clusters, whereas other methods like SB-VAE show overlapping or "sticky" clusters.
该图像是一个图表,展示了不同算法在 MNIST 数据集上的 t-SNE 投影结果,包括 DIVA、SB-VAE、VSB-DVM、DeepPDM 和 GMVAE。可以明显看出,DIVA 方法能够更好地学习到清晰的聚类表现,而其他方法在聚类上存在明显的不足。
6.2. Incremental Representation Learning
This is the most innovative experiment. The authors simulated a scenario where the number of classes in the training data increases over time.
Setup:
- Train on digits {0, 1, 2} (3 features).
- Then train on {0, 1, 2, 3, 4} (5 features).
- Then {0...6} (7 features), and finally all 10 digits.
Results: Parametric models (GMVAE) fail completely if initialized with . Their accuracy drops because they force 5 digits into 3 clusters. DIVA, however, maintains high accuracy. As shown in the figure below (Figure 5 from the paper), DIVA (yellow bars) consistently scores high, while others collapse as the feature count grows.

Dynamic Adaptation Process: Figure 6 (below) tracks the training process over epochs.
-
Solid lines: Accuracy.
-
Dashed line (DIVA): Number of Clusters (). We see that when new data is introduced (at epoch 30, 60, 90), DIVA's cluster count (dashed line) jumps up. The accuracy temporarily dips as the model adjusts, then recovers to a high level. DeepDPM (purple) fails to adapt in this zero-shot manner because its neural network splitter requires extensive retraining.
该图像是图表,展示了增量训练的测试准确性与聚类数量随训练轮数的变化。图中,DIVA(黄色线条)相较其他算法(如GMVAE和DPDM)在不同轮次中展现出持续上升的准确性,并显示出聚类数量的变化情况,从而反映出其动态适应能力。
6.3. Generative Performance & Sub-Features
Can the clusters generate realistic images? The authors sampled from the learned DPMM clusters. Interestingly, DIVA often learned more clusters than the ground truth (e.g., more than 10 for MNIST). Why? The clusters captured "sub-features" or styles. For example, one cluster might capture "upright 0s" while another captures "slanted 0s".
Figure 7 (below) shows these reconstructions. In panel (a)-(c), different clusters generate variations of the digit '0'. This proves the model learns a hierarchical or fine-grained representation of the data.
该图像是图表,展示了 DIVA 框架在不同数据集上的重构图像,包括 MNIST、Fashion-MNIST、CIFAR-10 和 SVHN。每个子图包含来自一个集群的16个图像,展示了该框架在提取信息子特征方面的有效性。
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper presents DIVA, a robust framework for deep clustering that removes the need for a pre-defined number of clusters. By integrating a nonparametric Dirichlet Process Mixture Model into the VAE latent space and utilizing "birth/merge" variational updates, DIVA achieves:
- State-of-the-art performance on standard clustering benchmarks.
- True dynamic adaptation, successfully handling datasets where the number of classes grows over time.
- Rich representation learning, capturing fine-grained sub-features (styles) within broader categories.
7.2. Limitations & Future Work
- Computational Complexity: The "memoized online variational inference" involves calculating sufficient statistics and performing birth/merge checks, which adds computational overhead compared to a simple standard VAE.
- Pre-extracted Features: For high-resolution images (ImageNet), the method relies on pre-trained extractors (ResNet). Applying DIVA end-to-end on raw high-res pixels remains a challenge, likely due to the difficulty of training stable reconstructions simultaneously with complex probabilistic priors.
- Future Directions: The authors suggest applications in Continuous Learning (to prevent catastrophic forgetting) and robotics, where agents must learn novel objects in open-ended environments.
7.3. Personal Insights & Critique
- Innovation: The combination of classical Bayesian Nonparametrics (DPMM) with modern Deep Learning (VAE) is elegant. While "stick-breaking VAEs" existed, explicitly using the birth/merge heuristics from the MemoVB algorithm effectively bridges the gap between static deep learning and dynamic Bayesian inference.
- Practicality: The ability to handle "unknown K" is a massive practical advantage. In most real-world data mining tasks, we simply do not know how many clusters exist. DIVA offers a solution that doesn't require running K-means with to find the elbow point.
- Critique: The paper shows DIVA creates more clusters than ground truth (e.g., splitting '0' into two styles). While this is touted as a feature ("sub-features"), in some strict classification tasks, this might be a bug if a 1-to-1 mapping is desired. Users might need a post-processing step to merge these sub-clusters if they strictly want coarse categories.
Similar papers
Recommended via semantic vector search.