Deep Plug-and-Play Clustering with Unknown Number of Clusters

Yunhe Wang

Paper status: completed

Deep Plug-and-Play Clustering with Unknown Number of Clusters

Deep Clustering Algorithms (1)Unknown Cluster Count Adjustment (1)Automatic Clustering Module (1)Split-and-Merge Framework (1)Unsupervised Classification Methods (1)

Original Link

Price: 0.10

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces a plug-and-play clustering module that automatically adjusts the number of clusters without prior knowledge of K, utilizing a split-and-merge framework. Experiments demonstrate it achieves state-of-the-art performance across benchmark datasets.

Abstract

Clustering is an essential task for the purpose that data points can be classified in an unsupervised manner. Most deep clustering algorithms are very effective when given the number of clusters $K$ . However, when $K$ is unknown, finding the appropriate $K$ for these algorithms can be computationally expensive via model-selection criteria, and applying algorithms with an inaccurate $K$ can hardly achieve the state-of-the-art performance. This paper proposes a plug-and-play clustering module to automatically adjust the number of clusters, which can be easily embedded into existing deep parametric clustering methods. By analyzing the goal of clustering, a split-and-merge framework is introduced to reduce the intra-class diversity and increase the inter-class difference, which leverages the entropy between different clusters. Specifically, given an initial clustering number, clusters can be split into sub-clusters or merged into super-clusters and converge to a stable number of $K$ clusters at the end of training. Experiments on benchmark datasets demonstrate that the proposed method can achieve comparable performance with the state-of-the-art works without requiring the number of clusters.

Mind Map

In-depth Reading

English Analysis~17 min read · 23,016 chars

1. Bibliographic Information

1.1. Title

Deep Plug-and-Play Clustering with Unknown Number of Clusters

1.2. Authors

An Xiao, Hanting Chen, Tianyu Guo, Qinghua Zhang, and Yunhe Wang. All authors are affiliated with Huawei Noah's Ark Lab. This lab is a well-known corporate research institution focusing on artificial intelligence and data mining.

1.3. Journal/Conference

The paper was submitted to and reviewed on OpenReview, which is a platform for open peer review, commonly used for major AI conferences like ICLR, NeurIPS, and ICML. The link provided suggests it was submitted for consideration at a conference, but the specific venue is not explicitly stated in the provided text. OpenReview is a highly respected platform in the machine learning community.

1.4. Publication Year

The most recent reference in the paper is from 2022, and the OpenReview submission link is active, suggesting the work is from around 2022 or later.

1.5. Abstract

The paper addresses a significant limitation in most deep clustering algorithms: the requirement to know the number of clusters, denoted as $K$ , in advance. When $K$ is unknown, finding the optimal value is computationally expensive, and using an incorrect $K$ leads to poor performance. To solve this, the authors propose a plug-and-play clustering module that can be integrated into existing deep parametric clustering methods. This module employs a split-and-merge framework to automatically adjust the number of clusters during training. The decision to split or merge is based on minimizing intra-class diversity and maximizing inter-class difference, using entropy as a measure of similarity between clusters. The module starts with an initial guess for $K$ and dynamically adjusts it until it converges to a stable number. Experiments show that this approach allows existing state-of-the-art methods to achieve comparable performance without needing the true $K$ to be specified beforehand.

1.6. Original Source Link

Official PDF Link: https://openreview.net/pdf?id=6rbcq0qacA
Publication Status: Reviewed on OpenReview. This indicates the paper has undergone a peer-review process, but its final publication status in a specific conference or journal is not detailed in the provided text.

2. Executive Summary

2.1. Background & Motivation

Core Problem: The majority of high-performing deep clustering algorithms are parametric, meaning they require the number of clusters ( $K$ ) as a mandatory input parameter. In real-world applications, $K$ is often unknown.
Importance & Challenges:
1. Performance Degradation: Applying a parametric algorithm with an incorrect $K$ results in a significant drop in performance.
2. Computational Cost: Traditional methods for finding the optimal $K$ , such as grid search combined with model-selection criteria (e.g., silhouette score), are computationally prohibitive for deep learning models due to their long training times.
3. Gap in Performance: Existing non-parametric deep clustering methods (which do not require $K$ ) generally underperform compared to their state-of-the-art parametric counterparts. There is a trade-off between the convenience of not needing $K$ and achieving top-tier accuracy.
Paper's Entry Point: Instead of creating a new, standalone non-parametric clustering algorithm, the authors propose a more practical solution: a plug-and-play module. This module can be added to existing, powerful parametric algorithms, giving them the ability to automatically determine the correct $K$ . This approach aims to get the "best of both worlds": the high performance of parametric methods without the need to pre-specify $K$ .

2.2. Main Contributions / Findings

A Novel Plug-and-Play Module: The primary contribution is a universal module that can be integrated into various deep parametric clustering methods to enable automatic determination of the number of clusters.
A Principled Split-and-Merge Framework: The module is based on a theoretically grounded framework for splitting and merging clusters. The goal is to minimize intra-cluster distance (making clusters more compact) and maximize inter-cluster distance (making clusters more separate).
Entropy-Based Decision Making: The decision to split or merge clusters is guided by the Jensen-Shannon (JS) divergence, an entropy-based metric that measures the similarity between the probability distributions of data points within different clusters. The paper derives dynamic thresholds for splitting and merging based on an overall clustering objective function.
State-of-the-Art Performance without $K$ : Experiments demonstrate that by equipping existing parametric models (like SCAN) with this module, the resulting system achieves performance comparable to the original models when they are given the true $K$ , and significantly outperforms other non-parametric methods.

3.1. Foundational Concepts

Unsupervised Learning: A category of machine learning where algorithms are trained on data without any labeled responses. The goal is to find hidden patterns or intrinsic structures in the input data. Clustering is a primary task in unsupervised learning.
Clustering: The task of grouping a set of objects (e.g., images, documents) in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups.
- Parametric Clustering: Algorithms that require the number of clusters ( $K$ ) to be specified in advance. A classic example is K-means. Most deep clustering methods fall into this category.
- Non-Parametric Clustering: Algorithms that can determine the number of clusters automatically from the data. Examples include DBSCAN (density-based) and hierarchical clustering methods.
Deep Clustering: A class of algorithms that use deep neural networks (DNNs) to learn meaningful feature representations of the data and perform clustering simultaneously. DNNs, such as Autoencoders, are powerful at extracting complex, non-linear features from high-dimensional data like images, which traditional methods struggle with.
Autoencoder (AE): A type of neural network used for unsupervised learning of efficient data codings. An AE consists of two parts: an encoder that compresses the input data into a low-dimensional latent representation (or "embedding"), and a decoder that reconstructs the original input from this latent representation. The network is trained to minimize the reconstruction error. The learned latent space often captures the most salient features of the data, making it a good representation for clustering.
Jensen-Shannon (JS) Divergence: A method of measuring the similarity between two probability distributions. It is based on the Kullback-Leibler (KL) divergence but is symmetric and always has a finite value. The JS divergence between two distributions $P$ and $Q$ is 0 if they are identical and increases as they become more different. In this paper, it is used to quantify the similarity between clusters, where each cluster is represented as a probability distribution over its data points.

3.2. Previous Works

Traditional Clustering:
- K-means: An iterative algorithm that partitions $N$ observations into $K$ clusters by minimizing the within-cluster sum of squares. It is simple and efficient but requires $K$ and is sensitive to initialization.
- DBSCAN: A density-based algorithm that groups together points that are closely packed, marking as outliers points that lie alone in low-density regions. It does not require $K$ but depends on other hyperparameters.
Deep Parametric Clustering: These methods typically learn a feature embedding and cluster assignments simultaneously.
- DEC (Deep Embedding for Clustering): It first pre-trains an autoencoder, then fine-tunes the encoder by iteratively refining cluster assignments using a student's t-distribution-based clustering loss. It requires $K$ .
- SCAN (Learning to Classify Images without Labels): A high-performing method that separates feature learning from clustering. It first pre-trains a feature extractor using self-supervised contrastive learning. Then, it identifies "semantic nearest neighbors" and uses them to train a classification head for clustering. This is the main baseline model the authors build upon.
- CC (Contrastive Clustering): An online clustering method that uses contrastive learning principles to pull together positive pairs (augmented views of the same image) and push apart negative pairs, while also learning cluster centroids.
Deep Non-Parametric Clustering: These methods attempt to find $K$ $K$ automatically.
- DeepDPM (Deep Dirichlet Process Mixture Model): This method extends the classical Dirichlet Process (a Bayesian non-parametric model) to the deep learning context. It uses a split-or-merge strategy to adjust the number of clusters but, as the paper notes, its performance lags behind top parametric methods. The key difference is that DeepDPM is a standalone non-parametric model, whereas the work in this paper is a module that enhances existing parametric ones.

3.3. Technological Evolution

The field of clustering has evolved significantly:

Early Stage (Traditional Methods): Algorithms like K-means and hierarchical clustering operated directly on the raw feature space. They were effective for low-dimensional, well-separated data but struggled with high-dimensional data like images (the "curse of dimensionality").
Feature Engineering Era: To improve clustering, researchers would manually design feature extraction pipelines (e.g., SIFT, HOG for images) before applying clustering algorithms. This was domain-specific and labor-intensive.
Deep Clustering Emergence: With the rise of deep learning, methods like DEC started using neural networks (especially autoencoders) to automatically learn powerful, low-dimensional feature representations tailored for clustering. This led to a massive performance leap.
SOTA Deep Parametric Methods: Recent methods like SCAN and CC have further improved performance by incorporating techniques from self-supervised and contrastive learning, achieving clustering accuracies that rival supervised methods on some benchmarks. However, the reliance on a known $K$ remained a major constraint.
Addressing the "Unknown K" Problem: A parallel track of research focused on non-parametric deep models like DeepDPM. This paper fits into this latest stage but takes a novel "plug-and-play" approach, aiming to bridge the performance gap between parametric and non-parametric solutions.

3.4. Differentiation Analysis

Modular vs. Monolithic: The core innovation is the modularity. Most previous non-parametric deep methods (e.g., DeepDPM) are built from the ground up as integrated systems. This paper's method is a separate module that can be attached to any deep parametric clustering algorithm. This makes it more flexible and allows it to leverage the power of the latest, most advanced parametric models without needing to reinvent them.
Principled Thresholds vs. Heuristics: The split and merge decisions are not based on simple heuristics. The paper derives the conditions for splitting (Proposition 1) and merging (Proposition 2) directly from a global clustering objective function (Equation 2). This provides a more theoretical justification for the algorithm's behavior.
Goal: Augmentation, not Replacement: The paper's goal is not to propose a new best clustering algorithm, but rather to augment existing ones. This is a practical and impactful approach, as it improves the usability of a whole class of powerful models.

4. Methodology

4.1. Principles

The core idea is to frame the clustering problem with an unknown $K$ as an optimization problem. The objective is to find a partitioning of data and a number of clusters $K$ that simultaneously minimizes Compactness (intra-cluster distance) and maximizes Separation (inter-cluster distance). The proposed method solves this by iteratively adjusting $K$ using a greedy split-and-merge scheme. At each stage, the algorithm decides whether to increase $K$ by splitting a diverse cluster or decrease $K$ by merging two similar clusters. This decision is based on whether the action will lead to a better value for the overall objective function.

4.2. Core Methodology In-depth (Layer by Layer)

The entire methodology is built around a central optimization objective. Let's start there.

4.2.1. The Optimization Objective for Clustering

The paper defines the goal of clustering with an unknown $K$ as minimizing a single objective function that balances intra-cluster compactness and inter-cluster separation.

Integrated Explanation: The authors denote a dataset as $\{x_i\}_{i=1}^N$ . The distance between two samples is given by a metric $d(\cdot, \cdot)$ . They define D(k, k) as the total intra-cluster distance for cluster $k$ (a measure of how spread out it is) and $D(k_1, k_2)$ as the total inter-cluster distance between clusters $k1$ and $k2$ (a measure of how far apart they are). The overall optimization problem is formulated as: $\arg \operatorname* { m i n } _ { K } \sum _ { k \in [ K ] } D ( k , k ) - \frac { \lambda } { K } \sum _ { k _ { 1 } , k _ { 2 } \in [ K ] } ^ { k _ { 1 } \neq k _ { 2 } } D ( k _ { 1 } , k _ { 2 } )$ Symbol Explanation:

$K$ : The number of clusters, which is a variable to be optimized.
[K]: The set of integers from 1 to $K$ , representing the cluster indices.
$D(k, k) = \sum_{i, j} d(\mathbf{x}_i^k, \mathbf{x}_j^k)$ : The intra-cluster diversity (or lack of compactness) for cluster $k$ . We want to minimize this.
$D(k_1, k_2) = \sum_{i, j} d(\mathbf{x}_i^{k_1}, \mathbf{x}_j^{k_2})$ : The inter-cluster similarity (or lack of separation) between clusters $k_1$ and $k_2$ . The formula in the paper actually sums this distance, so we want to maximize it, which is achieved by minimizing its negative in the formula.
$\sum_{k \in [K]} D(k, k)$ : The total intra-cluster diversity across all clusters.
$\sum_{k_1, k_2 \in [K]}^{k_1 \neq k_2} D(k_1, k_2)$ : The total inter-cluster distance.
$\lambda$ : A hyper-parameter that balances the trade-off between minimizing intra-cluster diversity and maximizing inter-cluster separation.
$\frac{1}{K}$ : A normalization factor for the separation term to compute the average distance.

Since searching through all possible values of $K$ is computationally expensive for deep models, the authors propose a greedy approach to adjust $K$ one step at a time through splitting and merging.

4.2.2. Cluster Split to Minimize Inter-Class Diversity

Integrated Explanation: This phase is triggered when the current number of clusters, $K'$ , is likely smaller than the optimal number, $K^*$ . In this case, some clusters will contain dissimilar data points, leading to high intra-cluster diversity (a large first term in Equation 2). The goal is to identify and split such clusters.

The decision to split a cluster is made if splitting it would decrease the overall objective function value. The paper formalizes this in Proposition 1. If a cluster is split into two sub-clusters, $K^1$ and $K^2$ , the split is beneficial if the following condition holds: $\left( \frac { 2 \lambda } { K + 1 } + 2 \right) \mathrm { D } ( K ^ { 1 } , K ^ { 2 } ) \ge \frac { \lambda } { K ( K + 1 ) } \sum _ { k _ { 1 } , k _ { 2 } \in [ K ] } ^ { k _ { 1 } \ne k _ { 2 } } \mathrm { D } ( k _ { 1 } , k _ { 2 } )$ This inequality is derived by comparing the value of the objective function (Eq. 2) before the split (with $K$ clusters) and after the split (with $K+1$ clusters). A split is performed if $V(K+1) < V(K)$ , where V(K) is the objective function value for $K$ clusters.

To practically measure the distance/similarity $D(\cdot, \cdot)$ between clusters, the authors use Jensen-Shannon (JS) divergence. JS divergence measures the similarity between the probability distributions of the data points assigned to each cluster. A large JS divergence between two sub-clusters means they are dissimilar and should be split.

By substituting JS divergence for the distance metric $D$ in the inequality above, the authors derive a dynamic threshold, $T_s$ , for splitting. A cluster is split into sub-clusters $K^1$ and $K^2$ if their JS divergence is greater than this threshold. $\mathrm { J S } ( K ^ { 1 } \| K ^ { 2 } ) \ge \frac { \lambda } { 2 K ( \lambda + K + 1 ) } \sum _ { k _ { 1 } , k _ { 2 } \in [ K ] } \mathrm { J S } ( k _ { 1 } \| k _ { 2 } )$ The right-hand side of this inequality becomes the threshold $T_s$ : $T _ { s } = \frac { \lambda } { 2 K ( \lambda + K + 1 ) } \sum _ { k _ { 1 } , k _ { 2 } \in [ K ] } \mathrm { J S } ( k _ { 1 } \| k _ { 2 } )$ To encourage the network to separate the newly formed sub-clusters, a special split loss is introduced. This loss modifies the original training loss of the host algorithm $\mathcal{A}$ by adding a term that maximizes the JS divergence between the new sub-clusters. $L _ { s p l i t } = L _ { o r i } - \gamma \sum _ { i , j } \alpha _ { i j } \mathrm { J S } ( k _ { i } \| k _ { j } )$ Symbol Explanation:

$L_{ori}$ : The original loss function of the deep clustering algorithm being used.
$\gamma$ : A hyper-parameter to control the strength of the split-promoting term.
$\alpha_{ij}$ : A coefficient that is 1 if clusters $k_i$ and $k_j$ are the pair of sub-clusters just created by a split, and 0 otherwise. This ensures the loss only acts on the newly split clusters.

4.2.3. Cluster Merge to Maximize Intra-Class Diversity

Integrated Explanation: This phase handles the case where the current number of clusters is too high. This leads to some very similar clusters being separate, which hurts the separation term in the objective function. The goal is to identify and merge the most similar pair of clusters.

The algorithm first finds the pair of clusters $(K^1, K^2)$ that are most similar, meaning they have the smallest JS divergence between them. $K ^ { 1 } , K ^ { 2 } = \operatorname * { a r g m i n } _ { k _ { 1 } , k _ { 2 } \in [ K ] , k _ { 1 } \neq k _ { 2 } } \operatorname { J S } ( k _ { 1 } \| k _ { 2 } )$ Similar to the split phase, the decision to merge is based on whether the action will decrease the objective function's value. Proposition 2 provides the condition for merging. Merging clusters $K^1$ and $K^2$ into a new cluster $K^m$ is beneficial if: $\left( \frac { 2 \lambda } { K } + 2 \right) \mathrm { D } ( K ^ { 1 } , K ^ { 2 } ) \leq \frac { \lambda } { K ( K - 1 ) } \sum _ { k _ { 1 } , k _ { 2 } \in \{ [ K - 2 ] , K ^ { m } \} } ^ { k _ { 1 } \neq k _ { 2 } } \mathrm { D } ( k _ { 1 } , k _ { 2 } )$ Again, substituting JS divergence for the distance metric $D$ yields a dynamic threshold, $T_m$ , for merging. The two most similar clusters are merged if their JS divergence is smaller than this threshold. $\mathrm { J S } ( K ^ { 1 } \| K ^ { 2 } ) \le \frac { \lambda } { 2 ( K - 1 ) ( \lambda + K ) } \sum _ { k _ { 1 } , k _ { 2 } \in \{ [ K - 2 ] , K ^ { m } \} } ^ { k _ { 1 } \sharp k _ { 2 } } \mathrm { J S } ( k _ { 1 } \| k _ { 2 } )$ The right-hand side of this inequality becomes the merge threshold $T_m$ : $T _ { m } = \frac { \lambda } { 2 ( K - 1 ) ( \lambda + K ) } \sum _ { k _ { 1 } , k _ { 2 } \in \{ [ K - 2 ] , K ^ { m } \} } ^ { k _ { 1 } \neq k _ { 2 } } \mathrm { J S } ( k _ { 1 } \| k _ { 2 } )$ When two clusters are merged, the weights for these two clusters in the final classification layer of the neural network are averaged to create the weight for the new, merged cluster.

4.2.4. Overall Split-Merge Scheme

The complete process is detailed in Algorithm 1.

The following figure (Figure 1 from the original paper) provides a high-level illustration of the split and merge phases.

Figure 1: Proposed plug-and-play module can be applied in existing deep clustering methods. The split phase split the dissimilar data points in one clusters to minimize the inter-class diversity. The merge phase merge the similar data points in different clusters to maximize the intra-class diversity.

Integrated Explanation:

Initialization: Start with a parametric clustering algorithm $\mathcal{A}$ , an initial guess for the number of clusters $K'$ , and the hyper-parameter $\lambda$ .
Training Loop: The process is iterative. a. Train Model: Train the clustering network using algorithm $\mathcal{A}$ with the current number of clusters, $K^*$ , until the loss converges for the current $K$ . b. Split Phase: i. For each existing cluster, tentatively split it into two sub-clusters. ii. Calculate the JS divergence between these two sub-clusters. iii. Calculate the current split threshold $T_s$ using Equation 6. iv. If the JS divergence is greater than $T_s$ , the split is confirmed. The number of clusters $K^*$ is increased, and the network is retrained with the split loss (Equation 7) to enforce the separation. If a split happens, the merge phase for this iteration is skipped. c. Merge Phase: i. If no splits occurred, find the two most similar clusters (with the minimum pairwise JS divergence). ii. Calculate the current merge threshold $T_m$ using Equation 12. iii. If the JS divergence of the most similar pair is less than $T_m$ , merge them. Decrease $K^*$ and retrain the network.
Convergence: The loop continues until no more splits or merges occur, and the number of clusters $K^*$ becomes stable. The algorithm is guaranteed to converge because the objective function value decreases with each valid split or merge, and the number of clusters $K$ is bounded (cannot be less than 1 or more than the number of data points). The paper also proves that $K$ changes monotonically (it will not oscillate by splitting and then immediately merging).

5. Experimental Setup

5.1. Datasets

The experiments were conducted on several standard image classification datasets used for unsupervised clustering evaluation:

CIFAR-10: Contains 60,000 32x32 color images in 10 classes (e.g., airplane, dog, cat), with 6,000 images per class. The ground-truth $K=10$ .
CIFAR-100-20: A subset of CIFAR-100 where the 100 fine-grained classes are grouped into 20 super-classes. The ground-truth $K=20$ .
STL-10: Consists of 13,000 96x96 color images in 10 classes. It has a smaller labeled training set but a large set of unlabeled images, making it suitable for unsupervised learning. The ground-truth $K=10$ .
ImageNet-10 & ImageNet-50: Subsets of the large-scale ImageNet dataset, containing 10 and 50 classes respectively. These are more challenging due to higher intra-class variation and inter-class similarity.

These datasets were chosen because they are widely recognized benchmarks in the field, allowing for direct comparison with previous state-of-the-art methods.

5.2. Evaluation Metrics

The paper uses three standard metrics to evaluate clustering performance. For each metric, a higher value indicates better performance.

5.2.1. Clustering Accuracy (ACC)

Conceptual Definition: ACC measures the percentage of data points that are correctly assigned to their ground-truth classes. It finds the best one-to-one mapping between the predicted clusters and the ground-truth classes and then calculates the accuracy based on this optimal mapping.
Mathematical Formula: $\text{ACC} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(l_i = \text{map}(c_i))$
Symbol Explanation:
- $N$ : The total number of data points.
- $l_i$ : The ground-truth label for data point $i$ .
- $c_i$ : The predicted cluster assignment for data point $i$ .
- $\text{map}(\cdot)$ : A permutation function that maps each predicted cluster index to a ground-truth label index, found using an algorithm like the Hungarian algorithm to maximize the overall accuracy.
- $\mathbf{1}(\cdot)$ : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.

5.2.2. Normalized Mutual Information (NMI)

Conceptual Definition: NMI measures the mutual dependence between the predicted cluster assignments and the ground-truth labels. It quantifies how much information the predicted clusters provide about the true labels, normalized to a scale of 0 (no mutual information) to 1 (perfect correlation). It is robust even if the number of predicted clusters differs from the number of true classes.
Mathematical Formula: $\text{NMI}(C, L) = \frac{I(C, L)}{\sqrt{H(C) H(L)}}$
Symbol Explanation:
- $C$ : The set of predicted cluster assignments.
- $L$ : The set of ground-truth labels.
- I(C, L): The mutual information between $C$ and $L$ .
- H(C): The entropy of the cluster assignments $C$ .
- H(L): The entropy of the ground-truth labels $L$ .

5.2.3. Adjusted Rand Index (ARI)

Conceptual Definition: ARI measures the similarity between two data clusterings (predicted vs. ground-truth) by considering all pairs of samples. It counts pairs that are assigned to the same or different clusters in both the predicted and true partitions, adjusted for chance. A value of 1 means the clusterings are identical, and a value close to 0 indicates random assignment.
Mathematical Formula: ARI is calculated based on a contingency table. $\text{ARI} = \frac{\text{Index} - \text{Expected Index}}{\text{Max Index} - \text{Expected Index}} = \frac{\sum_{ij} \binom{n_{ij}}{2} - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{N}{2}}{\frac{1}{2}[\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}] - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{N}{2}}$
Symbol Explanation:
- $n_{ij}$ : The number of objects in ground-truth class $i$ and predicted cluster $j$ .
- $a_i$ : The number of objects in ground-truth class $i$ .
- $b_j$ : The number of objects in predicted cluster $j$ .
- $N$ : Total number of objects.
- $\binom{n}{k}$ : The binomial coefficient "n choose k".

5.3. Baselines

The proposed method (plugged into SCAN) is compared against a wide range of baselines:

Traditional Methods: AC (Agglomerative Clustering), NMF (Non-negative Matrix Factorization).
Early Deep Methods: AE, DAE (Denoising Autoencoder), VAE (Variational Autoencoder).
Established Deep Clustering Methods: DEC, DAC, DCCM.
State-of-the-Art Parametric Methods: SCAN, CC, MiCE, IDFD. These are the high-performance models that require a known $K$ .
Non-Parametric Methods: DBSCAN, moVB, DeepDPM. These are the main competitors for clustering without a known $K$ .

These baselines are representative as they cover the historical evolution of clustering and include the best-performing methods in both parametric and non-parametric categories.

6. Results & Analysis

6.1. Core Results Analysis

The authors demonstrate the effectiveness of their method by comparing it to numerous baselines on several datasets.

The following are the results from Table 1 of the original paper:

Method	CIFAR-10			CIFAR-100			STL-10			ImageNet-10
Method	NMI	ACC	ARI	NMI	ACC	ARI	NMI	ACC	ARI	NMI	ACC	ARI
AC (Gowda & Krishna, 1978)	10.5	22.8	6.5	9.8	13.8	3.4	23.9	33.2	14.0	13.8	24.2	6.7
NMF (Cai et al., 2009)	8.1	19.0	3.4	7.9	11.8	2.6	9.6	18.0	4.6	13.2	23.0	6.5
AE (Bengio et al., 2006)	23.9	31.4	16.9	10.0	16.5	4.8	25.0	30.3	16.1	21.0	31.7	15.2
DAE (Vincent et al., 2010)	25.1	29.7	16.3	11.1	15.1	4.6	22.4	30.2	15.2	20.6	30.4	13.8
DCGAN (Radford et al., 2015)	26.5	31.5	17.6	12.0	15.1	4.5	21.0	29.8	13.9	22.5	34.6	15.7
DeCNN (Zeiler et al., 2010)	24.0	28.2	17.4	9.2	13.3	3.8	22.7	29.9	16.2	18.6	31.3	14.2
VAE (Kingma & Welling, 2013)	24.5	29.1	16.7	10.8	15.2	4.0	20.0	28.2	14.6	19.3	33.4	16.8
JULE (Yang et al., 2016)	19.2	27.2	13.8	10.3	13.7	3.3	18.2	27.7	16.4	17.5	30.0	13.8
DEC (Xie et al., 2016)	25.7	30.1	16.1	13.6	18.5	5.0	27.6	35.9	18.6	28.2	38.1	20.3
DAC (Chang et al., 2017)	39.6	52.2	30.6	18.5	23.8	8.8	36.6	47.0	25.7	39.4	52.7	30.2
ADC (Haeusser et al., 2018)	-	32.5	-	-	16.0	-	-	53.0	-	-	-	-
DDC (Chang et al., 2019)	42.4	52.4	32.9	-	-	-	37.1	48.9	26.7	43.3	57.7	34.5
DCCM (Wu et al., 2019)	49.6	62.3	40.8	28.5	32.7	17.3	37.6	48.2	26.2	60.8	71.0	55.5
IIC (Ji et al., 2019)	51.3	61.7	41.1	-	25.7	-	43.1	49.9	29.5	-	-	-
MMDC (Shiran & Weinshall, 2021)	57.2	70.0	-	25.9	31.2	-	49.8	61.1	-	71.9	81.1	-
PICA (Huang et al., 2020)	56.1	64.5	46.7	29.6	32.2	15.9	-	-	-	78.2	85.0	73.3
DCCS (Zhao et al., 2020)	56.9	65.6	46.9	-	-	-	37.6	48.2	26.2	60.8	71.0	55.5
DHOG (Darlow & Storkey, 2020)	58.5	66.6	49.2	25.8	26.1	11.8	41.3	48.3	27.2	-	-	-
GATCluster (Niu et al., 2020)	47.5	61.0	40.2	21.5	28.1	11.6	44.6	58.3	36.3	59.4	73.9	55.2
IDFD (Tao et al., 2021)	71.4	81.5	66.3	42.6	42.5	26.4	64.3	75.6	57.5	89.8	95.4	90.1
CC (Li et al., 2021)	70.5	79.0	63.7	43.1	42.9	26.6	76.4	85.0	72.6	85.9	89.3	82.2
MiCE (Tsai et al., 2020)	73.7	83.5	69.8	43.6	44.0	28.0	63.5	75.2	57.5	-	-	-
SCAN (Van Gansbeke et al., 2020)	71.2	81.8	66.5	44.1	42.2	26.7	65.4	75.5	59.0	86.2	92.0	83.3
Ours (K0=3)	71.9	82.4	67.5	45.2	43.8	28.1	64.3	74.5	57.6	88.6	91.2	87.1
Ours (K0=20)	71.1	81.6	66.2	-	-	-	65.1	74.7	58.9	86.9	91.8	84.7
Ours (K0=30)	-	-	-	44.9	43.1	27.8	-	-	-	-	-	-

Analysis of Table 1: The key takeaway is that the proposed method ("Ours") achieves performance that is on par with, and sometimes slightly better than, the top-performing parametric methods like SCAN, CC, and MiCE. This is remarkable because all the other SOTA methods were given the true number of clusters, while the proposed method started with an incorrect $K_0$ (e.g., 3 or 20 for CIFAR-10 where $K=10$ ) and found the correct number automatically. This validates the central claim of the paper.

The following are the results from Table 3 of the original paper:

	Init k	Inferred k	ACC	NMI	ARI
SCAN Van Gansbeke et al. (2020)	50	-	73.7±1.7	79.7±0.6	61.8±1.3
SCAN Van Gansbeke et al. (2020)	10	-	19.4±0.1	60.6±0.4	23.0±0.3
DBSCAN Ester et al. (1996)	-	16.0	24.0±0.0	52.0±0.0	9.0±0.0
moVB Hughes & Sudderth (2013)	-	46.2±1.3	55.0±2.0	70.0±1.0	43.0±1.0
DPM Sampler Dinari et al. (2019)		72.0±2.6	57.0±1.0	72.0±0.0	43.0±1.0
DeepDPM Ronen et al. (2022)	10	55.3±1.5	66.0±1.0	77.0±0.0	54.0±1.0
Ours	10	50.6±1.7	73.3±1.3	80.1±0.7	62.3±1.5

Analysis of Table 3 (ImageNet-50): This table is particularly insightful.

Parametric Method Sensitivity: SCAN with the correct $K=50$ achieves 73.7% ACC. But when given $K=10$ , its accuracy plummets to 19.4%, showing extreme sensitivity to this hyperparameter.
Non-Parametric Gap: The best non-parametric baseline, DeepDPM, only reaches 66.0% ACC, a significant drop from SCAN's best performance. It also overestimates $K$ (55.3 vs 50).
Bridging the Gap: The proposed method, starting with a poor guess of $K=10$ , achieves 73.3% ACC, nearly matching SCAN's performance when SCAN knew the answer. Furthermore, it infers $K$ to be 50.6, which is extremely close to the ground truth of 50. This strongly supports the method's effectiveness on more complex datasets.

Inferred K Analysis: The following are the results from Table 2 of the original paper:

Dataset	Initial K
Dataset	3	20	30
CIFAR-10	10±0	10±0
CIFAR100-20	19.7±2.1		19.8±1.5
STL-10	9.7±0.5	10.3±0.9
ImageNET-10	10±0	10.3±0.5

Analysis of Table 2: This table shows the final number of clusters inferred by the model, starting from initial values both smaller and larger than the ground truth. In all cases, the inferred $K$ is very close to the true $K$ , demonstrating that the split-and-merge framework is robust and can reliably converge to the correct number of clusters.

6.2. Ablation Studies / Parameter Analysis

The authors conduct several ablation studies to validate the design choices of their module.

6.2.1. Robustness to Initial $K_0$

The following are the results from Table 5 of the original paper:

K0	3	10	15	20	30	40	50	100
SCAN	28.6	82.2	58.8	45.9	33.4	25.9	21.9	11.9
Ours	82.4	82.4	82.9	81.6	82.0	77.5	70.9	72.4
K Inferred	10	10	10	10	10	11	12	12

Analysis of Table 5: This experiment on CIFAR-10 is a stress test for the initial number of clusters, $K_0$ . The performance of the baseline SCAN degrades severely as $K_0$ moves away from the true value of 10. In contrast, the proposed method ("Ours") maintains a high accuracy (around 82%) for a very wide range of $K_0$ (from 3 to 30). Even with extreme initial values like 100, it still achieves over 70% accuracy and infers a $K$ close to the true value. This demonstrates exceptional robustness.

6.2.2. Robustness to Hyper-parameter $λ$

The following are the results from Table 6 of the original paper:

λ	1.5	1.8	2	2.2	2.50
CIFAR-20	39.84	43.57	44.71	41.34	41.40
STL-10	69.46	71.14	78.40	76.49	75.53

Analysis of Table 6: This table shows that the performance is stable across a reasonable range of values for $λ$ , the hyper-parameter balancing compactness and separation. The authors state they use $λ=2$ for all experiments, which avoids dataset-specific tuning and suggests the method is not overly sensitive to this choice.

6.2.3. Importance of Different Components

The following are the results from Table 7 of the original paper:

	K0=3			K0=10			K0=20
	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
No split/merge	28.7	46.2	26.4	82.2	71.6	66.8	47.3	62.3	43.2
No split	28.9	46.1	26.4	75.4	68.9	62.1	81.2	70.9	65.6
No merge	79.7	71.3	66.9	82.1	71.5	66.8	47.1	61.7	42.6
No split loss	28.6	44.5	25.3	77.6	69.9	63.3	81.7	71.1	66.4
Full method	82.4	72.2	67.9	82.4	71.7	67.4	82.7	72.3	67.7

Analysis of Table 7: This table clearly shows that each component is crucial.

When $K_0=3$ (less than true $K=10$ ), removing the split phase ("No split") results in performance as bad as having no module at all (28.9% ACC). This confirms the split mechanism is essential for increasing $K$ .
When $K_0=20$ (greater than true $K=10$ ), removing the merge phase ("No merge") leads to poor performance (47.1% ACC), as the model cannot reduce the inflated number of clusters.
Removing the split loss ("No split loss") when $K_0=3$ also leads to failure (28.6% ACC). This shows that simply increasing $K$ is not enough; the loss is needed to actively push the new sub-clusters apart so they become meaningful.
The full method consistently achieves the best results across all scenarios.

6.2.4. Visualization Results

The following figure (Figure 2 from the original paper) shows t-SNE visualizations of the learned features and outputs during training.

Figure 2: The training process of the proposed method. The images in the top row and the bottom row ar the t-sne visualization results of features and outputs, respectively. 该图像是一个示意图，展示了不同聚类数 $k$ 的训练过程和结果。顶部行显示了特征的 t-SNE 可视化，而底部行则展示了输出的 t-SNE 可视化。随着 $k$ 从 3 到 10，ACC 值逐步提高，说明聚类效果的改善。

Analysis of Figure 2: The visualizations provide an intuitive understanding of how the method works.

Epoch 0 (k=3): At the beginning, with $k=3$ , both the feature space (top row) and the output predictions (bottom row) show a messy, overlapping structure. The data is poorly separated into only three groups.
Epoch 20 (k=5): As training progresses, the split mechanism has increased $k$ to 5. The clusters are becoming more defined and separated.
Epoch 40 (k=8) & Epoch 80 (k=10): The number of clusters continues to increase until it converges at $k=10$ . In the final state, the feature space shows 10 distinct, well-separated clumps, and the output predictions are clean and correspond to these clusters. This visual evidence strongly supports the claim that the module not only finds the right $K$ but also helps the network learn a more structured and separable feature representation.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully tackles a major practical problem in deep clustering: the requirement to know the number of clusters $K$ beforehand. The authors propose a novel and effective plug-and-play module based on a split-and-merge framework. This module can be integrated into existing state-of-the-art parametric clustering algorithms, enabling them to automatically adjust $K$ during training. The decisions to split or merge are guided by a principled objective function and measured using JS divergence. Extensive experiments show that this approach allows parametric models to achieve their high performance without prior knowledge of $K$ , effectively bridging the performance gap between powerful parametric methods and convenient non-parametric ones.

7.2. Limitations & Future Work

Limitations:
1. Increased Computational Cost: The paper mentions that the running time increases (e.g., 3x for one setting). The iterative process of training, checking for splits/merges, and then retraining adds overhead compared to a single run of a parametric model. However, it is still much more efficient than the alternative of running the model multiple times with different $K$ s in a grid search.
2. Hyper-parameter $λ$ : While shown to be robust, the method still relies on a pre-defined hyper-parameter $λ$ to balance the two terms of the objective function. An ideal system might learn this balance automatically.
3. Greedy Approach: The split-and-merge process is greedy, making one change at a time. While effective, this may not always find the global optimum of the objective function.
Future Work:
1. The authors did not explicitly state future work, but a natural extension would be to explore other similarity metrics besides JS divergence to see if they offer better performance or stability.
2. Applying the module to other domains beyond computer vision, such as natural language processing or time-series analysis, would be a valuable next step.
3. Developing a more adaptive mechanism for setting $λ$ or removing it entirely would further improve the method's autonomy.

7.3. Personal Insights & Critique

Practicality and Impact: The plug-and-play approach is a significant strength. Instead of asking practitioners to abandon their favorite high-performing models for a new non-parametric one, this work provides a tool to enhance the models they already use. This makes the research highly practical and more likely to be adopted.
Theoretical Grounding: The derivation of split/merge thresholds from a single, clear objective function provides a solid theoretical foundation for the method. It moves beyond simple heuristics and offers a clear explanation for why the algorithm behaves as it does.
Critique:
1. The process of splitting a cluster $k$ involves training $K'$ additional 2-class classifiers. The exact architecture and training of these "assistant" networks could be detailed more clearly. How these are integrated and trained adds complexity that could be a potential failure point.
2. The paper shows convergence theoretically, but in practice, deep learning optimization landscapes are non-convex. The stability of the split/merge decisions could be sensitive to the quality of the learned features at any given stage. If the model is in a poor local minimum, the JS divergence calculations might not be meaningful, potentially leading to suboptimal splits or merges.
3. The claim that $K$ changes monotonically is interesting. This prevents oscillations, which is good for convergence, but one could imagine a scenario where a "wrong" split is made, and the ideal recovery would be to merge those same clusters back. A strictly monotonic change might prevent recovery from such a mistake.
  
  Overall, this paper presents a clever, well-executed, and highly useful contribution to the field of unsupervised learning. It addresses a real-world pain point with an elegant and effective solution.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Deep Plug-and-Play Clustering with Unknown Number of Clusters

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~17 min read · 23,016 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The Optimization Objective for Clustering

4.2.2. Cluster Split to Minimize Inter-Class Diversity

4.2.3. Cluster Merge to Maximize Intra-Class Diversity

4.2.4. Overall Split-Merge Scheme

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Clustering Accuracy (ACC)

5.2.2. Normalized Mutual Information (NMI)

5.2.3. Adjusted Rand Index (ARI)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Robustness to Initial K0K_0K0​

6.2.2. Robustness to Hyper-parameter λλλ

6.2.3. Importance of Different Components

6.2.4. Visualization Results

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.2.1. Robustness to Initial $K_0$

6.2.2. Robustness to Hyper-parameter $λ$