A Split-Merge DP-means Algorithm to Avoid Local Minima
TL;DR Summary
This paper extends DP-means with a split-merge technique to avoid local minima by dynamically splitting dense clusters, enabling near-optimal cluster counts with improved robustness demonstrated on multiple datasets.
Abstract
We present an extension of the DP-means algorithm, a hard-clustering approximation of nonparametric Bayesian models. Although a recent work [6] reports that the DP-means can converge to a local minimum, the condition for the DP-means to converge to a local minimum is still unknown. This paper demonstrates one reason the DP-means converges to a local minimum: the DP-means cannot assign the optimal number of clusters when many data points exist within small distances. As a first attempt to avoid the local minimum, we propose an extension of the DP-means by the split-merge technique. The proposed algorithm splits clusters when a cluster has many data points to assign the number of clusters near to optimal. The experimental results with multiple datasets show the robustness of the proposed algorithm.
English Analysis
1. Bibliographic Information
- Title: A Split-Merge DP-means Algorithm to Avoid Local Minima
- Authors: Shigeyuki Odashima, Miwa Ueki, and Naoyuki Sawasaki. All authors are affiliated with Fujitsu Laboratories Ltd., Kawasaki, Japan.
- Journal/Conference: The paper does not explicitly name the publication venue. Based on its structure and style, it is likely a conference paper in the field of data mining or machine learning.
- Publication Year: The paper does not specify a publication year, but citations go up to 2017, suggesting it was published around or after that time.
- Abstract: The paper presents an extension to the
DP-means
algorithm, a hard-clustering method derived from nonparametric Bayesian models. While a prior study [6] noted thatDP-means
can get trapped in local minima, the reasons were unknown. This paper identifies a key cause:DP-means
fails to find the optimal number of clusters when many data points are concentrated in a small area. To address this, the authors propose asplit-merge
extension. This new algorithm splits clusters that become too dense, helping to find a near-optimal number of clusters. Experiments on multiple datasets demonstrate the proposed algorithm's improved robustness and ability to find better solutions. - Original Source Link:
/files/papers/68f8705e81e5ec727a02b076/paper.pdf
. The paper appears to be formally published, given its polished format and citations.
2. Executive Summary
-
Background & Motivation (Why):
- The core problem is that
DP-means
, a popular algorithm for automatically determining the number of clusters in a dataset, can fail by converging to a local minimum. This means it finds a suboptimal clustering solution with fewer clusters than is ideal. - This problem is significant because
DP-means
is valued for its efficiency and ability to adapt model complexity (the number of clusters) to the data, a feature inherited from nonparametric Bayesian models. A previous paper [6] observed this failure but did not explain why it happens. This paper fills that gap. - The paper's fresh angle is to provide a theoretical analysis identifying a specific condition for this failure:
DP-means
gets stuck when a large number of data points exist within a small spatial distance. It cannot create new clusters in such dense regions, even if doing so would lead to a better overall solution.
- The core problem is that
-
Main Contributions / Findings (What):
- The paper makes three primary contributions:
- Analysis of a Local Minimum Condition: It formally demonstrates that
DP-means
can converge to a solution with a single cluster in a dense region, even when a solution with multiple clusters would have a lower (better) cost. - A Novel Split-Merge DP-means Algorithm: It proposes an extension to
DP-means
that incorporatessplit
andmerge
operations. The algorithm splits a cluster when it becomes too dense (containing too many points within its range) and merges clusters that were over-split, allowing it to escape the identified local minima. - Empirical Validation: It provides experimental results on four real-world datasets, showing that the proposed
split-merge DP-means
(SMD
) algorithm consistently finds clustering solutions with lower cost than standardDP-means
variants.
- Analysis of a Local Minimum Condition: It formally demonstrates that
- The paper makes three primary contributions:
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Clustering: The unsupervised machine learning task of grouping a set of objects such that objects in the same group (called a cluster) are more similar to each other than to those in other groups.
- k-means Clustering: A classic "hard-clustering" algorithm where each data point belongs to exactly one cluster. It aims to partition data points into clusters by minimizing the sum of squared distances from each point to its assigned cluster's center (centroid). Its main drawback is that the number of clusters, , must be specified in advance, and it is sensitive to the initial placement of centroids, often getting stuck in local minima.
- Nonparametric Bayesian Models: Statistical models where the complexity of the model (e.g., the number of parameters or components) is not fixed in advance but is determined from the data. For clustering, this means the model can infer the number of clusters automatically. The Dirichlet Process (DP) is a common example used for this purpose. These models are flexible but often computationally expensive.
- DP-means (Dirichlet Process Means): An algorithm that acts as a computationally efficient, hard-clustering approximation of a Dirichlet Process Mixture Model. It was derived using a technique called Small-Variance Asymptotics (SVA), which simplifies the complex probabilistic Bayesian model into a deterministic algorithm similar to k-means. Unlike k-means,
DP-means
can automatically determine the number of clusters by adding a penalty term for each new cluster. A new cluster is created if a data point is further than from all existing cluster centroids. - Split-Merge Techniques: A class of methods used to improve clustering algorithms. To escape local minima, a
split
operation divides one cluster into two (or more), while amerge
operation combines two (or more) clusters into one. These are common in MCMC-based Bayesian inference but less so for SVA-based methods likeDP-means
.
-
Previous Works:
- The paper positions itself relative to three areas:
- DP-means and extensions: Many extensions to
DP-means
exist for different applications (e.g., topic models, time-series data, subspace clustering) and for efficiency (distributed, online, coreset-based). However, the authors state that none of these have analyzed the local minimum problem. The work by Bachem et al. [6] is a key reference, as it first reported thatDP-means
can get stuck but did not explain the cause. - Hard-clustering algorithms:
k-means
is the most famous example. Research in this area, likek-means++
, has focused on better initialization to avoid local minima. This paper's analysis ofDP-means
local minima could inform similar improvements for this class of algorithms. - Split-merge clustering algorithms: These techniques have been applied to both hard clustering and, more recently, to fully Bayesian models optimized with MCMC or variational inference. This paper is the first to apply a split-merge approach to a hard-clustering method derived from small-variance asymptotics.
- DP-means and extensions: Many extensions to
- The paper positions itself relative to three areas:
-
Differentiation: The key innovation is twofold: (1) it is the first work to provide a theoretical analysis explaining why
DP-means
gets trapped in local minima, linking it to data density, and (2) it is the first to propose asplit-merge
algorithm specifically designed to solve this problem forDP-means
and related SVA-based methods.
4. Methodology (Core Technology & Implementation)
The paper's methodology is divided into two parts: first, an analysis of why the standard DP-means
algorithm fails, and second, the proposal of the split-merge
solution.
Analysis of Existing DP-means Approaches
The DP-means
algorithm aims to minimize the following cost function:
-
is the set of data points.
-
is the set of cluster centroids .
-
is the squared Euclidean distance between a data point and a centroid.
-
is a hyperparameter that penalizes the creation of new clusters.
-
is the number of clusters.
The first term is the standard k-means objective (quantization error), while the second term is a penalty that controls the number of clusters.
DP-means
works like k-means but adds a new cluster centered at a data point if its distance to the nearest existing cluster centroid is greater than (i.e., ).
The authors introduce a simplified scenario to analyze the algorithm's behavior.
Definition 1 (easy case for DP
-means clustering): Data is in the "easy case" if the maximum squared Euclidean distance between any two data points is less than . This means all data points are contained within a hypersphere of radius .
该图像是论文中图2的示意图,展示了DP-means算法在不同数据分布情况下的聚类结果。图(a)为简单情况,样本点均匀分布在半径为λ的圆内;图(b)显示样本高度集中在两个点,且λ=10;图(c)比较不同样本数下的聚类样例,分别为20点和2000点。
-
Image 2 Analysis: This figure illustrates the core problem. (a) shows the "easy case," where all points are close. (b) presents a specific example where two groups of points are close to each other but distinct;
DP-means
would group them into one cluster. (c) shows that as the number of points in these two groups increases, our intuition (and the underlying Bayesian model) suggests they should be two separate clusters, butDP-means
fails to do this.Based on this, the authors present two lemmas leading to their main theorem on local minima.
-
Lemma 1: In the "easy case," both batch and online
DP-means
will always converge to a single cluster whose centroid is the mean of all data points.- Reasoning: Since all points are within distance of each other, the condition to create a new cluster () is never met after the first cluster is created. All subsequent points are simply assigned to this single cluster.
-
Lemma 2: In the "easy case," there can exist a situation where a solution with two clusters has a lower
DP-means
cost than the one-cluster solution.- Proof by Example: Consider 1000 points at
(-1,0)
and 1000 points at , with . This is an "easy case" since the maximum distance squared is , which is less than 100.- Cost with 1 cluster: The centroid is at . The cost is .
- Cost with 2 clusters: The centroids are at
(-1,0)
and . The cost is .
- Since , the two-cluster solution is optimal.
- Proof by Example: Consider 1000 points at
-
Theorem 1: In the "easy case,"
DP-means
can converge to a local minimum with fewer than the optimal number of clusters.- Proof: Lemma 1 shows
DP-means
finds one cluster. Lemma 2 shows the optimal solution can be two (or more) clusters. Therefore,DP-means
gets stuck in the suboptimal one-cluster solution. The fundamental reason is that the algorithm doesn't account for the number of data points (density) when deciding whether to keep a region as a single cluster.
- Proof: Lemma 1 shows
Split-Merge DP-means
To fix this, the authors propose an algorithm that explicitly considers cluster density to decide when to split.
3.1 Condition for Splitting One Cluster into Two
The authors derive a condition for when splitting a cluster is beneficial. They make a simplifying assumption that the data points within a cluster are uniformly distributed.
Consider a cluster with points and a range in each dimension.
- Expected cost with one cluster: The expected cost is derived from the variance of a uniform distribution.
- Expected cost with two clusters (split along dimension ): If the cluster is split in half along dimension , the new clusters each have a range of in that dimension. The expected cost becomes:
- Splitting Condition: A split is beneficial if . This simplifies to:
- Interpretation: A cluster should be split if it has many data points ( is large) or if it is wide (range is large). The algorithm splits along the dimension with the maximum range, as this satisfies the condition first.
3.2 Split DP-means
Algorithm 3
is an online algorithm that incorporates this splitting rule. It maintains not just the centroid and point count for each cluster, but also its range and min/max boundaries . When a new data point arrives:
- It is assigned to the nearest cluster.
- The cluster's statistics () are updated.
- The algorithm checks if the split condition (Eq. 4) is now met for the updated cluster. If so, the cluster is split into two new clusters according to predefined rules (Eq. 6) that approximate the centroids and statistics of the two new sub-clusters.
3.3 Merge DP-means
The Split DP-means
can be too aggressive and create too many clusters. The merge step (Algorithm 4
) corrects this.
-
Merging Condition: A condition is derived for when merging two clusters, and , into one, , reduces the total
DP-means
cost. The change in cost from merging is:- are the number of points in each cluster.
- are the squared distances from the original cluster centroids () to the new merged centroid .
- A merge is performed if .
-
Algorithm: The
Merge DP-means
algorithm takes the clusters produced by theSplit DP-means
algorithm. It greedily and iteratively merges the pair of clusters that provides the largest cost reduction (most negative ) until no more beneficial merges are found.该图像是图表,展示了论文中图3所示的针对合成二维数据的分裂-合并DP-means算法的示例结果。图中分别展示了输入数据点、分裂DP-means结果以及最终分裂-合并DP-means聚类的可视化效果,突出显示了算法处理不同规模数据点时的聚类情况。
- Image 3 Analysis: This figure provides a visual example of the full
split-merge
process. The initial data has five natural groups. (a) StandardDP-means
might find only two clusters because the groups are close. (b) TheSplit DP-means
algorithm is run. As more data arrives, the dense clusters on the left and right satisfy the split condition and are broken down into smaller clusters, resulting in more clusters than necessary. (c) Finally, theMerge DP-means
algorithm cleans up the result by merging the over-split clusters, leading to a final solution that correctly identifies the five underlying groups.
5. Experimental Setup
-
Datasets: Four real-world datasets with varying sizes and dimensions were used:
- USGS: 59,209 earthquake locations (3 dimensions).
- MNIST: 70,000 handwritten digit images, reduced to 10 dimensions via PCA.
- KDD2004BIO: 145,751 protein sequence features (74 dimensions).
- SUN SCENES 397: 198,500 image GIST features (512 dimensions).
-
Evaluation Metrics:
- DP-means Cost: The primary metric for solution quality.
- Conceptual Definition: It measures the overall quality of the clustering by balancing the compactness of clusters (quantization error) against the number of clusters (model complexity). A lower cost is better.
- Mathematical Formula:
- Symbol Explanation:
- : The set of all data points.
- : The set of all cluster centroids.
- : An individual data point.
- : An individual cluster centroid.
- : The total number of clusters.
- : The penalty parameter for creating a new cluster.
- Computation Time: Measured in seconds, to evaluate the efficiency of the algorithms.
- DP-means Cost: The primary metric for solution quality.
-
Baselines: The proposed algorithms were compared against two standard
DP-means
implementations:- BD: Batch DP-means (Algorithm 1).
- OD: Online DP-means (Algorithm 2). The proposed methods are:
- SD: Split DP-means (Algorithm 3).
- SMD: Split-merge DP-means (Algorithms 3 and 4 combined).
6. Results & Analysis
The experiments were run five times with different data orders to measure robustness.
Core Results (Tables 1-4)
The results across all four datasets consistently show that the proposed SMD
algorithm achieves the lowest DP-means
cost.
-
Manual Transcription of Table 1: DP-means cost and runtime comparison for USGS data.
Cost (× 102) Computation time (s) BD OD SD SMD BD OD SD SMD 0.1 4.68 ± 0.26 7.06 ± 1.13 1.54 ± 0.03 1.20 ± 0.01 2217.9 45.9 525.6 604.4 0.32 18.0 ± 4.31 24.8 ± 3.35 3.39 ± 0.15 2.50 ± 0.06 507.1 17.2 404.1 446.6 1.0 64.1 ± 3.15 80.7 ± 23.6 7.02 ± 0.40 5.07 ± 0.20 119.8 6.3 308.8 328.5 3.2 460 ± 0 422 ± 105 14.5 ± 0.22 10.2 ± 0.22 1.7 2.1 234.3 242.5 -
Manual Transcription of Table 2: DP-means cost and runtime comparison for MNIST data.
Cost (× 105) Computation time (s) BD OD SD SMD BD OD SD SMD 8.0 × 100 1.31 ± 0.01 1.73 ± 0.04 1.14 ± 0.01 1.12 ± 0.00 58255.7 134.5 1739.0 2379.1 4.0 × 101 7.00 ± 0 6.91 ± 0.26 1.86 ± 0.07 1.71 ± 0.04 2.1 2.6 1006.0 1180.3 2.0 × 102 7.00 ± 0 7.00 ± 0 2.78 ± 0.13 2.50 ± 0.07 2.1 2.4 440.2 458.4 1.0 × 103 7.01 ± 0 7.01 ± 0 3.96 ± 0.15 3.60 ± 0.12 2.1 2.4 154.4 155.7 -
Analysis of Tables 1 & 2:
-
In many cases, especially with larger values (e.g., MNIST with ), the Batch DP-means (
BD
) has a cost variance of± 0
. This indicates it gets stuck in the exact same local minimum regardless of data order, confirming the paper's central hypothesis. -
The
SMD
algorithm consistently achieves a significantly lower cost, demonstrating its ability to escape these local minima. -
The
SD
algorithm alone often improves overBD
andOD
, but its cost is higher thanSMD
, showing that the merge step is crucial for refining the solution. -
Computation time for
SD
andSMD
is often higher thanOD
and sometimesBD
. This is expected, asSD
tends to create many intermediate clusters, increasing the complexity of the nearest-neighbor search.(Tables 3 and 4 show similar trends and are transcribed below for completeness.)
-
-
Manual Transcription of Table 3: DP-means cost and runtime comparison for KDD2004BIO data.
Cost (×1011) Computation time (s) BD OD SD SMD BD OD SD SMD 3.2 × 107 1.91 ± 0.01 3.37 ± 0.13 1.57 ± 0.05 1.45 ± 0.03 30349.0 49.6 2340.9 2522.5 1.0 × 108 2.75 ± 0.00 4.68 ± 0.03 2.12 ± 0.12 1.86 ± 0.06 10810.1 27.1 1543.8 1593.9 3.2 × 108 3.19 ± 0.04 5.69 ± 0.34 2.89 ± 0.19 2.45 ± 0.09 7723.6 20.2 948.9 960.8 1.0 × 109 7.33 ± 0.01 9.27 ± 3.62 4.49 ± 0.75 3.59 ± 0.43 255.5 10.2 661.3 665.5 -
Manual Transcription of Table 4: DP-means cost and runtime comparison for SUN SCENES 397 data.
Cost (×104) Computation time (s) BD OD SD SMD BD OD SD SMD 0.250 1.15 ± 0.01 1.37 ± 0.01 1.13 ± 0.00 1.13 ± 0.00 279623.7 605.5 5084.7 5268.8 0.354 1.25 ± 0.01 1.47 ± 0.02 1.17 ± 0.00 1.17 ± 0.00 101439.4 173.3 4223.2 4346.9 0.500 1.41 ± 0.02 1.56 ± 0.04 1.21 ± 0.00 1.21 ± 0.00 12746.6 58.1 3518.2 3593.1 0.707 1.79 ± 0 1.81 ± 0.01 1.24 ± 0.01 1.24 ± 0.01 421.6 23.1 2966.5 3013.5
Number of Clusters (Table 5)
This table compares the number of clusters found by each method to the "optimal" number reported in a previous study [6], which used an exhaustive grid search.
-
Manual Transcription of Table 5: Numbers of clusters for each dataset in original data order.
Dataset BD OD SD SMD Optimal [6] USGS (λ2=1.0) 8 6 529 312 156 MNIST (λ2=1.0 × 103) 1 1 140 87 65 KDD2004BIO (λ2=1.0 × 109) - 3 202 122 55 -
Analysis:
BD
andOD
find drastically fewer clusters than the optimal number, confirming they get stuck in local minima with too few clusters. For MNIST, they both find only 1 cluster when the optimum is 65.SD
(Split DP-means) overcompensates, creating far more clusters than optimal (e.g., 529 for USGS vs. an optimum of 156).SMD
(Split-merge DP-means) successfully corrects this over-splitting. The number of clusters it finds (e.g., 312 for USGS, 87 for MNIST) is much closer to the optimal number than any otherDP-means
variant, though still an overestimation. This demonstrates the synergy of the split and merge steps: splitting escapes the local minimum, and merging refines the solution towards the correct complexity.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully diagnoses a key failure mode of the
DP-means
algorithm: its inability to create new clusters in dense regions of data, leading to convergence in local minima with a suboptimal number of clusters. The authors provide a clear theoretical analysis of this phenomenon. They then propose a novelsplit-merge DP-means
algorithm that directly addresses this issue by splitting dense clusters and merging over-split ones. Empirical results convincingly show that this new algorithm finds better clustering solutions (lower cost) than standardDP-means
across a variety of real-world datasets. -
Limitations & Future Work: The authors acknowledge three main limitations in their proposed algorithm:
- The splitting and merging criteria are derived assuming a uniform distribution within clusters, which may not be accurate for real-world data.
- The online update rules lose detailed information about the data point distribution within a cluster, as only summary statistics (mean, range) are kept.
- The split operation is restricted to a single dimension (the one with the maximum range), which may not be optimal for clusters with complex, multi-dimensional shapes.
- Future work aims to create more exact versions of the algorithm that relax these simplifying approximations.
-
Personal Insights & Critique:
- Strengths: The paper's main strength is its clear, intuitive, and elegant analysis of the
DP-means
local minimum problem. The "easy case" scenario is a powerful tool for demonstrating the core issue. The proposed solution is practical and directly motivated by the analysis. It moves beyond just identifying a problem to offering a constructive algorithm to solve it. - Weaknesses: The reliance on the uniform distribution assumption is a significant simplification. Real-world clusters are rarely hyper-rectangles. A Gaussian assumption, as hinted by the authors, would be more realistic but also more complex to implement. The greedy nature of the merge step also does not guarantee finding the global optimum, but it is a standard and practical heuristic.
- Overall Impact: This work is a valuable contribution to the literature on
DP-means
and hard clustering. It provides a deeper understanding of the algorithm's failure modes and offers a robust method to mitigate them. The insights could inspire similar split-merge or density-aware improvements for other clustering algorithms derived from small-variance asymptotics. The trade-off between higher computational cost and better solution quality is clearly highlighted, providing practical guidance for users.
- Strengths: The paper's main strength is its clear, intuitive, and elegant analysis of the
Similar papers
Recommended via semantic vector search.
Discussion
Leave a comment
No comments yet. Start the discussion!