Revisiting k-means: New Algorithms via Bayesian Nonparametrics
TL;DR Summary
This paper revisits k-means via Bayesian nonparametrics, proposing DP-means to automatically determine cluster count and optimize a k-means-like objective. Extensions include hierarchical clustering across datasets and spectral and normalized cut methods for improved performance.
Abstract
Bayesian models offer great flexibility for clustering applications---Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between k-means and mixtures of Gaussians, we show that a Gibbs sampling algorithm for the Dirichlet process mixture approaches a hard clustering algorithm in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means-like clustering objective that includes a penalty for the number of clusters. We generalize this analysis to the case of clustering multiple data sets through a similar asymptotic argument with the hierarchical Dirichlet process. We also discuss further extensions that highlight the benefits of our analysis: i) a spectral relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that does not fix the number of clusters in the graph.
English Analysis
1. Bibliographic Information
- Title: Revisiting k-means: New Algorithms via Bayesian Nonparametrics
- Authors: Brian Kulis (Ohio State University) and Michael I. Jordan (UC Berkeley). Both are prominent researchers in machine learning. Michael I. Jordan is a world-renowned figure, particularly known for his foundational work at the intersection of statistics and computer science, including contributions to Bayesian nonparametrics.
- Journal/Conference: The paper is available on arXiv, which is a preprint server. This version () was submitted in November 2011. While not a peer-reviewed journal publication, it has been influential and widely cited.
- Publication Year: 2011
- Abstract: The authors aim to combine the flexibility of Bayesian models with the simplicity of classical clustering methods like k-means. They start from the known connection between k-means and Gaussian Mixture Models and extend this to Bayesian nonparametrics. They show that a Gibbs sampling algorithm for a Dirichlet Process (DP) mixture model, under a specific limiting condition, becomes a hard clustering algorithm. This new algorithm, which they call
DP-means
, is shown to minimize a k-means-like objective function that includes a penalty for the number of clusters, thus automatically determining the cluster count. This analysis is extended to the Hierarchical Dirichlet Process (HDP) to develop a method for clustering multiple datasets with shared clusters. The paper also presents extensions to spectral clustering and graph clustering. - Original Source Link:
- Official Source: https://arxiv.org/abs/1111.0352v2
- PDF Link: https://arxiv.org/pdf/1111.0352v2.pdf
- Publication Status: This is a preprint and has not been formally published in a peer-reviewed journal or conference proceeding according to the link provided.
2. Executive Summary
-
Background & Motivation (Why):
- The paper addresses a fundamental trade-off in clustering. On one hand, Bayesian nonparametric models (like Dirichlet Process mixtures) are highly flexible; they can model complex data structures and, most importantly, infer the number of clusters from the data itself. However, they typically rely on complex and computationally intensive inference methods like Gibbs sampling, which can be slow and hard to scale.
- On the other hand, k-means clustering is simple, fast, scalable, and widely used in practice. Its major drawback is its rigidity: the number of clusters, , must be specified in advance, and it struggles with non-spherical clusters.
- The paper seeks to bridge this gap, asking: Can we derive a simple, scalable, k-means-like "hard clustering" algorithm that inherits the flexibility of Bayesian nonparametrics, especially the ability to automatically determine the number of clusters?
-
Main Contributions / Findings (What):
- DP-means Algorithm: The authors derive a new hard clustering algorithm,
DP-means
, by taking the small-variance asymptotic limit of a Gibbs sampler for a Dirichlet Process (DP) mixture model. This algorithm is nearly as simple as k-means but includes a threshold parameter, , to decide whether to assign a point to an existing cluster or create a new one. - A Penalized Objective Function: They prove that
DP-means
monotonically minimizes a novel objective function: the standard k-means sum-of-squared-errors plus a penalty term , which penalizes the number of clusters. This provides a formal justification for the algorithm and guarantees convergence to a local minimum. - Hierarchical Clustering Algorithm (Hard Gaussian HDP): They extend the same asymptotic analysis to the Hierarchical Dirichlet Process (HDP), a model for clustering grouped data. This results in a new algorithm for finding shared "global" clusters across multiple datasets while allowing for dataset-specific "local" clusters. This algorithm also minimizes an intuitive penalized objective.
- Novel Extensions: They demonstrate the power of their framework by deriving two further extensions:
- A spectral clustering method where the number of clusters is determined by thresholding the eigenvalues of the kernel matrix, rather than taking the top eigenvectors.
- A graph clustering algorithm for the normalized cut objective that does not require fixing the number of clusters beforehand.
- DP-means Algorithm: The authors derive a new hard clustering algorithm,
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- k-means Clustering: An iterative algorithm that partitions data points into predefined clusters. It works by alternately assigning each point to the cluster with the nearest mean (centroid) and then recalculating the centroid for each cluster. Its goal is to minimize the within-cluster sum of squared distances.
- Gaussian Mixture Model (GMM): A probabilistic model that assumes the data is generated from a mixture of several Gaussian (normal) distributions with unknown parameters. Each Gaussian represents a cluster. Unlike k-means' "hard" assignments, GMM provides "soft" assignments (probabilities of a point belonging to each cluster). Parameters are typically estimated using the Expectation-Maximization (EM) algorithm.
- Bayesian Nonparametrics: A branch of Bayesian statistics that allows models to have an "infinite" number of parameters. For clustering, this means the number of clusters is not fixed in advance but can grow as more data is observed. This provides great modeling flexibility.
- Dirichlet Process (DP): The cornerstone of Bayesian nonparametrics for mixture models. A DP is a "distribution over distributions." A sample drawn from a DP is itself a discrete probability distribution. When used as a prior in a mixture model (a DP mixture), it allows for an infinite number of potential clusters, with the model automatically selecting a finite subset of them to explain the data.
- Gibbs Sampling: A Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations from a specified multivariate probability distribution when direct sampling is difficult. It is commonly used for inference in complex Bayesian models by iteratively sampling each variable from its conditional distribution given the current values of all other variables.
- Hierarchical Dirichlet Process (HDP): An extension of the DP used for modeling grouped data. It allows different groups (e.g., different documents or datasets) to share a common set of clusters, but with different proportions. For instance, in topic modeling, all documents draw from a common set of topics (global clusters), but each document has its own distribution over those topics.
-
Previous Works & Differentiation:
- The paper is inspired by the well-known asymptotic connection between GMM and k-means. If you take a GMM where all clusters have a spherical covariance and let the variance approach zero, the EM algorithm's updates become identical to the k-means algorithm's updates. The soft assignments become hard assignments to the nearest centroid.
- This paper's key innovation is to apply a similar asymptotic argument not to the standard GMM, but to a Bayesian nonparametric (DP) mixture model. Instead of analyzing the EM algorithm, they analyze a Gibbs sampler.
- This approach is distinct from previous works. While other methods have tried to make k-means more flexible, they typically do not arise from such a principled Bayesian foundation. And while standard DP mixture inference is powerful, it remains slow. The authors' method uniquely combines the theoretical elegance of Bayesian nonparametrics with the practical efficiency of a hard clustering algorithm.
4. Methodology (Core Technology & Implementation)
4.1. From DP Gibbs Sampler to DP-means
The core idea is to analyze the behavior of a specific Gibbs sampler for a DP mixture of Gaussians in the limit as the cluster variance .
-
Model Setup:
- The data points are assumed to be drawn from Gaussian distributions .
- The means are drawn from a distribution , which is itself drawn from a Dirichlet Process: .
- is the "base measure" or prior over the cluster means. The paper chooses a Gaussian prior: .
-
Gibbs Sampling Probabilities: In a Gibbs sampler for this model, for each data point , we decide whether to assign it to an existing cluster or create a new one. The (unnormalized) probability of assigning to an existing cluster (with other points) is: The (unnormalized) probability of creating a new cluster is: With the chosen Gaussian prior , this integral can be solved.
-
The Asymptotic Limit: The authors don't just let . To get a non-trivial result, they cleverly define the DP concentration parameter as a function of : where is a new hyperparameter. After substituting this into the Gibbs probabilities and simplifying, the posterior probability of assigning point to an existing cluster becomes proportional to: And the probability of creating a new cluster becomes proportional to: As , the exponential terms completely dominate. The assignment will be determined entirely by the term inside the that is least negative (i.e., smallest in magnitude). The decision rule becomes:
- Find the cluster that minimizes the squared distance .
- Compare this minimum squared distance to .
- If , assign to cluster .
- If , create a new cluster for .
-
Mean Update: In the Gibbs sampler, after updating assignments, the cluster means are resampled from their posterior. For a cluster with points , the posterior mean is , where is the empirical mean. As , this posterior mean converges to the empirical mean , which is exactly the mean update step in k-means.
This leads to Algorithm 1: DP-means. It iterates through points, assigning them to the closest cluster centroid unless the squared distance exceeds , in which case a new cluster is created with the point as its first member. After each pass through the data (or after each point assignment), the centroids are recomputed as the mean of their assigned points.
4.2. Underlying Objective Function
The paper proves that the DP-means algorithm monotonically optimizes the following objective function: where are the clusters, are their means, and is the total number of clusters.
- Interpretation: This is the standard k-means objective (minimizing within-cluster variance) plus a penalty term that is linear in the number of clusters. The parameter controls the trade-off: a larger imposes a higher cost for creating new clusters, leading to fewer clusters, and vice-versa. This objective is related to information criteria like AIC and BIC, which also penalize model complexity.
4.3. Hierarchical Clustering: Hard Gaussian HDP
The same logic is extended to the Hierarchical Dirichlet Process (HDP) to handle multiple datasets (). The HDP model allows datasets to share a set of global clusters.
-
Intuition: Each dataset has its own set of "local" clusters. Each local cluster is then associated with one of a shared pool of "global" clusters. This structure is depicted in Figure 1.
该图像是一个示意图,展示了多个数据集中局部簇与全局簇之间的层次关系。每个数据集包含若干局部簇,每个局部簇对应一个全局均值 μ_p,且该全局均值由所有相关簇的点的均值计算。当重新分配点到簇时,目标函数会对新建局部簇或全局簇进行惩罚。
-
Algorithm and Objective: The asymptotic analysis of an HDP Gibbs sampler yields Algorithm 2: Hard Gaussian HDP. This algorithm involves two penalty parameters:
-
: Penalty for creating a new local cluster within a dataset.
-
: Penalty for creating a new global cluster to be shared across datasets.
The algorithm minimizes the following objective: Here, are the global clusters with means , is the total number of global clusters, and is the total number of local clusters across all datasets. The algorithm iteratively assigns points to local clusters and local clusters to global clusters, creating new ones when the cost exceeds the penalty thresholds.
-
4.4. Further Extensions
-
Spectral Meets Nonparametric: The authors show a connection to spectral clustering. The DP-means objective can be rewritten as a trace maximization problem: where is the kernel matrix () and is a normalized cluster indicator matrix. Relaxing the constraints on to be any orthonormal matrix, the solution is given by the eigenvectors of corresponding to all eigenvalues greater than . This is a beautiful result:
- Standard spectral clustering takes the top k eigenvectors.
- This nonparametric version takes all eigenvectors whose "energy" (eigenvalue) surpasses a threshold . The number of clusters is thus determined by the data's spectral properties and .
-
Graph Clustering: The paper leverages known connections between weighted kernel k-means and graph cuts. By extending their DP-means objective to the weighted kernel case, they show it is equivalent to minimizing a penalized normalized cut objective: This allows them to develop a version of the popular normalized cut algorithm that does not require specifying in advance.
5. Experimental Setup
-
Datasets:
- Synthetic Data: A simple 2D dataset with 3 Gaussian clusters was used for illustration (Figure 2a). Another synthetic setup involved 50 datasets for the hard Gaussian HDP experiment (Figure 2d).
- UCI Datasets: A collection of standard real-world benchmark datasets from the UCI Machine Learning Repository:
Wine
,Iris
,Pima
,Soybean
,Car
,Balance Scale
,Breast Cancer
, andVehicle
.
-
Evaluation Metrics:
- Normalized Mutual Information (NMI): This metric measures the agreement between two clusterings (e.g., the algorithm's output and the ground truth labels), correcting for chance. It is a value between 0 (no mutual information) and 1 (perfect correlation).
- Conceptual Definition: NMI quantifies the "information" that the predicted cluster labels provide about the true class labels . It is normalized by the entropies of both label sets, so the score is independent of the number of clusters. A higher NMI indicates a better match.
- Mathematical Formula:
- Symbol Explanation:
- : The set of cluster assignments produced by the algorithm.
- : The set of ground-truth class labels.
I(Y, C)
: The mutual information between and . It measures the dependency between the two sets of labels.H(Y)
andH(C)
: The entropy of the cluster assignments and ground-truth labels, respectively. Entropy measures the uncertainty or "information content" of a distribution.
- Normalized Mutual Information (NMI): This metric measures the agreement between two clusterings (e.g., the algorithm's output and the ground truth labels), correcting for chance. It is a value between 0 (no mutual information) and 1 (perfect correlation).
-
Baselines:
- k-means: The classical algorithm, run with the true number of clusters .
- Gibbs: A standard Gibbs sampler for a DP mixture model. This represents the full (but slow) Bayesian nonparametric approach.
-
Parameter Selection:
- For
DP-means
, the crucial parameter was set using afarthest-first heuristic
. To get a desired number of clusters , the heuristic iteratively builds a set of points, where each new point added is the one farthest from the current set. The final distance threshold becomes . This makes it easy to choose to target an approximate number of clusters.
- For
6. Results & Analysis
-
Core Results:
-
Synthetic Data Analysis (Figure 2):
-
Figure 2b is the most telling. It shows the NMI score over 5000 Gibbs sampling iterations. The Gibbs sampler takes thousands of iterations to converge to a good solution. In stark contrast,
DP-means
is reported to converge in just 8 iterations on average, consistently finding 3 clusters and achieving a high NMI of 0.89. This highlights the massive speed and stability advantage of the derived hard clustering algorithm. -
Figure 2c demonstrates the role of . As the penalty increases, the number of clusters found by
DP-means
decreases, as expected from the objective function. This shows that is an intuitive and effective control knob.该图像是包含四个子图的图表,展示了文中方法在合成数据上的表现。a) 三个高斯分布的数据集示意图;b) Gibbs采样迭代中NMI分数变化曲线;c) DP-means算法中不同lambda值对应的簇数平均值;d) 硬性高斯HDP实验中的一个数据集示例。
-
-
UCI Dataset Results (Table 1): The table below is a transcription of the results from the paper. It compares the average NMI scores of
DP-means
, standardk-means
(given the true ), andGibbs
sampling.Data set DP-means k-means Gibbs Wine .41 .43 .42 Iris .75 .76 .73 Pima .02 .03 .02 Soybean .72 .66 .73 Car .07 .05 .15 Balance Scale .17 .11 .14 Breast Cancer .04 .03 .04 Vehicle .18 .18 .17 - Analysis: The results show that
DP-means
performs comparably to bothk-means
and the fullGibbs
sampler. It is not uniformly better, but it is consistently competitive across a range of datasets. This is a strong result, asDP-means
achieves this performance without needing to be specified, while being significantly faster and simpler to implement thanGibbs
sampling.
- Analysis: The results show that
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully forges a principled connection between Bayesian nonparametrics and classical hard clustering. It derives
DP-means
, a simple, fast, and scalable algorithm that, like k-means, is easy to implement, but unlike k-means, can determine the number of clusters automatically. The algorithm is shown to optimize an intuitive penalized k-means objective. The authors demonstrate the generality of their framework by extending it to hierarchical clustering (HDP), spectral clustering, and graph cuts, yielding novel algorithms in each domain. -
Limitations & Future Work: The authors acknowledge one primary limitation: the
DP-means
algorithm is sequential and its results can depend on the order in which data points are processed. They propose exploring adaptive methods for choosing a better ordering as an area for future work. -
Personal Insights & Critique:
- Elegance and Impact: This paper is a beautiful example of theoretical machine learning research that produces a practical and useful result. The derivation of
DP-means
from a DP mixture is elegant, and the resulting algorithm is brilliantly simple. It provides a strong "why" for a k-means-like algorithm that can infer . - Practical Utility:
DP-means
has become a popular and practical clustering algorithm. It hits a sweet spot: it's much more flexible than k-means but avoids the heavy machinery and slow convergence of MCMC-based Bayesian methods. The penalty parameter is more intuitive to set than , as it relates to a distance scale in the data. - Conceptual Bridge: The work serves as a powerful conceptual bridge. The spectral clustering extension, which connects the penalty to an eigenvalue threshold, is particularly insightful. It suggests a deep relationship between the "energy" of a cluster (related to an eigenvalue) and the Bayesian prior's preference for simplicity (encoded by ).
- Untested Assumptions: The derivation relies on an asymptotic limit (). While the resulting algorithm works well in practice, it is an approximation. The performance in real-world, finite-variance scenarios is an empirical question, though the experiments show it is effective. The choice of a spherical covariance is also a simplifying assumption, inherited from k-means, meaning the method is best suited for discovering spherical or convex clusters.
- Elegance and Impact: This paper is a beautiful example of theoretical machine learning research that produces a practical and useful result. The derivation of
Similar papers
Recommended via semantic vector search.
A Split-Merge DP-means Algorithm to Avoid Local Minima
This paper extends DP-means with a split-merge technique to avoid local minima by dynamically splitting dense clusters, enabling near-optimal cluster counts with improved robustness demonstrated on multiple datasets.
DNB: A Joint Learning Framework for Deep Bayesian Nonparametric Clustering
DNB proposes a deep Bayesian nonparametric framework for image clustering. It offers a "doubly unsupervised" end-to-end solution, jointly learning image clusters and deep representations by simultaneously estimating both cluster assignments and the number of clusters using Dirich
Discussion
Leave a comment
No comments yet. Start the discussion!