- Title: Revisiting k-means: New Algorithms via Bayesian Nonparametrics
- Authors: Brian Kulis (Ohio State University) and Michael I. Jordan (UC Berkeley). Both are prominent researchers in machine learning. Michael I. Jordan is a world-renowned figure, particularly known for his foundational work at the intersection of statistics and computer science, including contributions to Bayesian nonparametrics.
- Journal/Conference: The paper is available on arXiv, which is a preprint server. This version (v2) was submitted in November 2011. While not a peer-reviewed journal publication, it has been influential and widely cited.
- Publication Year: 2011
- Abstract: The authors aim to combine the flexibility of Bayesian models with the simplicity of classical clustering methods like k-means. They start from the known connection between k-means and Gaussian Mixture Models and extend this to Bayesian nonparametrics. They show that a Gibbs sampling algorithm for a Dirichlet Process (DP) mixture model, under a specific limiting condition, becomes a hard clustering algorithm. This new algorithm, which they call
DP-means
, is shown to minimize a k-means-like objective function that includes a penalty for the number of clusters, thus automatically determining the cluster count. This analysis is extended to the Hierarchical Dirichlet Process (HDP) to develop a method for clustering multiple datasets with shared clusters. The paper also presents extensions to spectral clustering and graph clustering.
- Original Source Link:
2. Executive Summary
4. Methodology (Core Technology & Implementation)
4.1. From DP Gibbs Sampler to DP-means
The core idea is to analyze the behavior of a specific Gibbs sampler for a DP mixture of Gaussians in the limit as the cluster variance σ→0.
-
Model Setup:
- The data points xi are assumed to be drawn from Gaussian distributions N(ϕi,σI).
- The means ϕi are drawn from a distribution G, which is itself drawn from a Dirichlet Process: G∼DP(α,G0).
- G0 is the "base measure" or prior over the cluster means. The paper chooses a Gaussian prior: G0=N(0,ρI).
-
Gibbs Sampling Probabilities:
In a Gibbs sampler for this model, for each data point xi, we decide whether to assign it to an existing cluster c or create a new one. The (unnormalized) probability of assigning xi to an existing cluster c (with n−i,c other points) is:
p(assign to c)∝n−i,c⋅N(xi∣μc,σI)
The (unnormalized) probability of creating a new cluster is:
p(new cluster)∝α∫N(xi∣μ,σI)dG0(μ)
With the chosen Gaussian prior G0, this integral can be solved.
-
The Asymptotic Limit:
The authors don't just let σ→0. To get a non-trivial result, they cleverly define the DP concentration parameter α as a function of σ:
α=(1+ρ/σ)d/2⋅exp(−2σλ)
where λ is a new hyperparameter. After substituting this into the Gibbs probabilities and simplifying, the posterior probability of assigning point i to an existing cluster c becomes proportional to:
γ^(zic)∝n−i,c⋅exp(−2σ1∥xi−μc∥2)
And the probability of creating a new cluster becomes proportional to:
γ^(zi,new)∝exp(−2σ1[λ+ρ+σσ∥xi∥2])
As σ→0, the exponential terms completely dominate. The assignment will be determined entirely by the term inside the exp(⋅) that is least negative (i.e., smallest in magnitude). The decision rule becomes:
- Find the cluster c∗ that minimizes the squared distance ∥xi−μc∥2.
- Compare this minimum squared distance to λ.
- If ∥xi−μc∗∥2≤λ, assign xi to cluster c∗.
- If ∥xi−μc∗∥2>λ, create a new cluster for xi.
-
Mean Update:
In the Gibbs sampler, after updating assignments, the cluster means are resampled from their posterior. For a cluster c with points {xj}, the posterior mean is μ~c=(1+ρncσ)−1xˉc, where xˉc is the empirical mean. As σ→0, this posterior mean μ~c converges to the empirical mean xˉc, which is exactly the mean update step in k-means.
This leads to Algorithm 1: DP-means. It iterates through points, assigning them to the closest cluster centroid unless the squared distance exceeds λ, in which case a new cluster is created with the point as its first member. After each pass through the data (or after each point assignment), the centroids are recomputed as the mean of their assigned points.
4.2. Underlying Objective Function
The paper proves that the DP-means algorithm monotonically optimizes the following objective function:
{ℓc}c=1kminc=1∑kx∈ℓc∑∥x−μc∥2+λk
where ℓc are the clusters, μc are their means, and k is the total number of clusters.
- Interpretation: This is the standard k-means objective (minimizing within-cluster variance) plus a penalty term λk that is linear in the number of clusters. The parameter λ controls the trade-off: a larger λ imposes a higher cost for creating new clusters, leading to fewer clusters, and vice-versa. This objective is related to information criteria like AIC and BIC, which also penalize model complexity.
4.3. Hierarchical Clustering: Hard Gaussian HDP
The same logic is extended to the Hierarchical Dirichlet Process (HDP) to handle multiple datasets (j=1,...,D). The HDP model allows datasets to share a set of global clusters.
-
Intuition: Each dataset j has its own set of "local" clusters. Each local cluster is then associated with one of a shared pool of "global" clusters. This structure is depicted in Figure 1.
该图像是一个示意图,展示了多个数据集中局部簇与全局簇之间的层次关系。每个数据集包含若干局部簇,每个局部簇对应一个全局均值 μ_p,且该全局均值由所有相关簇的点的均值计算。当重新分配点到簇时,目标函数会对新建局部簇或全局簇进行惩罚。
-
Algorithm and Objective: The asymptotic analysis of an HDP Gibbs sampler yields Algorithm 2: Hard Gaussian HDP. This algorithm involves two penalty parameters:
-
λℓ: Penalty for creating a new local cluster within a dataset.
-
λg: Penalty for creating a new global cluster to be shared across datasets.
The algorithm minimizes the following objective:
{ℓp}p=1gminp=1∑gxij∈ℓp∑∥xij−μp∥22+λℓk+λgg
Here, ℓp are the global clusters with means μp, g is the total number of global clusters, and k=∑jkj is the total number of local clusters across all datasets. The algorithm iteratively assigns points to local clusters and local clusters to global clusters, creating new ones when the cost exceeds the penalty thresholds.
4.4. Further Extensions
-
Spectral Meets Nonparametric: The authors show a connection to spectral clustering. The DP-means objective can be rewritten as a trace maximization problem:
Ymaxtr(YT(K−λI)Y)
where K is the kernel matrix (Kij=xiTxj) and Y is a normalized cluster indicator matrix. Relaxing the constraints on Y to be any orthonormal matrix, the solution is given by the eigenvectors of K corresponding to all eigenvalues greater than λ. This is a beautiful result:
- Standard spectral clustering takes the top k eigenvectors.
- This nonparametric version takes all eigenvectors whose "energy" (eigenvalue) surpasses a threshold λ. The number of clusters is thus determined by the data's spectral properties and λ.
-
Graph Clustering: The paper leverages known connections between weighted kernel k-means and graph cuts. By extending their DP-means objective to the weighted kernel case, they show it is equivalent to minimizing a penalized normalized cut objective:
Cut(A)+λ′k
This allows them to develop a version of the popular normalized cut algorithm that does not require specifying k in advance.
5. Experimental Setup
-
Datasets:
- Synthetic Data: A simple 2D dataset with 3 Gaussian clusters was used for illustration (Figure 2a). Another synthetic setup involved 50 datasets for the hard Gaussian HDP experiment (Figure 2d).
- UCI Datasets: A collection of standard real-world benchmark datasets from the UCI Machine Learning Repository:
Wine
, Iris
, Pima
, Soybean
, Car
, Balance Scale
, Breast Cancer
, and Vehicle
.
-
Evaluation Metrics:
- Normalized Mutual Information (NMI): This metric measures the agreement between two clusterings (e.g., the algorithm's output and the ground truth labels), correcting for chance. It is a value between 0 (no mutual information) and 1 (perfect correlation).
- Conceptual Definition: NMI quantifies the "information" that the predicted cluster labels Y provide about the true class labels C. It is normalized by the entropies of both label sets, so the score is independent of the number of clusters. A higher NMI indicates a better match.
- Mathematical Formula:
NMI(Y,C)=H(Y)+H(C)2⋅I(Y,C)
- Symbol Explanation:
- Y: The set of cluster assignments produced by the algorithm.
- C: The set of ground-truth class labels.
I(Y, C)
: The mutual information between Y and C. It measures the dependency between the two sets of labels.
H(Y)
and H(C)
: The entropy of the cluster assignments and ground-truth labels, respectively. Entropy measures the uncertainty or "information content" of a distribution.
-
Baselines:
- k-means: The classical algorithm, run with the true number of clusters k.
- Gibbs: A standard Gibbs sampler for a DP mixture model. This represents the full (but slow) Bayesian nonparametric approach.
-
Parameter Selection:
- For
DP-means
, the crucial parameter λ was set using a farthest-first heuristic
. To get a desired number of clusters k, the heuristic iteratively builds a set of k points, where each new point added is the one farthest from the current set. The final distance threshold becomes λ. This makes it easy to choose λ to target an approximate number of clusters.
6. Results & Analysis
-
Core Results:
-
Synthetic Data Analysis (Figure 2):
-
Figure 2b is the most telling. It shows the NMI score over 5000 Gibbs sampling iterations. The Gibbs sampler takes thousands of iterations to converge to a good solution. In stark contrast, DP-means
is reported to converge in just 8 iterations on average, consistently finding 3 clusters and achieving a high NMI of 0.89. This highlights the massive speed and stability advantage of the derived hard clustering algorithm.
-
Figure 2c demonstrates the role of λ. As the penalty λ increases, the number of clusters found by DP-means
decreases, as expected from the objective function. This shows that λ is an intuitive and effective control knob.
该图像是包含四个子图的图表,展示了文中方法在合成数据上的表现。a) 三个高斯分布的数据集示意图;b) Gibbs采样迭代中NMI分数变化曲线;c) DP-means算法中不同lambda值对应的簇数平均值;d) 硬性高斯HDP实验中的一个数据集示例。
-
UCI Dataset Results (Table 1):
The table below is a transcription of the results from the paper. It compares the average NMI scores of DP-means
, standard k-means
(given the true k), and Gibbs
sampling.
Data set |
DP-means |
k-means |
Gibbs |
Wine |
.41 |
.43 |
.42 |
Iris |
.75 |
.76 |
.73 |
Pima |
.02 |
.03 |
.02 |
Soybean |
.72 |
.66 |
.73 |
Car |
.07 |
.05 |
.15 |
Balance Scale |
.17 |
.11 |
.14 |
Breast Cancer |
.04 |
.03 |
.04 |
Vehicle |
.18 |
.18 |
.17 |
- Analysis: The results show that
DP-means
performs comparably to both k-means
and the full Gibbs
sampler. It is not uniformly better, but it is consistently competitive across a range of datasets. This is a strong result, as DP-means
achieves this performance without needing k to be specified, while being significantly faster and simpler to implement than Gibbs
sampling.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully forges a principled connection between Bayesian nonparametrics and classical hard clustering. It derives DP-means
, a simple, fast, and scalable algorithm that, like k-means, is easy to implement, but unlike k-means, can determine the number of clusters automatically. The algorithm is shown to optimize an intuitive penalized k-means objective. The authors demonstrate the generality of their framework by extending it to hierarchical clustering (HDP), spectral clustering, and graph cuts, yielding novel algorithms in each domain.
-
Limitations & Future Work: The authors acknowledge one primary limitation: the DP-means
algorithm is sequential and its results can depend on the order in which data points are processed. They propose exploring adaptive methods for choosing a better ordering as an area for future work.
-
Personal Insights & Critique:
- Elegance and Impact: This paper is a beautiful example of theoretical machine learning research that produces a practical and useful result. The derivation of
DP-means
from a DP mixture is elegant, and the resulting algorithm is brilliantly simple. It provides a strong "why" for a k-means-like algorithm that can infer k.
- Practical Utility:
DP-means
has become a popular and practical clustering algorithm. It hits a sweet spot: it's much more flexible than k-means but avoids the heavy machinery and slow convergence of MCMC-based Bayesian methods. The penalty parameter λ is more intuitive to set than k, as it relates to a distance scale in the data.
- Conceptual Bridge: The work serves as a powerful conceptual bridge. The spectral clustering extension, which connects the penalty λ to an eigenvalue threshold, is particularly insightful. It suggests a deep relationship between the "energy" of a cluster (related to an eigenvalue) and the Bayesian prior's preference for simplicity (encoded by λ).
- Untested Assumptions: The derivation relies on an asymptotic limit (σ→0). While the resulting algorithm works well in practice, it is an approximation. The performance in real-world, finite-variance scenarios is an empirical question, though the experiments show it is effective. The choice of a spherical covariance σ2I is also a simplifying assumption, inherited from k-means, meaning the method is best suited for discovering spherical or convex clusters.