AiPaper
Status: completed

Revisiting k-means: New Algorithms via Bayesian Nonparametrics

Dirichlet Process Mixture ModelsBayesian Nonparametric Clustering Algorithmsk-means Algorithm ImprovementsMulti-Dataset ClusteringSpectral Clustering and Normalized Cut
Original LinkPDFEdit PDF notes
Price: 0.10
Price: 0.10
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper revisits k-means via Bayesian nonparametrics, proposing DP-means to automatically determine cluster count and optimize a k-means-like objective. Extensions include hierarchical clustering across datasets and spectral and normalized cut methods for improved performance.

Abstract

Bayesian models offer great flexibility for clustering applications---Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between k-means and mixtures of Gaussians, we show that a Gibbs sampling algorithm for the Dirichlet process mixture approaches a hard clustering algorithm in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means-like clustering objective that includes a penalty for the number of clusters. We generalize this analysis to the case of clustering multiple data sets through a similar asymptotic argument with the hierarchical Dirichlet process. We also discuss further extensions that highlight the benefits of our analysis: i) a spectral relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that does not fix the number of clusters in the graph.

English Analysis

1. Bibliographic Information

  • Title: Revisiting k-means: New Algorithms via Bayesian Nonparametrics
  • Authors: Brian Kulis (Ohio State University) and Michael I. Jordan (UC Berkeley). Both are prominent researchers in machine learning. Michael I. Jordan is a world-renowned figure, particularly known for his foundational work at the intersection of statistics and computer science, including contributions to Bayesian nonparametrics.
  • Journal/Conference: The paper is available on arXiv, which is a preprint server. This version (v2v2) was submitted in November 2011. While not a peer-reviewed journal publication, it has been influential and widely cited.
  • Publication Year: 2011
  • Abstract: The authors aim to combine the flexibility of Bayesian models with the simplicity of classical clustering methods like k-means. They start from the known connection between k-means and Gaussian Mixture Models and extend this to Bayesian nonparametrics. They show that a Gibbs sampling algorithm for a Dirichlet Process (DP) mixture model, under a specific limiting condition, becomes a hard clustering algorithm. This new algorithm, which they call DP-means, is shown to minimize a k-means-like objective function that includes a penalty for the number of clusters, thus automatically determining the cluster count. This analysis is extended to the Hierarchical Dirichlet Process (HDP) to develop a method for clustering multiple datasets with shared clusters. The paper also presents extensions to spectral clustering and graph clustering.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • The paper addresses a fundamental trade-off in clustering. On one hand, Bayesian nonparametric models (like Dirichlet Process mixtures) are highly flexible; they can model complex data structures and, most importantly, infer the number of clusters from the data itself. However, they typically rely on complex and computationally intensive inference methods like Gibbs sampling, which can be slow and hard to scale.
    • On the other hand, k-means clustering is simple, fast, scalable, and widely used in practice. Its major drawback is its rigidity: the number of clusters, kk, must be specified in advance, and it struggles with non-spherical clusters.
    • The paper seeks to bridge this gap, asking: Can we derive a simple, scalable, k-means-like "hard clustering" algorithm that inherits the flexibility of Bayesian nonparametrics, especially the ability to automatically determine the number of clusters?
  • Main Contributions / Findings (What):

    1. DP-means Algorithm: The authors derive a new hard clustering algorithm, DP-means, by taking the small-variance asymptotic limit of a Gibbs sampler for a Dirichlet Process (DP) mixture model. This algorithm is nearly as simple as k-means but includes a threshold parameter, λ\lambda, to decide whether to assign a point to an existing cluster or create a new one.
    2. A Penalized Objective Function: They prove that DP-means monotonically minimizes a novel objective function: the standard k-means sum-of-squared-errors plus a penalty term λk\lambda k, which penalizes the number of clusters. This provides a formal justification for the algorithm and guarantees convergence to a local minimum.
    3. Hierarchical Clustering Algorithm (Hard Gaussian HDP): They extend the same asymptotic analysis to the Hierarchical Dirichlet Process (HDP), a model for clustering grouped data. This results in a new algorithm for finding shared "global" clusters across multiple datasets while allowing for dataset-specific "local" clusters. This algorithm also minimizes an intuitive penalized objective.
    4. Novel Extensions: They demonstrate the power of their framework by deriving two further extensions:
      • A spectral clustering method where the number of clusters is determined by thresholding the eigenvalues of the kernel matrix, rather than taking the top kk eigenvectors.
      • A graph clustering algorithm for the normalized cut objective that does not require fixing the number of clusters beforehand.

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • k-means Clustering: An iterative algorithm that partitions nn data points into kk predefined clusters. It works by alternately assigning each point to the cluster with the nearest mean (centroid) and then recalculating the centroid for each cluster. Its goal is to minimize the within-cluster sum of squared distances.
    • Gaussian Mixture Model (GMM): A probabilistic model that assumes the data is generated from a mixture of several Gaussian (normal) distributions with unknown parameters. Each Gaussian represents a cluster. Unlike k-means' "hard" assignments, GMM provides "soft" assignments (probabilities of a point belonging to each cluster). Parameters are typically estimated using the Expectation-Maximization (EM) algorithm.
    • Bayesian Nonparametrics: A branch of Bayesian statistics that allows models to have an "infinite" number of parameters. For clustering, this means the number of clusters is not fixed in advance but can grow as more data is observed. This provides great modeling flexibility.
    • Dirichlet Process (DP): The cornerstone of Bayesian nonparametrics for mixture models. A DP is a "distribution over distributions." A sample drawn from a DP is itself a discrete probability distribution. When used as a prior in a mixture model (a DP mixture), it allows for an infinite number of potential clusters, with the model automatically selecting a finite subset of them to explain the data.
    • Gibbs Sampling: A Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations from a specified multivariate probability distribution when direct sampling is difficult. It is commonly used for inference in complex Bayesian models by iteratively sampling each variable from its conditional distribution given the current values of all other variables.
    • Hierarchical Dirichlet Process (HDP): An extension of the DP used for modeling grouped data. It allows different groups (e.g., different documents or datasets) to share a common set of clusters, but with different proportions. For instance, in topic modeling, all documents draw from a common set of topics (global clusters), but each document has its own distribution over those topics.
  • Previous Works & Differentiation:

    • The paper is inspired by the well-known asymptotic connection between GMM and k-means. If you take a GMM where all clusters have a spherical covariance σ2I\sigma^2I and let the variance σ2\sigma^2 approach zero, the EM algorithm's updates become identical to the k-means algorithm's updates. The soft assignments become hard assignments to the nearest centroid.
    • This paper's key innovation is to apply a similar asymptotic argument not to the standard GMM, but to a Bayesian nonparametric (DP) mixture model. Instead of analyzing the EM algorithm, they analyze a Gibbs sampler.
    • This approach is distinct from previous works. While other methods have tried to make k-means more flexible, they typically do not arise from such a principled Bayesian foundation. And while standard DP mixture inference is powerful, it remains slow. The authors' method uniquely combines the theoretical elegance of Bayesian nonparametrics with the practical efficiency of a hard clustering algorithm.

4. Methodology (Core Technology & Implementation)

4.1. From DP Gibbs Sampler to DP-means

The core idea is to analyze the behavior of a specific Gibbs sampler for a DP mixture of Gaussians in the limit as the cluster variance σ0\sigma \to 0.

  • Model Setup:

    • The data points xi\pmb{x}_i are assumed to be drawn from Gaussian distributions N(ϕi,σI)\mathcal{N}(\phi_i, \sigma I).
    • The means ϕi\phi_i are drawn from a distribution GG, which is itself drawn from a Dirichlet Process: GDP(α,G0)G \sim \text{DP}(\alpha, G_0).
    • G0G_0 is the "base measure" or prior over the cluster means. The paper chooses a Gaussian prior: G0=N(0,ρI)G_0 = \mathcal{N}(\mathbf{0}, \rho I).
  • Gibbs Sampling Probabilities: In a Gibbs sampler for this model, for each data point xi\pmb{x}_i, we decide whether to assign it to an existing cluster cc or create a new one. The (unnormalized) probability of assigning xi\pmb{x}_i to an existing cluster cc (with ni,cn_{-i,c} other points) is: p(assign to c)ni,cN(xiμc,σI) p(\text{assign to } c) \propto n_{-i,c} \cdot \mathcal{N}(\pmb{x}_i | \pmb{\mu}_c, \sigma I) The (unnormalized) probability of creating a new cluster is: p(new cluster)αN(xiμ,σI)dG0(μ) p(\text{new cluster}) \propto \alpha \int \mathcal{N}(\pmb{x}_i | \pmb{\mu}, \sigma I) dG_0(\pmb{\mu}) With the chosen Gaussian prior G0G_0, this integral can be solved.

  • The Asymptotic Limit: The authors don't just let σ0\sigma \to 0. To get a non-trivial result, they cleverly define the DP concentration parameter α\alpha as a function of σ\sigma: α=(1+ρ/σ)d/2exp(λ2σ) \alpha = (1 + \rho/\sigma)^{d/2} \cdot \exp(-\frac{\lambda}{2\sigma}) where λ\lambda is a new hyperparameter. After substituting this into the Gibbs probabilities and simplifying, the posterior probability of assigning point ii to an existing cluster cc becomes proportional to: γ^(zic)ni,cexp(12σxiμc2) \hat{\gamma}(z_{ic}) \propto n_{-i,c} \cdot \exp\left(-\frac{1}{2\sigma} \|\pmb{x}_i - \pmb{\mu}_c\|^2\right) And the probability of creating a new cluster becomes proportional to: γ^(zi,new)exp(12σ[λ+σρ+σxi2]) \hat{\gamma}(z_{i, new}) \propto \exp\left(-\frac{1}{2\sigma}\left[\lambda + \frac{\sigma}{\rho+\sigma}\|\pmb{x}_i\|^2\right]\right) As σ0\sigma \to 0, the exponential terms completely dominate. The assignment will be determined entirely by the term inside the exp()\exp(\cdot) that is least negative (i.e., smallest in magnitude). The decision rule becomes:

    • Find the cluster cc^* that minimizes the squared distance xiμc2\|\pmb{x}_i - \pmb{\mu}_c\|^2.
    • Compare this minimum squared distance to λ\lambda.
    • If xiμc2λ\|\pmb{x}_i - \pmb{\mu}_{c^*}\|^2 \le \lambda, assign xi\pmb{x}_i to cluster cc^*.
    • If xiμc2>λ\|\pmb{x}_i - \pmb{\mu}_{c^*}\|^2 > \lambda, create a new cluster for xi\pmb{x}_i.
  • Mean Update: In the Gibbs sampler, after updating assignments, the cluster means are resampled from their posterior. For a cluster cc with points {xj}\{\pmb{x}_j\}, the posterior mean is μ~c=(1+σρnc)1xˉc\tilde{\pmb{\mu}}_c = (1 + \frac{\sigma}{\rho n_c})^{-1} \bar{\pmb{x}}_c, where xˉc\bar{\pmb{x}}_c is the empirical mean. As σ0\sigma \to 0, this posterior mean μ~c\tilde{\pmb{\mu}}_c converges to the empirical mean xˉc\bar{\pmb{x}}_c, which is exactly the mean update step in k-means.

This leads to Algorithm 1: DP-means. It iterates through points, assigning them to the closest cluster centroid unless the squared distance exceeds λ\lambda, in which case a new cluster is created with the point as its first member. After each pass through the data (or after each point assignment), the centroids are recomputed as the mean of their assigned points.

4.2. Underlying Objective Function

The paper proves that the DP-means algorithm monotonically optimizes the following objective function: min{c}c=1kc=1kxcxμc2+λk \min_{ \{ \ell_c \}_{c=1}^k } \sum_{c=1}^k \sum_{\pmb{x} \in \ell_c} \| \pmb{x} - \pmb{\mu}_c \|^2 + \lambda k where c\ell_c are the clusters, μc\pmb{\mu}_c are their means, and kk is the total number of clusters.

  • Interpretation: This is the standard k-means objective (minimizing within-cluster variance) plus a penalty term λk\lambda k that is linear in the number of clusters. The parameter λ\lambda controls the trade-off: a larger λ\lambda imposes a higher cost for creating new clusters, leading to fewer clusters, and vice-versa. This objective is related to information criteria like AIC and BIC, which also penalize model complexity.

4.3. Hierarchical Clustering: Hard Gaussian HDP

The same logic is extended to the Hierarchical Dirichlet Process (HDP) to handle multiple datasets (j=1,...,Dj=1, ..., D). The HDP model allows datasets to share a set of global clusters.

  • Intuition: Each dataset jj has its own set of "local" clusters. Each local cluster is then associated with one of a shared pool of "global" clusters. This structure is depicted in Figure 1.

    Figu Overviewf clustering with multiple data sets. Each data set has some number of local clusters, and each local cluster is associated with some global mean \(\\mu _ { p }\) . Each global mean \$\\mu _… 该图像是一个示意图,展示了多个数据集中局部簇与全局簇之间的层次关系。每个数据集包含若干局部簇,每个局部簇对应一个全局均值 μ_p,且该全局均值由所有相关簇的点的均值计算。当重新分配点到簇时,目标函数会对新建局部簇或全局簇进行惩罚。

  • Algorithm and Objective: The asymptotic analysis of an HDP Gibbs sampler yields Algorithm 2: Hard Gaussian HDP. This algorithm involves two penalty parameters:

    • λ\lambda_\ell: Penalty for creating a new local cluster within a dataset.

    • λg\lambda_g: Penalty for creating a new global cluster to be shared across datasets.

      The algorithm minimizes the following objective: min{p}p=1gp=1gxijpxijμp22+λk+λgg \min_{ \{ \ell_p \}_{p=1}^g } \sum_{p=1}^g \sum_{\pmb{x}_{ij} \in \ell_p} \| \pmb{x}_{ij} - \pmb{\mu}_p \|_2^2 + \lambda_\ell k + \lambda_g g Here, p\ell_p are the global clusters with means μp\pmb{\mu}_p, gg is the total number of global clusters, and k=jkjk = \sum_j k_j is the total number of local clusters across all datasets. The algorithm iteratively assigns points to local clusters and local clusters to global clusters, creating new ones when the cost exceeds the penalty thresholds.

4.4. Further Extensions

  • Spectral Meets Nonparametric: The authors show a connection to spectral clustering. The DP-means objective can be rewritten as a trace maximization problem: maxYtr(YT(KλI)Y) \max_Y \operatorname{tr}(Y^T (K - \lambda I) Y) where KK is the kernel matrix (Kij=xiTxjK_{ij} = \pmb{x}_i^T \pmb{x}_j) and YY is a normalized cluster indicator matrix. Relaxing the constraints on YY to be any orthonormal matrix, the solution is given by the eigenvectors of KK corresponding to all eigenvalues greater than λ\lambda. This is a beautiful result:

    • Standard spectral clustering takes the top k eigenvectors.
    • This nonparametric version takes all eigenvectors whose "energy" (eigenvalue) surpasses a threshold λ\lambda. The number of clusters is thus determined by the data's spectral properties and λ\lambda.
  • Graph Clustering: The paper leverages known connections between weighted kernel k-means and graph cuts. By extending their DP-means objective to the weighted kernel case, they show it is equivalent to minimizing a penalized normalized cut objective: Cut(A)+λk\text{Cut}(A) + \lambda' k This allows them to develop a version of the popular normalized cut algorithm that does not require specifying kk in advance.

5. Experimental Setup

  • Datasets:

    • Synthetic Data: A simple 2D dataset with 3 Gaussian clusters was used for illustration (Figure 2a). Another synthetic setup involved 50 datasets for the hard Gaussian HDP experiment (Figure 2d).
    • UCI Datasets: A collection of standard real-world benchmark datasets from the UCI Machine Learning Repository: Wine, Iris, Pima, Soybean, Car, Balance Scale, Breast Cancer, and Vehicle.
  • Evaluation Metrics:

    • Normalized Mutual Information (NMI): This metric measures the agreement between two clusterings (e.g., the algorithm's output and the ground truth labels), correcting for chance. It is a value between 0 (no mutual information) and 1 (perfect correlation).
      1. Conceptual Definition: NMI quantifies the "information" that the predicted cluster labels YY provide about the true class labels CC. It is normalized by the entropies of both label sets, so the score is independent of the number of clusters. A higher NMI indicates a better match.
      2. Mathematical Formula: NMI(Y,C)=2I(Y,C)H(Y)+H(C) \text{NMI}(Y, C) = \frac{2 \cdot I(Y, C)}{H(Y) + H(C)}
      3. Symbol Explanation:
        • YY: The set of cluster assignments produced by the algorithm.
        • CC: The set of ground-truth class labels.
        • I(Y, C): The mutual information between YY and CC. It measures the dependency between the two sets of labels.
        • H(Y) and H(C): The entropy of the cluster assignments and ground-truth labels, respectively. Entropy measures the uncertainty or "information content" of a distribution.
  • Baselines:

    • k-means: The classical algorithm, run with the true number of clusters kk.
    • Gibbs: A standard Gibbs sampler for a DP mixture model. This represents the full (but slow) Bayesian nonparametric approach.
  • Parameter Selection:

    • For DP-means, the crucial parameter λ\lambda was set using a farthest-first heuristic. To get a desired number of clusters kk, the heuristic iteratively builds a set of kk points, where each new point added is the one farthest from the current set. The final distance threshold becomes λ\lambda. This makes it easy to choose λ\lambda to target an approximate number of clusters.

6. Results & Analysis

  • Core Results:

    • Synthetic Data Analysis (Figure 2):

      • Figure 2b is the most telling. It shows the NMI score over 5000 Gibbs sampling iterations. The Gibbs sampler takes thousands of iterations to converge to a good solution. In stark contrast, DP-means is reported to converge in just 8 iterations on average, consistently finding 3 clusters and achieving a high NMI of 0.89. This highlights the massive speed and stability advantage of the derived hard clustering algorithm.

      • Figure 2c demonstrates the role of λ\lambda. As the penalty λ\lambda increases, the number of clusters found by DP-means decreases, as expected from the objective function. This shows that λ\lambda is an intuitive and effective control knob.

        Figure : Synthetic results demonstrating advantages of our method. a) A simple data set of 3 Gaussians. b NMI scores over the first 5000 Gibbs iterations—in contrast, across 100 runs, DP-means always… 该图像是包含四个子图的图表,展示了文中方法在合成数据上的表现。a) 三个高斯分布的数据集示意图;b) Gibbs采样迭代中NMI分数变化曲线;c) DP-means算法中不同lambda值对应的簇数平均值;d) 硬性高斯HDP实验中的一个数据集示例。

    • UCI Dataset Results (Table 1): The table below is a transcription of the results from the paper. It compares the average NMI scores of DP-means, standard k-means (given the true kk), and Gibbs sampling.

      Data set DP-means k-means Gibbs
      Wine .41 .43 .42
      Iris .75 .76 .73
      Pima .02 .03 .02
      Soybean .72 .66 .73
      Car .07 .05 .15
      Balance Scale .17 .11 .14
      Breast Cancer .04 .03 .04
      Vehicle .18 .18 .17
      • Analysis: The results show that DP-means performs comparably to both k-means and the full Gibbs sampler. It is not uniformly better, but it is consistently competitive across a range of datasets. This is a strong result, as DP-means achieves this performance without needing kk to be specified, while being significantly faster and simpler to implement than Gibbs sampling.

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully forges a principled connection between Bayesian nonparametrics and classical hard clustering. It derives DP-means, a simple, fast, and scalable algorithm that, like k-means, is easy to implement, but unlike k-means, can determine the number of clusters automatically. The algorithm is shown to optimize an intuitive penalized k-means objective. The authors demonstrate the generality of their framework by extending it to hierarchical clustering (HDP), spectral clustering, and graph cuts, yielding novel algorithms in each domain.

  • Limitations & Future Work: The authors acknowledge one primary limitation: the DP-means algorithm is sequential and its results can depend on the order in which data points are processed. They propose exploring adaptive methods for choosing a better ordering as an area for future work.

  • Personal Insights & Critique:

    • Elegance and Impact: This paper is a beautiful example of theoretical machine learning research that produces a practical and useful result. The derivation of DP-means from a DP mixture is elegant, and the resulting algorithm is brilliantly simple. It provides a strong "why" for a k-means-like algorithm that can infer kk.
    • Practical Utility: DP-means has become a popular and practical clustering algorithm. It hits a sweet spot: it's much more flexible than k-means but avoids the heavy machinery and slow convergence of MCMC-based Bayesian methods. The penalty parameter λ\lambda is more intuitive to set than kk, as it relates to a distance scale in the data.
    • Conceptual Bridge: The work serves as a powerful conceptual bridge. The spectral clustering extension, which connects the penalty λ\lambda to an eigenvalue threshold, is particularly insightful. It suggests a deep relationship between the "energy" of a cluster (related to an eigenvalue) and the Bayesian prior's preference for simplicity (encoded by λ\lambda).
    • Untested Assumptions: The derivation relies on an asymptotic limit (σ0\sigma \to 0). While the resulting algorithm works well in practice, it is an approximation. The performance in real-world, finite-variance scenarios is an empirical question, though the experiments show it is effective. The choice of a spherical covariance σ2I\sigma^2I is also a simplifying assumption, inherited from k-means, meaning the method is best suited for discovering spherical or convex clusters.

Discussion

Leave a comment

Sign in to join the discussion.

No comments yet. Start the discussion!