Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training
TL;DR Summary
The COTC framework combines BERT and TFIDF features for enhanced short text clustering, achieving effective alignment through mutual learning of two modules. Experiments show significant performance improvements over state-of-the-art methods on eight benchmark datasets.
Abstract
BERT and TFIDF features excel in capturing rich semantics and important words, respectively. Since most existing clustering methods are solely based on the BERT model, they often fall short in utilizing keyword information, which, however, is very useful in clustering short texts. In this paper, we propose a CO-Training Clustering (COTC) framework to make use of the collective strengths of BERT and TFIDF features. Specifically, we develop two modules responsible for the clustering of BERT and TFIDF features, respectively. We use the deep representations and cluster assignments from the TFIDF module outputs to guide the learning of the BERT module, seeking to align them at both the representation and cluster levels. Reversely, we also use the BERT module outputs to train the TFIDF module, thus leading to the mutual promotion. We then show that the alternating co-training framework can be placed under a unified joint training objective, which allows the two modules to be connected tightly and the training signals to be propagated efficiently. Experiments on eight benchmark datasets show that our method outperforms current SOTA methods significantly.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training
1.2. Authors
Zetong Li, Qinliang Su, Shijing Si, Jianxing Yu
Affiliations:
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
- School of Economics and Finance, Shanghai International Studies University, Shanghai, China
- School of Artificial Intelligence, Sun Yat-sen University, Guangdong, China
1.3. Journal/Conference
Published at EMNLP (Conference on Empirical Methods in Natural Language Processing) 2024. EMNLP is a highly reputable and influential conference in the field of Natural Language Processing, known for showcasing cutting-edge research.
1.4. Publication Year
2024
1.5. Abstract
This paper introduces a novel CO-Training Clustering (COTC) framework for short text clustering, which effectively combines the strengths of Bidirectional Encoder Representations from Transformers (BERT) features and Term Frequency-Inverse Document Frequency (TFIDF) features. BERT excels at capturing rich semantic information, while TFIDF is proficient in identifying important keywords, a crucial aspect often overlooked by BERT-centric methods, especially for domain-specific short texts. The COTC framework comprises two distinct modules, one for BERT features and one for TFIDF features. These modules are designed to mutually enhance each other's learning process by aligning their deep representations and cluster assignments. The outputs (deep representations and cluster assignments) from the TFIDF module guide the learning of the BERT module, and vice-versa, leading to a symbiotic relationship. The authors demonstrate that this alternating co-training approach can be consolidated into a unified joint training objective, allowing for tighter module connections and more efficient propagation of training signals. Extensive experiments on eight benchmark datasets show that COTC significantly surpasses current state-of-the-art methods in clustering performance.
1.6. Original Source Link
- Original Source Link:
https://aclanthology.org/2024.emnlp-main.828/ - PDF Link:
https://aclanthology.org/2024.emnlp-main.828.pdf - Publication Status: Officially published at EMNLP 2024.
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the accurate clustering of short text segments. Short text clustering is vital for various applications, including topic discovery, news recommendation, and spam detection.
Challenges and Gaps in Prior Research:
- Limited Information in Short Texts: Short texts inherently contain very few words, making it challenging to extract sufficient semantic information for effective clustering.
- Limitations of Traditional TFIDF: While traditional methods used
TFIDFfeatures, these features often lack deep semantic understanding, leading to suboptimal performance. - Limitations of BERT-based Methods: The advent of
BERTmodels significantly improved text representation by capturing deep semantics. However,BERTmodels, primarily trained on general texts, often struggle to capture domain-specific keyword information effectively. For example, in specialized domains, unique technical terms (likeQT,QMAKESPECinStackOverflowdata) are crucial for topic identification but might be rare inBERT's pre-training corpus, makingBERTless sensitive to their importance. This can restrict clustering performance when keyword information is vital. - Under-explored Collective Strengths: Prior attempts to combine
BERTandTFIDFfeatures often involved simple fusion techniques (e.g., concatenation), which failed to fully leverage the distinct and complementary strengths of both feature types due to their "intrinsically different natures."
Paper's Entry Point / Innovative Idea:
The paper's innovative idea is to recognize and explicitly leverage the complementary strengths of BERT (deep semantics) and TFIDF (keyword signals). Instead of simple fusion, it proposes a sophisticated co-training framework that allows two distinct modules, each specializing in one feature type, to mutually guide and learn from each other. This alignment-promoting co-training aims to overcome the individual shortcomings of BERT-only or TFIDF-only approaches.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Proposed
COTCFramework: Introduction of a novel CO-Training Clustering (COTC) framework that systematically combinesBERTandTFIDFfeatures for short text clustering. This framework designs two modules, one forBERTand one forTFIDF, allowing them to learn complementary information. - Mutual Promotion through Alignment: Development of specific mechanisms within
COTCformutual promotion. TheBERTmodule uses deep representations and cluster assignments from theTFIDFmodule for guidance, aligning at both representation and cluster levels. Conversely, theTFIDFmodule is trained using outputs from theBERTmodule. - Unified Joint Training Objective: Demonstrating that the alternating co-training process can be formulated under a unified joint training objective. This objective allows for a tighter connection between the two modules and more efficient propagation of training signals, empirically leading to better clustering performance.
- Rediscovery of TFIDF Value: Reaffirming the utility of
TFIDFfeatures, showing that when properly integrated with deep learning models, they are still valuable, especially for keyword-sensitive tasks, and can complement state-of-the-artBERTfeatures. - State-of-the-Art Performance: Achieving significant improvements over current state-of-the-art methods on eight benchmark datasets, validating the effectiveness and robustness of the
COTCframework.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Short Text Clustering
Short text clustering is the task of grouping short text segments (e.g., tweets, search queries, news headlines) into clusters such that texts within the same cluster are semantically similar, while texts in different clusters are dissimilar. Unlike text classification, clustering is an unsupervised task, meaning it does not rely on pre-labeled data. The brevity of short texts makes this task particularly challenging as they often lack sufficient context or word count to derive strong semantic signals using traditional methods.
3.1.2. TFIDF (Term Frequency-Inverse Document Frequency)
TFIDF is a statistical measure that evaluates how important a word is to a document in a collection or corpus. The TFIDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
- Term Frequency (TF): Measures how frequently a term appears in a document. A higher
TFoften indicates greater importance within that document. $ \mathrm{tf}(t, d) = \frac{\text{number of times term } t \text{ appears in document } d}{\text{total number of terms in document } d} $ Where:- is the term frequency of term in document .
- represents a specific term (word).
- represents a document.
- Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus. Rare words tend to have a high
IDFscore, while common words (like "the", "a") have a lowIDFscore. $ \mathrm{idf}(t, D) = \log\left(\frac{N}{\text{number of documents in } D \text{ containing } t}\right) $ Where:- is the inverse document frequency of term in the corpus .
- is the total number of documents in the corpus .
- is typically the natural logarithm.
- TFIDF Score: The
TFIDFscore is the product ofTFandIDF. $ \mathrm{tfidf}(t, d, D) = \mathrm{tf}(t, d) \times \mathrm{idf}(t, D) $ Where:- is the TFIDF score of term in document within corpus .
TFIDFfeatures are word-frequency based and excel at capturingkeyword information, which is very useful for topics often defined by specific terms.
- is the TFIDF score of term in document within corpus .
3.1.3. BERT (Bidirectional Encoder Representations from Transformers)
BERT is a powerful pre-trained language model developed by Google, based on the Transformer architecture. It revolutionized Natural Language Processing (NLP) by introducing bidirectional training, meaning it processes text by considering the context from both the left and right sides of a word simultaneously. This allows BERT to generate deep contextualized embeddings (vector representations of words) that capture rich semantic meanings based on the surrounding words in a sentence. BERT is typically pre-trained on a massive amount of text data using tasks like Masked Language Modeling (predicting masked words) and Next Sentence Prediction. After pre-training, it can be fine-tuned for various downstream NLP tasks with minimal architecture changes.
BERT features are known for capturing deep semantic information and contextual understanding.
3.1.4. K-Means Clustering
K-Means is a popular unsupervised learning algorithm used for clustering. The goal of K-Means is to partition observations into clusters, where each observation belongs to the cluster with the nearest mean (centroid), serving as a prototype of the cluster.
The basic K-Means algorithm involves:
- Initialization: Randomly selecting initial centroids.
- Assignment Step: Assigning each data point to its closest centroid.
- Update Step: Recalculating the centroids as the mean of all data points assigned to that cluster. Steps 2 and 3 are repeated until the centroids no longer change significantly or a maximum number of iterations is reached.
3.1.5. Deep Embedded Clustering (DEC)
Deep Embedded Clustering (DEC) is a deep learning-based approach to clustering that simultaneously learns feature representations and cluster assignments. It aims to map data points into a lower-dimensional latent space using a deep neural network (e.g., an autoencoder) and then performs clustering in this learned space. DEC iteratively refines both the feature representations and the cluster assignments by optimizing a clustering objective that encourages assigning data points to their high-confidence clusters. It often uses a Kullback-Leibler (KL) divergence-based loss to match a soft assignment distribution to a target distribution that sharpens the cluster assignments.
3.1.6. Contrastive Learning
Contrastive learning is a self-supervised learning paradigm where a model learns representations by comparing similar and dissimilar samples. The core idea is to "contrast" data points: make representations of positive pairs (augmented versions of the same sample, or semantically similar samples) closer in the embedding space, and representations of negative pairs (different samples) farther apart. This is often achieved using a contrastive loss function (e.g., NT-Xent loss), which encourages discriminative representations without explicit human-provided labels.
3.1.7. Pseudo-labelling
Pseudo-labelling is a semi-supervised learning technique where a model is trained on a small amount of labeled data and then used to predict labels for the unlabeled data (these predictions are called pseudo-labels). The model is then retrained on the combination of original labeled data and the newly pseudo-labeled data. This iterative process helps the model to leverage the vast amount of unlabeled data available. In clustering, it can be used to generate target assignments (pseudo-labels) for samples, which can then be used to train a classifier-like clustering head.
3.1.8. Variational Autoencoder (VAE)
A Variational Autoencoder (VAE) is a generative neural network model that learns a compressed, continuous latent representation of input data. It consists of two main parts:
- Encoder: Takes an input (e.g., an image, text) and maps it to a distribution over a latent space (typically a Gaussian distribution, defined by mean and variance vectors). This makes the latent space
probabilisticrather than deterministic. - Decoder: Takes a sample from the latent distribution and reconstructs the original input.
VAEsare trained to minimize a loss function that combines two terms: areconstruction loss(how well the output matches the input) and aKL-divergence loss(which regularizes the latent space, forcing the latent distributions to be close to a prior distribution, often a standard normal distribution). This ensures that the latent space is well-structured and allows for generating new samples by sampling from the prior.
3.1.9. Optimal Transport (OT)
Optimal Transport (OT) is a mathematical framework for comparing probability distributions. In the context of machine learning, especially for pseudo-labelling in clustering, OT can be used to find a mapping (or "transport plan") that transforms one distribution of predicted cluster probabilities into another target distribution (e.g., a uniform distribution across clusters). This helps to prevent mode collapse (where all samples are assigned to a single cluster) and ensures a balanced distribution of samples across all clusters. The goal is to minimize the "cost" of transforming one distribution into another.
3.1.1. KL-Divergence (Kullback-Leibler Divergence)
The Kullback-Leibler (KL) divergence is a non-symmetric measure of how one probability distribution is different from a second, reference probability distribution . It quantifies the amount of information lost when is used to approximate .
$
D_{\mathrm{KL}}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right)
$
Where:
P(x)is the probability of event in distribution .Q(x)is the probability of event in distribution .- is the set of all possible events.
A
KL-divergenceof zero indicates that the two distributions are identical. Higher values indicate greater dissimilarity. It's often used as a regularization term or an alignment loss to encourage one distribution to match another.
3.1.1. Gumbel-Softmax Trick
The Gumbel-Softmax trick (or Concrete distribution) is a technique used to enable gradient-based optimization for models that involve sampling from discrete categorical distributions. Standard sampling from discrete distributions is not differentiable, preventing gradients from flowing back through the sampling step. The Gumbel-Softmax trick approximates a discrete sample using a continuous, differentiable function, allowing the model to be trained end-to-end with backpropagation. It introduces Gumbel noise to the logits of the categorical distribution and then applies a softmax function with a temperature parameter. As the temperature approaches zero, the output approximates a one-hot sample from the categorical distribution.
3.2. Previous Works
The paper contextualizes its approach by reviewing the evolution of short text clustering methods:
-
Early Studies (BoW/TFIDF + K-Means/Hierarchical Clustering):
Banerjee et al., 2007; Hu et al., 2009: These works focused on enrichingBag-of-Words (BoW)features with external knowledge (e.g., Wikipedia) before applying traditional clustering algorithms likeK-Meansorhierarchical agglomerative clustering.- Limitation: These methods suffered from the sparsity of
BoWfeatures and their inability to capture deep semantic information. TFIDF-K-Means: A baseline mentioned, directly applyingK-MeanstoTFIDFfeatures.K-Means_IC (Rakib et al., 2020): AppliesK-MeanstoTFIDFfeatures enhanced with aniterative classification algorithm.
-
Word2Vec-based Methods:
Yang et al., 2017; Guo et al., 2017: Used neural networks to learn better representations from features.Xu et al., 2017; Hadifar et al., 2019: UtilizedWord2Vecembeddings, which provide dense, lower-dimensional representations, for clustering, often combined withDEC (Deep Embedded Clustering)loss.- : Uses
Word2Vecto fit codes pre-obtained withLPI (Locality Preserving Projections)through aCNN, then clusters deep representations withK-Means. Self-Train (Hadifar et al., 2019): Employs an autoencoder to modelWord2Vecenhanced withSIF (Smooth Inverse Frequency)and fine-tunes the encoder withDECloss.- Limitation:
Word2Vecembeddings are shallow and cannot capture deep contextual semantic information as effectively as later models.
-
BERT-based Methods:
- Inspired by the success of
BERT (Devlin et al., 2019), recent methods apply clustering heads on top ofBERTfeatures. Huang et al., 2020: One of the first to fine-tuneBERTwithMasked Language Modelloss andDECloss for clustering.SCCL (Zhang et al., 2021): Combinescontrastive learningwithDEConBERTfeatures to achieve clustering, believing contrastive learning yields better representations.Yin et al., 2022: Builds onSCCLby adding atopic modeling moduleto enhance representation semantics.RSTC (Zheng et al., 2023): AddressesDEC's tendency to assign all samples to one cluster by employingpseudo-labelling(usingOptimal Transport (OT)) for clustering, achieving state-of-the-art performance amongBERT-only methods.BERT-K-Means: A baseline applyingK-Meansdirectly toBERTfeatures.- Limitation: While powerful for semantics,
BERTstruggles with domain-specific keywords if they are rare in its pre-training corpus (as highlighted in Figure 1).
- Inspired by the success of
-
Hybrid Baselines (Simple Fusion): The paper also introduces simple fusion baselines to show that naive combinations are not enough.
RSTCBERT-TFIDF-Linear: Linearly combinesBERTfeatures with a dimensionality-reducedTFIDFfeature, then feeds intoRSTC.RSTCBERT-TFIDF-Concat-1: ConcatenatesBERTfeatures with a dimensionality-reducedTFIDFfeature, then feeds intoRSTC.RSTCBERT-TFIDF-Concat-2: ConcatenatesBERTfeatures with the originalTFIDFfeature, then feeds intoRSTC.- Limitation: These intuitive fusions often fail to fully exploit the distinct natures of
BERTandTFIDFfeatures.
-
TFIDF-based VAE:
GMVAE (Jiang et al., 2017): A baseline that modelsTFIDFfeatures viaVAEwith aGaussian mixture prior. This is used to benchmark theTFIDFmodule ofCOTC.
3.3. Technological Evolution
The field of short text clustering has evolved from simple statistical methods to sophisticated deep learning approaches:
-
Early Statistical Methods (pre-2010s): Focused on
Bag-of-Words (BoW)orTFIDFrepresentations, often enriched with external knowledge, combined with traditional clustering algorithms likeK-MeansorHierarchical Clustering. These suffered from sparsity and lack of semantic understanding. -
Distributed Word Embeddings (2013 onwards): The introduction of
Word2Vec(Mikolov et al., 2013) provided dense, continuous word embeddings, enabling better semantic representation thanBoW. Methods likeDECwere applied to these embeddings, marking a shift towards neural network-based representation learning. -
Contextualized Embeddings (2018 onwards): The
Transformerarchitecture and models likeBERT(Devlin et al., 2019) brought a revolution.BERT's ability to generate contextualized embeddings captured deep, dynamic semantics, leading to significant performance gains across many NLP tasks, including clustering. Recent works focused onfine-tuning BERTwithcontrastive learningorpseudo-labellingfor clustering.This paper's work,
COTC, fits into this timeline by proposing ahybrid approach. It acknowledges the strengths ofBERTbut recognizes its limitations in specific contexts (keywords). It thenre-integratesthe "seemingly outdated but easily availableTFIDFfeatures" through a sophisticated co-training mechanism, rather than abandoning them. This represents an evolution from purelyBERT-centric approaches to a moremulti-modal feature learningstrategy for clustering.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of COTC are:
- Beyond Single-Feature Reliance: Most previous SOTA methods (e.g.,
SCCL,RSTC) rely solely onBERTfeatures, or earlier ones onTFIDF/Word2Vec.COTCfundamentally distinguishes itself by acknowledging that no single feature type is perfect and thatBERTandTFIDFhave complementary strengths (deep semanticsvs.keyword signals). - Sophisticated Co-Training vs. Simple Fusion: While some baselines (e.g.,
RSTCBERT-TFIDF-Linear/Concat) attempt to combineBERTandTFIDF, they do so through naive fusion (concatenation, linear combination).COTCproposes aco-training frameworkwith two dedicated modules thatmutually promoteeach other. This is a much deeper integration than simple feature concatenation. - Explicit Alignment at Multiple Levels:
COTCexplicitly enforces alignment between the two modules at both therepresentation level(using similarity graphs from one module to guide contrastive learning in the other) and thecluster assignment level(aligning predicted cluster probabilities usingKL-divergence). This ensures that the complementary information is effectively shared and integrated. - Unified Joint Training Objective: The paper goes a step further by formulating the alternating co-training into a
unified joint training objective. This allows for tighter coupling and more efficient gradient propagation between theBERTandTFIDFmodules, which is a significant architectural and training innovation. - Robustness to BERT's Keyword Weakness:
COTCdirectly addressesBERT's weakness in capturing domain-specific keyword information by explicitly integratingTFIDFfeatures, which are strong in this aspect. This makesCOTCmore robust for short text clustering in diverse domains.
4. Methodology
4.1. Principles
The core idea behind the CO-Training Clustering (COTC) framework is to leverage the complementary strengths of BERT features (for deep semantics) and TFIDF features (for keyword information) by developing two specialized modules that mutually promote each other's learning. The key principles are:
- Dual-Module Design: Two distinct modules, one for
BERTand one forTFIDF, are responsible for generating representations and cluster assignments from their respective feature types. - Alignment-Promoting Learning: The modules are designed to actively align their learned representations and cluster assignments. This means information from one module (e.g.,
TFIDF-induced similarity structures) guides the learning of the other module (e.g.,BERT's contrastive learning). - Mutual Promotion: Through this alignment, each module benefits from the unique strengths of the other, leading to improved performance that neither could achieve alone.
- Unified Optimization: The alternating co-training process is consolidated into a single joint training objective to enable more efficient gradient propagation and tighter coupling between the modules.
4.2. Core Methodology In-depth (Layer by Layer)
The COTC framework comprises two primary modules: for BERT features and for TFIDF features. Let's denote the original input text as . The BERT transformation gives , and the TFIDF transformation gives .
The overall architecture of the COTC framework is shown in Figure 2 from the original paper:
该图像是一个示意图,展示了共同训练聚类框架COTC的整体架构。图中分别展示了BERT和TFIDF模块的结构与功能,并表示了它们之间的知识流动。如图所示,共同训练的过程通过对高维表示和聚类结果的交互,促进了两个模块的协同学习,实现了有效的信息传递与增强。图中的箭头和符号清晰标示了各个步骤及其相互作用关系。
The overall architecture illustrates how the BERT Module and TFIDF Module interact. Each module processes its respective features, generating representations and cluster probabilities. These outputs are then used to guide and align the learning of the other module, facilitating mutual promotion.
4.2.1. Implementation of Module (BERT Module)
The BERT module aims to learn BERT-induced representations and cluster probabilities from the raw BERT features . This module is guided by information from the TFIDF module through two main mechanisms: representation-level alignment via contrastive learning and cluster-level alignment via KL-divergence.
A. Representation-Level Alignment via Contrastive Learning
The paper argues that while BERT and TFIDF representations reside in different spaces, their local topological structures should be aligned. If two texts are similar in the TFIDF space, they should also be similar in the BERT space. To achieve this, the BERT module uses the TFIDF-induced similarity graph to guide its contrastive learning.
-
Constructing the TFIDF Similarity Graph : First, a similarity graph is constructed using the
TFIDFrepresentations (which are obtained from theTFIDFmodule).- is the set of texts.
- is the set of edges.
- denotes the set of top- nearest neighbors of text in the
TFIDFrepresentation space. This is defined as: $ \mathcal{N}_i^t = {j | j \neq i \ & \cos(\pmb{h}_i^t, \pmb{h}_j^t) \ \mathrm{is \ top-}L \ \mathrm{largest}} $ Where:- is the cosine similarity between
TFIDFrepresentation of text and of text . top-L largestmeans selecting the texts with the highest cosine similarity to .
- is the cosine similarity between
-
Contrastive Loss : The
BERTrepresentation is learned under acontrastive learningframework. For each text , three augmentations are generated:- and : Obtained by standard
contextual augmentertechniques (e.g.,Kobayashi, 2018; Ma, 2019). - : A new augmentation generated by randomly selecting a sample from its
TFIDFneighbor set . These three augmentations are treated aspositives. Thecontrastive lossis defined as: $ \mathcal{L}{Contr} = - \frac{1}{N} \sum{i=1}^N \sum_{j=2}^3 \log \ell_{ij} $ Where: - is the total number of texts.
- is the similarity probability for the positive pair : $ \ell_{ij} \triangleq \frac{\Delta(\pmb{h}_i^{b(1)}, \pmb{h}_i^{b(j)})}{\Delta(\pmb{h}_i^{b(1)}, \pmb{h}i^{b(j)}) + \sum{k \neq i, m \in {1, j}} \Delta(\pmb{h}_i^{b(1)}, \pmb{h}_k^{b(m)})} $
- denotes the
BERTrepresentation of the -th augmentation of text . It's computed as: $ \pmb{h}_i^{b(m)} = f(\pmb{b}_i^{(m)}) = f(\pmb{B}(\pmb{x}_i^{(m)})) $ Where:- is an
MLP(Multi-Layer Perceptron) neural network serving as aprojection head. - is the
BERTbackbone.
- is an
- measures the similarity between
BERTrepresentations, where is atemperature parameter. - The denominator includes similarities with
negative samples(other texts or their augmentations). Minimizing encourages theBERTrepresentations of positive pairs (includingTFIDF-neighbors) to be close, thereby aligning the similarity structures inBERTandTFIDFrepresentation spaces.
- and : Obtained by standard
B. Cluster Assignment with Pseudo-labelling and Consistency
-
Computing Cluster Probability : A clustering head is applied over the
BERTfeature to compute the cluster probability : $ \pmb{p}_i^b = \delta(g(\pmb{b}_i)) = \delta(g(\pmb{B}(\pmb{x}_i))) $ Where:- is the
softmax function. - is an
MLPneural network serving as theclustering head. - is the
BERTbackbone.
- is the
-
Pseudo-labelling and Cross-Entropy Loss :
Pseudo-labelsare inferred from the predicted probabilities by solving anOptimal Transport (OT)problem (details in Appendix A.1). Let be the one-hotpseudo-labelobtained for text . The model is trained to minimize thecross-entropy loss: $ \mathcal{L}{CE} = - \frac{1}{N} \sum{i=1}^N \sum_{m=1}^3 \pmb{q}_i^T \log \delta(\pmb{g}(\pmb{b}_i^{(m)})) $ Where:- .
This loss encourages the model to predict the same
pseudo-labelfor all three augmented texts of .
- .
This loss encourages the model to predict the same
-
Consistency Loss : To ensure robustness and consistency, the model is encouraged to output consistent probability distributions for the original text and its augmentations : $ \mathcal{L}{Consist} = \frac{1}{N} \sum{i=1}^N \sum_{m=1}^3 D_{KL}(\delta(g(\pmb{b}_i)) || \delta(g(\pmb{b}_i^{(m)}))) $ Where:
- is the
KL-divergence. This loss promotes robustness of thepseudo-labellingtechnique. The clustering loss for theBERTmodule is .
- is the
C. Cluster-Level Alignment between Modules
To achieve cluster-level alignment between the BERT and TFIDF modules, the predicted probability from the BERT feature should be aligned with the cluster probability inferred from the TFIDF feature. This is done by minimizing:
$
\mathcal{L}{Align} = \frac{1}{N} \sum{i=1}^N D_{KL}(\delta(g(\pmb{b}_i)) || \pmb{p}_i^t)
$
Where:
- is the cluster probability from the
TFIDFmodule (to be described next).
D. Total Loss for BERT Module
The entire BERT module is trained by minimizing the combined loss:
$
\mathcal{L}B = \mathcal{L}{Contr} + \mathcal{L}{Cluster} + \lambda \mathcal{L}{Align}
$
Where:
- is a weighting parameter balancing the alignment term.
4.2.2. Implementation of Module (TFIDF Module)
The TFIDF module aims to learn TFIDF-induced representations and cluster probabilities from the raw TFIDF features . Unlike the BERT module which uses contrastive learning and pseudo-labelling, the TFIDF module uses a Variational Autoencoder (VAE) to model the TFIDF features, ensuring keyword information preservation and explicit clustering capabilities through a Gaussian mixture prior. This module is also guided by information from the BERT module.
A. Representation-Level Alignment via BERT-induced Graph
Similar to how the BERT module uses the TFIDF-induced graph, the TFIDF module uses a BERT-induced similarity graph to enforce representation-level alignment.
- Constructing the BERT Similarity Graph :
A similarity graph is constructed using the
BERTrepresentations (obtained from theBERTmodule).- with
$
\mathcal{N}_i^b = {j | j \neq i \ & \mathrm{cos}(\pmb{h}_i^b, \pmb{h}_j^b) \ \mathrm{is \ top-}L \ \mathrm{largest}}
$
This indicates that if two texts are similar in the
BERTspace, theirTFIDFrepresentations should also align.
- with
$
\mathcal{N}_i^b = {j | j \neq i \ & \mathrm{cos}(\pmb{h}_i^b, \pmb{h}_j^b) \ \mathrm{is \ top-}L \ \mathrm{largest}}
$
This indicates that if two texts are similar in the
B. Generative Model for TFIDF Features (VAE)
A VAE is used to model the TFIDF feature . This ensures that keyword information is preserved during reconstruction and that the TFIDF representation exhibits a cluster structure through a latent Gaussian mixture prior distribution. The generative model is defined as:
$
p({t_i}{i=1}^N, \mathcal{G}^b, {\pmb{h}i^t}{i=1}^N, {c_i}{i=1}^N) = \prod_{i=1}^N p(t_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t | c_i) p(c_i)
$
Where:
- is a
latent categorical variable(cluster assignment) randomly drawn from (number of clusters). - is the
categorical prior distributionover clusters, with being a -dimensional vector. - is a
Gaussian priorfor the latent representation , where and are the mean and variance for cluster . This models the idea that data points belonging to the same cluster should haveTFIDFrepresentations drawn from the same Gaussian distribution. - is the
decoderresponsible forreconstructingtheTFIDFfeature from . It's asoftmaxover an embedding matrix. (Details in Appendix A.2, Equation 42). p(\mathcal{G}_i^b | \{\pmb{h}_i^t\}_{i=1}^N)is thedecoderresponsible for generating theBERT-induced similarity graph fromTFIDFrepresentations. This term explicitly encouragesTFIDFrepresentations to reflect theBERTsimilarity structure: $ p(\mathcal{G}i^b | {\pmb{h}i^t}{i=1}^N) = \prod{j \in \mathcal{N}_i^b} \frac{\Delta(\pmb{h}_i^t, \pmb{h}j^t)}{\sum{k \ne i} \Delta(\pmb{h}_i^t, \pmb{h}_k^t)} $ Where:- measures similarity between
TFIDFrepresentations.
- measures similarity between
C. Training the VAE with ELBO
The generative model is trained by minimizing the negative Evidence Lower Bound (ELBO):
$
\mathcal{L}{ELBO} = \frac{1}{N} \sum{i=1}^N \ell_i^{elbo}\left(q(\pmb{h}_i^t, c_i | \pmb{t}_i)\right)
$
Where for a single sample is:
$
\ell_i^{elbo}\left(q(\pmb{h}_i^t, c_i | \pmb{t}i)\right) = - \mathbb{E}{q}\left[\log \frac{p(\pmb{t}_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t | c_i) p(c_i)}{q(\pmb{h}_i^t, c_i | \pmb{t}_i)}\right]
$
Here, denotes the expectation with respect to the variational posterior .
By restricting to the form , the ELBO simplifies to:
$
\ell_i^{elbo}\left(q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}i^t)\right) = - \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}\left[\log \frac{p(\pmb{t}_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t)}{q(\pmb{h}_i^t | \pmb{t}_i)}\right]
$
Where:
-
is the
encoder(a neural network) that outputs the mean and variance of the Gaussian distribution for . -
is the
posterior probabilityof cluster given .After the
VAEis trained, theTFIDFrepresentation and cluster probability are obtained from its encoder: $ \pmb{h}i^t = \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}[\pmb{h}] = \pmb{\mu}(\pmb{t}_i) $ $ \pmb{p}i^t[c] = \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}\left[\frac{\pi_c \mathcal{N}(\pmb{h}_i^t; \pmb{\mu}_c, \mathrm{diag}(\pmb{\sigma}c^2))}{\sum{\mathcal{T}} \mathcal{N}(\pmb{h}_i^t; \pmb{\mu}_c, \mathrm{diag}(\pmb{\sigma}_c^2))}\right] $ Where: -
is the mean output by the encoder.
-
is the -th element of the cluster probability vector . In practice, the expectation is approximated by a sample drawn from .
D. Cluster-Level Alignment between Modules
Similar to the BERT module, the TFIDF module also uses a KL-divergence term to align its predicted probabilities with those from the BERT module:
$
\mathcal{L}{Align} = \frac{1}{N} \sum{i=1}^N D_{KL}(\pmb{p}_i^b || \pmb{p}_i^t)
$
This is the same alignment term as in the BERT module, but the KL-divergence is computed in the opposite direction.
E. Total Loss for TFIDF Module
The TFIDF module is trained by minimizing:
$
\mathcal{L}T = \mathcal{L}{ELBO} + \lambda' \mathcal{L}_{Align}
$
Where:
- is a weighting parameter.
4.2.3. A Unified Training Objective
Initially, the two modules and can be updated in an alternating manner. However, the paper shows that a more efficient approach is to use a unified joint training objective.
-
Initial Joint Loss : A straightforward joint loss combining and would be: $ \mathcal{L}{Joint} = \mathcal{L}B + \lambda_1 \mathcal{L}T = \mathcal{L}{Contr} + \mathcal{L}{Cluster} + \lambda \mathcal{L}{Align} + \lambda_1 (\mathcal{L}{ELBO} + \lambda' \mathcal{L}{Align}) $ By absorbing and into , this can be rewritten as: $ \mathcal{L}{Joint} = \mathcal{L}{Contr} + \mathcal{L}{Cluster} + \lambda_1 (\mathcal{L}{ELBO} + \lambda_2 \mathcal{L}_{Align}) $ Where:
- and are weighting parameters. Note that the term appears twice in the sum.
-
Derivation of Unified Joint Loss (using Appendix A.2): The paper then shows an inequality that allows for a more tightly connected and efficient joint objective. From Appendix A.2, the following inequality is derived: $ \ell_i^{elbo}(q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}_i^t)) + \ell_i^{align} \leq \ell_i^{elbo}(q(\pmb{h}_i^t | \pmb{t}_i) \pmb{p}_i^b[c]) $ Where:
-
.
-
denotes the -th element of . This inequality implies that if we use the
cluster probabilityfrom theBERTmodule directly as thevariational posteriorfor theTFIDFmodule (i.e., replacing with ), it yields an upper bound on the sum of the standardELBOand thealignment loss. Aggregating over all samples, this leads to: $ \mathcal{L}{ELBO} + \mathcal{L}{Align} \leq \mathcal{L}{ELBO}' $ Where: $ \mathcal{L}{ELBO}' \triangleq \frac{1}{N} \sum_{i=1}^N \ell_i^{elbo}(q(\pmb{h}i^t | \pmb{t}i) \pmb{p}i^b[c]) $ By setting in the initial joint loss (Equation 20) and substituting the inequality, a newunified joint training lossis obtained: $ \mathcal{L}{Joint}' = \mathcal{L}{Contr} + \mathcal{L}{Cluster} + \lambda_1 \mathcal{L}_{ELBO}' $ Benefits of : -
Efficient Gradient Propagation: The
cluster probabilityfrom theBERTmodule is directly used as thevariational posteriorin theTFIDFmodule. This allows gradients from theTFIDFmodule to flow directly back to theBERTmodule via , enabling more efficient propagation of training signals between the two modules. -
Sharper Probabilities: This formulation avoids explicitly optimizing the
KL-divergenceterm directly. Directly optimizingKL-divergencecan sometimes encourage distributions to be overly smoothed (allocate some probability to all clusters to avoid large losses), making it harder to yield sharp predicted probabilities. helps in producing sharper cluster assignments. -
Optimized with Gumbel-Softmax: The
Gumbel-Softmax trickcan be employed to approximate the expectation over the categorical variable in . Combined with theGaussian re-parameterization trickfor , this allows to be optimized efficiently.After training, the
predicted cluster probabilityfrom theBERTmodule is used to obtain the final clustering result.
-
4.2.4. Appendix A.1: Optimal Transport for Pseudo-labelling
The pseudo-labels for the BERT module are obtained by solving an Optimal Transport (OT) problem over the predicted probabilities . This approach helps to generate balanced cluster assignments and avoid mode collapse.
-
Problem Formulation: Let be the matrix of predicted cluster probabilities for samples and clusters. A cost matrix is defined as . The
OTproblem seeks to find atransport matrixthat minimizes the following objective: $ \underset{\pmb{r}, \pmb{b}}{\operatorname{min}} \langle \pmb{r}, C \rangle - \epsilon_1 H(\pmb{r}) + \epsilon_2 U(\pmb{b}) $ Subject to constraints: $ \pmb{r} \mathbf{1} = \pmb{a}, \quad \pmb{r}^T \mathbf{1} = \pmb{b}, \quad \pmb{b}^T \mathbf{1} = 1, \quad \pmb{r} \geq 0, \quad \pmb{b} \geq 0 $ Where:- is the probability of transporting sample to class .
- is the
transport cost. - are weighting parameters.
H(\pmb{r}) = - \sum_{i=1}^N \sum_{j=1}^K \pmb{r}_{ij} (\log \pmb{r}_{ij} - 1)is theentropy regularizationfor , preventing degenerate solutions.U(\pmb{b}) = - \sum_{j=1}^K (\log \pmb{b}_j + \log(1 - \pmb{b}_j))is apenalty functionthat encourages theadaptive marginal distributionto be uniform, thus avoiding cluster collapse.- is a
uniform marginal distributionover samples. - is an
adaptive marginal distributionover classes, allowing for class imbalance. - is a vector of ones.
-
Lagrange Multiplier Formulation: The problem can be solved using
Lagrange Multipliers: $ \underset{\pmb{r}, \pmb{b}}{\operatorname{min}} L = \langle \pmb{r}, \pmb{C} \rangle - \epsilon_1 H(\pmb{r}) + \epsilon_2 U(\pmb{b}) - \pmb{f}^T (\pmb{r}\mathbf{1} - \pmb{a}) - \pmb{g}^T (\pmb{r}^T\mathbf{1} - \pmb{b}) - h (\pmb{b}^T\mathbf{1} - 1) $ Where areLagrange multipliers. -
Iterative Solution: The optimal can be expressed in terms of the multipliers: $ \pmb{r}{ij} = \exp\left(\frac{f_i + g_j - C{ij}}{\epsilon_1}\right) $ By applying the marginal constraints, iterative updates for and are derived, where
W = \exp(-C/\epsilon_1): $ \pmb{u}^{(t+1)} \gets \frac{\pmb{a}}{W \pmb{v}^{(t)}} $ $ \pmb{v}^{(t+1)} \gets \frac{\pmb{b}^{(t)}}{W^T \pmb{u}^{(t)}} $ The adaptive marginal for classes is updated by solving a quadratic equation for each , which is derived by setting : $ (\pmb{g}_j - h) \pmb{b}_j^2 + (-\pmb{g}_j + h - 2\epsilon_2) \pmb{b}_j + \epsilon_2 = 0 $ The correct solution for is: $ \pmb{b}_j = \frac{\pmb{g}_j - h + 2\epsilon_2 - \sqrt{\Delta_j}}{2(\pmb{g}_j - h)} $ Where is the discriminant. The value of is found usingNewton's Methodto satisfy the constraint . These updates are performed iteratively until convergence. -
Pseudo-label Derivation: Once the
transport matrixis obtained (after several iterations): $ \pmb{r} = \pmb{u}^T \pmb{W} \pmb{v} $ Thepseudo-labelsare then derived by taking theargmaxof each row of : $ \pmb{q}{ij} = 1 \ \mathrm{if} \ \underset{j'}{\operatorname{argmax}} \pmb{r}{ij'} = j \ \mathrm{else} \ 0 $ This converts the soft assignments in into hard one-hotpseudo-labels.
4.2.5. Appendix A.2: Variational Autoencoder Details
This section elaborates on the VAE used in the TFIDF module and the derivation of the unified joint training objective.
-
Generative Model Review: The generative model for
TFIDFfeatures and theBERT-induced graph is: $ p({t_i}{i=1}^N, \mathcal{G}^b, {\pmb{h}i^t}{i=1}^N, {c_i}{i=1}^N) = \prod_{i=1}^N p(t_i | \pmb{h}_i^t) p(\mathcal{G}i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}i^t | c_i) p(c_i) $ The negativeELBOto be minimized is: $ \mathcal{L}{ELBO} = \sum{i=1}^N \ell_i^{elbo}(q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}_i^t)) $ Where: $ \ell_i^{elbo}\left(q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}i^t)\right) = - \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}\left[\log \frac{p(\pmb{t}_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t)}{q(\pmb{h}_i^t | \pmb{t}_i)}\right] $ Thealignment lossis , with: $ \pmb{p}_i^b[c_i] = \delta(g(\pmb{b}_i)) $ $ \pmb{p}i^t[c_i] = \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}[\pmb{p}(c_i | \pmb{h}_i^t)] $ -
Derivation of the Inequality for Unified Loss: To show , the derivation starts with the
KL-divergenceterm: $ D_{KL}(P || Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} $ Applying this to the alignment loss term : $ D_{KL}(\pmb{p}_i^b[c_i] || \pmb{p}i^t[c_i]) = \sum{c'} \pmb{p}_i^b[c'] \log \frac{\pmb{p}i^b[c']}{\mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}[p(c' | \pmb{h}i^t)]} $ UsingJensen's inequality(), we have . Applying this (in the denominator with a negative sign): $ \le \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}i)} \left[ \sum{c'} \pmb{p}_i^b[c'] \log \frac{\pmb{p}_i^b[c']}{p(c' | \pmb{h}i^t)} \right] $ This expression can be seen as an expectation ofKL-divergencewhere the posterior is conditioned on . It can be rewritten as: $ = \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i) \pmb{p}_i^b[c_i]} \left[ \log \frac{\pmb{p}i^b[c_i]}{p(c_i | \pmb{h}i^t)} \right] $ Combining this with theELBOterm , and after algebraic manipulation (which involves cancelling terms and regrouping using properties of logarithm and expectation), the inequality is obtained: $ \mathcal{L}{ELBO}' = \sum{i=1}^N \ell_i^{elbo}(q(\pmb{h}_i^t | \pmb{t}i) \pmb{p}i^b[c_i]) \geq \mathcal{L}{ELBO} + \mathcal{L}{Align} $ This inequality shows that optimizing (where acts as the posterior for ) implicitly optimizes the original and thealignment loss. -
Decoder : The decoder in the
VAEreconstructs theTFIDFfeature . TheTFIDFfeature is treated as a set of words , where is a one-hot representation in the vocabulary . Anembedding matrix(where 128 is the dimension of ) is used as the decoder network. The probability of generating given is defined as: $ p(\pmb{t}_i | \pmb{h}i^t) = \prod{\pmb{w}_j \in \pmb{t}_i} p(\pmb{w}_j | \pmb{h}i^t) = \prod{\pmb{w}_j \in \pmb{t}_i} \frac{\exp(\pmb{h}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}j)}{\sum{k=1}^{|\mathcal{W}|} \exp(\pmb{h}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}_k)} $ This is essentially a product ofsoftmaxprobabilities, where eachword embedding(column of ) is compared against theTFIDFrepresentation . -
Approximation of : The term in the new
ELBOcan be factorized into five subterms: $ \ell_i^{elbo}\left(q(\pmb{h}_i^t | \pmb{t}_i) \pmb{p}i^b[c_i]\right) = - \mathbb{E}{q} \left[ \log \frac{p(\pmb{t}_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t | c_i) p(c_i)}{q(\pmb{h}_i^t | \pmb{t}_i) \pmb{p}i^b[c_i]} \right] \ = - \mathbb{E}{q} [ \log p(\pmb{t}_i | \pmb{h}i^t) ] - \mathbb{E}{q} [ \log p(\mathcal{G}i^b | {\pmb{h}i^t}{i=1}^N) ] \ + \mathbb{E}{q} [ \log q(\pmb{h}_i^t | \pmb{t}i) ] - \mathbb{E}{q} \left[ \mathrm{log} \frac{p(c_i)}{ \pmb{p}i^b[c_i]} \right] - \mathbb{E}{q} [ \log p(\pmb{h}_i^t | c_i) ] $ These subterms are approximated for computation:- Reconstruction term (from ):
$
- \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}_i)} [ \log p(\pmb{t}_i | \pmb{h}i^t) ] \approx - \sum{\pmb{w}_j \in \pmb{t}_i} \log \frac{\exp(\tilde{\pmb{h}}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}j)}{\sum{k=1}^{|\mathcal{W}|} \exp(\tilde{\pmb{h}}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}_k)}
$
This uses the
Gaussian re-parameterization trickwhere with .
- \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}_i)} [ \log p(\pmb{t}_i | \pmb{h}i^t) ] \approx - \sum{\pmb{w}_j \in \pmb{t}_i} \log \frac{\exp(\tilde{\pmb{h}}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}j)}{\sum{k=1}^{|\mathcal{W}|} \exp(\tilde{\pmb{h}}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}_k)}
$
This uses the
- Graph reconstruction term (from
p(\mathcal{G}_i^b|\{h_i^t\}_{i=1}^N)): $- \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}_i)} [ \log p(\mathcal{G}i^b | {\pmb{h}i^t}{i=1}^N) ] \approx - \sum{j \in \mathcal{N}_i^b} \log \frac{\Delta(\tilde{h}_i^t, \tilde{h}j^t)}{\sum{k \not= i} \Delta(\tilde{h}_i^t, \tilde{h}_k^t)}
$
This also uses
re-parameterization.
- \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}_i)} [ \log p(\mathcal{G}i^b | {\pmb{h}i^t}{i=1}^N) ] \approx - \sum{j \in \mathcal{N}_i^b} \log \frac{\Delta(\tilde{h}_i^t, \tilde{h}j^t)}{\sum{k \not= i} \Delta(\tilde{h}_i^t, \tilde{h}_k^t)}
$
This also uses
- Encoder entropy term (from ): Can be computed analytically for a Gaussian distribution. $ \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}_i)} [ \log q(\pmb{h}_i^t | \pmb{t}i) ] = - \frac{1}{2} \sum{j=1}^{128} (\log 2\pi + 1 + \log \pmb{\sigma}^2(\pmb{t}_i)|_j) $ Where 128 is the dimension of .
- Prior-posterior difference for :
$
- \mathbb{E}{\pmb{p}i^b[c_i]} \left[ \log \frac{p(c_i)}{\pmb{p}i^b[c_i]} \right] \approx - \sum{k=1}^K \tilde{c}{ik} \log \frac{p(k)}{\pmb{p}i^b[k]}
$
Here, is obtained using the
Gumbel-Softmax trick: $ \tilde{c}{ik} = \frac{\exp((g{ik} + \log{\pmb{p}i^b[k]}) / \tau)}{\sum{j=1}^K \exp((g_{ij} + \log{\pmb{p}_i^b[j]}) / \tau)} $ Where and is the temperature parameter.
- \mathbb{E}{\pmb{p}i^b[c_i]} \left[ \log \frac{p(c_i)}{\pmb{p}i^b[c_i]} \right] \approx - \sum{k=1}^K \tilde{c}{ik} \log \frac{p(k)}{\pmb{p}i^b[k]}
$
Here, is obtained using the
- Prior-posterior difference for given :
$
- \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}i) \pmb{p}i^b[c_i]} [ \log p(\pmb{h}i^t | c_i) ] \ \approx \frac{1}{2} \sum{k=1}^K \tilde{c}{ik} \left( \sum{j=1}^{128} \log 2\pi + \log \sigma_k^2|_j + \frac{(\mu(\pmb{t}_i)|_j - \mu_k|_j)^2 + \sigma^2(\pmb{t}_i)|_j}{\sigma_k^2|_j} \right) $ By summing these approximated terms, can be computed and optimized.
- Reconstruction term (from ):
$
4.2.6. Data Flows and Network Architectures
The paper provides detailed data flows and network architectures for both the BERT and TFIDF modules in Table 10 and Table 11.
The following are the data flows and network architectures for BERT features. is the number of clusters, 768 is the dimension of BERT features and 128 is the dimension of BERT representations.
| Data Flow | Network Architecture | |
|---|---|---|
| Raw Text x | - | |
| BERT Transformation b = B(x) | B(·) | BERT Backbone |
| Projection Head `h^b = f(b)` | f(·) | Linear(768, 768); ReLU(); Linear(768, 128); Normalize(). |
| Clustering Head `p^b = g(b)` | g(·) | Dropout(); Linear(768, 768); ReLU(); Dropout(); Linear(768, 768); ReLU(); Linear(768, K); Softmax(). |
The following are the data flows and network architectures for TFIDF features. is the number of clusters, 2048 is the dimension of TFIDF features and 128 is the dimension of TFIDF representations.
| Data Flow | Network Architecture | |
|---|---|---|
| Raw Text x | ||
| TFIDF Transformation t = T(x) | T(·) | TFIDF Vectorizer |
| Encoder Network | Enc-μ() | Linear(2048, 2048); ReLU(); Linear(2048, 128); Tanh(). |
| µ=Enc-µ(t), σ=Enc-σ(t) | Enc-σ(·) | Linear(2048, 2048); ReLU(); Linear(2048, 128); Exp(). |
| Sample Process (e; 0, 1), | ||
| Decoder Network = Dec() | Dec(·) | Linear(128, 2048); Softmax(). |
| Class Distribution | π , | |
| Gaussian Components | - | {µi, σi} |
5. Experimental Setup
5.1. Datasets
The authors evaluate COTC on eight benchmark datasets for short text clustering. These datasets cover a variety of domains and characteristics, making them suitable for validating the method's performance across different scenarios.
The following are the statistics of the datasets. : the number of texts; Len: the average length of texts; : the number of classes; :the size ratio of the largest class versus the smallest one.
| Dataset | N | Len | K | L/S |
|---|---|---|---|---|
| AgNews | 8000 | 23 | 4 | 1 |
| SearchSnippets | 12340 | 18 | 8 | 7 |
| StackOverflow | 20000 | 9 | 20 | 1 |
| Biomedical | 20000 | 13 | 20 | 1 |
| GoogleNews-TS | 11109 | 28 | 152 | 143 |
| GoogleNews-T | 11109 | 6 | 152 | 143 |
| GoogleNews-S | 11109 | 22 | 152 | 143 |
| Tweet | 2472 | 9 | 89 | 249 |
Details of the Datasets:
- AgNews: A subset of news articles (from over 2000 sources) containing 8000 news titles from 4 topic categories.
- Example: "Wall St. Bears Claw Back Into the Black (Reuters)"
- SearchSnippets: A subset of 12340 snippets from web search results across 8 domains.
- Example: "Free Internet Software - Download the latest internet tools, utilities, browsers, and more."
- StackOverflow: 20000 question titles related to programming from 20 tags. This is a domain-specific dataset where keywords are crucial.
- Example: "How to parse XML with SimpleXML in PHP?"
- Biomedical: 20000 paper titles from 20 categories sourced from
BioASQ. Another domain-specific dataset.- Example: "A novel approach for the detection of amyloid plaques in Alzheimer's disease using optical coherence tomography."
- GoogleNews (TS, T, S): Constructed by Yin and Wang (2014), containing titles and snippets of 11109 news articles corresponding to 152 events.
- GoogleNews-TS: Both titles and snippets (average length 28).
- GoogleNews-T: Titles only (average length 6).
- GoogleNews-S: Snippets only (average length 22).
- Example (title): "Google to acquire Motorola Mobility" (snippet might follow for -TS, -S).
- Tweet: Contains 2472 tweets from 89 queries.
Tweetsare very short and informal texts.-
Example: "Just saw @ladygaga concert, amazing performance!"
These datasets are well-suited for evaluating short text clustering due to their varying lengths, numbers of clusters, and domain specificity, especially
StackOverflowandBiomedicalfor keyword importance, andGoogleNewsandTweetfor their brevity and potentially high class imbalance.
-
5.2. Evaluation Metrics
The performance of the clustering methods is evaluated using two standard metrics: Clustering Accuracy (ACC) and Normalized Mutual Information (NMI).
5.2.1. Clustering Accuracy (ACC)
Conceptual Definition: Clustering Accuracy (ACC) measures the agreement between the predicted cluster assignments and the true ground-truth labels, after finding the best possible mapping between predicted clusters and true classes. It quantifies how many samples are correctly assigned to their corresponding true class. Since clustering is unsupervised, the predicted cluster IDs don't inherently match the true class IDs, so a mapping step (e.g., using the Hungarian algorithm) is necessary to align them for comparison.
Mathematical Formula: $ ACC = \frac{\sum_{i=1}^N \mathbb{1}_{y_i = map(\hat{y}_i)}}{N} $
Symbol Explanation:
ACC: The clustering accuracy.- : The total number of samples (texts) in the dataset.
- : An indicator function that equals 1 if the condition is true, and 0 otherwise.
- : The ground-truth label (true class ID) of the -th sample.
- : The predicted cluster assignment (predicted cluster ID) for the -th sample.
- : A permutation mapping function that maps each predicted cluster ID to a ground-truth class ID. This mapping is found to maximize the accuracy (e.g., using the Hungarian algorithm).
5.2.2. Normalized Mutual Information (NMI)
Conceptual Definition: Normalized Mutual Information (NMI) is a measure of the mutual dependence between two clusterings (the predicted clustering and the ground-truth clustering). It quantifies how much information is shared between the two clusterings, normalized to a value between 0 and 1. A value of 1 indicates perfect correlation (the predicted clustering perfectly matches the true clustering), while a value of 0 indicates no mutual information (the clusterings are independent). NMI is often preferred over ACC in some contexts because it is less sensitive to the number of clusters and is a direct measure of information overlap.
Mathematical Formula: $ NMI = \frac{2 I(Y; \hat{Y})}{H(Y) + H(\hat{Y})} $
Symbol Explanation:
NMI: The normalized mutual information score.- : The set of ground-truth labels for all samples.
- : The set of predicted cluster assignments for all samples.
- : The
Mutual Informationbetween the ground-truth labels and the predicted labels . It is defined as: $ I(Y; \hat{Y}) = \sum_{y \in Y} \sum_{\hat{y} \in \hat{Y}} P(y, \hat{y}) \log\left(\frac{P(y, \hat{y})}{P(y)P(\hat{y})}\right) $ Where is the joint probability of a sample belonging to class and being assigned to cluster , andP(y)and are the marginal probabilities. H(Y): TheEntropyof the ground-truth labels . It is defined as: $ H(Y) = - \sum_{y \in Y} P(y) \log(P(y)) $- : The
Entropyof the predicted cluster assignments . It is defined similarly toH(Y).
5.3. Baselines
The paper compares COTC against a comprehensive set of baselines, including traditional, Word2Vec-based, BERT-based, and simple fusion methods. This allows for a thorough evaluation of COTC's performance relative to existing techniques.
-
TFIDF-K-Means: Applies
K-Meansclustering directly toTFIDFfeatures. Represents a traditional, basic approach usingTFIDF. -
BERT-K-Means: Applies
K-Meansclustering directly toBERTfeatures. Represents a basic approach leveragingBERTembeddings. -
K-Means_IC (Rakib et al., 2020): An enhanced
TFIDF-based method that appliesK-MeanstoTFIDFfeatures combined with aniterative classification algorithm. -
STC2-LPI (Xu et al., 2017): Uses
Word2Vecembeddings, pre-obtainedLPIcodes, aCNNfor deep representations, and thenK-Meansfor clustering. -
Self-Train (Hadifar et al., 2019): Uses an autoencoder to model
Word2Vecembeddings enhanced withSIF (Smooth Inverse Frequency), then fine-tunes the encoder withDECloss. -
SCCL (Zhang et al., 2021): A
BERT-based method that performscontrastive learningonBERTfeatures and achieves clustering usingDECloss. -
RSTC (Zheng et al., 2023): Current SOTA
BERT-only method. It performspseudo-labellingonBERTfeatures, wherepseudo-labelsare generated by solving anOptimal Transport (OT)problem. -
GMVAE (Jiang et al., 2017): Models
TFIDFfeatures using aVariational Autoencoder (VAE)with aGaussian mixture prior. This serves as a strong baseline forTFIDF-focused deep clustering and is directly relevant to theTFIDFmodule inCOTC. -
RSTCBERT-TFIDF-Linear: A simple fusion baseline. Reduces
TFIDFfeatures to the same dimensionality asBERTfeatures using an autoencoder, then linearly combines them (e.g., ) and feeds the result intoRSTC. -
RSTCBERT-TFIDF-Concat-1: Another simple fusion baseline. Concatenates
BERTfeatures with a dimensionality-reducedTFIDFfeature () and feeds the combined feature intoRSTC. -
RSTCBERT-TFIDF-Concat-2: Similar to
RSTCBERT-TFIDF-Concat-1, but concatenatesBERTfeatures with the original, high-dimensionalTFIDFfeature () before feeding intoRSTC.The default
BERTmodel used for these baselines andCOTCisdistilbert-base-nli-stsb-mean-tokens, as specified in the original paper.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that the proposed COTC method consistently achieves superior clustering performance across all eight benchmark datasets, significantly outperforming current state-of-the-art methods in terms of both ACC and NMI.
The following are the clustering performance of the baselines and our method COTC on eight benchmark datasets. The results for the baselines are quoted from Zheng et al., 2023. The best results are bold and the second one underlined.
| Method | AgNews | SearchSnippets | StackOverflow | Biomedical | ||||
|---|---|---|---|---|---|---|---|---|
| ACC | NMI | ACC | NMI | ACC | NMI | ACC | NMI | |
| TFIDF-K-Means | 34.39 | 12.19 | 30.85 | 18.67 | 58.52 | 59.02 | 29.13 | 25.12 |
| BERT-K-Means | 65.95 | 31.55 | 55.83 | 32.07 | 60.55 | 51.79 | 39.50 | 32.63 |
| K-Means_IC | 66.30 | 42.03 | 63.84 | 42.77 | 74.96 | 70.27 | 40.44 | 32.16 |
| STC2-LPI | - | - | 76.98 | 62.56 | 51.14 | 49.10 | 43.37 | 38.02 |
| Self-Train | - | - | 72.69 | 56.74 | 59.38 | 52.81 | 40.06 | 34.46 |
| SCCL | 83.10 | 61.96 | 79.90 | 63.78 | 70.83 | 69.21 | 42.49 | 39.16 |
| RSTC | 84.24 | 62.45 | 80.10 | 69.74 | 83.30 | 74.11 | 48.40 | 40.12 |
| GMVAE | 82.62 | 55.76 | 80.11 | 58.96 | 82.90 | 71.44 | 48.17 | 40.57 |
| RSTCBERT-TFIDF-Linear | 84.45 | 60.86 | 83.21 | 71.17 | 78.79 | 76.14 | 50.17 | 45.18 |
| RSTCBERT-TFIDF-Concat-1 | 85.79 | 63.26 | 80.90 | 69.99 | 82.41 | 78.45 | 49.34 | 45.00 |
| RSTCBERT-TFIDF-Concat-2 | 85.80 | 63.11 | 82.54 | 70.74 | 78.55 | 73.95 | 49.24 | 43.15 |
| COTC | 87.56 | 67.09 | 90.32 | 77.09 | 87.78 | 79.19 | 53.20 | 46.09 |
| Method | GoogleNews-TS | GoogleNews-T | GoogleNews-S | Tweet | ||||
| ACC | NMI | ACC | NMI | ACC | NMI | ACC | NMI | |
| TFIDF-K-Means | 69.00 | 87.78 | 58.36 | 79.14 | 62.30 | 83.00 | 54.34 | 78.47 |
| BERT-K-Means | 65.71 | 86.60 | 55.53 | 78.38 | 56.62 | 80.50 | 53.44 | 78.99 |
| K-Means_IC | 79.81 | 92.91 | 68.88 | 83.55 | 74.48 | 88.53 | 66.54 | 84.84 |
| SCCL | 82.51 | 93.01 | 69.01 | 85.10 | 73.44 | 87.98 | 73.10 | 86.66 |
| RSTC | 83.27 | 93.15 | 72.27 | 87.39 | 79.32 | 89.40 | 75.20 | 87.35 |
| GMVAE | 83.37 | 93.48 | 79.98 | 90.25 | 80.65 | 90.04 | 73.23 | 88.86 |
| RSTCBERT-TFIDF-Linear | 83.72 | 93.26 | 74.29 | 88.67 | 81.57 | 91.17 | 78.20 | 89.42 |
| RSTCBERT-TFIDF-Concat-1 | 83.74 | 93.79 | 79.31 | 91.06 | 82.91 | 91.55 | 75.61 | 88.50 |
| RSTCBERT-TFIDF-Concat-2 | 84.03 | 93.55 | 74.46 | 87.70 | 81.23 | 90.60 | 83.62 | 90.30 |
| COTC | 90.50 | 96.33 | 83.53 | 92.07 | 86.10 | 93.49 | 91.33 | 95.09 |
Key Observations:
- Weakness of Shallow Methods:
TFIDF-K-MeansandBERT-K-Meansgenerally perform poorly, highlighting that raw features combined with simple clustering algorithms are insufficient for complex short text data. - Deep Learning Advantage: Methods employing deep learning for representation learning (
STC2-LPI,Self-Train,SCCL,RSTC) significantly outperform the shallow methods, validating the importance of learning better representations. - Power of BERT:
BERT-based methods (SCCL,RSTC) demonstrate strong performance, often outperformingWord2Vec-based methods, confirmingBERT's ability to capture rich semantics.RSTC, in particular, shows competitive results, being the previous SOTA amongBERT-only approaches. - Rediscovered Value of TFIDF:
GMVAE, which modelsTFIDFfeatures with aVAE, surprisingly shows strong performance, beating many baselines and evenBERT-basedSCCLon some datasets (e.g.,SearchSnippets,StackOverflow,GoogleNews-T/S,Tweet). This validates the paper's premise thatTFIDFfeatures, when properly modeled, remain valuable. - Limitations of Simple Fusion: The
RSTCBERT-TFIDF-Linear,RSTCBERT-TFIDF-Concat-1, andRSTCBERT-TFIDF-Concat-2baselines, which naively combineBERTandTFIDF, show some improvement overRSTCon certain datasets (e.g.,AgNews,SearchSnippets,Biomedical), but they are not consistently superior and sometimes even fall behindRSTC. This suggests that simple fusion cannot fully exploit the complementary strengths due to the distinct nature of the features. - Superiority of COTC:
COTCconsistently achieves the highestACCandNMIscores across all eight datasets, often by a significant margin. For example, onSearchSnippets,COTCreaches 90.32ACCand 77.09NMI, compared toRSTC's 80.10ACCand 69.74NMI. This robust performance validates the effectiveness ofCOTC's alignment-promoting co-training framework in leveraging the collective strengths ofBERTandTFIDFfeatures.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Ablation Study for BERT Features
The paper conducts an ablation study to analyze the contribution of different components to the BERT module's performance. The ACC results for variants of the BERT module are shown below, focusing on the outputs from .
The following are the ACC results of the basic variants for BERT features. vs Last means the average improvement comparing the current row with the last one.
| Variant | AN | SS | SO | Bio | GN-TS | GN-T | GN-S | Tw | vs Last | |
|---|---|---|---|---|---|---|---|---|---|---|
| Basis | (M) | 85.55 | 80.78 | 83.23 | 50.97 | 83.25 | 74.79 | 79.98 | 83.32 | - |
| w/ | () | 86.06 | 88.30 | 86.35 | 52.16 | 87.08 | 81.76 | 82.13 | 87.27 | +3.66 |
| w/ | () | 86.56 | 89.03 | 87.22 | 52.55 | 89.85 | 82.73 | 85.23 | 90.64 | +1.59 |
| COTC | () | 87.56 | 90.32 | 87.78 | 53.20 | 90.50 | 83.53 | 86.10 | 91.33 | +0.81 |
Analysis:
-
Basis (): This is a basic
BERTmodule with contrastive learning,pseudo-labelling, and consistency constraints, but without direct interaction withTFIDFinformation for alignment. It achieves a baselineACC. -
w/ (): This variant introduces the
TFIDFrepresentations to construct the similarity graph , which is then used for neighborhood augmentations in theBERTmodule's contrastive learning. The averageACCimprovement of +3.66 over the Basis model is substantial, particularly onSearchSnippets(+7.52) andGoogleNews-TS(+3.83). This clearly demonstrates the value of aligning theBERTrepresentation space with the similarity structure learned fromTFIDFfeatures, especially for capturing keyword-level relationships. -
w/ (): This variant further incorporates
cluster-level alignmentby using the cluster probabilities from theTFIDFmodule via . It achieves an additional average improvement of +1.59 over . This indicates that aligning the predicted cluster assignments between the two feature spaces provides further benefits, guiding both modules towards more coherent cluster structures. -
COTC (): This is the final proposed method, which uses the
unified joint training objective. It yields an average improvement of +0.81 over . This improvement, though smaller than the previous steps, is consistent across datasets and validates the superiority of the tighter coupling and more efficient gradient propagation enabled by the unified objective.These results confirm the incremental benefits of
COTC's components, showing that both representation-level alignment (via graph structures) and cluster-level alignment (via probability distributions), integrated through a unified objective, are crucial for achieving state-of-the-art performance.
6.2.2. Ablation Study for TFIDF Features
The paper also presents ablation studies for the TFIDF module, showing how its performance is influenced by interactions with the BERT module. The ACC and NMI results for TFIDF variants (outputs from ) are provided in Table 13 and Table 14 in the Appendix.
The following are the ACC results of the basic variants for TFIDF features. vs Last means the average improvement comparing the current row with the last one.
| Variant | AN | SS | SO | Bio | GN-TS | GN-T | GN-S | Tw | vs Last | |
|---|---|---|---|---|---|---|---|---|---|---|
| Basis | (M) | 82.62 | 80.11 | 82.90 | 48.17 | 83.37 | 79.98 | 80.65 | 73.23 | - |
| w/ | () | 84.89 | 87.41 | 84.12 | 49.68 | 85.07 | 80.82 | 82.11 | 74.99 | +2.26 |
| w/ | () | 85.83 | 88.95 | 85.04 | 51.55 | 87.86 | 81.68 | 84.40 | 87.65 | +2.98 |
| COTC | () | 87.26 | 90.00 | 86.87 | 52.41 | 90.35 | 83.36 | 86.03 | 91.05 | +1.80 |
The following are the NMI results of the basic variants for TFIDF features. vs Last means the average improvement comparing the current row with the last one.
| Variant | AN | SS | SO | Bio | GN-TS | GN-T | GN-S | Tw | vs Last | |
|---|---|---|---|---|---|---|---|---|---|---|
| Basis | (M) | 55.76 | 58.96 | 71.44 | 40.57 | 93.48 | 90.25 | 90.04 | 88.86 | - |
| w/ | () | 60.57 | 71.91 | 78.77 | 44.25 | 94.44 | 91.13 | 91.17 | 89.70 | +4.07 |
| w/ | () | 63.94 | 74.80 | 75.07 | 43.39 | 94.33 | 90.83 | 92.29 | 92.85 | +0.70 |
| COTC | () | 66.16 | 76.53 | 78.97 | 45.69 | 96.19 | 91.91 | 93.41 | 94.72 | +2.01 |
Analysis (for TFIDF Module):
-
Basis (): This is the standalone
GMVAEmodel forTFIDFfeatures. Its performance is already decent, outperforming many non-BERTbaselines. -
w/ (): Incorporating the
BERTrepresentations to construct the similarity graph and using it to guide theTFIDFmodule'sVAElearning significantly boosts performance (average +2.26ACC, +4.07NMI). This confirms that thedeep semantic structurefromBERTis highly beneficial for enrichingTFIDFrepresentations. -
w/ (): Adding
cluster-level alignmentwithBERT's cluster probabilities further improves performance (average +2.98ACC, +0.70NMI). This shows the value of harmonizing cluster assignments across both feature spaces. -
COTC (): The full
COTCwith the unified joint training objective delivers additional average improvements (+1.80ACC, +2.01NMI), demonstrating the superior integration and signal propagation of the unified approach.General Conclusion from Ablation: The ablation studies clearly show that the
co-trainingandalignmentmechanisms are highly effective for both theBERTandTFIDFmodules. Each module benefits significantly from the information shared by the other, validating the principle of leveraging complementary strengths. The unified joint training objective consistently provides the best overall performance.
6.2.3. Hyperparameter Sensitivity
-
Sensitivity of number of neighbors : The sensitivity of the number of neighbors is presented in Figure 3 from the original paper.
该图像是图表,展示了在Biomedical和GoogleNews-TS数据集上邻居数量对ACC、NMI和Precision的敏感性。左侧图表显示Biomedical的数据表现,右侧则为GoogleNews-TS的数据表现。随着的增加,Precision指标呈下降趋势,而ACC和NMI则保持相对稳定。The figure shows that using a
proper number of neighbors (L > 0)generally benefitsBERTfeatures in clustering, compared to (no neighbor augmentation). However, for some datasets likeGoogleNews-TS, increasing beyond an optimal point can decreaseprecision(the ratio of neighbors in the same class as the anchor), introducing noise and hurting performance. To balance performance across datasets, is set to 10. -
Sensitivity of weighting parameter : The sensitivity of the weighting parameter is shown in Figure 5 from the original paper.
该图像是图表,展示了不同权重参数 下的聚类效果,包括 AgNews 和 GoogleNews-S 数据集的准确率 (ACC) 和归一化互信息 (NMI) 变化情况。左侧图表显示 AgNews 的度量结果,右侧图表对应 GoogleNews-S 的结果。The
weighting parameter\lambda_1 balances the `BERT` module's loss ($\mathcal{L}_B$) and the `TFIDF` module's ELBO ($\mathcal{L}_{ELBO}'$). The figure indicates that a `proper range for`\lambda_1 is important. While extreme values can degrade performance, values in the range of 0.04 to 0.16 show relatively stable performance. This suggests that the model is not overly sensitive within a reasonable range. For convenience, is fixed to 0.1 for all datasets. -
Sensitivity of temperature parameter (Gumbel-Softmax): The sensitivity of the temperature parameter for the
Gumbel-Softmax trickis presented in Figure 6 from the original paper.
该图像是一个图表,展示了温度参数 au对 ACC 和 NMI 评估指标的敏感性。左侧图表显示了不同方法在 ACC 上的表现,右侧图表则展示了 NMI 的变化趋势。每个数据点对应不同的au值,绘制的线条分别代表多种算法的效果。The graph shows that
utilizing the Gumbel-Softmax trick with a temperature parameter\tau generally `enhances clustering performance`. This is likely due to the trick's ability to facilitate exploration during training by allowing for soft, differentiable approximations of discrete samples. $\tau$ is simply set to 0.5 for all datasets. ### 6.2.4. Case Study * **Keyword Analysis (SearchSnippets):** To understand the keywords learned by the `TFIDF` module, the paper maps the cluster centers from the `TFIDF` module back to the `vocabulary space` using the `embedding matrix`\pmb{\mathcal{E}} from theVAE's decoder. The top-5 relevant keywords for each cluster onSearchSnippetsare listed below. This shows that theTFIDFmodule successfully identifies domain-specific keywords for each topic.The following are different keywords revealed by the cluster centers in the TFIDF module on SearchSnippets.
clusters keywords topics #1 business, market,services, financial, finance Business #2 computer, software,programming, linux, web Computers #3 movie, music,com, movies, film Culture-Arts-Entertainment #4 edu, research,science, university, theory Education-Science #5 electrical, car,motor, engine, products Engineering #6 health, medical,information, disease, gov Health #7 political, party,democracy, government, democratic Politics-Society #8 sports, football,news, games, com Sports -
Visualization (SearchSnippets): The visualization on
SearchSnippets(Figure 4 from the original paper) qualitatively demonstrates the improved cluster separation achieved byCOTC.
该图像是一个示意图,展示了聚类前后的文本特征分布。左侧为原始特征数据的可视化,右侧为经过训练后的聚类结果,数字标注表示不同的聚类。通过对比可以看出,训练后聚类的效果明显提升。The left panel shows the initial feature distribution, while the right panel, post-training, clearly indicates well-defined and separated clusters, confirming the model's ability to learn distinct groupings.
-
Keyword Analysis (StackOverflow): For
StackOverflow, a domain-specific dataset, theTFIDFmodule also reveals highly relevant keywords for each of the 20 clusters. This highlightsTFIDF's strength in capturing professional terms.The following are different keywords revealed by the cluster centers in the TFIDF module on StackOverflow.
clusters keywords topics #1 excel, vba, cell, macro, data excel #2 haskell, type, function, scala, list haskell #3 mac, os, osx, application, app OSX #4 linq, sql, query, using, join linq #5 ajax, jquery, javascript, request, php ajax #6 visual, studio, 2008, 2005, project visual-studio #7 cocoa, using, file, use, text cocoa #8 hibernate, mapping, criteria, query, hql hibernate #9 sharepoint, web, site, 2007, list sharepoint #10 bash, script, command, shell, file bash #11 apache, rewrite, mod, htaccess, redirect apache #12 wordpress, posts, post, page, blog wordpress #13 svn, subversion, repository, files, commit svn #14 drupal, node, views, module, content drupal #15 qt, widget, window, creator, application qt #16 scala, java, class, type, actors scala #17 magento, product, products, page, admin magento #18 matlab, matrix, plot, array, function matlab #19 oracle, sql, table, pl, database oracle #20 spring, bean, hibernate, security, using spring -
Visualization (StackOverflow): The visualization on
StackOverflow(Figure 7 from the original paper) further illustrates the benefits ofCOTC.
该图像是一个示意图,展示了原始和训练后聚类的效果对比。左侧是未经训练的样本分布,右侧是经过训练后,样本更紧密聚集的结果,标记的星形点表示中心聚类的样本。The figure from the introduction (Figure 1) showed
BERTfeatures separatingTFIDFneighbors. Figure 7 (post-training visualization) demonstrates thatCOTCcan successfully cluster these samples together, and the decision boundaries between different clusters become clearer. This supports the paper's claim thatCOTCaddressesBERT's keyword weakness.
6.2.5. Investigation of Other Features and Base Models
-
Replacing TFIDF with Other Keyword Features: The paper investigates if
TFIDFcan be replaced by other keyword-reflecting features likeBoW(Bag-of-Words) orWord2Vec.The following are the ACC results using BoW or Word2Vec instead of TFIDF features. COTCBERT-TFIDF is our final method.
dataset AN SO Bio GN-TS Tw RSTCBERT 84.24 83.30 48.40 83.27 75.20 COTCBERT-W2V 34.24 28.06 27.50 74.06 14.36 COTCBERT-BoW 87.41 84.91 52.68 89.15 88.43 COTCBERT-TFIDF 87.56 87.78 53.20 90.50 91.33 Analysis:
COTCBERT-BoWperforms nearly as well asCOTCBERT-TFIDF, indicating thatBoWfeatures, likeTFIDF, effectively capture keyword information and can be successfully integrated into the co-training framework.COTCBERT-W2Vperforms very poorly. The authors conjecture thatWord2Vecis more similar toBERTin nature (dense embeddings) and strongly relies on its pre-training corpus, thus not providing the complementary keyword-focused information thatTFIDForBoWdo.
-
Using Different BERT Backbones: The paper tests the robustness of
COTCwith differentBERTbase models (instead of the defaultdistilbert-base-nli-stsb-mean-tokens).The following are the ACC results using different base models instead of the default sentence-distilbert.
dataset AN GN-TS Tw RSTCxLNet-base-uncased 71.75 34.47 10.07 COTCXLNet-base-uncased 84.60 80.21 71.97 RSTCBERT-base-uncased 82.23 77.45 73.87 COTCBERT-base-uncased 87.83 89.10 89.81 RSTCRoBERTa-base 85.76 75.63 71.08 COTCRoBERTa-base 87.44 88.32 90.45 Analysis:
- The performance of
RSTC(aBERT-only SOTA method) is highly dependent on the quality of theBERTbackbone. For instance,RSTCwithXLNet-base-uncasedperforms very poorly onTweet(10.07ACC). - In contrast,
COTCdemonstrates muchhigher stabilityand consistently good performance across differentBERTbackbones. Even when the baseBERTmodel is weak (XLNet-base-uncased),COTCstill achieves significantly better results (e.g., 71.97ACConTweetwithXLNet, a massive improvement overRSTC's 10.07ACC). This confirms that the complementaryTFIDFfeatures and the co-training framework enhance therobustnessof the clustering, compensating for weaknesses in the underlyingBERTrepresentations.
- The performance of
6.2.6. Clustering with Noisy Data
The authors test COTC's stability under noisy data conditions by adding random samples from the Biomedical dataset as noise to the StackOverflow dataset.
The following are the clustering results when our method performs clustering under noisy data condition, i.e., StackOverflow contaminated by Biomedical.
| percentage of noisy samples | 0% | 1% | 2% | 3% | 4% |
|---|---|---|---|---|---|
| ACC | 87.78 | 87.13 | 84.32 | 83.59 | 81.72 |
| NMI | 79.19 | 78.54 | 77.29 | 77.20 | 76.60 |
Analysis:
- As expected, performance (
ACCandNMI) gradually declines as the percentage of noisy samples increases. - However, even with 4% noise from a completely different domain (
Biomedicalvs.StackOverflow),COTCmaintains a relatively highACCof 81.72% andNMIof 76.60%. This indicates thatCOTCisrelatively robustto noise and can maintain a certain level of stability even when clustering contaminated data.
6.2.7. Comparison with LLM Zero-Shot Clustering
The paper includes a brief comparison with zero-shot clustering using a Large Language Model (LLM), specifically Qwen2-7B-Instruct, on the AgNews dataset.
The following are the clustering results of LLM for zero-shot short text clustering on AgNews.
| ACC | NMI | |
|---|---|---|
| Qwen2-7B-Instruct-zero-shot | 75.28 | 48.27 |
| COTC | 87.56 | 67.09 |
Analysis:
Qwen2-7B-Instructachieves a respectableACCof 75.28% andNMIof 48.27% in azero-shotsetting, even when provided with specific category names in the prompt. This highlights the impressive capabilities ofLLMsfor text understanding without explicit training on the clustering task.- However,
COTCstill significantly outperforms theLLM(87.56 ACC,67.09 NMI). This suggests that whileLLMsare powerful generalists,specialized modelslikeCOTC, which are carefully designed and trained for the specific task of short text clustering by leveraging complementary features, remain highly competitive and often superior for this specific task. The paper notes thatLLMsof this size might be practical for large datasets, but dedicated models still hold an advantage.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully addresses the challenge of short text clustering by proposing the CO-Training Clustering (COTC) framework. COTC innovatively combines the deep semantic understanding capabilities of BERT features with the keyword information strengths of TFIDF features, which BERT-only models often miss. The core of COTC lies in its dual-module design, where a BERT module and a TFIDF module are developed to mutually promote each other. This mutual promotion is achieved through explicit alignment at both the representation level (using similarity graphs) and the cluster assignment level (aligning predicted probabilities). Furthermore, the alternating co-training process is unified into a single, efficient joint training objective, allowing for seamless gradient propagation. Extensive experiments across eight benchmark datasets consistently demonstrate COTC's superior performance, outperforming existing state-of-the-art methods by a significant margin and showcasing its robustness across different BERT backbones and even in the presence of noisy data. The work effectively rediscovers the value of TFIDF features in a modern deep learning context.
7.2. Limitations & Future Work
The authors acknowledge several limitations of the current COTC framework:
-
Known Number of Clusters: Similar to many previous clustering methods,
COTCrequires the number of clusters () to be known beforehand. This can be a practical limitation in real-world scenarios where is unknown. -
Hyperparameter Tuning Complexity: Due to the inherent differences between
BERT(low-dimensional, dense) andTFIDF(high-dimensional, sparse) features, the neural networks in their respective modules often require different learning rates. This necessitates additional effort forhyperparameter tuning. -
Increased Computational Cost: Compared to methods relying solely on
BERTfeatures, the introduction of aTFIDFmodule incurs extra time and space for training and inference.For future work, the authors plan to:
- Explore a more
compact and efficient wayto leverage the collective strengths of different text features without significantly increasing computational costs. This could involve more lightweightTFIDFmodules or more integrated architectures.
7.3. Personal Insights & Critique
This paper presents a highly insightful and practical approach to short text clustering. The core idea of explicitly leveraging complementary feature types rather than solely relying on a single, albeit powerful, feature is a valuable lesson. The strength of BERT in capturing broad semantics is undeniable, but the paper rightly points out its blind spot for domain-specific keywords. TFIDF, despite being a traditional technique, effectively fills this gap.
Inspirations and Applications:
- Multi-Modal Feature Fusion: The
co-trainingandalignmentstrategy could inspire similar approaches in other domains where multiple data modalities or feature types (e.g., text and images, or different types of text embeddings) possess complementary information but have intrinsically different distributions. - Domain-Specific Adaptation: For fields requiring high precision on specialized terminology (e.g., legal documents, medical records, scientific papers), this framework offers a robust way to ensure that critical keywords are not overlooked by general-purpose language models.
- Robustness Enhancement: The improved robustness of
COTCwhen using weakerBERTbackbones or facing noisy data is a significant advantage, suggesting that this framework can makeclustering pipelinesmore resilient.
Potential Issues and Areas for Improvement:
-
Automated K-determination: The limitation of requiring a known number of clusters () is a common but persistent challenge in clustering. Future work could integrate methods for
automatic cluster number detection(e.g., silhouette score, elbow method, or Bayesian non-parametric models) into theCOTCframework. -
Adaptive Hyperparameter Tuning: The need for extensive hyperparameter tuning (especially different learning rates) highlights a practical hurdle. Research into
adaptive optimization strategiesormeta-learningfor dynamic learning rate adjustment across disparate modules could reduce this burden. -
Scalability for Very Large Vocabularies: While
TFIDFis effective, for extremely large and dynamic vocabularies, theTFIDFfeature vector can become very high-dimensional, potentially increasing memory and computational demands for theVAE. Exploring moresparse-aware or approximate techniquesfor theTFIDFmodule could be beneficial. -
Beyond BERT and TFIDF: The paper opens the door for integrating other types of specialized features (e.g.,
graph-based featuresfor texts in knowledge graphs,syntactic features) into a similar co-training paradigm, further enhancing the model's capabilities for complex text data.Overall,
COTCprovides a well-reasoned and empirically strong solution for short text clustering, emphasizing the enduring value of traditional methods when thoughtfully integrated with modern deep learning.
Similar papers
Recommended via semantic vector search.