Paper status: completed

Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training

Published:11/01/2024

BERT and TFIDF Co-Training Clustering (1)Short Text Clustering (1)Feature Alignment Enhancement (1)Deep Representation Learning (1)Clustering Algorithm Evaluation (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The COTC framework combines BERT and TFIDF features for enhanced short text clustering, achieving effective alignment through mutual learning of two modules. Experiments show significant performance improvements over state-of-the-art methods on eight benchmark datasets.

Abstract

BERT and TFIDF features excel in capturing rich semantics and important words, respectively. Since most existing clustering methods are solely based on the BERT model, they often fall short in utilizing keyword information, which, however, is very useful in clustering short texts. In this paper, we propose a CO-Training Clustering (COTC) framework to make use of the collective strengths of BERT and TFIDF features. Specifically, we develop two modules responsible for the clustering of BERT and TFIDF features, respectively. We use the deep representations and cluster assignments from the TFIDF module outputs to guide the learning of the BERT module, seeking to align them at both the representation and cluster levels. Reversely, we also use the BERT module outputs to train the TFIDF module, thus leading to the mutual promotion. We then show that the alternating co-training framework can be placed under a unified joint training objective, which allows the two modules to be connected tightly and the training signals to be propagated efficiently. Experiments on eight benchmark datasets show that our method outperforms current SOTA methods significantly.

Mind Map

In-depth Reading

English Analysis~49 min read · 64,159 chars

1. Bibliographic Information

1.1. Title

Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training

1.2. Authors

Zetong Li, Qinliang Su, Shijing Si, Jianxing Yu

Affiliations:

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
School of Economics and Finance, Shanghai International Studies University, Shanghai, China
School of Artificial Intelligence, Sun Yat-sen University, Guangdong, China

1.3. Journal/Conference

Published at EMNLP (Conference on Empirical Methods in Natural Language Processing) 2024. EMNLP is a highly reputable and influential conference in the field of Natural Language Processing, known for showcasing cutting-edge research.

1.4. Publication Year

2024

1.5. Abstract

This paper introduces a novel CO-Training Clustering (COTC) framework for short text clustering, which effectively combines the strengths of Bidirectional Encoder Representations from Transformers (BERT) features and Term Frequency-Inverse Document Frequency (TFIDF) features. BERT excels at capturing rich semantic information, while TFIDF is proficient in identifying important keywords, a crucial aspect often overlooked by BERT-centric methods, especially for domain-specific short texts. The COTC framework comprises two distinct modules, one for BERT features and one for TFIDF features. These modules are designed to mutually enhance each other's learning process by aligning their deep representations and cluster assignments. The outputs (deep representations and cluster assignments) from the TFIDF module guide the learning of the BERT module, and vice-versa, leading to a symbiotic relationship. The authors demonstrate that this alternating co-training approach can be consolidated into a unified joint training objective, allowing for tighter module connections and more efficient propagation of training signals. Extensive experiments on eight benchmark datasets show that COTC significantly surpasses current state-of-the-art methods in clustering performance.

1.6. Original Source Link

Original Source Link: https://aclanthology.org/2024.emnlp-main.828/
PDF Link: https://aclanthology.org/2024.emnlp-main.828.pdf
Publication Status: Officially published at EMNLP 2024.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the accurate clustering of short text segments. Short text clustering is vital for various applications, including topic discovery, news recommendation, and spam detection.

Challenges and Gaps in Prior Research:

Limited Information in Short Texts: Short texts inherently contain very few words, making it challenging to extract sufficient semantic information for effective clustering.
Limitations of Traditional TFIDF: While traditional methods used TFIDF features, these features often lack deep semantic understanding, leading to suboptimal performance.
Limitations of BERT-based Methods: The advent of BERT models significantly improved text representation by capturing deep semantics. However, BERT models, primarily trained on general texts, often struggle to capture domain-specific keyword information effectively. For example, in specialized domains, unique technical terms (like QT, QMAKESPEC in StackOverflow data) are crucial for topic identification but might be rare in BERT's pre-training corpus, making BERT less sensitive to their importance. This can restrict clustering performance when keyword information is vital.
Under-explored Collective Strengths: Prior attempts to combine BERT and TFIDF features often involved simple fusion techniques (e.g., concatenation), which failed to fully leverage the distinct and complementary strengths of both feature types due to their "intrinsically different natures."

Paper's Entry Point / Innovative Idea: The paper's innovative idea is to recognize and explicitly leverage the complementary strengths of BERT (deep semantics) and TFIDF (keyword signals). Instead of simple fusion, it proposes a sophisticated co-training framework that allows two distinct modules, each specializing in one feature type, to mutually guide and learn from each other. This alignment-promoting co-training aims to overcome the individual shortcomings of BERT-only or TFIDF-only approaches.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

Proposed COTC Framework: Introduction of a novel CO-Training Clustering (COTC) framework that systematically combines BERT and TFIDF features for short text clustering. This framework designs two modules, one for BERT and one for TFIDF, allowing them to learn complementary information.
Mutual Promotion through Alignment: Development of specific mechanisms within COTC for mutual promotion. The BERT module uses deep representations and cluster assignments from the TFIDF module for guidance, aligning at both representation and cluster levels. Conversely, the TFIDF module is trained using outputs from the BERT module.
Unified Joint Training Objective: Demonstrating that the alternating co-training process can be formulated under a unified joint training objective. This objective allows for a tighter connection between the two modules and more efficient propagation of training signals, empirically leading to better clustering performance.
Rediscovery of TFIDF Value: Reaffirming the utility of TFIDF features, showing that when properly integrated with deep learning models, they are still valuable, especially for keyword-sensitive tasks, and can complement state-of-the-art BERT features.
State-of-the-Art Performance: Achieving significant improvements over current state-of-the-art methods on eight benchmark datasets, validating the effectiveness and robustness of the COTC framework.

3.1. Foundational Concepts

3.1.1. Short Text Clustering

Short text clustering is the task of grouping short text segments (e.g., tweets, search queries, news headlines) into clusters such that texts within the same cluster are semantically similar, while texts in different clusters are dissimilar. Unlike text classification, clustering is an unsupervised task, meaning it does not rely on pre-labeled data. The brevity of short texts makes this task particularly challenging as they often lack sufficient context or word count to derive strong semantic signals using traditional methods.

3.1.2. TFIDF (Term Frequency-Inverse Document Frequency)

TFIDF is a statistical measure that evaluates how important a word is to a document in a collection or corpus. The TFIDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Term Frequency (TF): Measures how frequently a term appears in a document. A higher TF often indicates greater importance within that document. $ \mathrm{tf}(t, d) = \frac{\text{number of times term } t \text{ appears in document } d}{\text{total number of terms in document } d} $ Where:
- $\mathrm{tf}(t, d)$ is the term frequency of term $t$ in document $d$ .
- $t$ represents a specific term (word).
- $d$ represents a document.
Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus. Rare words tend to have a high IDF score, while common words (like "the", "a") have a low IDF score. $ \mathrm{idf}(t, D) = \log\left(\frac{N}{\text{number of documents in } D \text{ containing } t}\right) $ Where:
- $\mathrm{idf}(t, D)$ is the inverse document frequency of term $t$ in the corpus $D$ .
- $N$ is the total number of documents in the corpus $D$ .
- $\log$ is typically the natural logarithm.
TFIDF Score: The TFIDF score is the product of TF and IDF. $ \mathrm{tfidf}(t, d, D) = \mathrm{tf}(t, d) \times \mathrm{idf}(t, D) $ Where:
- $\mathrm{tfidf}(t, d, D)$ is the TFIDF score of term $t$ in document $d$ within corpus $D$ . TFIDF features are word-frequency based and excel at capturing keyword information, which is very useful for topics often defined by specific terms.

3.1.3. BERT (Bidirectional Encoder Representations from Transformers)

BERT is a powerful pre-trained language model developed by Google, based on the Transformer architecture. It revolutionized Natural Language Processing (NLP) by introducing bidirectional training, meaning it processes text by considering the context from both the left and right sides of a word simultaneously. This allows BERT to generate deep contextualized embeddings (vector representations of words) that capture rich semantic meanings based on the surrounding words in a sentence. BERT is typically pre-trained on a massive amount of text data using tasks like Masked Language Modeling (predicting masked words) and Next Sentence Prediction. After pre-training, it can be fine-tuned for various downstream NLP tasks with minimal architecture changes. BERT features are known for capturing deep semantic information and contextual understanding.

3.1.4. K-Means Clustering

K-Means is a popular unsupervised learning algorithm used for clustering. The goal of K-Means is to partition $n$ observations into $k$ clusters, where each observation belongs to the cluster with the nearest mean (centroid), serving as a prototype of the cluster. The basic K-Means algorithm involves:

Initialization: Randomly selecting $k$ initial centroids.
Assignment Step: Assigning each data point to its closest centroid.
Update Step: Recalculating the centroids as the mean of all data points assigned to that cluster. Steps 2 and 3 are repeated until the centroids no longer change significantly or a maximum number of iterations is reached.

3.1.5. Deep Embedded Clustering (DEC)

Deep Embedded Clustering (DEC) is a deep learning-based approach to clustering that simultaneously learns feature representations and cluster assignments. It aims to map data points into a lower-dimensional latent space using a deep neural network (e.g., an autoencoder) and then performs clustering in this learned space. DEC iteratively refines both the feature representations and the cluster assignments by optimizing a clustering objective that encourages assigning data points to their high-confidence clusters. It often uses a Kullback-Leibler (KL) divergence-based loss to match a soft assignment distribution to a target distribution that sharpens the cluster assignments.

3.1.6. Contrastive Learning

Contrastive learning is a self-supervised learning paradigm where a model learns representations by comparing similar and dissimilar samples. The core idea is to "contrast" data points: make representations of positive pairs (augmented versions of the same sample, or semantically similar samples) closer in the embedding space, and representations of negative pairs (different samples) farther apart. This is often achieved using a contrastive loss function (e.g., NT-Xent loss), which encourages discriminative representations without explicit human-provided labels.

3.1.7. Pseudo-labelling

Pseudo-labelling is a semi-supervised learning technique where a model is trained on a small amount of labeled data and then used to predict labels for the unlabeled data (these predictions are called pseudo-labels). The model is then retrained on the combination of original labeled data and the newly pseudo-labeled data. This iterative process helps the model to leverage the vast amount of unlabeled data available. In clustering, it can be used to generate target assignments (pseudo-labels) for samples, which can then be used to train a classifier-like clustering head.

3.1.8. Variational Autoencoder (VAE)

A Variational Autoencoder (VAE) is a generative neural network model that learns a compressed, continuous latent representation of input data. It consists of two main parts:

Encoder: Takes an input (e.g., an image, text) and maps it to a distribution over a latent space (typically a Gaussian distribution, defined by mean and variance vectors). This makes the latent space probabilistic rather than deterministic.
Decoder: Takes a sample from the latent distribution and reconstructs the original input. VAEs are trained to minimize a loss function that combines two terms: a reconstruction loss (how well the output matches the input) and a KL-divergence loss (which regularizes the latent space, forcing the latent distributions to be close to a prior distribution, often a standard normal distribution). This ensures that the latent space is well-structured and allows for generating new samples by sampling from the prior.

3.1.9. Optimal Transport (OT)

Optimal Transport (OT) is a mathematical framework for comparing probability distributions. In the context of machine learning, especially for pseudo-labelling in clustering, OT can be used to find a mapping (or "transport plan") that transforms one distribution of predicted cluster probabilities into another target distribution (e.g., a uniform distribution across clusters). This helps to prevent mode collapse (where all samples are assigned to a single cluster) and ensures a balanced distribution of samples across all clusters. The goal is to minimize the "cost" of transforming one distribution into another.

3.1.1. KL-Divergence (Kullback-Leibler Divergence)

The Kullback-Leibler (KL) divergence is a non-symmetric measure of how one probability distribution $P$ is different from a second, reference probability distribution $Q$ . It quantifies the amount of information lost when $Q$ is used to approximate $P$ . $ D_{\mathrm{KL}}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right) $ Where:

P(x) is the probability of event $x$ in distribution $P$ .
Q(x) is the probability of event $x$ in distribution $Q$ .
$\mathcal{X}$ is the set of all possible events. A KL-divergence of zero indicates that the two distributions are identical. Higher values indicate greater dissimilarity. It's often used as a regularization term or an alignment loss to encourage one distribution to match another.

3.1.1. Gumbel-Softmax Trick

The Gumbel-Softmax trick (or Concrete distribution) is a technique used to enable gradient-based optimization for models that involve sampling from discrete categorical distributions. Standard sampling from discrete distributions is not differentiable, preventing gradients from flowing back through the sampling step. The Gumbel-Softmax trick approximates a discrete sample using a continuous, differentiable function, allowing the model to be trained end-to-end with backpropagation. It introduces Gumbel noise to the logits of the categorical distribution and then applies a softmax function with a temperature parameter. As the temperature approaches zero, the output approximates a one-hot sample from the categorical distribution.

3.2. Previous Works

The paper contextualizes its approach by reviewing the evolution of short text clustering methods:

Early Studies (BoW/TFIDF + K-Means/Hierarchical Clustering):
- Banerjee et al., 2007; Hu et al., 2009: These works focused on enriching Bag-of-Words (BoW) features with external knowledge (e.g., Wikipedia) before applying traditional clustering algorithms like K-Means or hierarchical agglomerative clustering.
- Limitation: These methods suffered from the sparsity of BoW features and their inability to capture deep semantic information.
- TFIDF-K-Means: A baseline mentioned, directly applying K-Means to TFIDF features.
- K-Means_IC (Rakib et al., 2020): Applies K-Means to TFIDF features enhanced with an iterative classification algorithm.
Word2Vec-based Methods:
- Yang et al., 2017; Guo et al., 2017: Used neural networks to learn better representations from features.
- Xu et al., 2017; Hadifar et al., 2019: Utilized Word2Vec embeddings, which provide dense, lower-dimensional representations, for clustering, often combined with DEC (Deep Embedded Clustering) loss.
- $STC2-LPI (Xu et al., 2017)$ : Uses Word2Vec to fit codes pre-obtained with LPI (Locality Preserving Projections) through a CNN, then clusters deep representations with K-Means.
- Self-Train (Hadifar et al., 2019): Employs an autoencoder to model Word2Vec enhanced with SIF (Smooth Inverse Frequency) and fine-tunes the encoder with DEC loss.
- Limitation: Word2Vec embeddings are shallow and cannot capture deep contextual semantic information as effectively as later models.
BERT-based Methods:
- Inspired by the success of BERT (Devlin et al., 2019), recent methods apply clustering heads on top of BERT features.
- Huang et al., 2020: One of the first to fine-tune BERT with Masked Language Model loss and DEC loss for clustering.
- SCCL (Zhang et al., 2021): Combines contrastive learning with DEC on BERT features to achieve clustering, believing contrastive learning yields better representations.
- Yin et al., 2022: Builds on SCCL by adding a topic modeling module to enhance representation semantics.
- RSTC (Zheng et al., 2023): Addresses DEC's tendency to assign all samples to one cluster by employing pseudo-labelling (using Optimal Transport (OT)) for clustering, achieving state-of-the-art performance among BERT-only methods.
- BERT-K-Means: A baseline applying K-Means directly to BERT features.
- Limitation: While powerful for semantics, BERT struggles with domain-specific keywords if they are rare in its pre-training corpus (as highlighted in Figure 1).
Hybrid Baselines (Simple Fusion): The paper also introduces simple fusion baselines to show that naive combinations are not enough.
- RSTCBERT-TFIDF-Linear: Linearly combines BERT features with a dimensionality-reduced TFIDF feature, then feeds into RSTC.
- RSTCBERT-TFIDF-Concat-1: Concatenates BERT features with a dimensionality-reduced TFIDF feature, then feeds into RSTC.
- RSTCBERT-TFIDF-Concat-2: Concatenates BERT features with the original TFIDF feature, then feeds into RSTC.
- Limitation: These intuitive fusions often fail to fully exploit the distinct natures of BERT and TFIDF features.
TFIDF-based VAE:
- GMVAE (Jiang et al., 2017): A baseline that models TFIDF features via VAE with a Gaussian mixture prior. This is used to benchmark the TFIDF module of COTC.

3.3. Technological Evolution

The field of short text clustering has evolved from simple statistical methods to sophisticated deep learning approaches:

Early Statistical Methods (pre-2010s): Focused on Bag-of-Words (BoW) or TFIDF representations, often enriched with external knowledge, combined with traditional clustering algorithms like K-Means or Hierarchical Clustering. These suffered from sparsity and lack of semantic understanding.
Distributed Word Embeddings (2013 onwards): The introduction of Word2Vec (Mikolov et al., 2013) provided dense, continuous word embeddings, enabling better semantic representation than BoW. Methods like DEC were applied to these embeddings, marking a shift towards neural network-based representation learning.
Contextualized Embeddings (2018 onwards): The Transformer architecture and models like BERT (Devlin et al., 2019) brought a revolution. BERT's ability to generate contextualized embeddings captured deep, dynamic semantics, leading to significant performance gains across many NLP tasks, including clustering. Recent works focused on fine-tuning BERT with contrastive learning or pseudo-labelling for clustering.

This paper's work, COTC, fits into this timeline by proposing a hybrid approach. It acknowledges the strengths of BERT but recognizes its limitations in specific contexts (keywords). It then re-integrates the "seemingly outdated but easily available TFIDF features" through a sophisticated co-training mechanism, rather than abandoning them. This represents an evolution from purely BERT-centric approaches to a more multi-modal feature learning strategy for clustering.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of COTC are:

Beyond Single-Feature Reliance: Most previous SOTA methods (e.g., SCCL, RSTC) rely solely on BERT features, or earlier ones on TFIDF/Word2Vec. COTC fundamentally distinguishes itself by acknowledging that no single feature type is perfect and that BERT and TFIDF have complementary strengths (deep semantics vs. keyword signals).
Sophisticated Co-Training vs. Simple Fusion: While some baselines (e.g., RSTCBERT-TFIDF-Linear/Concat) attempt to combine BERT and TFIDF, they do so through naive fusion (concatenation, linear combination). COTC proposes a co-training framework with two dedicated modules that mutually promote each other. This is a much deeper integration than simple feature concatenation.
Explicit Alignment at Multiple Levels: COTC explicitly enforces alignment between the two modules at both the representation level (using similarity graphs from one module to guide contrastive learning in the other) and the cluster assignment level (aligning predicted cluster probabilities using KL-divergence). This ensures that the complementary information is effectively shared and integrated.
Unified Joint Training Objective: The paper goes a step further by formulating the alternating co-training into a unified joint training objective. This allows for tighter coupling and more efficient gradient propagation between the BERT and TFIDF modules, which is a significant architectural and training innovation.
Robustness to BERT's Keyword Weakness: COTC directly addresses BERT's weakness in capturing domain-specific keyword information by explicitly integrating TFIDF features, which are strong in this aspect. This makes COTC more robust for short text clustering in diverse domains.

4. Methodology

4.1. Principles

The core idea behind the CO-Training Clustering (COTC) framework is to leverage the complementary strengths of BERT features (for deep semantics) and TFIDF features (for keyword information) by developing two specialized modules that mutually promote each other's learning. The key principles are:

Dual-Module Design: Two distinct modules, one for BERT and one for TFIDF, are responsible for generating representations and cluster assignments from their respective feature types.
Alignment-Promoting Learning: The modules are designed to actively align their learned representations and cluster assignments. This means information from one module (e.g., TFIDF-induced similarity structures) guides the learning of the other module (e.g., BERT's contrastive learning).
Mutual Promotion: Through this alignment, each module benefits from the unique strengths of the other, leading to improved performance that neither could achieve alone.
Unified Optimization: The alternating co-training process is consolidated into a single joint training objective to enable more efficient gradient propagation and tighter coupling between the modules.

4.2. Core Methodology In-depth (Layer by Layer)

The COTC framework comprises two primary modules: $\mathcal{F}_B(\cdot)$ for BERT features and $\mathcal{F}_T(\cdot)$ for TFIDF features. Let's denote the original input text as $\pmb{x}_i$ . The BERT transformation gives $\pmb{b}_i = \pmb{B}(\pmb{x}_i)$ , and the TFIDF transformation gives $\pmb{t}_i = \mathcal{T}(\pmb{x}_i)$ .

The overall architecture of the COTC framework is shown in Figure 2 from the original paper:

Figure 2: The overall architecture of the co-training clustering framework COTC. 该图像是一个示意图，展示了共同训练聚类框架COTC的整体架构。图中分别展示了BERT和TFIDF模块的结构与功能，并表示了它们之间的知识流动。如图所示，共同训练的过程通过对高维表示和聚类结果的交互，促进了两个模块的协同学习，实现了有效的信息传递与增强。图中的箭头和符号清晰标示了各个步骤及其相互作用关系。

The overall architecture illustrates how the BERT Module and TFIDF Module interact. Each module processes its respective features, generating representations and cluster probabilities. These outputs are then used to guide and align the learning of the other module, facilitating mutual promotion.

4.2.1. Implementation of Module $\mathcal{F}_B(\cdot)$ (BERT Module)

The BERT module aims to learn BERT-induced representations $\{h_i^b\}_{i=1}^N$ and cluster probabilities $\{\pmb{p}_i^b\}_{i=1}^N$ from the raw BERT features $\{\pmb{b}_i\}_{i=1}^N$ . This module is guided by information from the TFIDF module through two main mechanisms: representation-level alignment via contrastive learning and cluster-level alignment via KL-divergence.

A. Representation-Level Alignment via Contrastive Learning

The paper argues that while BERT and TFIDF representations reside in different spaces, their local topological structures should be aligned. If two texts are similar in the TFIDF space, they should also be similar in the BERT space. To achieve this, the BERT module uses the TFIDF-induced similarity graph to guide its contrastive learning.

Constructing the TFIDF Similarity Graph $\mathcal{G}^t$ : First, a similarity graph $\mathcal{G}^t(\mathcal{V}, \mathcal{E}^t)$ is constructed using the TFIDF representations $\{h_i^t\}_{i=1}^N$ (which are obtained from the TFIDF module).
- $\mathcal{V} = \{1, 2, \dots, N\}$ is the set of texts.
- $\mathcal{E}^t = \{(i, j) | i \in \mathcal{V}, j \in \mathcal{N}_i^t\}$ is the set of edges.
- $\mathcal{N}_i^t$ $N_{i}^{t}$ denotes the set of top- $L$ $L$ nearest neighbors of text $\pmb{x}_i$ $x_{i}$ in the TFIDF representation space. This is defined as: $ \mathcal{N}_i^t = {j | j \neq i \ & \cos(\pmb{h}_i^t, \pmb{h}_j^t) \ \mathrm{is \ top-}L \ \mathrm{largest}} $ Where:
  - $\cos(\pmb{h}_i^t, \pmb{h}_j^t)$ is the cosine similarity between TFIDF representation $\pmb{h}_i^t$ of text $i$ and $\pmb{h}_j^t$ of text $j$ .
  - top-L largest means selecting the $L$ texts with the highest cosine similarity to $\pmb{x}_i$ .
Contrastive Loss $\mathcal{L}_{Contr}$ : The BERT representation $h_i^b$ is learned under a contrastive learning framework. For each text $\pmb{x}_i$ , three augmentations are generated:
- $\pmb{x}_i^{(1)}$ and $\pmb{x}_i^{(2)}$ : Obtained by standard contextual augmenter techniques (e.g., Kobayashi, 2018; Ma, 2019).
- $\pmb{x}_i^{(3)}$ : A new augmentation generated by randomly selecting a sample from its TFIDF neighbor set $\mathcal{N}_i^t$ . These three augmentations $\{ \pmb{x}_i^{(1)}, \pmb{x}_i^{(2)}, \pmb{x}_i^{(3)} \}$ are treated as positives. The contrastive loss is defined as: $ \mathcal{L}{Contr} = - \frac{1}{N} \sum{i=1}^N \sum_{j=2}^3 \log \ell_{ij} $ Where:
- $N$ is the total number of texts.
- $\ell_{ij}$ is the similarity probability for the positive pair $(\pmb{x}_i^{(1)}, \pmb{x}_i^{(j)})$ : $ \ell_{ij} \triangleq \frac{\Delta(\pmb{h}_i^{b(1)}, \pmb{h}_i^{b(j)})}{\Delta(\pmb{h}_i^{b(1)}, \pmb{h}i^{b(j)}) + \sum{k \neq i, m \in {1, j}} \Delta(\pmb{h}_i^{b(1)}, \pmb{h}_k^{b(m)})} $
- $\pmb{h}_i^{b(m)}$ $h_{i}^{b (m)}$ denotes the BERT representation of the $m$ $m$ -th augmentation of text $\pmb{x}_i$ $x_{i}$ . It's computed as: $ \pmb{h}_i^{b(m)} = f(\pmb{b}_i^{(m)}) = f(\pmb{B}(\pmb{x}_i^{(m)})) $ Where:
  - $f(\cdot)$ is an MLP (Multi-Layer Perceptron) neural network serving as a projection head.
  - $\pmb{B}(\cdot)$ is the BERT backbone.
- $\Delta(\pmb{h}_i^{b(1)}, \pmb{h}_i^{b(j)}) = e^{\cos(\pmb{h}_i^{b(1)}, \pmb{h}_i^{b(j)}) / \tau}$ measures the similarity between BERT representations, where $\tau$ is a temperature parameter.
- The denominator $\sum_{k \neq i, m \in \{1, j\}} \Delta(\pmb{h}_i^{b(1)}, \pmb{h}_k^{b(m)})$ includes similarities with negative samples (other texts or their augmentations). Minimizing $\mathcal{L}_{Contr}$ encourages the BERT representations of positive pairs (including TFIDF-neighbors) to be close, thereby aligning the similarity structures in BERT and TFIDF representation spaces.

B. Cluster Assignment with Pseudo-labelling and Consistency

Computing Cluster Probability $\pmb{p}_i^b$ : A clustering head is applied over the BERT feature $\pmb{b}_i$ to compute the cluster probability $\pmb{p}_i^b$ : $ \pmb{p}_i^b = \delta(g(\pmb{b}_i)) = \delta(g(\pmb{B}(\pmb{x}_i))) $ Where:
- $\delta(\cdot)$ is the softmax function.
- $g(\cdot)$ is an MLP neural network serving as the clustering head.
- $\pmb{B}(\cdot)$ is the BERT backbone.
Pseudo-labelling and Cross-Entropy Loss $\mathcal{L}_{CE}$ : Pseudo-labels are inferred from the predicted probabilities $\{\pmb{p}_i^b\}_{i=1}^N$ by solving an Optimal Transport (OT) problem (details in Appendix A.1). Let $\pmb{q}_i$ be the one-hot pseudo-label obtained for text $\pmb{x}_i$ . The model is trained to minimize the cross-entropy loss: $ \mathcal{L}{CE} = - \frac{1}{N} \sum{i=1}^N \sum_{m=1}^3 \pmb{q}_i^T \log \delta(\pmb{g}(\pmb{b}_i^{(m)})) $ Where:
- $\pmb{b}_i^{(m)} = \pmb{B}(\pmb{x}_i^{(m)})$ . This loss encourages the model to predict the same pseudo-label for all three augmented texts of $\pmb{x}_i$ .
Consistency Loss $\mathcal{L}_{Consist}$ : To ensure robustness and consistency, the model is encouraged to output consistent probability distributions for the original text $\pmb{x}_i$ and its augmentations $\pmb{x}_i^{(m)}$ : $ \mathcal{L}{Consist} = \frac{1}{N} \sum{i=1}^N \sum_{m=1}^3 D_{KL}(\delta(g(\pmb{b}_i)) || \delta(g(\pmb{b}_i^{(m)}))) $ Where:
- $D_{KL}(\cdot || \cdot)$ is the KL-divergence. This loss promotes robustness of the pseudo-labelling technique. The clustering loss for the BERT module is $\mathcal{L}_{Cluster} \triangleq \mathcal{L}_{CE} + \mathcal{L}_{Consist}$ .

C. Cluster-Level Alignment between Modules $\mathcal{L}_{Align}$

To achieve cluster-level alignment between the BERT and TFIDF modules, the predicted probability $\delta(g(\pmb{b}_i))$ from the BERT feature should be aligned with the cluster probability $\pmb{p}_i^t$ inferred from the TFIDF feature. This is done by minimizing: $ \mathcal{L}{Align} = \frac{1}{N} \sum{i=1}^N D_{KL}(\delta(g(\pmb{b}_i)) || \pmb{p}_i^t) $ Where:

$\pmb{p}_i^t$ is the cluster probability from the TFIDF module (to be described next).

D. Total Loss for BERT Module $\mathcal{L}_B$

The entire BERT module is trained by minimizing the combined loss: $ \mathcal{L}B = \mathcal{L}{Contr} + \mathcal{L}{Cluster} + \lambda \mathcal{L}{Align} $ Where:

$\lambda$ is a weighting parameter balancing the alignment term.

4.2.2. Implementation of Module $\mathcal{F}_T(\cdot)$ (TFIDF Module)

The TFIDF module aims to learn TFIDF-induced representations $\{h_i^t\}_{i=1}^N$ and cluster probabilities $\{\pmb{p}_i^t\}_{i=1}^N$ from the raw TFIDF features $\{\pmb{t}_i\}_{i=1}^N$ . Unlike the BERT module which uses contrastive learning and pseudo-labelling, the TFIDF module uses a Variational Autoencoder (VAE) to model the TFIDF features, ensuring keyword information preservation and explicit clustering capabilities through a Gaussian mixture prior. This module is also guided by information from the BERT module.

A. Representation-Level Alignment via BERT-induced Graph

Similar to how the BERT module uses the TFIDF-induced graph, the TFIDF module uses a BERT-induced similarity graph to enforce representation-level alignment.

Constructing the BERT Similarity Graph $\mathcal{G}^b$ : A similarity graph $\mathcal{G}^b(\mathcal{V}, \mathcal{E}^b)$ $G^{b} (V, E^{b})$ is constructed using the BERT representations $\{h_i^b\}_{i=1}^N$ ${h_{i}^{b}}_{i = 1}^{N}$ (obtained from the BERT module).
- $\mathcal{E}^b = \{(i, j) | i \in \mathcal{V}, j \in \mathcal{N}_i^b\}$ with $ \mathcal{N}_i^b = {j | j \neq i \ & \mathrm{cos}(\pmb{h}_i^b, \pmb{h}_j^b) \ \mathrm{is \ top-}L \ \mathrm{largest}} $ This indicates that if two texts are similar in the BERT space, their TFIDF representations should also align.

B. Generative Model for TFIDF Features (VAE)

A VAE is used to model the TFIDF feature $\pmb{t}_i$ . This ensures that keyword information is preserved during reconstruction and that the TFIDF representation $h_i^t$ exhibits a cluster structure through a latent Gaussian mixture prior distribution. The generative model is defined as: $ p({t_i}{i=1}^N, \mathcal{G}^b, {\pmb{h}i^t}{i=1}^N, {c_i}{i=1}^N) = \prod_{i=1}^N p(t_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t | c_i) p(c_i) $ Where:

$c_i$ is a latent categorical variable (cluster assignment) randomly drawn from $\{1, 2, \dots, K\}$ (number of clusters).
$p(c_i) = \mathrm{Cat}(c_i; \pi)$ is the categorical prior distribution over clusters, with $\pi$ being a $K$ -dimensional vector.
$p(\pmb{h}_i^t | c_i) = \mathcal{N}(\pmb{h}_i^t; \pmb{\mu}_{c_i}, \mathrm{diag}(\pmb{\sigma}_{c_i}^2))$ is a Gaussian prior for the latent representation $\pmb{h}_i^t$ , where $\pmb{\mu}_{c_i}$ and $\pmb{\sigma}_{c_i}^2$ are the mean and variance for cluster $c_i$ . This models the idea that data points belonging to the same cluster should have TFIDF representations drawn from the same Gaussian distribution.
$p(t_i | \pmb{h}_i^t)$ is the decoder responsible for reconstructing the TFIDF feature $\pmb{t}_i$ from $\pmb{h}_i^t$ . It's a softmax over an embedding matrix. (Details in Appendix A.2, Equation 42).
p(\mathcal{G}_i^b | \{\pmb{h}_i^t\}_{i=1}^N) is the decoder responsible for generating the BERT-induced similarity graph $\mathcal{G}^b$ $G^{b}$ from TFIDF representations. This term explicitly encourages TFIDF representations to reflect the BERT similarity structure: $ p(\mathcal{G}i^b | {\pmb{h}i^t}{i=1}^N) = \prod{j \in \mathcal{N}_i^b} \frac{\Delta(\pmb{h}_i^t, \pmb{h}j^t)}{\sum{k \ne i} \Delta(\pmb{h}_i^t, \pmb{h}_k^t)} $ Where:
- $\Delta(\pmb{h}_i^t, \pmb{h}_j^t) = e^{\cos(\pmb{h}_i^t, \pmb{h}_j^t) / \tau}$ measures similarity between TFIDF representations.

C. Training the VAE with ELBO $\mathcal{L}_{ELBO}$

The generative model is trained by minimizing the negative Evidence Lower Bound (ELBO): $ \mathcal{L}{ELBO} = \frac{1}{N} \sum{i=1}^N \ell_i^{elbo}\left(q(\pmb{h}_i^t, c_i | \pmb{t}_i)\right) $ Where $\ell_i^{elbo}$ for a single sample $i$ is: $ \ell_i^{elbo}\left(q(\pmb{h}_i^t, c_i | \pmb{t}i)\right) = - \mathbb{E}{q}\left[\log \frac{p(\pmb{t}_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t | c_i) p(c_i)}{q(\pmb{h}_i^t, c_i | \pmb{t}_i)}\right] $ Here, $\mathbb{E}_q[\cdot]$ denotes the expectation with respect to the variational posterior $q(\pmb{h}_i^t, c_i | \pmb{t}_i)$ . By restricting $q(\pmb{h}_i^t, c_i | \pmb{t}_i)$ to the form $q(\pmb{h}_i^t, c_i | \pmb{t}_i) = q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}_i^t)$ , the ELBO simplifies to: $ \ell_i^{elbo}\left(q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}i^t)\right) = - \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}\left[\log \frac{p(\pmb{t}_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t)}{q(\pmb{h}_i^t | \pmb{t}_i)}\right] $ Where:

$q(\pmb{h}_i^t | \pmb{t}_i) = \mathcal{N}(\pmb{h}_i^t; \pmb{\mu}(\pmb{t}_i), \mathrm{diag}(\pmb{\sigma}^2(\pmb{t}_i)))$ is the encoder (a neural network) that outputs the mean $\pmb{\mu}(\pmb{t}_i)$ and variance $\pmb{\sigma}^2(\pmb{t}_i)$ of the Gaussian distribution for $\pmb{h}_i^t$ .
$p(c_i | \pmb{h}_i^t) = \frac{p(\pmb{h}_i^t | c_i) p(c_i)}{\sum_c p(\pmb{h}_i^t | c) p(c)}$ is the posterior probability of cluster $c_i$ given $\pmb{h}_i^t$ .

After the VAE is trained, the TFIDF representation $\pmb{h}_i^t$ and cluster probability $\pmb{p}_i^t$ are obtained from its encoder: $ \pmb{h}i^t = \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}[\pmb{h}] = \pmb{\mu}(\pmb{t}_i) $ $ \pmb{p}i^t[c] = \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}\left[\frac{\pi_c \mathcal{N}(\pmb{h}_i^t; \pmb{\mu}_c, \mathrm{diag}(\pmb{\sigma}c^2))}{\sum{\mathcal{T}} \mathcal{N}(\pmb{h}_i^t; \pmb{\mu}_c, \mathrm{diag}(\pmb{\sigma}_c^2))}\right] $ Where:
$\pmb{\mu}(\pmb{t}_i)$ is the mean output by the encoder.
$\pmb{p}_i^t[c]$ is the $c$ -th element of the cluster probability vector $\pmb{p}_i^t$ . In practice, the expectation is approximated by a sample drawn from $q(\pmb{h}_i^t | \pmb{t}_i)$ .

D. Cluster-Level Alignment between Modules $\mathcal{L}_{Align}$

Similar to the BERT module, the TFIDF module also uses a KL-divergence term to align its predicted probabilities with those from the BERT module: $ \mathcal{L}{Align} = \frac{1}{N} \sum{i=1}^N D_{KL}(\pmb{p}_i^b || \pmb{p}_i^t) $ This is the same alignment term as in the BERT module, but the KL-divergence is computed in the opposite direction.

E. Total Loss for TFIDF Module $\mathcal{L}_T$

The TFIDF module is trained by minimizing: $ \mathcal{L}T = \mathcal{L}{ELBO} + \lambda' \mathcal{L}_{Align} $ Where:

$\lambda'$ is a weighting parameter.

4.2.3. A Unified Training Objective

Initially, the two modules $\mathcal{F}_B(\cdot)$ and $\mathcal{F}_T(\cdot)$ can be updated in an alternating manner. However, the paper shows that a more efficient approach is to use a unified joint training objective.

Initial Joint Loss $\mathcal{L}_{Joint}$ : A straightforward joint loss combining $\mathcal{L}_B$ and $\mathcal{L}_T$ would be: $ \mathcal{L}{Joint} = \mathcal{L}B + \lambda_1 \mathcal{L}T = \mathcal{L}{Contr} + \mathcal{L}{Cluster} + \lambda \mathcal{L}{Align} + \lambda_1 (\mathcal{L}{ELBO} + \lambda' \mathcal{L}{Align}) $ By absorbing $\lambda$ and $\lambda'$ into $\lambda_2$ , this can be rewritten as: $ \mathcal{L}{Joint} = \mathcal{L}{Contr} + \mathcal{L}{Cluster} + \lambda_1 (\mathcal{L}{ELBO} + \lambda_2 \mathcal{L}_{Align}) $ Where:
- $\lambda_1$ and $\lambda_2$ are weighting parameters. Note that the $\mathcal{L}_{Align}$ term appears twice in the sum.
Derivation of Unified Joint Loss $\mathcal{L}_{Joint}'$ (using Appendix A.2): The paper then shows an inequality that allows for a more tightly connected and efficient joint objective. From Appendix A.2, the following inequality is derived: $ \ell_i^{elbo}(q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}_i^t)) + \ell_i^{align} \leq \ell_i^{elbo}(q(\pmb{h}_i^t | \pmb{t}_i) \pmb{p}_i^b[c]) $ Where:
- $\ell_i^{align} = D_{KL}(\pmb{p}_i^b || \pmb{p}_i^t)$ .
- $\pmb{p}_i^b[c]$ denotes the $c$ -th element of $\pmb{p}_i^b$ . This inequality implies that if we use the cluster probability $\pmb{p}_i^b[c]$ from the BERT module directly as the variational posterior for the TFIDF module (i.e., replacing $p(c_i | \pmb{h}_i^t)$ with $\pmb{p}_i^b[c]$ ), it yields an upper bound on the sum of the standard ELBO and the alignment loss. Aggregating over all samples, this leads to: $ \mathcal{L}{ELBO} + \mathcal{L}{Align} \leq \mathcal{L}{ELBO}' $ Where: $ \mathcal{L}{ELBO}' \triangleq \frac{1}{N} \sum_{i=1}^N \ell_i^{elbo}(q(\pmb{h}i^t | \pmb{t}i) \pmb{p}i^b[c]) $ By setting $\lambda_2 = 1$ in the initial joint loss (Equation 20) and substituting the inequality, a new unified joint training loss is obtained: $ \mathcal{L}{Joint}' = \mathcal{L}{Contr} + \mathcal{L}{Cluster} + \lambda_1 \mathcal{L}_{ELBO}' $ Benefits of $\mathcal{L}_{Joint}'$ :
- Efficient Gradient Propagation: The cluster probability $\pmb{p}_i^b[c]$ from the BERT module is directly used as the variational posterior in the TFIDF module. This allows gradients from the TFIDF module to flow directly back to the BERT module via $\pmb{p}_i^b[c]$ , enabling more efficient propagation of training signals between the two modules.
- Sharper Probabilities: This formulation avoids explicitly optimizing the KL-divergence term $D_{KL}(\pmb{p}_i^b || \pmb{p}_i^t)$ directly. Directly optimizing KL-divergence can sometimes encourage distributions to be overly smoothed (allocate some probability to all clusters to avoid large losses), making it harder to yield sharp predicted probabilities. $\mathcal{L}_{Joint}'$ helps in producing sharper cluster assignments.
- Optimized with Gumbel-Softmax: The Gumbel-Softmax trick can be employed to approximate the expectation over the categorical variable $c_i$ in $\ell_i^{elbo}(q(h_i^t | t_i) \pmb{p}_i^b[c_i])$ . Combined with the Gaussian re-parameterization trick for $h_i^t$ , this allows $\mathcal{L}_{Joint}'$ to be optimized efficiently.
  
  After training, the predicted cluster probability $\pmb{p}_i^b$ from the BERT module is used to obtain the final clustering result.

4.2.4. Appendix A.1: Optimal Transport for Pseudo-labelling

The pseudo-labels $\{\pmb{q}_i\}_{i=1}^N$ for the BERT module are obtained by solving an Optimal Transport (OT) problem over the predicted probabilities $\{\pmb{p}_i^b\}_{i=1}^N$ . This approach helps to generate balanced cluster assignments and avoid mode collapse.

Problem Formulation: Let $P = [\pmb{p}_1^b, \dots, \pmb{p}_N^b]^T \in [0, 1]^{N \times K}$ be the matrix of predicted cluster probabilities for $N$ samples and $K$ clusters. A cost matrix $C$ is defined as $C = -\log P$ . The OT problem seeks to find a transport matrix $\pmb{r} \in [0, 1]^{N \times K}$ that minimizes the following objective: $ \underset{\pmb{r}, \pmb{b}}{\operatorname{min}} \langle \pmb{r}, C \rangle - \epsilon_1 H(\pmb{r}) + \epsilon_2 U(\pmb{b}) $ Subject to constraints: $ \pmb{r} \mathbf{1} = \pmb{a}, \quad \pmb{r}^T \mathbf{1} = \pmb{b}, \quad \pmb{b}^T \mathbf{1} = 1, \quad \pmb{r} \geq 0, \quad \pmb{b} \geq 0 $ Where:
- $\pmb{r}_{ij}$ is the probability of transporting sample $i$ to class $j$ .
- $\langle \pmb{r}, C \rangle = \sum_{i,j} \pmb{r}_{ij} C_{ij}$ is the transport cost.
- $\epsilon_1, \epsilon_2$ are weighting parameters.
- H(\pmb{r}) = - \sum_{i=1}^N \sum_{j=1}^K \pmb{r}_{ij} (\log \pmb{r}_{ij} - 1) is the entropy regularization for $\pmb{r}$ , preventing degenerate solutions.
- U(\pmb{b}) = - \sum_{j=1}^K (\log \pmb{b}_j + \log(1 - \pmb{b}_j)) is a penalty function that encourages the adaptive marginal distribution $\pmb{b}$ to be uniform, thus avoiding cluster collapse.
- $\pmb{a} = \frac{1}{N} \mathbf{1}$ is a uniform marginal distribution over samples.
- $\pmb{b}$ is an adaptive marginal distribution over classes, allowing for class imbalance.
- $\mathbf{1}$ is a vector of ones.
Lagrange Multiplier Formulation: The problem can be solved using Lagrange Multipliers: $ \underset{\pmb{r}, \pmb{b}}{\operatorname{min}} L = \langle \pmb{r}, \pmb{C} \rangle - \epsilon_1 H(\pmb{r}) + \epsilon_2 U(\pmb{b}) - \pmb{f}^T (\pmb{r}\mathbf{1} - \pmb{a}) - \pmb{g}^T (\pmb{r}^T\mathbf{1} - \pmb{b}) - h (\pmb{b}^T\mathbf{1} - 1) $ Where $\pmb{f}, \pmb{g}, h$ are Lagrange multipliers.
Iterative Solution: The optimal $\pmb{r}_{ij}$ can be expressed in terms of the multipliers: $ \pmb{r}{ij} = \exp\left(\frac{f_i + g_j - C{ij}}{\epsilon_1}\right) $ By applying the marginal constraints, iterative updates for $\pmb{u} = \exp(\pmb{f}/\epsilon_1)$ and $\pmb{v} = \exp(\pmb{g}/\epsilon_1)$ are derived, where W = \exp(-C/\epsilon_1): $ \pmb{u}^{(t+1)} \gets \frac{\pmb{a}}{W \pmb{v}^{(t)}} $ $ \pmb{v}^{(t+1)} \gets \frac{\pmb{b}^{(t)}}{W^T \pmb{u}^{(t)}} $ The adaptive marginal $\pmb{b}_j$ for classes is updated by solving a quadratic equation for each $j$ , which is derived by setting $\frac{\partial L}{\partial \pmb{b}_j} = 0$ : $ (\pmb{g}_j - h) \pmb{b}_j^2 + (-\pmb{g}_j + h - 2\epsilon_2) \pmb{b}_j + \epsilon_2 = 0 $ The correct solution for $\pmb{b}_j$ is: $ \pmb{b}_j = \frac{\pmb{g}_j - h + 2\epsilon_2 - \sqrt{\Delta_j}}{2(\pmb{g}_j - h)} $ Where $\Delta_j = (\pmb{g}_j - h)^2 + 4\epsilon_2^2 > 0$ is the discriminant. The value of $h$ is found using Newton's Method to satisfy the constraint $\pmb{b}^T \mathbf{1} = 1$ . These updates are performed iteratively until convergence.
Pseudo-label Derivation: Once the transport matrix $\pmb{r}$ is obtained (after several iterations): $ \pmb{r} = \pmb{u}^T \pmb{W} \pmb{v} $ The pseudo-labels $\{\pmb{q}_i\}_{i=1}^N$ are then derived by taking the argmax of each row of $\pmb{r}$ : $ \pmb{q}{ij} = 1 \ \mathrm{if} \ \underset{j'}{\operatorname{argmax}} \pmb{r}{ij'} = j \ \mathrm{else} \ 0 $ This converts the soft assignments in $\pmb{r}$ into hard one-hot pseudo-labels.

4.2.5. Appendix A.2: Variational Autoencoder Details

This section elaborates on the VAE used in the TFIDF module and the derivation of the unified joint training objective.

Generative Model Review: The generative model for TFIDF features and the BERT-induced graph is: $ p({t_i}{i=1}^N, \mathcal{G}^b, {\pmb{h}i^t}{i=1}^N, {c_i}{i=1}^N) = \prod_{i=1}^N p(t_i | \pmb{h}_i^t) p(\mathcal{G}i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}i^t | c_i) p(c_i) $ The negative ELBO to be minimized is: $ \mathcal{L}{ELBO} = \sum{i=1}^N \ell_i^{elbo}(q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}_i^t)) $ Where: $ \ell_i^{elbo}\left(q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}i^t)\right) = - \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}\left[\log \frac{p(\pmb{t}_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t)}{q(\pmb{h}_i^t | \pmb{t}_i)}\right] $ The alignment loss is $\mathcal{L}_{Align} = \frac{1}{N} D_{KL}(\pmb{p}_i^b[c_i] || \pmb{p}_i^t[c_i])$ , with: $ \pmb{p}_i^b[c_i] = \delta(g(\pmb{b}_i)) $ $ \pmb{p}i^t[c_i] = \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}[\pmb{p}(c_i | \pmb{h}_i^t)] $
Derivation of the Inequality for Unified Loss: To show $\mathcal{L}_{ELBO} + \mathcal{L}_{Align} \leq \mathcal{L}_{ELBO}'$ , the derivation starts with the KL-divergence term: $ D_{KL}(P || Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} $ Applying this to the alignment loss term $\ell_i^{align} = D_{KL}(\pmb{p}_i^b[c_i] || \pmb{p}_i^t[c_i])$ : $ D_{KL}(\pmb{p}_i^b[c_i] || \pmb{p}i^t[c_i]) = \sum{c'} \pmb{p}_i^b[c'] \log \frac{\pmb{p}i^b[c']}{\mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i)}[p(c' | \pmb{h}i^t)]} $ Using Jensen's inequality ( $\mathbb{E}[\log X] \le \log \mathbb{E}[X]$ ), we have $\log \mathbb{E}[p(c' | \pmb{h}_i^t)] \ge \mathbb{E}[\log p(c' | \pmb{h}_i^t)]$ . Applying this (in the denominator with a negative sign): $ \le \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}i)} \left[ \sum{c'} \pmb{p}_i^b[c'] \log \frac{\pmb{p}_i^b[c']}{p(c' | \pmb{h}i^t)} \right] $ This expression can be seen as an expectation of KL-divergence where the posterior $p(c' | \pmb{h}_i^t)$ is conditioned on $\pmb{h}_i^t$ . It can be rewritten as: $ = \mathbb{E}{q(\pmb{h}_i^t | \pmb{t}_i) \pmb{p}_i^b[c_i]} \left[ \log \frac{\pmb{p}i^b[c_i]}{p(c_i | \pmb{h}i^t)} \right] $ Combining this with the ELBO term $\ell_i^{elbo}(q(\pmb{h}_i^t | \pmb{t}_i) p(c_i | \pmb{h}_i^t))$ , and after algebraic manipulation (which involves cancelling terms and regrouping using properties of logarithm and expectation), the inequality is obtained: $ \mathcal{L}{ELBO}' = \sum{i=1}^N \ell_i^{elbo}(q(\pmb{h}_i^t | \pmb{t}i) \pmb{p}i^b[c_i]) \geq \mathcal{L}{ELBO} + \mathcal{L}{Align} $ This inequality shows that optimizing $\mathcal{L}_{ELBO}'$ (where $\pmb{p}_i^b[c_i]$ acts as the posterior for $c_i$ ) implicitly optimizes the original $\mathcal{L}_{ELBO}$ and the alignment loss $\mathcal{L}_{Align}$ .
Decoder $p(t_i | h_i^t)$ : The decoder $p(t_i | h_i^t)$ in the VAE reconstructs the TFIDF feature $\pmb{t}_i$ . The TFIDF feature $\pmb{t}_i$ is treated as a set of words $\{w_j \in t_i\}$ , where $\pmb{w}_j$ is a one-hot representation in the vocabulary $\mathcal{W}$ . An embedding matrix $\pmb{\mathcal{E}} \in \mathbb{R}^{|\mathcal{W}| \times 128}$ (where 128 is the dimension of $h_i^t$ ) is used as the decoder network. The probability of generating $\pmb{t}_i$ given $\pmb{h}_i^t$ is defined as: $ p(\pmb{t}_i | \pmb{h}i^t) = \prod{\pmb{w}_j \in \pmb{t}_i} p(\pmb{w}_j | \pmb{h}i^t) = \prod{\pmb{w}_j \in \pmb{t}_i} \frac{\exp(\pmb{h}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}j)}{\sum{k=1}^{|\mathcal{W}|} \exp(\pmb{h}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}_k)} $ This is essentially a product of softmax probabilities, where each word embedding (column of $\pmb{\mathcal{E}}$ ) is compared against the TFIDF representation $\pmb{h}_i^t$ .
Approximation of $\mathcal{L}_{ELBO}'$ : The term $\ell_i^{elbo}(q(h_i^t | t_i) \pmb{p}_i^b[c_i])$ in the new ELBO can be factorized into five subterms: $ \ell_i^{elbo}\left(q(\pmb{h}_i^t | \pmb{t}_i) \pmb{p}i^b[c_i]\right) = - \mathbb{E}{q} \left[ \log \frac{p(\pmb{t}_i | \pmb{h}_i^t) p(\mathcal{G}_i^b | {\pmb{h}i^t}{i=1}^N) p(\pmb{h}_i^t | c_i) p(c_i)}{q(\pmb{h}_i^t | \pmb{t}_i) \pmb{p}i^b[c_i]} \right] \ = - \mathbb{E}{q} [ \log p(\pmb{t}_i | \pmb{h}i^t) ] - \mathbb{E}{q} [ \log p(\mathcal{G}i^b | {\pmb{h}i^t}{i=1}^N) ] \ + \mathbb{E}{q} [ \log q(\pmb{h}_i^t | \pmb{t}i) ] - \mathbb{E}{q} \left[ \mathrm{log} \frac{p(c_i)}{ \pmb{p}i^b[c_i]} \right] - \mathbb{E}{q} [ \log p(\pmb{h}_i^t | c_i) ] $ These subterms are approximated for computation:
- Reconstruction term (from $p(t_i|h_i^t)$ $p (t_{i} ∣ h_{i}^{t})$ ): $
  - \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}_i)} [ \log p(\pmb{t}_i | \pmb{h}i^t) ] \approx - \sum{\pmb{w}_j \in \pmb{t}_i} \log \frac{\exp(\tilde{\pmb{h}}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}j)}{\sum{k=1}^{|\mathcal{W}|} \exp(\tilde{\pmb{h}}_i^{t^T} \pmb{\mathcal{E}} \pmb{w}_k)} $ This uses the Gaussian re-parameterization trick where $\tilde{\pmb{h}}_i^t = \pmb{\mu}(\pmb{t}_i) + \epsilon^T \pmb{\sigma}(\pmb{t}_i)$ with $\epsilon \sim \mathcal{N}(\epsilon; 0, 1)$ .
- Graph reconstruction term (from p(\mathcal{G}_i^b|\{h_i^t\}_{i=1}^N)): $
  - \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}_i)} [ \log p(\mathcal{G}i^b | {\pmb{h}i^t}{i=1}^N) ] \approx - \sum{j \in \mathcal{N}_i^b} \log \frac{\Delta(\tilde{h}_i^t, \tilde{h}j^t)}{\sum{k \not= i} \Delta(\tilde{h}_i^t, \tilde{h}_k^t)} $ This also uses re-parameterization.
- Encoder entropy term (from $q(h_i^t|t_i)$ ): Can be computed analytically for a Gaussian distribution. $ \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}_i)} [ \log q(\pmb{h}_i^t | \pmb{t}i) ] = - \frac{1}{2} \sum{j=1}^{128} (\log 2\pi + 1 + \log \pmb{\sigma}^2(\pmb{t}_i)|_j) $ Where 128 is the dimension of $h_i^t$ .
- Prior-posterior difference for $c_i$ : $
  - \mathbb{E}{\pmb{p}i^b[c_i]} \left[ \log \frac{p(c_i)}{\pmb{p}i^b[c_i]} \right] \approx - \sum{k=1}^K \tilde{c}{ik} \log \frac{p(k)}{\pmb{p}i^b[k]} $ Here, $\tilde{c}_{ik}$ is obtained using the Gumbel-Softmax trick: $ \tilde{c}{ik} = \frac{\exp((g{ik} + \log{\pmb{p}i^b[k]}) / \tau)}{\sum{j=1}^K \exp((g_{ij} + \log{\pmb{p}_i^b[j]}) / \tau)} $ Where $g_{ik} \sim \mathrm{Gumbel}(0, 1)$ and $\tau$ is the temperature parameter.
- Prior-posterior difference for $h_i^t$ given $c_i$ : $
  - \mathbb{E}_{q(\pmb{h}_i^t | \pmb{t}i) \pmb{p}i^b[c_i]} [ \log p(\pmb{h}i^t | c_i) ] \ \approx \frac{1}{2} \sum{k=1}^K \tilde{c}{ik} \left( \sum{j=1}^{128} \log 2\pi + \log \sigma_k^2|_j + \frac{(\mu(\pmb{t}_i)|_j - \mu_k|_j)^2 + \sigma^2(\pmb{t}_i)|_j}{\sigma_k^2|_j} \right) $ By summing these approximated terms, $\mathcal{L}_{ELBO}'$ can be computed and optimized.

4.2.6. Data Flows and Network Architectures

The paper provides detailed data flows and network architectures for both the BERT and TFIDF modules in Table 10 and Table 11.

The following are the data flows and network architectures for BERT features. $K$ is the number of clusters, 768 is the dimension of BERT features and 128 is the dimension of BERT representations.

Data Flow		Network Architecture
Raw Text x	-
BERT Transformation b = B(x)	B(·)	BERT Backbone
Projection Head `h^b = f(b)`	f(·)	Linear(768, 768); ReLU(); Linear(768, 128); Normalize().
Clustering Head `p^b = g(b)`	g(·)	Dropout(); Linear(768, 768); ReLU(); Dropout(); Linear(768, 768); ReLU(); Linear(768, K); Softmax().

The following are the data flows and network architectures for TFIDF features. $K$ is the number of clusters, 2048 is the dimension of TFIDF features and 128 is the dimension of TFIDF representations.

Data Flow		Network Architecture
Raw Text x
TFIDF Transformation t = T(x)	T(·)	TFIDF Vectorizer
Encoder Network	Enc-μ()	Linear(2048, 2048); ReLU(); Linear(2048, 128); Tanh().
µ=Enc-µ(t), σ=Enc-σ(t)	Enc-σ(·)	Linear(2048, 2048); ReLU(); Linear(2048, 128); Exp().
Sample Process $\sim$ (e; 0, 1), $h^t=µ+Tσ$
Decoder Network $t_{rec}$ = Dec( $h^t$ )	Dec(·)	Linear(128, 2048); Softmax().
Class Distribution		π $\in [0, 1]^K$ , $\sum_{i=1} \pi_i = 1$
Gaussian Components	-	{µi, σi} $^K_{k=1}$

5. Experimental Setup

5.1. Datasets

The authors evaluate COTC on eight benchmark datasets for short text clustering. These datasets cover a variety of domains and characteristics, making them suitable for validating the method's performance across different scenarios.

The following are the statistics of the datasets. $N$ : the number of texts; Len: the average length of texts; $K$ : the number of classes; $\mathrm { L } / \mathrm { S }$ :the size ratio of the largest class versus the smallest one.

Dataset	N	Len	K	L/S
AgNews	8000	23	4	1
SearchSnippets	12340	18	8	7
StackOverflow	20000	9	20	1
Biomedical	20000	13	20	1
GoogleNews-TS	11109	28	152	143
GoogleNews-T	11109	6	152	143
GoogleNews-S	11109	22	152	143
Tweet	2472	9	89	249

Details of the Datasets:

AgNews: A subset of news articles (from over 2000 sources) containing 8000 news titles from 4 topic categories.
- Example: "Wall St. Bears Claw Back Into the Black (Reuters)"
SearchSnippets: A subset of 12340 snippets from web search results across 8 domains.
- Example: "Free Internet Software - Download the latest internet tools, utilities, browsers, and more."
StackOverflow: 20000 question titles related to programming from 20 tags. This is a domain-specific dataset where keywords are crucial.
- Example: "How to parse XML with SimpleXML in PHP?"
Biomedical: 20000 paper titles from 20 categories sourced from BioASQ. Another domain-specific dataset.
- Example: "A novel approach for the detection of amyloid plaques in Alzheimer's disease using optical coherence tomography."
GoogleNews (TS, T, S): Constructed by Yin and Wang (2014), containing titles and snippets of 11109 news articles corresponding to 152 events.
- GoogleNews-TS: Both titles and snippets (average length 28).
- GoogleNews-T: Titles only (average length 6).
- GoogleNews-S: Snippets only (average length 22).
- Example (title): "Google to acquire Motorola Mobility" (snippet might follow for -TS, -S).
Tweet: Contains 2472 tweets from 89 queries. Tweets are very short and informal texts.
- Example: "Just saw @ladygaga concert, amazing performance!"
  
  These datasets are well-suited for evaluating short text clustering due to their varying lengths, numbers of clusters, and domain specificity, especially StackOverflow and Biomedical for keyword importance, and GoogleNews and Tweet for their brevity and potentially high class imbalance.

5.2. Evaluation Metrics

The performance of the clustering methods is evaluated using two standard metrics: Clustering Accuracy (ACC) and Normalized Mutual Information (NMI).

5.2.1. Clustering Accuracy (ACC)

Conceptual Definition: Clustering Accuracy (ACC) measures the agreement between the predicted cluster assignments and the true ground-truth labels, after finding the best possible mapping between predicted clusters and true classes. It quantifies how many samples are correctly assigned to their corresponding true class. Since clustering is unsupervised, the predicted cluster IDs don't inherently match the true class IDs, so a mapping step (e.g., using the Hungarian algorithm) is necessary to align them for comparison.

Mathematical Formula: $ ACC = \frac{\sum_{i=1}^N \mathbb{1}_{y_i = map(\hat{y}_i)}}{N} $

Symbol Explanation:

ACC: The clustering accuracy.
$N$ : The total number of samples (texts) in the dataset.
$\mathbb{1}_{\text{condition}}$ : An indicator function that equals 1 if the condition is true, and 0 otherwise.
$y_i$ : The ground-truth label (true class ID) of the $i$ -th sample.
$\hat{y}_i$ : The predicted cluster assignment (predicted cluster ID) for the $i$ -th sample.
$map(\cdot)$ : A permutation mapping function that maps each predicted cluster ID to a ground-truth class ID. This mapping is found to maximize the accuracy (e.g., using the Hungarian algorithm).

5.2.2. Normalized Mutual Information (NMI)

Conceptual Definition: Normalized Mutual Information (NMI) is a measure of the mutual dependence between two clusterings (the predicted clustering and the ground-truth clustering). It quantifies how much information is shared between the two clusterings, normalized to a value between 0 and 1. A value of 1 indicates perfect correlation (the predicted clustering perfectly matches the true clustering), while a value of 0 indicates no mutual information (the clusterings are independent). NMI is often preferred over ACC in some contexts because it is less sensitive to the number of clusters and is a direct measure of information overlap.

Mathematical Formula: $ NMI = \frac{2 I(Y; \hat{Y})}{H(Y) + H(\hat{Y})} $

Symbol Explanation:

NMI: The normalized mutual information score.
$Y$ : The set of ground-truth labels for all samples.
$\hat{Y}$ : The set of predicted cluster assignments for all samples.
$I(Y; \hat{Y})$ : The Mutual Information between the ground-truth labels $Y$ and the predicted labels $\hat{Y}$ . It is defined as: $ I(Y; \hat{Y}) = \sum_{y \in Y} \sum_{\hat{y} \in \hat{Y}} P(y, \hat{y}) \log\left(\frac{P(y, \hat{y})}{P(y)P(\hat{y})}\right) $ Where $P(y, \hat{y})$ is the joint probability of a sample belonging to class $y$ and being assigned to cluster $\hat{y}$ , and P(y) and $P(\hat{y})$ are the marginal probabilities.
H(Y): The Entropy of the ground-truth labels $Y$ . It is defined as: $ H(Y) = - \sum_{y \in Y} P(y) \log(P(y)) $
$H(\hat{Y})$ : The Entropy of the predicted cluster assignments $\hat{Y}$ . It is defined similarly to H(Y).

5.3. Baselines

The paper compares COTC against a comprehensive set of baselines, including traditional, Word2Vec-based, BERT-based, and simple fusion methods. This allows for a thorough evaluation of COTC's performance relative to existing techniques.

TFIDF-K-Means: Applies K-Means clustering directly to TFIDF features. Represents a traditional, basic approach using TFIDF.
BERT-K-Means: Applies K-Means clustering directly to BERT features. Represents a basic approach leveraging BERT embeddings.
K-Means_IC (Rakib et al., 2020): An enhanced TFIDF-based method that applies K-Means to TFIDF features combined with an iterative classification algorithm.
STC2-LPI (Xu et al., 2017): Uses Word2Vec embeddings, pre-obtained LPI codes, a CNN for deep representations, and then K-Means for clustering.
Self-Train (Hadifar et al., 2019): Uses an autoencoder to model Word2Vec embeddings enhanced with SIF (Smooth Inverse Frequency), then fine-tunes the encoder with DEC loss.
SCCL (Zhang et al., 2021): A BERT-based method that performs contrastive learning on BERT features and achieves clustering using DEC loss.
RSTC (Zheng et al., 2023): Current SOTA BERT-only method. It performs pseudo-labelling on BERT features, where pseudo-labels are generated by solving an Optimal Transport (OT) problem.
GMVAE (Jiang et al., 2017): Models TFIDF features using a Variational Autoencoder (VAE) with a Gaussian mixture prior. This serves as a strong baseline for TFIDF-focused deep clustering and is directly relevant to the TFIDF module in COTC.
RSTCBERT-TFIDF-Linear: A simple fusion baseline. Reduces TFIDF features to the same dimensionality as BERT features using an autoencoder, then linearly combines them (e.g., $0.5 \cdot \pmb{b} + 0.5 \cdot \hat{\pmb{t}}$ ) and feeds the result into RSTC.
RSTCBERT-TFIDF-Concat-1: Another simple fusion baseline. Concatenates BERT features with a dimensionality-reduced TFIDF feature ( $[\pmb{b}; \hat{\pmb{t}}]$ ) and feeds the combined feature into RSTC.
RSTCBERT-TFIDF-Concat-2: Similar to RSTCBERT-TFIDF-Concat-1, but concatenates BERT features with the original, high-dimensional TFIDF feature ( $[\pmb{b}; \pmb{t}]$ ) before feeding into RSTC.

The default BERT model used for these baselines and COTC is distilbert-base-nli-stsb-mean-tokens, as specified in the original paper.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that the proposed COTC method consistently achieves superior clustering performance across all eight benchmark datasets, significantly outperforming current state-of-the-art methods in terms of both ACC and NMI.

The following are the clustering performance of the baselines and our method COTC on eight benchmark datasets. The results for the baselines are quoted from Zheng et al., 2023. The best results are bold and the second one underlined.

Method	AgNews		SearchSnippets		StackOverflow		Biomedical
Method	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
TFIDF-K-Means	34.39	12.19	30.85	18.67	58.52	59.02	29.13	25.12
BERT-K-Means	65.95	31.55	55.83	32.07	60.55	51.79	39.50	32.63
K-Means_IC	66.30	42.03	63.84	42.77	74.96	70.27	40.44	32.16
STC2-LPI	-	-	76.98	62.56	51.14	49.10	43.37	38.02
Self-Train	-	-	72.69	56.74	59.38	52.81	40.06	34.46
SCCL	83.10	61.96	79.90	63.78	70.83	69.21	42.49	39.16
RSTC	84.24	62.45	80.10	69.74	83.30	74.11	48.40	40.12
GMVAE	82.62	55.76	80.11	58.96	82.90	71.44	48.17	40.57
RSTCBERT-TFIDF-Linear	84.45	60.86	83.21	71.17	78.79	76.14	50.17	45.18
RSTCBERT-TFIDF-Concat-1	85.79	63.26	80.90	69.99	82.41	78.45	49.34	45.00
RSTCBERT-TFIDF-Concat-2	85.80	63.11	82.54	70.74	78.55	73.95	49.24	43.15
COTC	87.56	67.09	90.32	77.09	87.78	79.19	53.20	46.09
Method	GoogleNews-TS		GoogleNews-T		GoogleNews-S		Tweet
	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
TFIDF-K-Means	69.00	87.78	58.36	79.14	62.30	83.00	54.34	78.47
BERT-K-Means	65.71	86.60	55.53	78.38	56.62	80.50	53.44	78.99
K-Means_IC	79.81	92.91	68.88	83.55	74.48	88.53	66.54	84.84
SCCL	82.51	93.01	69.01	85.10	73.44	87.98	73.10	86.66
RSTC	83.27	93.15	72.27	87.39	79.32	89.40	75.20	87.35
GMVAE	83.37	93.48	79.98	90.25	80.65	90.04	73.23	88.86
RSTCBERT-TFIDF-Linear	83.72	93.26	74.29	88.67	81.57	91.17	78.20	89.42
RSTCBERT-TFIDF-Concat-1	83.74	93.79	79.31	91.06	82.91	91.55	75.61	88.50
RSTCBERT-TFIDF-Concat-2	84.03	93.55	74.46	87.70	81.23	90.60	83.62	90.30
COTC	90.50	96.33	83.53	92.07	86.10	93.49	91.33	95.09

Key Observations:

Weakness of Shallow Methods: TFIDF-K-Means and BERT-K-Means generally perform poorly, highlighting that raw features combined with simple clustering algorithms are insufficient for complex short text data.
Deep Learning Advantage: Methods employing deep learning for representation learning (STC2-LPI, Self-Train, SCCL, RSTC) significantly outperform the shallow methods, validating the importance of learning better representations.
Power of BERT: BERT-based methods (SCCL, RSTC) demonstrate strong performance, often outperforming Word2Vec-based methods, confirming BERT's ability to capture rich semantics. RSTC, in particular, shows competitive results, being the previous SOTA among BERT-only approaches.
Rediscovered Value of TFIDF: GMVAE, which models TFIDF features with a VAE, surprisingly shows strong performance, beating many baselines and even BERT-based SCCL on some datasets (e.g., SearchSnippets, StackOverflow, GoogleNews-T/S, Tweet). This validates the paper's premise that TFIDF features, when properly modeled, remain valuable.
Limitations of Simple Fusion: The RSTCBERT-TFIDF-Linear, RSTCBERT-TFIDF-Concat-1, and RSTCBERT-TFIDF-Concat-2 baselines, which naively combine BERT and TFIDF, show some improvement over RSTC on certain datasets (e.g., AgNews, SearchSnippets, Biomedical), but they are not consistently superior and sometimes even fall behind RSTC. This suggests that simple fusion cannot fully exploit the complementary strengths due to the distinct nature of the features.
Superiority of COTC: COTC consistently achieves the highest ACC and NMI scores across all eight datasets, often by a significant margin. For example, on SearchSnippets, COTC reaches 90.32 ACC and 77.09 NMI, compared to RSTC's 80.10 ACC and 69.74 NMI. This robust performance validates the effectiveness of COTC's alignment-promoting co-training framework in leveraging the collective strengths of BERT and TFIDF features.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study for BERT Features

The paper conducts an ablation study to analyze the contribution of different components to the BERT module's performance. The ACC results for variants of the BERT module are shown below, focusing on the outputs from $\pmb{p}_i^b$ .

The following are the ACC results of the basic variants for BERT features. vs Last means the average improvement comparing the current row with the last one.

Variant		AN	SS	SO	Bio	GN-TS	GN-T	GN-S	Tw	vs Last
Basis	(M)	85.55	80.78	83.23	50.97	83.25	74.79	79.98	83.32	-
w/ $h^t$	( $\mathcal{M}_{Graph}$ )	86.06	88.30	86.35	52.16	87.08	81.76	82.13	87.27	+3.66
w/ $p^t$	( $\mathcal{M}_{Align}$ )	86.56	89.03	87.22	52.55	89.85	82.73	85.23	90.64	+1.59
COTC	( $\mathcal{M}_{Joint}$ )	87.56	90.32	87.78	53.20	90.50	83.53	86.10	91.33	+0.81

Analysis:

Basis ( $\mathcal{M}$ ): This is a basic BERT module with contrastive learning, pseudo-labelling, and consistency constraints, but without direct interaction with TFIDF information for alignment. It achieves a baseline ACC.
w/ $h^t$ ( $\mathcal{M}_{Graph}$ ): This variant introduces the TFIDF representations $\{h_i^t\}$ to construct the similarity graph $\mathcal{G}^t$ , which is then used for neighborhood augmentations in the BERT module's contrastive learning. The average ACC improvement of +3.66 over the Basis model is substantial, particularly on SearchSnippets (+7.52) and GoogleNews-TS (+3.83). This clearly demonstrates the value of aligning the BERT representation space with the similarity structure learned from TFIDF features, especially for capturing keyword-level relationships.
w/ $p^t$ ( $\mathcal{M}_{Align}$ ): This variant further incorporates cluster-level alignment by using the cluster probabilities $\{p_i^t\}$ from the TFIDF module via $\mathcal{L}_{Align}$ . It achieves an additional average improvement of +1.59 over $\mathcal{M}_{Graph}$ . This indicates that aligning the predicted cluster assignments between the two feature spaces provides further benefits, guiding both modules towards more coherent cluster structures.
COTC ( $\mathcal{M}_{Joint}$ ): This is the final proposed method, which uses the unified joint training objective $\mathcal{L}_{Joint}'$ . It yields an average improvement of +0.81 over $\mathcal{M}_{Align}$ . This improvement, though smaller than the previous steps, is consistent across datasets and validates the superiority of the tighter coupling and more efficient gradient propagation enabled by the unified objective.

These results confirm the incremental benefits of COTC's components, showing that both representation-level alignment (via graph structures) and cluster-level alignment (via probability distributions), integrated through a unified objective, are crucial for achieving state-of-the-art performance.

6.2.2. Ablation Study for TFIDF Features

The paper also presents ablation studies for the TFIDF module, showing how its performance is influenced by interactions with the BERT module. The ACC and NMI results for TFIDF variants (outputs from $\pmb{p}_i^t$ ) are provided in Table 13 and Table 14 in the Appendix.

The following are the ACC results of the basic variants for TFIDF features. vs Last means the average improvement comparing the current row with the last one.

Variant		AN	SS	SO	Bio	GN-TS	GN-T	GN-S	Tw	vs Last
Basis	(M)	82.62	80.11	82.90	48.17	83.37	79.98	80.65	73.23	-
w/ $h^b$	( $\mathcal{M}_{Graph}$ )	84.89	87.41	84.12	49.68	85.07	80.82	82.11	74.99	+2.26
w/ $p^b$	( $\mathcal{M}_{Align}$ )	85.83	88.95	85.04	51.55	87.86	81.68	84.40	87.65	+2.98
COTC	( $\mathcal{M}_{Joint}$ )	87.26	90.00	86.87	52.41	90.35	83.36	86.03	91.05	+1.80

The following are the NMI results of the basic variants for TFIDF features. vs Last means the average improvement comparing the current row with the last one.

Variant		AN	SS	SO	Bio	GN-TS	GN-T	GN-S	Tw	vs Last
Basis	(M)	55.76	58.96	71.44	40.57	93.48	90.25	90.04	88.86	-
w/ $h^b$	( $\mathcal{M}_{Graph}$ )	60.57	71.91	78.77	44.25	94.44	91.13	91.17	89.70	+4.07
w/ $p^b$	( $\mathcal{M}_{Align}$ )	63.94	74.80	75.07	43.39	94.33	90.83	92.29	92.85	+0.70
COTC	( $\mathcal{M}_{Joint}$ )	66.16	76.53	78.97	45.69	96.19	91.91	93.41	94.72	+2.01

Analysis (for TFIDF Module):

Basis ( $\mathcal{M}$ ): This is the standalone GMVAE model for TFIDF features. Its performance is already decent, outperforming many non-BERT baselines.
w/ $h^b$ ( $\mathcal{M}_{Graph}$ ): Incorporating the BERT representations $\{h_i^b\}$ to construct the similarity graph $\mathcal{G}^b$ and using it to guide the TFIDF module's VAE learning significantly boosts performance (average +2.26 ACC, +4.07 NMI). This confirms that the deep semantic structure from BERT is highly beneficial for enriching TFIDF representations.
w/ $p^b$ ( $\mathcal{M}_{Align}$ ): Adding cluster-level alignment with BERT's cluster probabilities $\{p_i^b\}$ further improves performance (average +2.98 ACC, +0.70 NMI). This shows the value of harmonizing cluster assignments across both feature spaces.
COTC ( $\mathcal{M}_{Joint}$ ): The full COTC with the unified joint training objective delivers additional average improvements (+1.80 ACC, +2.01 NMI), demonstrating the superior integration and signal propagation of the unified approach.

General Conclusion from Ablation: The ablation studies clearly show that the co-training and alignment mechanisms are highly effective for both the BERT and TFIDF modules. Each module benefits significantly from the information shared by the other, validating the principle of leveraging complementary strengths. The unified joint training objective consistently provides the best overall performance.

6.2.3. Hyperparameter Sensitivity

Sensitivity of number of neighbors $L$ : The sensitivity of the number of neighbors $L$ is presented in Figure 3 from the original paper.

$Figure 3: The sensitivity of the number of neighbors $L$ Pre. means precision, which is the ratio of the neighbors in the same class as the anchor.$ 该图像是图表，展示了在Biomedical和GoogleNews-TS数据集上邻居数量 $L$ 对ACC、NMI和Precision的敏感性。左侧图表显示Biomedical的数据表现，右侧则为GoogleNews-TS的数据表现。随着 $L$ 的增加，Precision指标呈下降趋势，而ACC和NMI则保持相对稳定。

The figure shows that using a proper number of neighbors (L > 0) generally benefits BERT features in clustering, compared to $L=0$ (no neighbor augmentation). However, for some datasets like GoogleNews-TS, increasing $L$ beyond an optimal point can decrease precision (the ratio of neighbors in the same class as the anchor), introducing noise and hurting performance. To balance performance across datasets, $L$ is set to 10.
Sensitivity of weighting parameter $\lambda_1$ : The sensitivity of the weighting parameter $\lambda_1$ is shown in Figure 5 from the original paper.

$Figure 5: The sensitivity of the weighting parameter $\\lambda _ { 1 }$ .$ 该图像是图表，展示了不同权重参数 $heta_1$ 下的聚类效果，包括 AgNews 和 GoogleNews-S 数据集的准确率 (ACC) 和归一化互信息 (NMI) 变化情况。左侧图表显示 AgNews 的度量结果，右侧图表对应 GoogleNews-S 的结果。

The weighting parameter\lambda_1 $balances the `BERT` module's loss ($\mathcal{L}_B$) and the `TFIDF` module's ELBO ($\mathcal{L}_{ELBO}'$). The figure indicates that a `proper range for`\lambda_1$ is important. While extreme values can degrade performance, values in the range of 0.04 to 0.16 show relatively stable performance. This suggests that the model is not overly sensitive within a reasonable range. For convenience, $\lambda_1$ is fixed to 0.1 for all datasets.

Sensitivity of temperature parameter $\tau$ (Gumbel-Softmax): The sensitivity of the temperature parameter $\tau$ for the Gumbel-Softmax trick is presented in Figure 6 from the original paper.

$Figure 6: The sensitivity of the temperature parameter of Gumbel trick $\\tau$ .$ 该图像是一个图表，展示了温度参数 au 对 ACC 和 NMI 评估指标的敏感性。左侧图表显示了不同方法在 ACC 上的表现，右侧图表则展示了 NMI 的变化趋势。每个数据点对应不同的 au 值，绘制的线条分别代表多种算法的效果。

The graph shows that utilizing the Gumbel-Softmax trick with a temperature parameter\tau $generally `enhances clustering performance`. This is likely due to the trick's ability to facilitate exploration during training by allowing for soft, differentiable approximations of discrete samples. $\tau$ is simply set to 0.5 for all datasets. ### 6.2.4. Case Study * **Keyword Analysis (SearchSnippets):** To understand the keywords learned by the `TFIDF` module, the paper maps the cluster centers from the `TFIDF` module back to the `vocabulary space` using the `embedding matrix`\pmb{\mathcal{E}}$ from the VAE's decoder. The top-5 relevant keywords for each cluster on SearchSnippets are listed below. This shows that the TFIDF module successfully identifies domain-specific keywords for each topic.

The following are different keywords revealed by the cluster centers in the TFIDF module on SearchSnippets.

clusters	keywords	topics
#1	business, market,services, financial, finance	Business
#2	computer, software,programming, linux, web	Computers
#3	movie, music,com, movies, film	Culture-Arts-Entertainment
#4	edu, research,science, university, theory	Education-Science
#5	electrical, car,motor, engine, products	Engineering
#6	health, medical,information, disease, gov	Health
#7	political, party,democracy, government, democratic	Politics-Society
#8	sports, football,news, games, com	Sports

Visualization (SearchSnippets): The visualization on SearchSnippets (Figure 4 from the original paper) qualitatively demonstrates the improved cluster separation achieved by COTC.

该图像是一个示意图，展示了聚类前后的文本特征分布。左侧为原始特征数据的可视化，右侧为经过训练后的聚类结果，数字标注表示不同的聚类。通过对比可以看出，训练后聚类的效果明显提升。

The left panel shows the initial feature distribution, while the right panel, post-training, clearly indicates well-defined and separated clusters, confirming the model's ability to learn distinct groupings.

Keyword Analysis (StackOverflow): For StackOverflow, a domain-specific dataset, the TFIDF module also reveals highly relevant keywords for each of the 20 clusters. This highlights TFIDF's strength in capturing professional terms.

The following are different keywords revealed by the cluster centers in the TFIDF module on StackOverflow.

clusters	keywords	topics
#1	excel, vba, cell, macro, data	excel
#2	haskell, type, function, scala, list	haskell
#3	mac, os, osx, application, app	OSX
#4	linq, sql, query, using, join	linq
#5	ajax, jquery, javascript, request, php	ajax
#6	visual, studio, 2008, 2005, project	visual-studio
#7	cocoa, using, file, use, text	cocoa
#8	hibernate, mapping, criteria, query, hql	hibernate
#9	sharepoint, web, site, 2007, list	sharepoint
#10	bash, script, command, shell, file	bash
#11	apache, rewrite, mod, htaccess, redirect	apache
#12	wordpress, posts, post, page, blog	wordpress
#13	svn, subversion, repository, files, commit	svn
#14	drupal, node, views, module, content	drupal
#15	qt, widget, window, creator, application	qt
#16	scala, java, class, type, actors	scala
#17	magento, product, products, page, admin	magento
#18	matlab, matrix, plot, array, function	matlab
#19	oracle, sql, table, pl, database	oracle
#20	spring, bean, hibernate, security, using	spring

Visualization (StackOverflow): The visualization on StackOverflow (Figure 7 from the original paper) further illustrates the benefits of COTC.

该图像是一个示意图，展示了原始和训练后聚类的效果对比。左侧是未经训练的样本分布，右侧是经过训练后，样本更紧密聚集的结果，标记的星形点表示中心聚类的样本。

The figure from the introduction (Figure 1) showed BERT features separating TFIDF neighbors. Figure 7 (post-training visualization) demonstrates that COTC can successfully cluster these samples together, and the decision boundaries between different clusters become clearer. This supports the paper's claim that COTC addresses BERT's keyword weakness.

6.2.5. Investigation of Other Features and Base Models

Replacing TFIDF with Other Keyword Features: The paper investigates if TFIDF can be replaced by other keyword-reflecting features like BoW (Bag-of-Words) or Word2Vec.

The following are the ACC results using BoW or Word2Vec instead of TFIDF features. COTCBERT-TFIDF is our final method.

dataset	AN	SO	Bio	GN-TS	Tw
RSTCBERT	84.24	83.30	48.40	83.27	75.20
COTCBERT-W2V	34.24	28.06	27.50	74.06	14.36
COTCBERT-BoW	87.41	84.91	52.68	89.15	88.43
COTCBERT-TFIDF	87.56	87.78	53.20	90.50	91.33

Analysis:

COTCBERT-BoW performs nearly as well as COTCBERT-TFIDF, indicating that BoW features, like TFIDF, effectively capture keyword information and can be successfully integrated into the co-training framework.
COTCBERT-W2V performs very poorly. The authors conjecture that Word2Vec is more similar to BERT in nature (dense embeddings) and strongly relies on its pre-training corpus, thus not providing the complementary keyword-focused information that TFIDF or BoW do.

Using Different BERT Backbones: The paper tests the robustness of COTC with different BERT base models (instead of the default distilbert-base-nli-stsb-mean-tokens).

The following are the ACC results using different base models instead of the default sentence-distilbert.

dataset	AN	GN-TS	Tw
RSTCxLNet-base-uncased	71.75	34.47	10.07
COTCXLNet-base-uncased	84.60	80.21	71.97
RSTCBERT-base-uncased	82.23	77.45	73.87
COTCBERT-base-uncased	87.83	89.10	89.81
RSTCRoBERTa-base	85.76	75.63	71.08
COTCRoBERTa-base	87.44	88.32	90.45

Analysis:

The performance of RSTC (a BERT-only SOTA method) is highly dependent on the quality of the BERT backbone. For instance, RSTC with XLNet-base-uncased performs very poorly on Tweet (10.07 ACC).
In contrast, COTC demonstrates much higher stability and consistently good performance across different BERT backbones. Even when the base BERT model is weak (XLNet-base-uncased), COTC still achieves significantly better results (e.g., 71.97 ACC on Tweet with XLNet, a massive improvement over RSTC's 10.07 ACC). This confirms that the complementary TFIDF features and the co-training framework enhance the robustness of the clustering, compensating for weaknesses in the underlying BERT representations.

6.2.6. Clustering with Noisy Data

The authors test COTC's stability under noisy data conditions by adding random samples from the Biomedical dataset as noise to the StackOverflow dataset.

The following are the clustering results when our method performs clustering under noisy data condition, i.e., StackOverflow contaminated by Biomedical.

percentage of noisy samples	0%	1%	2%	3%	4%
ACC	87.78	87.13	84.32	83.59	81.72
NMI	79.19	78.54	77.29	77.20	76.60

Analysis:

As expected, performance (ACC and NMI) gradually declines as the percentage of noisy samples increases.
However, even with 4% noise from a completely different domain (Biomedical vs. StackOverflow), COTC maintains a relatively high ACC of 81.72% and NMI of 76.60%. This indicates that COTC is relatively robust to noise and can maintain a certain level of stability even when clustering contaminated data.

6.2.7. Comparison with LLM Zero-Shot Clustering

The paper includes a brief comparison with zero-shot clustering using a Large Language Model (LLM), specifically Qwen2-7B-Instruct, on the AgNews dataset.

The following are the clustering results of LLM for zero-shot short text clustering on AgNews.

	ACC	NMI
Qwen2-7B-Instruct-zero-shot	75.28	48.27
COTC	87.56	67.09

Analysis:

Qwen2-7B-Instruct achieves a respectable ACC of 75.28% and NMI of 48.27% in a zero-shot setting, even when provided with specific category names in the prompt. This highlights the impressive capabilities of LLMs for text understanding without explicit training on the clustering task.
However, COTC still significantly outperforms the LLM (87.56 ACC, 67.09 NMI). This suggests that while LLMs are powerful generalists, specialized models like COTC, which are carefully designed and trained for the specific task of short text clustering by leveraging complementary features, remain highly competitive and often superior for this specific task. The paper notes that LLMs of this size might be practical for large datasets, but dedicated models still hold an advantage.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully addresses the challenge of short text clustering by proposing the CO-Training Clustering (COTC) framework. COTC innovatively combines the deep semantic understanding capabilities of BERT features with the keyword information strengths of TFIDF features, which BERT-only models often miss. The core of COTC lies in its dual-module design, where a BERT module and a TFIDF module are developed to mutually promote each other. This mutual promotion is achieved through explicit alignment at both the representation level (using similarity graphs) and the cluster assignment level (aligning predicted probabilities). Furthermore, the alternating co-training process is unified into a single, efficient joint training objective, allowing for seamless gradient propagation. Extensive experiments across eight benchmark datasets consistently demonstrate COTC's superior performance, outperforming existing state-of-the-art methods by a significant margin and showcasing its robustness across different BERT backbones and even in the presence of noisy data. The work effectively rediscovers the value of TFIDF features in a modern deep learning context.

7.2. Limitations & Future Work

The authors acknowledge several limitations of the current COTC framework:

Known Number of Clusters: Similar to many previous clustering methods, COTC requires the number of clusters ( $K$ ) to be known beforehand. This can be a practical limitation in real-world scenarios where $K$ is unknown.
Hyperparameter Tuning Complexity: Due to the inherent differences between BERT (low-dimensional, dense) and TFIDF (high-dimensional, sparse) features, the neural networks in their respective modules often require different learning rates. This necessitates additional effort for hyperparameter tuning.
Increased Computational Cost: Compared to methods relying solely on BERT features, the introduction of a TFIDF module incurs extra time and space for training and inference.

For future work, the authors plan to:

Explore a more compact and efficient way to leverage the collective strengths of different text features without significantly increasing computational costs. This could involve more lightweight TFIDF modules or more integrated architectures.

7.3. Personal Insights & Critique

This paper presents a highly insightful and practical approach to short text clustering. The core idea of explicitly leveraging complementary feature types rather than solely relying on a single, albeit powerful, feature is a valuable lesson. The strength of BERT in capturing broad semantics is undeniable, but the paper rightly points out its blind spot for domain-specific keywords. TFIDF, despite being a traditional technique, effectively fills this gap.

Inspirations and Applications:

Multi-Modal Feature Fusion: The co-training and alignment strategy could inspire similar approaches in other domains where multiple data modalities or feature types (e.g., text and images, or different types of text embeddings) possess complementary information but have intrinsically different distributions.
Domain-Specific Adaptation: For fields requiring high precision on specialized terminology (e.g., legal documents, medical records, scientific papers), this framework offers a robust way to ensure that critical keywords are not overlooked by general-purpose language models.
Robustness Enhancement: The improved robustness of COTC when using weaker BERT backbones or facing noisy data is a significant advantage, suggesting that this framework can make clustering pipelines more resilient.

Potential Issues and Areas for Improvement:

Automated K-determination: The limitation of requiring a known number of clusters ( $K$ ) is a common but persistent challenge in clustering. Future work could integrate methods for automatic cluster number detection (e.g., silhouette score, elbow method, or Bayesian non-parametric models) into the COTC framework.
Adaptive Hyperparameter Tuning: The need for extensive hyperparameter tuning (especially different learning rates) highlights a practical hurdle. Research into adaptive optimization strategies or meta-learning for dynamic learning rate adjustment across disparate modules could reduce this burden.
Scalability for Very Large Vocabularies: While TFIDF is effective, for extremely large and dynamic vocabularies, the TFIDF feature vector can become very high-dimensional, potentially increasing memory and computational demands for the VAE. Exploring more sparse-aware or approximate techniques for the TFIDF module could be beneficial.
Beyond BERT and TFIDF: The paper opens the door for integrating other types of specialized features (e.g., graph-based features for texts in knowledge graphs, syntactic features) into a similar co-training paradigm, further enhancing the model's capabilities for complex text data.

Overall, COTC provides a well-reasoned and empirically strong solution for short text clustering, emphasizing the enduring value of traditional methods when thoughtfully integrated with modern deep learning.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~49 min read · 64,159 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Short Text Clustering

3.1.2. TFIDF (Term Frequency-Inverse Document Frequency)

3.1.3. BERT (Bidirectional Encoder Representations from Transformers)

3.1.4. K-Means Clustering

3.1.5. Deep Embedded Clustering (DEC)

3.1.6. Contrastive Learning

3.1.7. Pseudo-labelling

3.1.8. Variational Autoencoder (VAE)

3.1.9. Optimal Transport (OT)

3.1.1. KL-Divergence (Kullback-Leibler Divergence)

3.1.1. Gumbel-Softmax Trick

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Implementation of Module FB(⋅)\mathcal{F}_B(\cdot)FB​(⋅) (BERT Module)

4.2.2. Implementation of Module FT(⋅)\mathcal{F}_T(\cdot)FT​(⋅) (TFIDF Module)

4.2.3. A Unified Training Objective

4.2.4. Appendix A.1: Optimal Transport for Pseudo-labelling

4.2.5. Appendix A.2: Variational Autoencoder Details

4.2.6. Data Flows and Network Architectures

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Clustering Accuracy (ACC)

5.2.2. Normalized Mutual Information (NMI)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study for BERT Features

6.2.2. Ablation Study for TFIDF Features

6.2.3. Hyperparameter Sensitivity

6.2.5. Investigation of Other Features and Base Models

6.2.6. Clustering with Noisy Data

6.2.7. Comparison with LLM Zero-Shot Clustering

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.1. Implementation of Module $\mathcal{F}_B(\cdot)$ (BERT Module)

4.2.2. Implementation of Module $\mathcal{F}_T(\cdot)$ (TFIDF Module)