Paper status: completed

TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision

Published:03/01/2024

LLM-Enhanced Hierarchical Text Classification (1)Weakly-Supervised Hierarchical Text Classification (1)Taxonomy Enrichment Methods (1)Zero-Shot Prompting Techniques (1)Text Mining (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

TELEClass introduces a hierarchical text classification method that combines LLMs with features from unannotated corpora, using class names as the only supervision. It automates taxonomy enrichment and generates additional class-indicative features, outperforming traditional meth

Abstract

Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy, which is a fundamental web text mining task with broad applications such as web content analysis and semantic indexing. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with a minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, large language models (LLM) have shown competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which combines the general knowledge of LLMs and task-specific features mined from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy with class-indicative features for better label space understanding and utilizes novel LLM-based data annotation and generation methods specifically tailored for the hierarchical setting. Experiments show that TELEClass can significantly outperform previous baselines while achieving comparable performance to zero-shot prompting of LLMs with drastically less inference cost.

Mind Map

In-depth Reading

English Analysis~35 min read · 48,552 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision. It focuses on classifying text documents into a hierarchical structure of labels using very little human annotation.

1.2. Authors

The authors and their affiliations are:

Yunyi Zhang (University of Illinois Urbana-Champaign, Urbana, IL, USA)
Ruozhen Yang* (University of Illinois Urbana-Champaign, Urbana, IL, USA)
Xueqiang Xu* (University of Illinois Urbana-Champaign, Urbana, IL, USA)
Rui Li* (University of Science and Technology of China, Hefei, China)
Jiaming Shen (Google Deepmind, New York, NY, USA)
Jinfeng Xiao (University of Illinois Urbana-Champaign, Urbana, IL, USA)
Jiawei Han (University of Illinois Urbana-Champaign, Urbana, IL, USA)

The asterisks next to some authors' names typically indicate a shared contribution or other specific acknowledgments, often referring to co-first authorship or corresponding authorship in other contexts, though not explicitly defined in this paper's excerpt. Many authors are associated with the University of Illinois Urbana-Champaign, a prominent research institution in computer science, and one author is from Google Deepmind, a leading AI research company, and another from the University of Science and Technology of China. This suggests a strong academic and industry collaboration on the research.

1.3. Journal/Conference

The paper is published at the ACM Web Conference 2025 (WWW '25). The WWW conference is a highly reputable and influential venue in the field of web technologies, data mining, and information retrieval. Publication at WWW signifies that the research is considered significant and of high quality within the web and data science communities.

1.4. Publication Year

The paper was published at WWW '25, indicating a publication year of 2025. However, the provided metadata indicates Published at (UTC): 2024-02-29T22:26:07.000Z, which means it was made available as a preprint on arXiv in 2024 and accepted for a 2025 conference.

1.5. Abstract

The paper addresses the challenge of hierarchical text classification with minimal supervision, specifically using only class names as input. Traditional methods require extensive human-annotated data, which is costly. While Large Language Models (LLMs) show promise in zero-shot tasks, they struggle with the large, structured label spaces of hierarchical classification due to prompt limitations. Existing weakly-supervised methods overlook rich class-indicative features within unlabeled text corpora.

To overcome these issues, the paper proposes TELEClass, which combines LLMs' general knowledge with task-specific features mined from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy with class-indicative features for better label space understanding. It also introduces novel LLM-based data annotation and generation methods specifically designed for hierarchical settings. Experiments demonstrate that TELEClass significantly outperforms previous baselines and achieves performance comparable to zero-shot LLM prompting, but with drastically lower inference costs.

1.6. Original Source Link

The official source link for this paper is: https://arxiv.org/abs/2403.00165v3. The PDF link is: https://arxiv.org/pdf/2403.00165v3.pdf. This is a preprint published on arXiv, often a precursor to formal publication. The abstract also mentions "Proceedings of the ACM Web Conference 2025 (WWW '25)", indicating it has been accepted for formal publication.

2. Executive Summary

2.1. Background & Motivation

The paper tackles the hierarchical text classification (HTC) problem, a fundamental task in web text mining and natural language processing (NLP). HTC involves categorizing documents into one or multiple classes within a structured label taxonomy. This task is crucial for applications such as web content organization, semantic indexing, and query classification.

Core Problem: The main challenge lies in the significant cost and time required to acquire human-annotated data for training robust HTC models. Most prior work relies on fully-supervised or semi-supervised methods, which necessitate large volumes of labeled examples.

Specific Challenges/Gaps:

Cost of Supervision: Human annotation is expensive and not scalable, especially for large and dynamic taxonomies.
LLM Limitations in Hierarchy: While Large Language Models (LLMs) like GPT-4 perform well in flat text classification, they struggle with hierarchical settings. Directly including hundreds of classes in prompts is inefficient, leads to information loss, and incurs prohibitive inference costs due to long prompts.
Prior Weak Supervision Deficiencies: Existing weakly-supervised HTC methods, such as TaxoClass, primarily use only class names or a few keywords. They often ignore the rich, class-indicative features hidden within large unlabeled text corpora, which could significantly aid in understanding the nuances of fine-grained and long-tail classes. They also suffer from unreliable pseudo label selection, as their underlying models are not optimized for comparing class relevance.

Paper's Entry Point / Innovative Idea: The paper proposes to advance minimally-supervised HTC by synergistically combining the general knowledge of LLMs with task-specific knowledge extracted from an unlabeled text corpus. The core innovation is to enrich the label taxonomy with more informative features and to use LLMs more effectively for data annotation and generation in a hierarchical context.

2.2. Main Contributions / Findings

The paper's primary contributions and key findings are:

Novel Method TELEClass: Proposes TELEClass, a new method for minimally-supervised hierarchical text classification that requires only class names as supervision. It trains a multi-label text classifier by combining LLM capabilities and corpus-mined features.
Taxonomy Enrichment with Hybrid Knowledge: Introduces a novel approach to enrich the label taxonomy with class-indicative terms from two sources:
- LLM generation: Leveraging LLMs' general knowledge to generate descriptive keywords.
- Automated corpus extraction: Mining corpus-specific topical terms through semantic and statistical analysis of the unlabeled text. This combined enrichment significantly improves label space understanding and pseudo label quality.
LLM-Enhanced Data Annotation and Generation for Hierarchy: Enhances LLMs' utility in hierarchical classification by:
- Efficient Core Class Annotation: Using a taxonomy-guided candidate search to reduce the label space for LLMs, making annotation more efficient and effective. This process identifies document "core classes" (fine-grained, accurate descriptors).
- Path-Based Data Augmentation: Optimizing LLM-based document generation to create more precise pseudo data by guiding LLMs with taxonomy paths, ensuring coverage for fine-grained and long-tail classes that might otherwise be missed.
Significant Performance Improvement: Experiments on two datasets (Amazon-531 and DBPedia-298) demonstrate that TELEClass can significantly outperform previous zero-shot and weakly-supervised hierarchical text classification baselines.
Cost-Effective Solution: Achieves performance comparable to zero-shot prompting of LLMs (like GPT-4), but with drastically less inference cost and significantly shorter inference time once the classifier is trained. This makes it a more practical solution for real-world applications.

These findings collectively solve the problem of requiring extensive human annotation for HTC, provide a more effective way to leverage LLMs in complex hierarchical structures, and improve the quality of pseudo-labeled data by integrating diverse sources of knowledge.

3.1. Foundational Concepts

Hierarchical Text Classification (HTC): This is a specialized form of text classification where documents are assigned to one or more categories organized in a hierarchy or taxonomy. Unlike flat classification, where labels are independent, HTC considers parent-child relationships (e.g., "Electronics" -> "Smartphones" -> "Android Phones"). A document might be classified at multiple levels and along multiple paths (e.g., "Electronics", "Smartphones", "Android Phones"). The label space is often large and structured.
Taxonomy (Label Taxonomy): A hierarchical structure, often represented as a Directed Acyclic Graph (DAG) or a tree, where nodes represent classes and edges represent relationships (e.g., is-a or part-of). In HTC, it defines the target label space.
Minimal Supervision / Weakly-Supervised Learning: A machine learning paradigm where the model is trained with very limited, noisy, or indirect supervision signals, rather than large amounts of precisely human-labeled data. In this paper, minimal supervision refers to using only the class names of the taxonomy nodes.
Large Language Models (LLMs): Advanced deep learning models (e.g., GPT-3, GPT-4, Claude) trained on vast amounts of text data to understand, generate, and process human language. They exhibit strong capabilities in various NLP tasks, often through zero-shot or few-shot prompting.
Zero-Shot Prompting: A technique where an LLM performs a task without any specific training examples for that task. The instructions are provided entirely within the prompt (input query), relying on the LLM's pre-trained general knowledge. For example, asking an LLM to classify a document into categories without providing any labeled examples.
Semantic Embedding: A numerical representation (vector) of text (words, phrases, sentences, documents) that captures its meaning. Texts with similar meanings are represented by vectors that are "close" to each other in the embedding space (e.g., high cosine similarity).
- Sentence Transformer: A type of pre-trained model (often based on BERT or other transformers) specifically designed to generate dense vector embeddings for sentences or short paragraphs. These embeddings are optimized for semantic similarity tasks, meaning that sentences with similar meanings will have similar embeddings. A popular example is all-mpnet-base-v2.
- BERT (Bidirectional Encoder Representations from Transformers): A powerful pre-trained language model that processes text by considering the context from both the left and right sides of a word simultaneously. It generates contextualized word and sentence embeddings and can be fine-tuned for various NLP tasks. BERT-base-uncased is a common version.
Cosine Similarity: A measure of similarity between two non-zero vectors in an inner product space. It quantifies the cosine of the angle between them. A cosine similarity of 1 means the vectors are perfectly aligned (same direction), 0 means they are orthogonal (no similarity), and -1 means they are diametrically opposed. $ \cos(\vec{A}, \vec{B}) = \frac{\vec{A} \cdot \vec{B}}{|\vec{A}| |\vec{B}|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}} $ where $\vec{A}$ and $\vec{B}$ are two vectors, $A_i$ and $B_i$ are their components, and $\|\vec{A}\|$ and $\|\vec{B}\|$ are their magnitudes.
BM25 (Best Match 25): A ranking function used by search engines to estimate the relevance of documents to a given search query. It's a bag-of-words model that considers term frequency (how often a term appears in a document), inverse document frequency (how rare a term is across the corpus), and document length. $ \mathrm{BM25}(Q, D) = \sum_{t \in Q} \mathrm{IDF}(t) \cdot \frac{f(t,D) \cdot (k_1 + 1)}{f(t,D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\mathrm{avgdl}})} $ where:
- $Q$ is the query (in this context, a term $t$ ).
- $D$ is the document (here, a set of documents $D_c^0$ relevant to class $c$ ).
- f(t,D) is the term frequency of $t$ in $D$ .
- $|D|$ is the length of document $D$ .
- $\mathrm{avgdl}$ is the average document length in the collection.
- $k_1$ and $b$ are free parameters, typically $k_1 \in [1.2, 2.0]$ and $b=0.75$ .
- $\mathrm{IDF}(t)$ is the Inverse Document Frequency of term $t$ , often calculated as $\log \frac{N - n(t) + 0.5}{n(t) + 0.5}$ , where $N$ is the total number of documents and n(t) is the number of documents containing $t$ .
Binary Cross-Entropy (BCE) Loss: A common loss function used in binary classification and multi-label classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1. For a single sample and a single class: $ L(y, \hat{y}) = - (y \log(\hat{y}) + (1-y) \log(1-\hat{y})) $ where $y$ is the true label (0 or 1) and $\hat{y}$ is the predicted probability. In multi-label classification, it's typically summed or averaged over all classes.
Multi-label Classification: A classification problem where each instance can be assigned to multiple labels simultaneously. For example, a movie might be categorized as "Action", "Comedy", and "Sci-Fi" at the same time.
Directed Acyclic Graph (DAG): A graph data structure where all edges are directed (point from one node to another), and there are no cycles. Taxonomies are often represented as DAGs, allowing a class to have multiple parents (e.g., "Apple" could be a subclass of "Fruit" and "Company" in a different context).

3.2. Previous Works

The paper discusses several lines of related work, primarily focusing on weakly-supervised and zero-shot HTC methods.

Fully-supervised and Semi-supervised HTC: Earlier works (e.g., [9, 24, 58] for fully-supervised; [5, 20] for semi-supervised) rely on substantial human-labeled data, which TELEClass aims to reduce. These methods often involve complex neural architectures that learn directly from large training sets.
Weakly-supervised Hierarchical Text Classification:
- WeSHClass [35]: This method uses a small set of keywords or labeled documents per class. It generates pseudo documents for pretraining text classifiers and then performs self-training. The challenge here is still the human effort required to compile keyword lists or obtain representative documents for many classes.
- TaxoClass [45]: This is a key baseline, as it shares the same minimal supervision setting as TELEClass (using only class names). It employs a textual entailment model (which determines if one text logically follows from another) with a top-down search and corpus-level comparison to identify core classes for each document. These core classes then serve as pseudo training data for a multi-label classifier.
  - Textual Entailment Model: A model (e.g., based on BERT or other transformer architectures) trained to recognize relationships between two text segments: entailment (the first text implies the second), contradiction (the texts contradict each other), or neutral (no clear relationship). In TaxoClass, it's used to determine if a document "entails" a class name, effectively assigning a label.
  - Limitation addressed by TELEClass: TaxoClass overlooks additional class-relevant features in the corpus and suffers from unreliable pseudo label selection because the entailment model isn't specifically trained to compare which class is most relevant.
Zero-Shot LLM Prompting for HTC:
- Hier-0Shot-TC [57]: A zero-shot approach that leverages a pre-trained textual entailment model to iteratively find the most similar class at each level of the hierarchy for a document. It performs top-down traversal.
- Direct LLM Prompting (e.g., GPT-3.5-turbo, GPT-4): Directly queries a powerful LLM by providing all classes in the prompt and asking it to classify the document.
  - Limitation addressed by TELEClass: The paper highlights that directly including hundreds of classes in prompts for hierarchical settings is ineffective and inefficient. It leads to structural information loss, diminishes clarity for LLMs in distinguishing fine-grained classes, and results in prohibitively expensive inference costs due to long prompts.
Other Weakly-Supervised Text Classification (Flat): The paper also mentions various methods for flat (non-hierarchical) weakly-supervised text classification, which inform aspects of TELEClass:
- LOTClass [36]: Uses MLM-based PLMs to extract class-indicative keywords.
- X-Class [53]: Extracts keywords and creates static class representations through clustering.
- PESCO [52]: Combines zero-shot classification with semantic matching and iterative contrastive learning.
- PIEClass [65]: Employs PLM zero-shot prompting for pseudo labels with noise-robust ensemble training.

3.3. Technological Evolution

The field of text classification has evolved significantly:

Rule-based & Traditional ML (Pre-deep learning): Early methods relied on handcrafted rules, feature engineering (TF-IDF), and traditional machine learning algorithms (SVM, Naive Bayes).
Deep Learning (2010s): Introduction of Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and later Attention mechanisms for text, improving feature learning and performance, especially with large labeled datasets.
Pre-trained Language Models (PLMs) (late 2010s): Models like Word2Vec, GloVe provided word embeddings, followed by contextualized embeddings from ELMo, BERT, and GPT. These models, pre-trained on massive corpora, demonstrated strong transfer learning capabilities.
Large Language Models (LLMs) (early 2020s): Extremely large-scale PLMs (GPT-3, GPT-4, Claude) with billions of parameters, excelling in zero-shot and few-shot learning across diverse NLP tasks due to their vast general knowledge.
Focus on Low-Resource & Weak Supervision: Parallel to model advancements, there's a growing focus on reducing human annotation effort, leading to research in weak supervision, zero-shot, and self-supervised learning, often by leveraging PLMs/LLMs.

This paper's work fits into the current wave of integrating powerful LLMs into weakly-supervised settings, specifically for the complex hierarchical text classification task. It addresses the practical limitations of directly using LLMs for large taxonomies and the incompleteness of previous weak supervision methods.

3.4. Differentiation Analysis

Compared to the main methods in related work, TELEClass introduces several core differences and innovations:

Vs. Fully/Semi-Supervised Methods: TELEClass completely alleviates the need for large amounts of human-annotated data, requiring only class names as supervision, which is a significant cost and time saver.
Vs. Prior Weakly-Supervised HTC (e.g., TaxoClass):
- Taxonomy Enrichment: TELEClass introduces a novel taxonomy enrichment step that combines LLM-generated key terms (general knowledge) and corpus-based class-indicative terms (task-specific knowledge). TaxoClass primarily relies on class names and the entailment model, overlooking corpus-specific features. This hybrid enrichment provides a richer, more robust understanding of fine-grained classes.
- Improved Pseudo-Label Quality: The enriched taxonomy and the subsequent embedding-based document-class matching in TELEClass lead to more accurate core class refinement compared to TaxoClass's entailment-based approach, which can suffer from unreliable comparisons.
Vs. Zero-Shot LLM Prompting:
- Hierarchy-Aware LLM Usage: TELEClass addresses the core limitation of LLMs in hierarchical settings by using a structure-aware candidate selection for LLM annotation (reducing prompt length) and path-based generation for data augmentation. This contrasts with naive prompting that struggles with large label spaces and structural information.
- Cost and Efficiency: While LLM prompting can be effective, it's often prohibitively expensive and time-consuming for large test sets, especially with models like GPT-4. TELEClass trains a smaller, efficient classifier, incurring LLM costs only during the initial pseudo-label generation and taxonomy enrichment phases, leading to drastically lower inference costs and faster classification on new documents.
- Data Scarcity for Fine-Grained Classes: TELEClass explicitly tackles the long-tail problem in hierarchical data by path-based data augmentation using LLMs. This ensures that even rare or fine-grained classes are adequately represented in the training data, a challenge not directly addressed by simple zero-shot prompting or even methods like TaxoClass which might miss such classes during core class selection.
  
  In essence, TELEClass differentiates itself by intelligently combining the strengths of LLMs (general knowledge, generation capabilities) with the nuances of corpus-specific information and the structure of the taxonomy, creating a more robust, accurate, and cost-effective minimally-supervised solution for hierarchical text classification.

4. Methodology

The TELEClass framework is designed to perform hierarchical text classification with minimal supervision, using only class names. It achieves this by combining the general knowledge of Large Language Models (LLMs) with task-specific features mined from an unlabeled text corpus. The methodology consists of four major steps: (1) LLM-Enhanced Core Class Annotation, (2) Corpus-Based Taxonomy Enrichment, (3) Core Class Refinement with Enriched Taxonomy, and (4) Text Classifier Training with Path-Based Data Augmentation.

The following figure (Figure 2 from the original paper) shows an overview of the TELEClass framework.

Figure 2: Overview of the TELEClass framework. 该图像是TELEClass框架的示意图，展示了三级文本分类的各个组成部分，包括LLM增强的核心类注释、增强分类法的核心类细化和基于路径的数据增强训练。图中提及了无标签语料库、标签分类法的丰富关键术语，以及多标签文本分类器的训练过程。

4.1. LLM-Enhanced Core Class Annotation

This initial step aims to identify "core classes" for each document, which are the most accurate and fine-grained classes describing it. This mimics how humans might approach hierarchical classification. The process leverages LLMs while managing the complexity of the large hierarchical label space.

LLM-Generated Class-Indicative Terms: To enable LLMs to better understand and distinguish similar classes, the raw taxonomy structure is first enriched with class-relevant keywords. For each class $c$ , an LLM is prompted to generate a set of key terms, denoted as $T_c^{\mathrm{LLM}}$ . These terms are specifically designed to uniquely identify class $c$ by being relevant to $c$ and its parent, but irrelevant to its sibling classes. For example, for "shampoo", terms like "flakes" might be generated, distinguishing it from "conditioner" which might have "moisture". This step provides the LLM with a more granular understanding of each class's semantic boundaries.
Similarity Score with LLM-Generated Terms: A similarity score between a document $d$ and a class $c$ is defined to guide candidate selection. This score uses the LLM-generated key terms $T_c^{\mathrm{LLM}}$ and pre-trained semantic embeddings from a model like Sentence Transformer. $ \mathrm{sim}(c, d) = \operatorname*{max}_{t \in T_c^{\mathrm{LLM}}} \cos(\vec{t}, \vec{d}) $ where $\vec{t}$ is the vector representation of a key term $t$ from the set $T_c^{\mathrm{LLM}}$ , and $\vec{d}$ is the vector representation of the document $d$ . The function $\cos(\cdot, \cdot)$ denotes the cosine similarity between the two embeddings. The maximum similarity is taken over all key terms for class $c$ , effectively finding the strongest semantic match between any of the class's defining terms and the document.
Structure-Aware Candidate Core Class Selection: To mitigate the challenge of LLMs comprehending a large, structured label space in a single prompt, a top-down tree search algorithm is applied. For each document, starting from the root node (level $l=0$ ):
- At each level $l$ , the algorithm selects the top $l+3$ most similar child classes to the document using the $\mathrm{sim}(c, d)$ score defined above.
- This process continues to deeper levels, only exploring the branches stemming from the selected classes. The increasing number of selected nodes ( $l+3$ ) accounts for the natural growth in the number of classes deeper in the taxonomy.
- All classes that are selected at any point during this process form the set of candidate core classes for that document, ensuring that the candidates are semantically relevant and conform to the hierarchical structure.
LLM Selection of Core Classes: Finally, an LLM is instructed to select the most precise core classes for each document from this reduced set of candidates. This leverages the LLM's strong understanding and reasoning capabilities on a manageable subset of labels. The output is an initial set of core classes, denoted as $\mathbb{C}_i^0$ , for each document $d_i$ in the corpus $\mathcal{D}$ .

4.2. Corpus-Based Taxonomy Enrichment

While LLM-generated terms capture general knowledge, corpus-specific knowledge is crucial for understanding nuanced and fine-grained classes. This step further enriches the taxonomy with terms mined directly from the unlabeled text corpus.

Collecting Relevant Documents: For each class $c \in C$ , an initial set of relevant documents, $D_c^0 \subset \mathcal{D}$ , is collected. This set includes all documents whose initial core classes (from the previous step) contain $c$ or any of its descendants. This provides a rough, but useful, initial grouping of documents for each class.
Class-Indicative Term Selection Factors: For each class $c$ and each of its parents $c_p$ (since a DAG allows multiple parents), the goal is to find a set of corpus-based terms $T(c, c_p)$ that signify $c$ and distinguish it from its siblings under $c_p$ . Three factors are considered for term selection:
- Popularity: A term $t$ is indicative if it appears frequently in documents relevant to class $c$ . $ \mathrm{pop}(t, c) = \log(1 + \mathrm{df}(t, D_c^0)) $ where $\mathrm{df}(t, D_c^0)$ is the document frequency of term $t$ in the set of documents $D_c^0$ relevant to class $c$ . The log normalization helps to dampen the effect of extremely frequent terms.
- Distinctiveness: A term $t$ is indicative if it is infrequent in documents relevant to $c$ 's siblings. This is quantified using a softmax of the BM25 relevance function over sibling classes: $ \mathrm{dist}(t, c, c_p) = \frac{\exp(\mathrm{BM25}(t, D_c^0))}{1 + \sum_{c' \in \mathrm{Sib}(c, c_p)} \exp(\mathrm{BM25}(t, D_{c'}^0))} $ where $\mathrm{Sib}(c, c_p)$ is the set of sibling classes of $c$ under parent $c_p$ . $\mathrm{BM25}(t, D_{c'}^0)$ measures the relevance of term $t$ to the document set $D_{c'}^0$ of a sibling class $c'$ . This formula ensures that a term is highly distinctive if its BM25 score for class $c$ is high, while its BM25 scores for sibling classes are low.
- Semantic Similarity: A term $t$ should be semantically similar to the class name of $c$ . This is measured using cosine similarity between their embeddings, derived from a pre-trained encoder like BERT. $ \mathrm{sem}(c, t) = \cos(\vec{s_c}, \vec{t}) $ where $\vec{s_c}$ is the embedding of the class name $s_c$ and $\vec{t}$ is the embedding of the term $t$ .
Affinity Score and Term Selection: The overall affinity score between a term $t$ and a class $c$ (with respect to parent $c_p$ ) is the geometric mean of the three factors: $ \mathrm{aff}(t, c, c_p) = \sqrt[3]{\mathrm{pop}(t, c) \cdot \mathrm{dist}(t, c, c_p) \cdot \mathrm{sem}(c, t)} $ First, a phrase mining tool like AutoPhrase is used to extract quality single- and multi-token phrases from the corpus as candidate terms. Then, for each class $c$ and each of its parents $c_p$ , the top $k$ terms with the highest affinity scores are selected, forming $T(c, c_p)$ .
Final Enriched Class-Indicative Terms: The corpus-based terms $T(c, c_p)$ are combined with the LLM-generated terms $T_c^{\mathrm{LLM}}$ from the previous step to form the final set of enriched class-indicative terms for class $c$ : $ T_c = \left( \bigcup_{c_p \in \mathrm{Par}(c)} T(c, c_p) \right) \bigcup T_c^{\mathrm{LLM}} $ where $\mathrm{Par}(c)$ denotes the set of parents of class $c$ . This union provides a comprehensive set of terms, combining general and corpus-specific knowledge.

The following figure (Figure 1 from the original paper) provides an example of taxonomy enrichment.

Figure 1: An example document tagged with 3 classes. We automatically enrich each node with class-indicative terms and utilize LLMs to facilitate classification. 该图像是示意图，展示了如何通过大型语言模型（LLM）和未标记文本语料库对标签分类法进行丰富。图中包含一个文档示例，标示了三级分类，以及通过 LLM 提取的关键术语，以增强分类效果。

With the enriched class-indicative terms $T_c$ for each class, this step refines the initial core classes ( $\mathbb{C}_i^0$ ) using an embedding-based document-class matching method.

Document and Class Representations:
- Document Representations: A pre-trained Sentence Transformer model encodes the entire document $d$ into a vector representation, denoted as $\vec{d}$ .
- Class Representations: For each class $c$ , a subset of its initially assigned documents $D_c^0$ is identified. These are the "most confident" documents, defined as those that explicitly mention at least one of the enriched class-indicative keywords $T_c$ . This subset is denoted as $D_c = \{d \in D_c^0 \mid \exists w \in T_c, w \in d\}$ . The class representation $\vec{c}$ is then computed as the average of the document embeddings in $D_c$ : $ \vec{c} = \frac{1}{|D_c|} \sum_{d \in D_c} \vec{d} $ The document-class matching score is simply the cosine similarity between these representations: $\cos(\vec{d}, \vec{c})$ .
Identifying Refined Core Classes using Similarity Gap: The key insight here is that true core classes should have significantly higher matching scores with a document than non-core classes. For each document $d_i$ , the classes are ranked by their matching scores. Then, the "largest similarity gap" is used to identify the refined core classes.
- A ranked list of classes for document $d_i$ is obtained: $[c_1^i, c_2^i, \ldots, c_{|C|}^i]$ , where $c_j^i$ is the $j$ -th ranked class.
- The similarity difference between adjacent ranks is calculated: $\mathrm{diff}^i(j) := \cos(\vec{d}_i, \vec{c}_j^i) - \cos(\vec{d}_i, \vec{c}_{j+1}^i)$ , for $j \in \{1, \ldots, |C|-1\}$ . Only positive differences are considered.
- The position $m_i$ with the highest similarity difference is found: $ m_i = \underset{j \in {1, \dots, |C| - 1}}{\arg\max} \quad \mathrm{diff}^i(j) $
- The classes ranked above this position $m_i$ are designated as the document's refined core classes, $\mathbb{C}_i = \{c_1^i, \dots, c_{m_i}^i\}$ .
- The similarity gap itself, \mathrm{conf}_i = \mathrm{diff}^i(m_i), serves as a confidence estimation for these refined core classes.
Selecting Confident Documents: Finally, only the top $75\%$ of documents $d_i$ (and their corresponding refined core classes $\mathbb{C}_i$ ) with the highest confidence scores $\mathrm{conf}_i$ are selected. This set of highly confident pseudo-labeled documents is denoted as $\mathcal{D}^{\mathrm{core}}$ , and it will be used for classifier training.

4.4. Text Classifier Training with Path-Based Data Augmentation

The final module trains a hierarchical text classifier using the confident refined core classes. A critical challenge is the data scarcity for fine-grained and long-tail classes, which might not be sufficiently represented in $\mathcal{D}^{\mathrm{core}}$ .

Path-Based Document Generation by LLMs: To address the data scarcity for fine-grained classes, LLMs are used to generate a small number ( $q=5$ ) of augmented documents for each distinct label path from a level-1 node to a leaf node in the taxonomy.
- Generating documents for full paths (e.g., "hair care" $\rightarrow$ "shampoo") instead of single classes ensures that the generated text accurately reflects the meaning of lower-level classes, which is often conditioned on their parents (e.g., distinguishing "hair shampoo" from "pet shampoo").
- A small, constant number of documents $q$ per path ensures that all classes are covered without over-representing frequent classes.
- To promote diversity, the LLM is prompted to generate $q$ diverse documents for each path. This set of generated documents is denoted as $\mathcal{D}^{\mathrm{gen}}$ . Appendix B of the paper provides examples of prompts used for this purpose.
Classifier Architecture: A simple text matching network is used, similar to TaxoClass.
- Document Encoder: Initialized with a pre-trained BERT-base model. It encodes the input document.
- Class Representations: Initialized by class name embeddings (from Section 3.2), but these are detached from the encoder model. This means only the class embeddings are updated during training, not the backbone BERT model.
- Log-Bilinear Matching Network: This network predicts the probability of document $d_i$ $d_{i}$ belonging to class $c_j$ $c_{j}$ : $ p(c_j | d_i) = \mathcal{P}(y_j = 1 | d_i) = \sigma(\exp(\mathbf{c}_j^T \mathbf{W} \mathbf{d}_i)) $ where:
  - $\sigma$ is the sigmoid function, which squashes the output to a probability between 0 and 1.
  - $\mathbf{c}_j$ is the encoded representation of class $c_j$ .
  - $\mathbf{d}_i$ is the encoded representation of document $d_i$ .
  - $\mathbf{W}$ is a learnable interaction matrix that captures the relationship between document and class representations.
  - The term $\exp(\mathbf{c}_j^T \mathbf{W} \mathbf{d}_i)$ represents a similarity or matching score between the document and class.
Training Process and Loss Function: The classifier is trained using two sets of data: $\mathcal{D}^{\mathrm{core}}$ (pseudo-labeled documents) and $\mathcal{D}^{\mathrm{gen}}$ (LLM-generated documents). A multi-label binary cross-entropy (BCE) loss is applied.
- Pseudo-labels for $\mathcal{D}^{\mathrm{core}}$ : For a document $d_i \in \mathcal{D}^{\mathrm{core}}$ with refined core classes $\mathbb{C}_i$ :
  - Positive Classes $\mathbb{C}_{i,+}^{\mathrm{core}}$ : These are the union of its core classes and all their ancestors in the taxonomy. This is because if a document belongs to a specific class, it also implicitly belongs to all its broader parent categories. $ \mathbb{C}_{i,+}^{\mathrm{core}} = \mathbb{C}i \cup \left(\bigcup{c \in \mathbb{C}_i} \mathrm{Anc}(c)\right) $ where $\mathrm{Anc}(c)$ denotes the set of all ancestors of class $c$ .
  - Negative Classes $\mathbb{C}_{i,-}^{\mathrm{core}}$ : These are all classes in the taxonomy $C$ that are not positive and are not descendants of the core classes. Descendants are excluded from negative labels because the automatically generated core classes are not perfect, and a document might actually belong to a descendant class that was not identified as a core class. $ \mathbb{C}{i,-}^{\mathrm{core}} = C - \mathbb{C}{i,+}^{\mathrm{core}} - \bigcup_{c \in \mathbb{C}_i} \mathrm{Des}(c) $ where $\mathrm{Des}(c)$ denotes the set of all descendants of class $c$ .
- Pseudo-labels for $\mathcal{D}^{\mathrm{gen}}$ : For a generated document $d_i^p \in \mathcal{D}^{\mathrm{gen}}$ from a path $\mathbb{C}_p$ :
  - Positive Classes $\mathbb{C}_{p,+}^{\mathrm{gen}}$ : All classes within the corresponding generation path $\mathbb{C}_p$ are considered positive. This is because LLM generation, guided by the path, is assumed to be highly confident in its labels. $ \mathbb{C}_{p,+}^{\mathrm{gen}} = \mathbb{C}_p $
  - Negative Classes $\mathbb{C}_{p,-}^{\mathrm{gen}}$ : All other classes in the taxonomy are considered negative. $ \mathbb{C}_{p,-}^{\mathrm{gen}} = C - \mathbb{C}_p $
- Loss Calculation: The BCE loss is calculated for both sets of data. $ \mathcal{L}^{\mathrm{core}} = - \sum_{d_i \in \mathcal{D}^{\mathrm{core}}} \left( \sum_{c_j \in \mathbb{C}{i,+}^{\mathrm{core}}} \log \hat{p}(c_j | d_i) + \sum{c_j \in \mathbb{C}{i,-}^{\mathrm{core}}} \log \left(1 - \hat{p}(c_j | d_i)\right) \right) $ $ \mathcal{L}^{\mathrm{gen}} = - \sum{d_i^p \in \mathcal{D}^{\mathrm{gen}}} \left( \sum_{c_j \in \mathbb{C}{p,+}^{\mathrm{gen}}} \log \hat{p}(c_j | d_i^p) + \sum{c_j \in \mathbb{C}_{p,-}^{\mathrm{gen}}} \log \left(1 - \hat{p}(c_j | d_i^p)\right) \right) $ The total loss $\mathcal{L}$ is a weighted sum of the two loss terms, with weights determined by their relative sizes: $ \mathcal{L} = \mathcal{L}^{\mathrm{core}} + \frac{|\mathcal{D}^{\mathrm{core}}|}{|\mathcal{D}^{\mathrm{gen}}|} \cdot \mathcal{L}^{\mathrm{gen}} $ The weights ensure that both data sources contribute proportionally to the training. The training is performed without self-training, which is often used in other weakly-supervised methods.

This comprehensive approach allows TELEClass to generate high-quality pseudo-labeled data from minimal supervision and an unlabeled corpus, thereby training an effective hierarchical text classifier. The full algorithm Algorithm 1 is provided in Appendix A of the paper.

5. Experimental Setup

5.1. Datasets

The experiments utilize two publicly available datasets from different domains to evaluate the performance of TELEClass.

The following are the results from Table 1 of the original paper:

Dataset	# unlabeled train	# test	# labels
Amazon-531	29,487	19,685	531
DBPedia-298	196,665	49,167	298

Amazon-531 [32]:
- Source: Consists of Amazon product reviews.
- Scale: Contains 29,487 unlabeled training documents and 19,685 test documents.
- Characteristics: Features a three-layer label taxonomy of 531 product types.
- Domain: E-commerce, product reviews.
- Example Data Sample (Hypothetical, based on context): A document might be a review like: "This shampoo made my hair feel so soft and clean. It really helped with my oily scalp without stripping moisture." The corresponding labels could be "hair care", "shampoo", "scalp treatment".
DBPedia-298 [28]:
- Source: Comprises Wikipedia articles.
- Scale: Contains 196,665 unlabeled training documents and 49,167 test documents.
- Characteristics: Features a three-layer label taxonomy of 298 categories.
- Domain: General knowledge, encyclopedic articles.
- Example Data Sample (Hypothetical, based on context): A document might be an excerpt from a Wikipedia article like: "The Empire State Building is a 102-story Art Deco skyscraper in Midtown Manhattan, New York City. It was designed by Shreve, Lamb and Harmon and completed in 1931." The corresponding labels could be "building", "skyscraper", "landmark".
  
  Rationale for Dataset Choice: These datasets represent different domains (product reviews vs. encyclopedic articles) and vary in scale and the number of labels, providing a robust evaluation of TELEClass's generalization capabilities in hierarchical text classification settings. The three-layer taxonomy structure in both datasets aligns well with the hierarchical nature of the problem.

5.2. Evaluation Metrics

Following previous studies, the paper uses three standard evaluation metrics for multi-label hierarchical text classification. Let $\mathbb{C}_i^{\mathrm{true}}$ denote the set of true labels and $\mathbb{C}_i^{\mathrm{pred}}$ denote the set of predicted labels for document $d_i$ . If a method provides ranked predictions, its top- $k$ predicted labels are denoted as $\mathbb{C}_{i,k}^{\mathrm{pred}}$ .

5.2.1. Example-F1

Conceptual Definition: Example-F1, also known as micro-Dice coefficient, evaluates the overall quality of multi-label classification results without considering the ranking of predicted labels. It computes the F1-score for each document individually and then averages these scores across all documents. It is useful for assessing the general accuracy of assigning multiple labels.
Mathematical Formula: $ \mathrm{Example-F1} = \frac{1}{|\mathcal{D}|} \sum_{d_i \in \mathcal{D}} \frac{2 \cdot |\mathbb{C}_i^{\mathrm{true}} \cap \mathbb{C}_i^{\mathrm{pred}}|}{|\mathbb{C}_i^{\mathrm{true}}| + |\mathbb{C}_i^{\mathrm{pred}}|} $
Symbol Explanation:
- $|\mathcal{D}|$ : The total number of documents in the test set.
- $d_i \in \mathcal{D}$ : An individual document from the test set.
- $\mathbb{C}_i^{\mathrm{true}}$ : The set of true (ground-truth) labels for document $d_i$ .
- $\mathbb{C}_i^{\mathrm{pred}}$ : The set of predicted labels for document $d_i$ .
- $|\mathbb{C}_i^{\mathrm{true}} \cap \mathbb{C}_i^{\mathrm{pred}}|$ : The number of labels that are common to both the true and predicted sets for document $d_i$ (i.e., true positives).
- $|\mathbb{C}_i^{\mathrm{true}}|$ : The total number of true labels for document $d_i$ .
- $|\mathbb{C}_i^{\mathrm{pred}}|$ : The total number of predicted labels for document $d_i$ .
- The fraction $\frac{2 \cdot \mathrm{TruePositives}}{\mathrm{TrueLabels} + \mathrm{PredictedLabels}}$ is the F1-score for a single document.

5.2.2. Precision at k (P@k)

Conceptual Definition: Precision at k (P@k) is a ranking-based metric that evaluates the precision of the top- $k$ predicted classes. It measures how many of the top- $k$ predicted labels are actually relevant (true positives) for a given document. It's particularly useful for applications where users only look at a few top results.
Mathematical Formula: $ \mathrm{P@k} = \frac{1}{k} \sum_{d_i \in \mathcal{D}} \frac{\lvert \mathbb{C}i^{\mathrm{true}} \cap \mathbb{C}{i,k}^{\mathrm{pred}} \rvert}{\operatorname*{min}(k, \lvert \mathbb{C}_i^{\mathrm{true}} \rvert)} $
Symbol Explanation:
- $k$ : The number of top predicted labels to consider (e.g., $k=1, 3$ ).
- $d_i \in \mathcal{D}$ : An individual document from the test set.
- $\mathbb{C}_i^{\mathrm{true}}$ : The set of true labels for document $d_i$ .
- $\mathbb{C}_{i,k}^{\mathrm{pred}}$ : The set of top- $k$ predicted labels for document $d_i$ .
- $|\mathbb{C}_i^{\mathrm{true}} \cap \mathbb{C}_{i,k}^{\mathrm{pred}}|$ : The number of true labels found within the top- $k$ predicted labels.
- $\operatorname*{min}(k, |\mathbb{C}_i^{\mathrm{true}}|)$ : This ensures that the denominator does not exceed the actual number of true labels, preventing unfairly penalizing models for not finding more than the existing true labels when $k$ is large.

5.2.3. Mean Reciprocal Rank (MRR)

Conceptual Definition: Mean Reciprocal Rank (MRR) is another ranking-based metric used to evaluate how high true labels appear in the ranked list of predictions. For each document, it calculates the reciprocal of the rank of each true label within the predicted list and then averages these reciprocal ranks. A higher MRR indicates that true labels are ranked higher.
Mathematical Formula: $ \mathrm{MRR} = \frac{1}{|\mathcal{D}|} \sum_{d_i \in \mathcal{D}} \frac{1}{|\mathbb{C}i^{\mathrm{true}}|} \sum{c_j \in \mathbb{C}i^{\mathrm{true}}} \frac{1}{\operatorname*{min}{k | c_j \in \mathbb{C}{i,k}^{\mathrm{pred}}}} $
Symbol Explanation:
- $|\mathcal{D}|$ : The total number of documents in the test set.
- $d_i \in \mathcal{D}$ : An individual document from the test set.
- $\mathbb{C}_i^{\mathrm{true}}$ : The set of true labels for document $d_i$ .
- $|\mathbb{C}_i^{\mathrm{true}}|$ : The total number of true labels for document $d_i$ .
- $c_j \in \mathbb{C}_i^{\mathrm{true}}$ : An individual true label for document $d_i$ .
- $\operatorname*{min}\{k | c_j \in \mathbb{C}_{i,k}^{\mathrm{pred}}\}$ : This finds the rank $k$ at which the true label $c_j$ first appears in the ordered list of predicted labels $\mathbb{C}_{i,k}^{\mathrm{pred}}$ . For example, if $c_j$ is the first predicted label, its rank is 1; if it's the fifth, its rank is 5.
- $\frac{1}{\mathrm{rank}}$ : The reciprocal rank for a true label.

5.3. Baselines

The proposed TELEClass method is compared against several zero-shot, weakly-supervised, and a fully-supervised baseline.

Zero-Shot Approaches:
- Hier-0Shot-TC [57]: A zero-shot method that uses a pre-trained textual entailment model to iteratively find the most similar class at each level of the taxonomy.
- GPT-3.5-turbo (ChatGPT): A direct zero-shot approach where the GPT-3.5-turbo model is queried by providing all possible classes directly in the prompt for each document.
Weakly-Supervised Approaches:
- Hier-doc2vec [27]: An approach that trains document and class representations in a shared embedding space. It then iteratively selects the most similar class at each level of the hierarchy.
- WeSHClass [35]: A weakly-supervised method that uses a small set of keywords for each class. It first generates pseudo documents to pretrain text classifiers and then improves performance through self-training.
- TaxoClass-NoST [45]: A variation of TaxoClass that uses only class names as supervision. It identifies core classes using a textual entailment model with a top-down search and corpus-level comparison. NoST indicates that it does not apply the self-training step on the trained classifier, aligning its training strategy with TELEClass for a fairer comparison of pseudo-label generation quality.
- TaxoClass [45]: The full TaxoClass model, which includes the self-training step on the classifier trained with pseudo-labels. This often leads to higher performance but may introduce noise.
- TELEClass: The proposed method, which also uses only class names, combining LLM knowledge and corpus-based features for pseudo-label generation and training.
Fully-Supervised Baseline:
- Fully-Supervised: This serves as an upper bound for performance, training the exact same text matching network (used in TELEClass) on the entire labeled training data. This indicates the maximum potential performance if perfect supervision were available.

5.4. Implementation Details

The paper details specific models and parameters used for implementation:

Text Encoder for Similarity Measures (Sections 3.1 & 3.3):
- Sentence Transformer all-mpnet-base-v2 is used to generate embeddings for documents and terms when calculating similarity scores (e.g., $\mathrm{sim}(c, d)$ for candidate selection and document/class representations for refinement).
LLM for Enrichment, Annotation, and Generation:
- GPT-3.5-turbo-0125 is used for all LLM-based components:
  - LLM-based taxonomy enrichment (generating $T_c^{\mathrm{LLM}}$ ).
  - Core class annotation (selecting initial core classes $\mathbb{C}_i^0$ ).
  - Path-based document generation (creating $\mathcal{D}^{\mathrm{gen}}$ ).
Corpus-Based Taxonomy Enrichment (Section 3.2):
- Term and class name embeddings are obtained using a pre-trained BERT-base-uncased model.
- The number of selected enriched terms for each class is $k=20$ .
Path-Based Data Augmentation (Section 3.4):
- $q = 5$ documents are generated for each distinct path from a level-1 node to a leaf node in the taxonomy.
Final Classifier Architecture (Section 3.4):
- The document encoder is initialized with BERT-base-uncased for fair comparison with baselines.
- Class representations are initialized by class name embeddings.
Training Parameters:
- Optimizer: AdamW.
- Learning rate: $5 \times 10^{-5}$ .
- Batch size: 64.
Hardware: Experiments were run on one NVIDIA RTX A6000 GPU.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the effectiveness of TELEClass across two datasets, Amazon-531 and DBPedia-298, evaluated using Example-F1, P@1, P@3, and MRR.

The following are the results from Table 2 of the original paper:

Supervision Type −	Methods	Amazon-531				DBPedia-298
Supervision Type −	Methods	Example-F1	P@1	P@3	MRR	Example-F1	P@1	P@3	MRR
Zero-Shot	Hier-0Shot-TC†	0.4742	0.7144	0.4610		0.6765	0.7871	0.6765
Zero-Shot	ChatGPT	0.5164	0.6807	0.4752		0.4816	0.5328	0.4547
Weakly-Supervised	Hier-doc2vec†	0.3157	0.5805	0.3115		0.1443	0.2635	0.1443
	WeSHClass†	0.2458	0.5773	0.2517		0.3047	0.5359	0.3048
	TaxoClass-NoST†	0.5431	0.7918	0.5414	0.5911	0.7712	0.8621	0.7712	0.8221
	TaxoClass	0.5934	0.8120	0.5894	0.6332	0.8156	0.8942	0.8156	0.8762
	TELEClass	0.6483	0.8505	0.6421	0.6865	0.8633	0.9351	0.8633	0.8864
Fully-Supervised		0.8843	0.9524	0.8758	0.9085	0.9786	0.9945	0.9786	0.9826

Key Observations:

TELEClass Outperforms Baselines Significantly: TELEClass achieves the best performance among all zero-shot and weakly-supervised baselines on both Amazon-531 and DBPedia-298 across all metrics. For instance, on Amazon-531, TELEClass achieves an Example-F1 of 0.6483, notably higher than TaxoClass (0.5934) and TaxoClass-NoST (0.5431). On DBPedia-298, its Example-F1 is 0.8633, also significantly surpassing TaxoClass (0.8156). This validates the effectiveness of combining LLM general knowledge and corpus-specific features for hierarchical text classification with minimal supervision.
Superiority over Strong Weakly-Supervised Methods: When compared to TaxoClass-NoST (the strongest baseline that, like TELEClass, does not use self-training), TELEClass shows substantial gains. This implies that TELEClass's methodologies for taxonomy enrichment and pseudo-label generation are inherently more effective, leading to substantially better training data, even with a simpler classifier model.
LLM Prompting Limitations in Hierarchy: Naively prompting ChatGPT (GPT-3.5-turbo) performs poorly in the hierarchical setting, especially on DBPedia-298 (Example-F1 of 0.4816) compared to Hier-0Shot-TC (0.6765) and other strong weakly-supervised methods. This highlights the necessity of incorporating corpus-based knowledge and structure-aware strategies to effectively utilize LLMs in complex hierarchical label spaces.
Gap to Fully-Supervised: While TELEClass significantly advances the state-of-the-art for minimal supervision, there is still a considerable gap to the Fully-Supervised baseline (e.g., Example-F1 of 0.6483 vs. 0.8843 on Amazon-531). This indicates that while minimal supervision is effective, substantial human annotation still provides a performance ceiling that is hard to reach without direct labels.
Temporal Complexity: The paper notes that TELEClass has similar temporal complexity to TaxoClass (around 5-5.5 hours on Amazon-531). This is attributed to TELEClass's taxonomy enrichment step simplifying document-class matching, compensating for the time budgeted for LLM prompting.

6.2. Ablation Studies

Ablation studies were conducted to understand the contribution of each component of TELEClass to its overall performance.

The following are the results from Table 3 of the original paper:

Methods	Amazon-531				DBPedia-298
Methods	Example-F1	P@1	P@3	MRR	Example-F1	P@1	P@3	MRR
Gen-Only	0.5151	0.7477	0.5096	0.5357	0.7930	0.9421	0.7930	0.8209
TELEClass-NoLLMEnrich	0.5520	0.7370	0.5463	0.5900	0.8319	0.9108	0.8319	0.8563
TELEClass-NoCorpusEnrich	0.6143	0.8358	0.6082	0.6522	0.8185	0.8916	0.8185	0.8463
TELEClass-NoGen	0.6449	0.8348	0.6387	0.6792	0.8494	0.9187	0.8494	0.8730
TELEClass	0.6483	0.8505	0.6421	0.6865	0.8633	0.9351	0.8633	0.8864

Observations from Ablation Studies:

Full TELEClass is Optimal: The complete TELEClass model consistently achieves the best performance, indicating that all its components contribute positively to the final results.
Impact of Taxonomy Enrichment:
- TELEClass-NoLLMEnrich: Excluding the LLM-based taxonomy enrichment leads to a drop in performance on both datasets (e.g., Example-F1 from 0.6483 to 0.5520 on Amazon-531). This shows the value of LLMs' general knowledge in defining clear class boundaries.
- TELEClass-NoCorpusEnrich: Removing the corpus-based taxonomy enrichment also reduces performance (e.g., Example-F1 from 0.6483 to 0.6143 on Amazon-531). This highlights the importance of task-specific features mined from the unlabeled corpus.
Differential Contribution of Enrichment: The paper notes an interesting difference in the contribution levels:
- LLM-based enrichment brings more improvement on Amazon-531. This is likely because Amazon-531 classes (product types) are more commonly understood by LLMs, allowing reliable enrichment based on general knowledge.
- Corpus-based enrichment contributes more on DBPedia-298. This suggests that DBPedia-298 contains more subtle or niche categories, where corpus-specific knowledge is crucial to consolidate class meanings and facilitate better distinction. This aligns with the lower performance of zero-shot LLM prompting on DBPedia, indicating LLMs might struggle more with its specific domain.
Effectiveness of Path-Based Generation (TELEClass-NoGen vs. Gen-Only):
- TELEClass-NoGen: Excluding the path-based LLM generation (D_gen) leads to a slight decrease in performance (e.g., Example-F1 from 0.6483 to 0.6449 on Amazon-531). This demonstrates that this component effectively addresses data scarcity for fine-grained classes and improves overall coverage.
- Gen-Only: Training the classifier only on documents generated by path-based LLM generation (without core classes) still achieves competitive performance, even comparable to TaxoClass-NoST (e.g., Example-F1 of 0.5151 on Amazon-531 vs. TaxoClass-NoST's 0.5431). This underscores the power and effectiveness of the LLM-based data augmentation strategy.

6.3. Comparison with Zero-Shot LLM Prompting

This section provides a deeper comparison with various configurations of zero-shot LLM prompting, focusing on performance, estimated cost, and inference time. MRR is not reported here due to difficulties in obtaining ranked predictions from LLMs. Costs and times are estimates for the entire test set.

The following are the results from Table 4 of the original paper:

Methods	Amazon-531					DBPedia-298
Methods	Example-F1	P@1	P@3	Est. Cost	Est. Time	Example-F1	P@1	P@3	Est. Cost	Est. Time
GPT-3.5-turbo	0.5164	0.6807	0.4752	\$60	240 mins	0.4816	0.5328	0.4547	\$80	400 mins
GPT-3.5-turbo (level)	0.6621	0.8574	0.6444	\$20	800 mins	0.6649	0.8301	0.6488	\$60	1,000 mins
GPT-4	0.6994	0.8220	0.6890	\$800	400 mins	0.6054	0.6520	0.5920	\$2,500	1,000 mins
TELEClass	0.6483	0.8505	0.6421	<\$1	3 mins	0.8633	0.9351	0.8633	<\$1	7 mins

Key Findings:

TELEClass vs. LLMs Performance:
- On DBPedia-298, TELEClass consistently and significantly outperforms all LLM prompting methods (e.g., Example-F1 0.8633 vs. 0.6649 for GPT-3.5-turbo (level) and 0.6054 for GPT-4).
- On Amazon-531, TELEClass's performance is comparable to GPT-3.5-turbo (level) and GPT-4. It slightly underperforms GPT-4 (Example-F1 0.6483 vs. 0.6994) and GPT-3.5-turbo (level) (Example-F1 0.6483 vs. 0.6621) but remains very close.
- GPT-3.5-turbo (level) (which uses a level-by-level search to find a path) consistently outperforms the naive GPT-3.5-turbo (which includes all classes in one prompt). This further underscores the importance of considering the taxonomy structure when querying LLMs for hierarchical tasks.
Cost and Time Efficiency: This is where TELEClass demonstrates a drastic advantage.
- Inference Cost: Once trained, TELEClass incurs <1in inference cost for the entire test set, whereas direct LLM prompting can be prohibitively expensive (e.g., ``800 for GPT-4 on Amazon-531, and up to 2,500` for `GPT-4` on DBPedia-298). Even `GPT-3.5-turbo` costs 60-80`. This makes TELEClass a much more practical solution for real-world deployments.
- Inference Time: TELEClass also boasts substantially shorter inference times (3-7 minutes) compared to LLM prompting methods (240-1000 minutes). The level-by-level prompting, while more accurate, takes even longer due to multiple queries per document.
  
  In summary, TELEClass offers a compelling balance of high performance and extreme cost-effectiveness, particularly excelling where LLMs struggle with large label spaces or become too expensive for extensive inference.

6.4. Case Studies

To provide qualitative insights, the paper presents case studies of two documents, illustrating the intermediate results of core class selection by different methods.

The following are the results from Table 5 of the original paper:

Dataset	Document	Core Classes by...	True Labels	Corr. Enrichment
DBPedia	The Lindenhurst Memorial Library (LML) islocated in Lindenhurst, New York, and is oneof the fifty six libraries that are part of theSuffolk Cooperative Library System ...	TaxoClass: villageTELEClass initial: buildingTELEClass refined: library	library©, agent,educational institution	Class: libraryTop Enrichment:national library,central library,collection, volumes...
Amazon	Since mom (89 yrs young) isn't steady onher feet, we have placed these grab barsaround the room. It gives her the stabilityand security she needs.	TaxoClass: personal care,health personal care, safetyTELEClass initial: daily living aids,medical supplies equipment, safety,TELEClass refined:bathroom aids safety	health personal care,medical supplies equipment,bathroom aids safety@	Class:bathroom aids safetyTop Enrichment:seat, toilet, shower,safety, handles...

Key Observations from Case Studies:

Improved Core Class Accuracy: TELEClass's refined core classes are consistently the most accurate compared to TaxoClass and TELEClass's own initial core classes.
- DBPedia Example (Library): For a Wikipedia article about a library:
  - TaxoClass incorrectly selects "village".
  - TELEClass initial (LLM-selected) finds a more relevant but still general class: "building".
  - TELEClass refined correctly identifies the optimal core class: "library". This improvement is attributed to the guidance from enriched class-indicative features (e.g., "national library", "central library", "collection", "volumes" for "library"), which helps in pinpointing the specific fine-grained class.
- Amazon Example (Grab Bars): For a product review about grab bars for safety:
  - TaxoClass provides more general labels like "personal care", "health personal care", "safety".
  - TELEClass initial finds slightly more specific but still broad terms: "daily living aids", "medical supplies equipment", "safety".
  - TELEClass refined accurately identifies the most precise core class: "bathroom aids safety". The enriched terms (e.g., "seat", "toilet", "shower", "safety", "handles" for "bathroom aids safety") played a crucial role in this specific distinction.
LLM Enhancement and Enrichment Effectiveness: The cases show that LLM-enhanced initial core class annotation (moving from "village" to "building") and the subsequent refinement with enriched taxonomy (moving from "building" to "library") are both critical steps. The enriched terms provide the necessary granularity for distinguishing similar classes.
Limitations and Future Work (Bias Example): The paper acknowledges a limitation with a case from Amazon-531 where GPT correctly predicted "beauty" and "skin care" for "glycolic treatment pads," while TELEClass predicted "health care." The authors suspect that the word "treatment" in the review might lead to an error due to the bias of term-based pseudo-labeling. This highlights a known issue in keyword-based methods, where highly discriminative but potentially misleading terms can cause misclassification. This points to a potential area for future research in debiasing these methods in the hierarchical setting.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces TELEClass, a novel and highly effective method for minimally-supervised hierarchical text classification. Its core innovation lies in strategically combining the broad general knowledge of Large Language Models (LLMs) with task-specific insights extracted from an unlabeled text corpus. TELEClass addresses the critical challenges of high annotation costs and the inefficiency of LLMs in handling complex, large hierarchical label spaces.

The method's primary contributions include:

Hybrid Taxonomy Enrichment: Automatically enriches raw label taxonomies with class-indicative terms derived from both LLM generation and corpus-based statistical/semantic analysis, leading to a more comprehensive understanding of class semantics.
LLM-Enhanced Data Annotation: Utilizes LLMs for core class annotation through a structure-aware candidate search, making the annotation process more efficient and effective for hierarchical settings.
Path-Based Data Augmentation: Leverages LLMs to generate high-quality pseudo documents for specific taxonomy paths, effectively combating the data scarcity problem for fine-grained and long-tail classes.

Experiments on Amazon-531 and DBPedia-298 datasets demonstrate that TELEClass significantly outperforms existing zero-shot and weakly-supervised baselines. Crucially, it achieves comparable performance to powerful zero-shot LLM prompting (e.g., GPT-4) but with drastically lower inference costs and faster classification, making it a highly practical and scalable solution.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose promising directions for future research:

Generalization to Other Low-Resource Tasks: The core idea of combining LLMs with data-specific knowledge is powerful. Future work could explore applying TELEClass's principles to other low-resource text mining tasks with hierarchical label spaces, such as fine-grained entity typing or aspect-based sentiment analysis with hierarchical aspects.
Advanced Model Architectures and Training Objectives: This paper primarily focuses on generating high-quality pseudo-labeled data while using a relatively simple text matching network and BCE loss. Future research could investigate how more advanced network structures (e.g., graph neural networks that explicitly model the hierarchy) and noise-robust training objectives (to handle potential errors in pseudo-labels) could further enhance performance.
Addressing Harder Scenarios:
- Private Domains: Extending TELEClass to scenarios where existing LLMs lack domain knowledge for initial annotation (e.g., highly specialized private company taxonomies) would be challenging but valuable. This might require domain adaptation techniques or more sophisticated prompting strategies.
- Limited Unlabeled Corpus: Investigating how TELEClass performs and can be improved when the unlabeled corpus itself is limited is another important direction.
- Extremely Large Label Spaces: Scaling TELEClass to even more complex label spaces, such as those with millions of classes, presents new challenges in terms of computational efficiency and managing the vast taxonomy.

7.3. Personal Insights & Critique

The TELEClass paper offers a pragmatic and innovative approach to hierarchical text classification under minimal supervision. The core idea of integrating LLM general knowledge with corpus-specific features is particularly strong, as it addresses the inherent limitations of each source when used in isolation. LLMs provide broad semantic understanding, while corpus analysis grounds this understanding in the specific domain and helps distinguish fine-grained categories that LLMs might otherwise misinterpret due to their general nature.

Transferability: The methodology of taxonomy enrichment and LLM-enhanced pseudo-label generation is highly transferable. This framework could be adapted to any domain where a hierarchical categorization is needed, and unlabeled text data is available, but human annotation is prohibitive. For instance, in scientific literature categorization, legal document classification, or even medical diagnosis codes, where taxonomies are complex and constantly evolving.

Potential Issues & Areas for Improvement:

Bias in Term-Based Labeling: As acknowledged in the case study, term-based pseudo-labeling can introduce bias from highly frequent or misleading keywords. While distinctiveness and semantic similarity factors aim to mitigate this, it remains a challenge. Future work could explore more context-aware term weighting or adversarial training to reduce such biases.
LLM Dependency: While TELEClass reduces LLM inference costs for classification, it still heavily relies on LLMs (GPT-3.5-turbo in this case) for initial enrichment and data generation. The quality and cost of these initial steps are still tied to LLM API performance and pricing. As LLMs evolve, this dependency might become more robust or cheaper, but for highly sensitive or private domains, reliance on external LLM services might be a concern.
Computational Cost of Enrichment: The corpus-based taxonomy enrichment process, while effective, involves phrase mining, embedding calculations, and affinity score computations across potentially large corpora and many classes. While not as expensive as per-document LLM prompting, it can still be computationally intensive. Optimization of this step, especially for extremely large corpora, could be a future research avenue.
Confidence Threshold Tuning: The selection of the top $75\%$ of documents with highest confidence scores for D_core is a heuristic. A more dynamic or adaptive thresholding mechanism, possibly driven by model uncertainty or active learning principles, might further improve pseudo-label quality.
Multi-path Labeling: The paper assumes multi-label output, meaning a document can belong to multiple classes across different paths. While the training loss accounts for this, the top-down search in core class annotation and path-based generation implicitly favor single paths or limited branching. Exploring how to more explicitly encourage discovering multiple, disparate paths early in the pseudo-labeling process could be beneficial.

Overall, TELEClass represents a significant step forward in making hierarchical text classification more accessible and scalable, bridging the gap between expensive supervised methods and often underperforming zero-shot approaches, particularly in cost-sensitive applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.