Paper status: completed

HierPrompt: Zero-Shot Hierarchical Text Classification with LLM-Enhanced Prototypes

Published:01/01/2025

Zero-Shot Hierarchical Text Classification (1)LLM-Enhanced Prototypes (1)Hierarchical Prototype Refinement (1)Example Text Prototype (ETP) (1)Maximum Similarity Propagation (MSP) (1)

Original Link

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

HierPrompt is proposed for zero-shot hierarchical text classification, enhancing prototype representation and informativeness. It introduces Example Text Prototypes and Category Name Prototypes, utilizing Maximum Similarity Propagation to improve prototype construction, showing s

Abstract

Hierarchical Text Classification is a challenging task which classifies texts into categories arranged in a hierarchy. Zero-Shot Hierarchical Text Classification (ZS-HTC) further assumes only the availability of hierarchical taxonomy, without any training data. Existing works of ZS-HTC are typically built on the prototype-based framework by embedding the category names into prototypes, which, however, do not perform very well due to the ambiguity and impreciseness of category names. In this paper, we propose HierPrompt, a method that leverages hierarchy-aware prompts to instruct LLM to produce more representative and informative prototypes. Specifically, we first introduce Example Text Prototype (ETP), in conjunction with Category Name Prototype (CNP), to enrich the information contained in hierarchical prototypes. A Maximum Similarity Propagation (MSP) technique is also proposed to consider the hierarchy in similarity calculation. Then, the hierarchical prototype refinement module is utilized to (i) contextualize the category names for more accurate CNPs and (ii) produce detailed example texts for each leaf category to form ETPs. Experiments on three benchmark datasets demonstrate that HierPrompt substantially outperforms existing ZS-HTC methods.

Mind Map

In-depth Reading

English Analysis~32 min read · 44,832 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "HierPrompt: Zero-Shot Hierarchical Text Classification with LLM-Enhanced Prototypes".

1.2. Authors

The authors are:

Qian Zhang (School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China)
Qinliang Su (School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, China)
Wei Zhu (China Mobile Internet Company Ltd.)
Yachun Pang (China Mobile Internet Company Ltd.)

1.3. Journal/Conference

The paper was published at EMNLP Findings, which is a reputable venue in the field of Natural Language Processing (NLP). EMNLP (Empirical Methods in Natural Language Processing) is one of the premier conferences, known for publishing high-quality research in various aspects of NLP. The "Findings" track typically includes strong papers that might not fit the main conference's thematic scope or length constraints but still contribute significantly.

1.4. Publication Year

The paper was published in 2025 (specifically, January 1, 2025).

1.5. Abstract

The paper addresses Zero-Shot Hierarchical Text Classification (ZS-HTC), a challenging task where texts are categorized into a hierarchical taxonomy without any labeled training data, relying solely on the taxonomy itself. Existing ZS-HTC methods, often prototype-based using category names, struggle due to the ambiguity and imprecision of these names.

The proposed method, HierPrompt, utilizes hierarchy-aware prompts to instruct Large Language Models (LLMs) to generate more representative and informative prototypes. Specifically, it introduces Example Text Prototypes (ETP) alongside Category Name Prototypes (CNP) to enrich prototype information. It also proposes a Maximum Similarity Propagation (MSP) technique to incorporate hierarchical structure into similarity calculations. A hierarchical prototype refinement module is used to (i) contextualize category names for more accurate CNPs and (ii) produce detailed example texts for leaf categories to form ETPs. Experiments on three benchmark datasets demonstrate that HierPrompt substantially outperforms existing ZS-HTC methods.

1.6. Original Source Link

The original source link is https://aclanthology.org/2025.findings-emnlp.207.pdf. It is officially published as part of the Findings of EMNLP 2025.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the effective classification of texts into hierarchical categories in a Zero-Shot Hierarchical Text Classification (ZS-HTC) setting.

This problem is crucial due to several factors:

Real-world applications: Hierarchical text classification (HTC) is fundamental for tasks like product categorization in e-commerce, document organization, and web content classification.
Cost of labeled data: Most existing HTC methods rely on extensive labeled datasets, which are expensive and time-consuming to create.
Evolving taxonomies: Hierarchical taxonomies frequently change, making it impractical to relabel data for every update. This highlights the need for methods that don't require retraining on new labels.

Specific challenges or gaps in prior research, particularly in ZS-HTC, include:
Limited prototype quality: Existing prototype-based ZS-HTC methods typically embed category names directly as prototypes. However, category names are often general, ambiguous, and lack context, leading to unrepresentative and imprecise prototypes. For instance, the word 'work' can have multiple meanings ('job' or 'creative content') without context.
Lack of stylistic information: Category names are simplified phrases that fail to capture the fine-grained stylistic nuances of actual texts within a category. This can lead to chaos in the embedding space, where texts from the same conceptual category but different datasets (e.g., product features vs. customer reviews for 'video games') appear semantically distant.

The paper's entry point or innovative idea is to leverage the advanced capabilities of Large Language Models (LLMs) and the inherent hierarchical structure of taxonomies to dynamically generate richer, more representative prototypes. Instead of just embedding raw category names, HierPrompt aims to create contextualized category names and example texts as prototypes, and then incorporates the hierarchy directly into the similarity calculation process.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of Zero-Shot Hierarchical Text Classification:

Generalized Prototype Framework: It refines the standard prototype-based framework by introducing Example Text Prototypes (ETP) to complement Category Name Prototypes (CNP). ETPs capture stylistic and detailed characteristics of texts, addressing the limitations of CNPs alone.
Maximum Similarity Propagation (MSP) Technique: A novel MSP technique is proposed to explicitly integrate hierarchical relationships into the similarity calculation. Unlike previous propagation methods, MSP ensures that parent categories benefit from the maximum similarities of their descendants for both CNP and ETP, thus better exploiting the structural dependencies of the hierarchy.
LLM-Enhanced Prototype Refinement: HierPrompt leverages the understanding and generation capabilities of LLMs through hierarchy-aware prompts. This module performs two key functions:
- Category Name Contextualization: LLMs are prompted to contextualize category names at different hierarchical levels, considering their parent, sibling, and child categories. This leads to more accurate and representative CNPs.
- Example Text Generation: LLMs are instructed to generate concrete example texts for each leaf category, forming the ETPs. A Chain Of Generation (COG) prompting strategy is introduced to guide LLMs in producing high-quality, stylistically relevant examples.
Superior Experimental Performance: Experiments on three public benchmark datasets (NYT, DBpedia, Amazon) demonstrate that HierPrompt substantially outperforms existing state-of-the-art ZS-HTC methods across almost all hierarchical levels. Ablation studies further validate the effectiveness of each proposed component, including MSP, hierarchy-aware prompting for CNP contextualization, and ETP generation with COG-based prompting.

These findings address the core problems by providing a robust framework for generating high-quality, informative prototypes in a zero-shot setting, thus significantly improving the accuracy and representativeness of hierarchical text classification without requiring any labeled training data.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following concepts:

Text Classification: The task of assigning one or more predefined categories or labels to a piece of text. For example, classifying a news article as "Sports" or "Politics."
Hierarchical Text Classification (HTC): An extension of text classification where categories are organized in a tree-like or graph-like structure (a hierarchy or taxonomy). A text might belong to "Sports" (level 1), then "Basketball" (level 2), and further "NBA" (level 3). This means classification needs to be consistent across levels, e.g., if a text is "NBA," it must also be "Basketball" and "Sports."
Zero-Shot Learning (ZSL) / Zero-Shot Text Classification (ZSC): A machine learning paradigm where the model is tasked with classifying data into categories it has never seen during training. The model relies on auxiliary information (like semantic descriptions of categories) to infer the correct label. In ZS-HTC, this means no labeled texts are available, only the hierarchical taxonomy itself.
Prototype-based Classification: A common approach in ZSL. Each category is represented by a "prototype" (an embedding vector) in a shared semantic space. To classify a new input text, its embedding is compared to all category prototypes, and the text is assigned to the category whose prototype is most similar. This is often framed as a 1-nearest neighbor problem.
- Embedding: A numerical representation of text (words, phrases, sentences, or documents) in a high-dimensional vector space, where semantically similar texts are mapped to nearby points. An encoder (e.g., a pre-trained language model) is used to generate these embeddings.
Large Language Models (LLMs): Advanced artificial intelligence models (like GPT-3, GPT-4, Qwen) trained on vast amounts of text data. They are capable of understanding natural language, generating coherent and contextually relevant text, summarizing, and even performing complex reasoning tasks through prompting.
Prompting: The technique of providing natural language instructions or questions to an LLM to guide its behavior and elicit a desired output. Hierarchy-aware prompts specifically leverage structural information (like parent, sibling, child categories) within the prompt itself.
Cosine Similarity: A measure of similarity between two non-zero vectors in an inner product space. It measures the cosine of the angle between them. A cosine similarity of 1 means the vectors are identical in direction (most similar), 0 means they are orthogonal (no similarity), and -1 means they are opposite in direction (most dissimilar). In text classification, a higher cosine similarity between a text embedding and a prototype embedding indicates a stronger semantic match.
- The cosine similarity between two vectors $\mathbf{A}$ $A$ and $\mathbf{B}$ $B$ is calculated as: $ \text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $ Where:
  - $\mathbf{A}$ and $\mathbf{B}$ are the two vectors being compared (e.g., text embedding and prototype embedding).
  - $A_i$ and $B_i$ are the $i$ -th components of vectors $\mathbf{A}$ and $\mathbf{B}$ respectively.
  - $\cdot$ denotes the dot product.
  - $\|\mathbf{A}\|$ and $\|\mathbf{B}\|$ denote the Euclidean norm (magnitude) of vectors $\mathbf{A}$ and $\mathbf{B}$ respectively.
Taxonomy/Hierarchy: A system of classification where categories are organized into a tree-like structure.
- Root node: The topmost category, with no parent.
- Parent node: A category directly above another category in the hierarchy.
- Child node: A category directly below another category in the hierarchy.
- Ancestor: Any category above a given category in the hierarchy (including parents, grandparents, etc.).
- Descendant: Any category below a given category in the hierarchy (including children, grandchildren, etc.).
- Leaf category: A category that has no children.

3.2. Previous Works

The paper contextualizes its contributions by discussing existing approaches in Zero-Shot Text Classification (ZSC) and Hierarchical Text Classification (HTC), particularly focusing on the intersection: ZS-HTC.

Zero-Shot Classification (ZSC):
- In-context learning: Some studies leverage LLMs to perform ZSC by providing examples or instructions within the prompt itself (Edwards and Camacho-Collados, 2024; Lepagnol et al., 2024).
- Textual entailment: Another approach treats ZSC as a textual entailment problem, where an LLM determines if a text "entails" a category label (Williams et al., 2018; Pamies et al., 2023).
- Prototype-based ZSC: This is a common and relevant approach, where category names are embedded as prototypes, and texts are classified by finding the nearest prototype in the embedding space (Snell et al., 2017; Liu et al., 2022). The current paper builds upon this framework.
- Self-training: Some ZSC definitions include using unlabeled data for training (Gera et al., 2022), but the paper strictly adheres to a "no training data at all" scenario.
Hierarchical Text Classification (HTC):
- Supervised HTC: Earlier works are heavily reliant on labeled data (Kowsari et al., 2017; Huang et al., 2019).
- Weakly-supervised HTC: Methods that reduce the need for labeled data, often using a small number of labeled texts or keywords (Meng et al., 2019; Shen etal., 2021). Some use only label hierarchy and unlabeled data (Zhang et al., 2025).
Zero-Shot Hierarchical Text Classification (ZS-HTC):
- Upward Score Propagation (USP) (Bongiovanni et al., 2023): This method is built on the prototype-based framework. It addresses the hierarchical structure by propagating similarity scores from leaf nodes upwards to parent categories. If a text is highly similar to a child category, that similarity contributes to the parent's score. This helps ensure hierarchical consistency. The core idea is that the similarity score of a parent node is a function of its own similarity and the propagated similarity from its children.
- HiLA (Hierarchical Label Augmentation) (Paletto et al., 2024): This work extends USP by using LLMs. HiLA addresses the USP limitation of not providing hierarchical information for most leaf categories. It exploits LLMs to generate new subclasses (more fine-grained categories) for the original leaf categories, thereby enriching the hierarchy. The USP technique is then applied to this augmented hierarchy. This allows for better estimation of similarities, but still relies on category names and their augmentations.
- TELEClass (Zhang et al., 2025): The paper mentions TELEClass as a similar model but explicitly states it's not included as a baseline because it addresses a "weakly-supervised" task, not a "strictly zero-shot" one like HierPrompt. TELEClass uses LLMs to generate keywords and pseudo texts, and incorporates hierarchy as an input to the LLM.

3.3. Technological Evolution

The field of text classification has evolved significantly:

Early Supervised Methods: Initially, HTC relied heavily on large, manually labeled datasets and rule-based systems or traditional machine learning algorithms.
Deep Learning for Supervised HTC: With the rise of deep learning, neural networks improved performance but still required substantial labeled data.
Weakly Supervised & Few-Shot HTC: Researchers sought to reduce the annotation burden by using minimal labels, keywords, or unlabeled data.
Zero-Shot Classification (ZSC): The most constrained scenario, where no labeled data is available, pushing methods to rely on semantic embeddings of labels and texts. Early ZSC primarily used direct embedding of category names as prototypes.
LLM Era for ZSC/ZS-HTC: The advent of powerful LLMs marked a new phase.
- LLMs were initially used for in-context learning or textual entailment in ZSC.
- For ZS-HTC, models like HiLA began using LLMs to augment the taxonomy by generating sub-labels, then applying existing propagation techniques like USP.
- This paper (HierPrompt) represents a further evolution, moving beyond just augmenting labels to using LLMs for generating richer prototypes themselves (contextualized category names and example texts) and integrating hierarchy into the similarity calculation in a more direct way (MSP).

3.4. Differentiation Analysis

Compared to the main methods in related work, HierPrompt introduces several core differences and innovations:

Prototype Quality Enhancement:
- From Category Name Prototypes (CNP) only (baseline ZSC, USP) to CNP + Example Text Prototypes (ETP): HierPrompt explicitly acknowledges the limitation of CNPs in capturing stylistic and detailed information. It addresses this by introducing ETPs which are generated example texts, thus providing a richer representation of each category.
- LLM-driven Prototype Generation vs. Augmentation: While HiLA uses LLMs to expand the hierarchy by generating new subclasses, HierPrompt uses LLMs to refine and generate the content of the prototypes themselves. This includes contextualizing existing category names and creating synthetic example texts, directly enriching the information embedded in the prototypes.
- Hierarchy-aware Prompting for Prototype Refinement: HierPrompt designs specific hierarchy-aware prompts that allow LLMs to perceive parent, sibling, and child categories during the generation of both contextualized category names and example texts. This deeper integration of hierarchical context into the LLM's understanding leads to more accurate and representative prototypes.
Improved Hierarchical Similarity Calculation:
- Maximum Similarity Propagation (MSP) vs. Upward Score Propagation (USP): While both HierPrompt and USP propagate similarity through the hierarchy, MSP specifically considers the maximum similarity score of descendants for both CNP and ETP components. This is a more aggressive and potentially more informative propagation strategy compared to USP's general score accumulation, especially when categories have diverse descendant characteristics. MSP also operates on the refined CNP and ETP, giving it a stronger foundation.
  
  In essence, HierPrompt's innovation lies in its holistic approach to prototype quality: it not only recognizes the need for richer prototypes beyond raw category names but also leverages LLMs and hierarchical information in a sophisticated manner to generate these high-quality prototypes and then integrates the hierarchy into the classification step with a refined propagation mechanism.

4. Methodology

4.1. Principles

The core idea behind HierPrompt is to overcome the limitations of existing prototype-based Zero-Shot Hierarchical Text Classification (ZS-HTC) methods, which suffer from ambiguous and unrepresentative category name prototypes. It achieves this by:

Enriching Prototypes: Moving beyond simple category names to include Example Text Prototypes (ETP) alongside Category Name Prototypes (CNP), thereby capturing both general semantic meaning and specific stylistic/detailed information.
Leveraging Large Language Models (LLMs): Utilizing LLMs' strong understanding and generation capabilities to refine these prototypes. This involves contextualizing category names based on their hierarchical position and generating synthetic example texts for leaf categories.
Integrating Hierarchy into Similarity: Developing a Maximum Similarity Propagation (MSP) technique that explicitly considers the hierarchical structure during the text-prototype similarity calculation, ensuring that classifications are consistent and benefit from deeper hierarchical relationships.

The theoretical basis is that more informative and contextually rich prototypes, combined with a hierarchy-aware similarity measure, will lead to more accurate and robust zero-shot hierarchical text classification.

4.2. Core Methodology In-depth (Layer by Layer)

The HierPrompt framework aims to classify an input text $x$ into hierarchical categories. The overall process involves several steps, as illustrated in Figure 2.

$该图像是一个示意图，展示了HierPrompt框架的工作流程。图中介绍了如何通过大型语言模型（如GPT）生成类别名称原型（CNP）和示例文本原型（ETP）。它还展示了通过链式生成（COG）方法处理和生成与“篮球”相关的文章。具体公式为 $p = 'Articles about "athlete" are divided into "basketball, football".'$。$
Figure 2: Overview of the proposed HierPrompt. First, the hierarchy-aware prompts are designed. Then the prompts serve as input of LLM for Category Name Contextualization and Example Text Generation (Sect.3.2). Finally, the refined prototypes are used to perform classification according to the Maximum Score Propagation technique (Sect. 3.1).

The method starts by defining a hierarchical taxonomy. The notation used is as follows:

The following are the results from Table 1 of the original paper:

Symbol	Meaning
$l$	A level in hierarchy, $l = 0, \ldots, L$
$c_j^l$	the $j$ -th category in level $l$
$N_l$	The number of categories in level $l$
$\uparrow c_j^l$	Parent of $c_j^l$
$\Uparrow c_j^l$	The set of ancestors of $c_j^l$
$\downarrow c_j^l$	The set of children of $c_j^l$
$\Downarrow c_j^l$	The set of descendants of $c_j^l$

For a non-root node ( $l \neq 1$ ), its parent $\uparrow c_j^l$ refers to a single category. For a leaf category ( $l = L$ ), its child and descendant categories are empty sets, i.e., $\downarrow c_j^l = \emptyset$ and $\Downarrow c_j^l = \emptyset$ .

4.2.1. Framework of ZS-HTC

Traditionally, Zero-Shot Classification (ZSC) on texts is framed as a 1-nearest neighbor problem. Given an input text $x$ , an encoder $E(\cdot)$ (e.g., a pre-trained language model like mpnet-all) is used to convert it into an embedding vector E(x). Similarly, category names are also encoded to form Category Name Prototypes (CNP). The similarity $S(x, c_j^l)$ between the text embedding and each category prototype $c_j^l$ is calculated (typically using cosine similarity). The text is then assigned to the category with the highest similarity.

For hierarchical classification, the task is to determine the predicted category $\hat{c}_i^l$ for each level $l$ of the hierarchy. This is formulated as: $\hat { c } _ { i } ^ { l } = \underset { c _ { j } ^ { l } } { \arg \operatorname* { m a x } } \left( S \left( x , c _ { j } ^ { l } \right) \right)$ Here, $S \left( \boldsymbol { x } , \boldsymbol { c } _ { j } ^ { l } \right)$ denotes the similarity between the embedding of the input text E(x) and the Category Name Prototype (CNP) of category $c_j^l$ .

4.2.1.1. Example Text Prototype (ETP)

The paper identifies that CNPs alone are insufficient because category names are often general, lack detail, and do not capture the stylistic nuances of actual texts. For example, "video game" in one dataset might describe features, while in another, it might describe user feelings, leading to chaos in the embedding space (as shown in Figure 1).

Figure 1: Category name lacks detailed and stylistic information, leading to chaos in embedding space.

To address this, HierPrompt introduces the Example Text Prototype (ETP). ETPs are derived from example texts that belong to a category and reflect its typical writing style and content. These example texts provide richer contextual information compared to simple category names.

ETP for Leaf Categories: For leaf categories (level $L$ ), the ETP is obtained by encoding an example text $\mathrm{text}_i^L$ associated with that category using the encoder $E(\cdot)$ : $\mathrm { ETP } _ { i } ^ { L } = E \left( \mathrm { text } _ { i } ^ { L } \right)$ Where:
- $\mathrm{ETP}_i^L$ is the Example Text Prototype for the $i$ -th leaf category in level $L$ .
- $E(\cdot)$ is the text encoder.
- $\mathrm{text}_i^L$ is an example text for the $i$ -th leaf category.
ETP for Higher-Level Categories: For categories in higher levels ( $l < L$ ), their ETPs are obtained by averaging the ETPs of their descendant leaf categories. This hierarchical aggregation ensures that higher-level prototypes encapsulate the characteristics of their sub-categories. $\mathrm { ETP } _ { i } ^ { l } = \frac { \sum _ { j \in \downarrow c _ { i } ^ { l } } \mathrm { ETP } _ { j } } { \mid \downarrow c _ _ { i } ^ { l } \mid }$ Where:
- $\mathrm{ETP}_i^l$ is the Example Text Prototype for the $i$ -th category in level $l$ .
- $\downarrow c_i^l$ denotes the set of descendant leaf categories of $c_i^l$ . Note: The notation in the paper states $\downarrow c_i^l$ as children, but the formula implies summing over descendants, and typical hierarchical prototype aggregation sums over leaf descendants to build up the hierarchy. For ETP, it averages over all descendant ETPs (which are ultimately derived from leaf ETPs).
- $|\downarrow c_i^l|$ is the number of descendant leaf categories of $c_i^l$ .
Combined Similarity Score: With both CNPs and ETPs available, a more accurate similarity score between the input text $x$ and a category $c_j^l$ is calculated by summing the cosine similarities from both prototype types: $S \left( x , c _ { j } ^ { l } \right) = S _ { C N } \left( x , c _ { j } ^ { l } \right) + S _ { E T } \left( x , c _ { j } ^ { l } \right)$ Where:
- $S \left( x , c _ { j } ^ { l } \right)$ is the overall similarity score.
- $S_{CN} \left( x , c _ { j } ^ { l } \right)$ denotes the cosine similarity between the text embedding E(x) and the Category Name Prototype $\mathrm{CNP}_j^l$ .
- $S_{ET} \left( x , c _ { j } ^ { l } \right)$ denotes the cosine similarity between the text embedding E(x) and the Example Text Prototype $\mathrm{ETP}_j^l$ .

4.2.2. Maximum Similarity Propagation (MSP)

The combined similarity score in the previous step still treats each category independently, ignoring the explicit hierarchical relationships. The paper argues that hierarchy exhibits transitivity: if a text is related to a lower-level category, it should also be related to its parent. To exploit this, HierPrompt introduces Maximum Similarity Propagation (MSP).

MSP recursively propagates the maximum similarity scores from descendant categories up the hierarchy. This means that for a non-leaf category, its similarity score is augmented by the strongest similarity found among its descendants, for both CNP and ETP components.

The MSP score for a category $c_j^l$ is defined as: $S _ { M S P } \left( x , c _ { j } ^ { l } \right) = \left\{ \begin{array} { c c } { S \left( x , c _ { j } ^ { l } \right) } & { l = L } \\ { S \Big ( x , c _ { j } ^ { l } \Big ) + S ^ { \operatorname* { m a x } } \Big ( \Downarrow c _ { j } ^ { l } \Big ) } & { l < L } \end{array} \right.$ Where:

$S_{MSP}(x, c_j^l)$ is the Maximum Similarity Propagation score for text $x$ and category $c_j^l$ .
$S(x, c_j^l)$ is the combined similarity score (from CNP and ETP) as defined in the previous section.
$l$ is the current level, and $L$ is the deepest (leaf) level.
For leaf categories ( $l=L$ ), the MSP score is just the direct combined similarity.
For non-leaf categories ( $l < L$ ), the MSP score is the direct combined similarity plus an additional term $S^{\mathrm{max}} \left( \Downarrow c _ { j } ^ { l } \right)$ , which captures the maximum similarity from its descendants.

The term $S^{\mathrm{max}} \left( \Downarrow c _ { j } ^ { l } \right)$ is calculated by summing the maximum CNP similarity and maximum ETP similarity found among all descendant categories $\Downarrow c_j^l$ of $c_j^l$ : $S ^ { \mathrm { m a x } } \left( \Downarrow c _ { j } ^ { l } \right) = \operatorname* { m a x } _ { c \in \Downarrow c _ { j } ^ { l } } \left( S _ { C N } \left( x , c \right) \right) + \operatorname* { m a x } _ { c \in \Downarrow c _ { j } ^ { l } } \left( S _ { E T } \left( x , c \right) \right)$ Where:
$\Downarrow c_j^l$ represents the set of all descendant categories of $c_j^l$ .
$\operatorname{max}_{c \in \Downarrow c_j^l} (\cdot)$ takes the maximum similarity score over all descendants $c$ for a given prototype type.
$S_{CN}(x, c)$ is the cosine similarity between E(x) and CNP of descendant $c$ .
$S_{ET}(x, c)$ is the cosine similarity between E(x) and ETP of descendant $c$ .

This means if a text is strongly related to a specific leaf category (e.g., "NBA"), that strong similarity will propagate upwards and boost the scores of its parent ("Basketball") and grandparent ("Sports") categories, making the hierarchical classification more robust and consistent.

The effectiveness of CNP and ETP heavily relies on the quality of the category names and example texts. Since this is a zero-shot scenario, actual example texts are unavailable, and raw category names can be ambiguous. HierPrompt addresses this by leveraging LLMs to refine prototypes using hierarchy-aware prompts.

4.2.3.1. Category Name Contextualization

The goal is to generate more accurate and representative CNPs by providing LLMs with hierarchical context for each category name.

Problem with Coarse-Grained Categories: The paper observes that classification performance can be counter-intuitively lower for coarse-grained (Level 1) categories compared to finer-grained ones (Figure 3). This is attributed to overly general category names (e.g., 'work' having multiple meanings).

Figure 3: Directly using category name for classification in DBpedia.
Contextualization for Level 1 Categories: To resolve the ambiguity of coarse-grained category names, LLMs are prompted to summarize their meaning based on their descendant categories. The prompt $P_{\mathrm{coarse}}(c_i^1)$ for a Level 1 category $c_i^1$ is: $P _ { \mathrm { c oar s e } } \left( c _ { i } ^ { 1 } \right) = \mathrm { fill-Template } _ { \mathrm { c oar s e } } \left( \downarrow c _ { i } ^ { 1 } \right)$ Where the template fill-Template_coarse is: $"A [dataset] can be classified into [↓c_i^1], please summarize them into one class, give the class name and its corresponding description sentence."$
- [dataset] is a placeholder (e.g., "DBpedia article").
- $[↓c_i^1]$ is the list of immediate child categories of $c_i^1$ . The LLM processes this prompt and outputs a revised category name and an explanatory description.
Contextualization for Other Levels (l > 1): For finer-grained categories, their parent and sibling categories provide crucial context. The prompt $P_{\mathrm{fine}}(c_i^l)$ for a category $c_i^l$ (where $l > 1$ ) is: $P _ { \mathrm { f i n e } } \left( c _ { i } ^ { l } \right) = \mathrm { fill-Template } _ { \mathrm { text } } \left( c _ { i } ^ { l } , \Downarrow \left\{ \uparrow c _ { j } ^ { l } \right\} \right)$ Where the template fill-Template_text is: $"[dataset] of [↑c_j^l] can be classified into classes: [↓{↑c_j^l}]. Write one sentence to summarize the features of [dataset] that is classified into the class c_i^l."$
- [dataset] is a placeholder.
- $[↑c_j^l]$ refers to the parent category of $c_i^l$ .
- $[↓{↑c_j^l}]$ refers to the sibling categories of $c_i^l$ (i.e., children of the parent of $c_i^l$ ). The LLM then provides a concise summary of the features for $c_i^l$ given its context.
Generating CNPs: The output from the LLM ( $d_i^l$ ) for each category (either the revised coarse-grained name and description or the fine-grained summary) is then encoded to form the refined CNP: $d _ { i } ^ { l } = \left\{ \begin{array} { c c } { L L M ( p _ { \mathrm { c oar s e } } ( c _ { i } ^ { 1 } ) ) } & { l = 1 } \\ { L L M ( p _ { \mathrm { f i n e } } ( c _ { i } ^ { l } ) ) } & { l > 1 } \end{array} \right.$ Then, the Category Name Prototype is obtained by encoding this contextualized description: $\mathrm { C N P } _ { i } ^ { l } = E ( d _ { i } ^ { l } )$ .

4.2.3.2. Example Text Generation

To create ETPs in a zero-shot setting, LLMs are prompted to generate synthetic example texts for each leaf category.

Initial Prompt Design: Similar to CNP contextualization, hierarchical information is integrated into the prompt to guide the LLM. An initial prompt for a leaf category $c_i^L$ could be: $P _ { \mathrm { text } } ( c _ { i } ^ { L } ) { = } \mathrm { fill-Template } _ { \mathrm { text } } ( c _ { i } ^ { L } \cup \uparrow ) c _ { i } ^ { L } \cup \downarrow \{ \uparrow c _ { i } ^ { L } \} )$ Where the template fill-Template_text is: $"There are [dataset] about [↑c_i^L], which can be divided into [↓{↑c_i^L}]. Please generate a [dataset], which can serve as a typical example of the class [c_i^L]."$
- [dataset] refers to the domain (e.g., "DBpedia article").
- $[↑c_i^L]$ is the parent of the leaf category.
- $[↓{↑c_i^L}]$ are the sibling categories of the leaf category.
Chain Of Generation (COG) Prompting: The authors found that LLMs might struggle to generate concrete examples directly, sometimes producing explanations instead of actual texts. This is due to a missing logical step: first exemplifying a concrete instance of the category, then generating text for that instance. To address this, a Chain Of Generation (COG) strategy is explicitly embedded into the prompt. This guides the LLM through a sequence of thought: (1) Understanding category semantics; (2) Exemplifying concrete instance of category; (3) Generating corresponding text of the instance.

The COG-enhanced prompt $P_{\mathrm{text}}^{\mathrm{COG}}(c_i^L)$ is designed as: $P _ { \mathrm { text } } ^ { \mathrm { COG } } ( c _ { i } ^ { L } ) { = } \mathrm { fill-Template } _ { \mathrm { COG } } ( c _ { i } ^ { L } { \cup } \uparrow ) c _ { i } ^ { L } { \cup } \Downarrow \{ \uparrow c _ { i } ^ { L } \} )$ Where the template fill-Template_COG is: $"There are [dataset] about [↑c_i^L], which can be divided into [↓{↑c_i^L}]. Please think of n specific examples of each fine-grained class. Then generate [dataset] for each fine-grained class."$
- $n$ is the number of examples to generate per leaf category. This prompt explicitly instructs the LLM to first brainstorm specific instances within the category's context and then generate text for those instances, leading to higher quality example texts.
Generating ETPs from COG: The LLM generates $n$ example texts for each leaf category. These texts are individually encoded by $E(\cdot)$ , and their embeddings are averaged to form the ETP for that leaf category: $\mathrm { ETP } _ { i } ^ { L } = \frac { \sum _ { j = 1 } ^ { n } E \left( \mathrm { text } _ { j } ^ { c _ { i } ^ { L } } \right) } { n }$ Where:
- $\mathrm{ETP}_i^L$ is the Example Text Prototype for the $i$ -th leaf category.
- $E(\cdot)$ is the text encoder.
- $\mathrm{text}_j^{c_i^L}$ is the $j$ -th example text generated for leaf category $c_i^L$ .
- $n$ is the number of generated example texts.
  
  The ETPs for higher levels are then obtained by averaging the ETPs of their descendants, as described in Section 4.2.1.1. (Example Text Prototype). This complete process of CNP contextualization and ETP generation, combined with MSP, forms the HierPrompt method.

5. Experimental Setup

5.1. Datasets

Three public datasets with hierarchical labels were selected to evaluate HierPrompt:

NYT (Tao et al., 2018): Contains 13,081 news documents. Its taxonomy has two levels, with 5 categories at Level 1 (L1) and 26 categories at Level 2 (L2). The average document length is 648.13 words. This dataset likely represents formal news reporting style.
DBpedia (Lehmann et al., 2015): Contains 50,000 Wikipedia articles. It has a three-level taxonomy: 9 categories at L1, 70 at L2, and 219 at L3. The average document length is 103.37 words. Texts are typically neutral, objective, and descriptive.
Amazon (Kashnitsky, 2020): Contains 50,000 Amazon reviews. It has a three-level taxonomy: 6 categories at L1, 64 at L2, and 522 at L3. The average document length is 90.29 words. The language is typically casual and informal, reflecting personal opinions and experiences.

The following are the results from Table 2 of the original paper:

Dataset L1 L2 L3 DocNum AvgLen
NYT 5 26 nan 13081 648.13
DBpedia 9 70 219 50000 103.37
amazon 6 64 522 50000 90.29

These datasets were chosen because they represent diverse domains (news, encyclopedic knowledge, e-commerce reviews) and varying text styles (formal, objective, informal), as well as different hierarchical depths and category counts. This diversity allows for a comprehensive evaluation of the proposed method's robustness.

To help the reader intuitively understand the data's form, Figure 1 (shown previously in the Methodology section) illustrates how the category "video games" can have different text styles in DBpedia (neutral, objective descriptions of game features) versus Amazon (informal, subjective reviews of game experience). This highlights the need for Example Text Prototypes (ETP) to capture these stylistic differences.

5.2. Evaluation Metrics

The paper reports Macro-F1 score in its experiments. The supplementary materials (Table 7 and Table 8) also include Micro-F1 score. Both are standard metrics for classification tasks, especially when dealing with multi-class or multi-label scenarios.

5.2.1. Macro-F1 Score

Conceptual Definition: Macro-F1 calculates the F1-score independently for each class and then takes the average. It treats all classes equally, regardless of their size, making it suitable for evaluating performance on imbalanced datasets where some classes might have significantly fewer instances than others. It emphasizes the model's performance on rare classes.

Mathematical Formula: First, for each class $k$ : $ \text{Precision}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FP}_k} $ $ \text{Recall}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FN}_k} $ $ \text{F1}_k = 2 \cdot \frac{\text{Precision}_k \cdot \text{Recall}_k}{\text{Precision}_k + \text{Recall}k} $ Then, the Macro-F1 is the average of the F1-scores for all classes: $ \text{Macro-F1} = \frac{1}{N_c} \sum{k=1}^{N_c} \text{F1}_k $

Symbol Explanation:

$\text{TP}_k$ : True Positives for class $k$ . The number of instances correctly identified as class $k$ .
$\text{FP}_k$ : False Positives for class $k$ . The number of instances incorrectly identified as class $k$ (they belong to another class).
$\text{FN}_k$ : False Negatives for class $k$ . The number of instances that belong to class $k$ but were incorrectly identified as another class.
$\text{Precision}_k$ : The proportion of instances predicted as class $k$ that were actually class $k$ .
$\text{Recall}_k$ : The proportion of actual class $k$ instances that were correctly predicted as class $k$ .
$\text{F1}_k$ : The F1-score for class $k$ , which is the harmonic mean of precision and recall. It balances both metrics.
$N_c$ : The total number of classes.

5.2.2. Micro-F1 Score

Conceptual Definition: Micro-F1 calculates global averages of True Positives (TP), False Positives (FP), and False Negatives (FN) across all classes. It then computes precision, recall, and F1-score based on these global counts. This metric is influenced by the frequency of each class, meaning that larger classes contribute more to the overall score. It's often preferred when class imbalance is not a major concern or when accuracy across all instances is paramount.

Mathematical Formula: First, sum TP, FP, and FN across all classes: $ \text{TP}{\text{total}} = \sum{k=1}^{N_c} \text{TP}k $ $ \text{FP}{\text{total}} = \sum_{k=1}^{N_c} \text{FP}k $ $ \text{FN}{\text{total}} = \sum_{k=1}^{N_c} \text{FN}k $ Then, calculate Micro-Precision, Micro-Recall, and Micro-F1: $ \text{Micro-Precision} = \frac{\text{TP}{\text{total}}}{\text{TP}{\text{total}} + \text{FP}{\text{total}}} $ $ \text{Micro-Recall} = \frac{\text{TP}{\text{total}}}{\text{TP}{\text{total}} + \text{FN}_{\text{total}}} $ $ \text{Micro-F1} = 2 \cdot \frac{\text{Micro-Precision} \cdot \text{Micro-Recall}}{\text{Micro-Precision} + \text{Micro-Recall}} $

Symbol Explanation:

$\text{TP}_{\text{total}}$ , $\text{FP}_{\text{total}}$ , $\text{FN}_{\text{total}}$ : Sum of True Positives, False Positives, and False Negatives across all classes, respectively.
$\text{Micro-Precision}$ : Global precision calculated from total counts.
$\text{Micro-Recall}$ : Global recall calculated from total counts.
$\text{Micro-F1}$ : The overall F1-score, which is the harmonic mean of Micro-Precision and Micro-Recall.

5.3. Baselines

The proposed HierPrompt method was compared against the following baselines:

base-label: This is the most basic zero-shot text classification baseline. It performs ZSC solely by embedding raw category names as prototypes and then classifying texts based on similarity to these prototypes. This highlights the performance without any advanced prototype refinement or hierarchical propagation.
base-text: This baseline uses actual example texts from the dataset to form prototypes. Specifically, it randomly selects one example text for each leaf category within the dataset and encodes it to create a prototype. While not a strictly zero-shot approach (as it uses real data, albeit limited), it serves as a strong reference to understand the upper bound of performance achievable if high-quality example texts were available.
USP (Bongiovanni et al., 2023): Upward Score Propagation is a core ZS-HTC method that propagates similarity scores from finer-grained (child) categories up to coarser-grained (parent) categories according to certain rules. It enhances the base-label approach by incorporating hierarchical structure during similarity calculation.
HiLA (Paletto et al., 2024): Hierarchical Label Augmentation builds upon USP. It leverages LLMs (specifically gpt-3.5-turbo) to expand the categories of leaf nodes by generating new subclasses. This augmented hierarchy is then used with the USP technique to perform ZS-HTC. HiLA is considered the state-of-the-art for ZS-HTC prior to HierPrompt.

TELEClass (Zhang et al., 2025) is mentioned as a similar model that also uses LLMs and hierarchy. However, the authors explicitly state it is not included as a direct baseline because TELEClass addresses a weakly-supervised task, whereas HierPrompt focuses on the stricter zero-shot scenario (where no labeled or unlabeled training data is provided).

5.4. Implementation Details

LLMs for Prototype Refinement:
- Category Name Extension: Alibaba Cloud's qwen-turbo (the LLM with the fewest parameters) was used. This indicates that this task is considered relatively straightforward for LLMs.
- Example Text Generation: Alibaba Cloud's qwen-plus (an LLM with a moderate number of parameters) was used. This suggests that text generation requires a more capable LLM than just name contextualization.
Text Encoder: Consistent with USP and HiLA, the mpnet-all model (Song et al., 2020) was used as the zero-shot text classification encoder. This ensures fair comparison with baselines.
Prompt Customization: The [dataset] placeholder in the prompts was filled according to the dataset:
- NYT: 'news report'
- DBpedia: 'dbpedia article'
- Amazon: 'amazon review'
Number of Generated Examples (n): For COG-based prompting in example text generation, the number of example texts $n$ was set to 4.
Evaluation Metric: Macro-F1 score is the primary metric reported in the main experiments.
Hardware: All experiments were conducted using an NVIDIA GeForce RTX 2080 GPU.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results, demonstrating the superiority of HierPrompt on three public benchmarks, are summarized in Table 4.

The following are the results from Table 4 of the original paper:

method	NYT		DBpedia			Amazon
method	L1	L2	L1	L2	L3	L1	L2	L3
base-label	70.60	66.13	31.60	33.64	63.60	56.35	26.97	14.22
base-text	89.05	58.22	62.72	55.33	63.88	79.57	43.92	17.84
USP	N.A.	66.13	64.70	65.60	62.80	71.20	34.80	17.30
HILA	N.A.	N.A.	76.80	66.00	62.90	76.20	39.30	24.90
ours-label	85.78	68.71	76.95	66.47	66.20	82.44	43.39	19.45
ours-text	86.48	63.79	88.17	65.90	56.09	82.69	47.58	22.09
ours	87.00	69.98	83.56	71.32	65.78	82.56	48.67	22.69

Not strictly zero-shot classification, for reference only

Analysis of Results:

Effect of Category Name Contextualization (ours-label vs. base-label):
- The ours-label row shows the Macro-F1 scores after applying the Category Name Extension module (contextualizing category names with LLMs).
- There is a substantial improvement over base-label across all datasets and levels.
- NYT L1: ours-label (85.78%) vs. base-label (70.60%) — an increase of 15.18%.
- DBpedia L1: ours-label (76.95%) vs. base-label (31.60%) — a dramatic increase of 45.35% (the paper abstract states 50.57%, which is slightly off from table calculation, but still a huge jump).
- Amazon L1: ours-label (82.44%) vs. base-label (56.35%) — an increase of 26.09%.
- This confirms the paper's hypothesis that coarse-grained labels are often too general and lead to misleading prototypes. Contextualizing these names with LLMs significantly improves their representativeness and classification performance.
Effect of Example Text Generation (ours-text vs. base-text):
- The ours-text row presents F1 scores after generating example texts using the LLM and COG-based prompting.
- In most cases, the LLM-generated texts (ours-text) perform comparably or even better than using randomly selected real examples (base-text).
- For example, on DBpedia L1, ours-text (88.17%) significantly outperforms base-text (62.72%). On Amazon L2, ours-text (47.58%) is better than base-text (43.92%).
- This demonstrates the effectiveness of the Example Text Generation module, showing that LLMs can generate high-quality synthetic examples that accurately capture the stylistic and detailed information of categories, even in a zero-shot scenario.
Overall Performance of HierPrompt (ours vs. Baselines):
- The ours row represents the full HierPrompt method, combining CNP contextualization, ETP generation, and Maximum Similarity Propagation (MSP).
- HierPrompt consistently outperforms all baselines, including the state-of-the-art HiLA, across almost all levels and datasets.
- DBpedia: ours (L1: 83.56%, L2: 71.32%, L3: 65.78%) significantly surpasses HiLA (L1: 76.80%, L2: 66.00%, L3: 62.90%). The improvements are 6.76% (L1), 5.32% (L2), and 2.88% (L3). The abstract mentions 5.41%, 5.24%, and 4.99% for DBpedia, which might be a slight discrepancy, but the trend of significant improvement is clear.
- Amazon: ours (L1: 82.56%, L2: 48.67%, L3: 22.69%) also shows strong gains over HiLA (L1: 76.20%, L2: 39.30%, L3: 24.90%). The improvements are 6.36% (L1) and 9.37% (L2). For L3, ours is slightly lower than HiLA, but still competitive.
- HierPrompt also generally outperforms USP by a wide margin, especially on L1 and L2 for DBpedia and Amazon.
  
  In summary, the results strongly validate HierPrompt's effectiveness. The combination of LLM-enhanced prototypes (both CNP and ETP) and Maximum Similarity Propagation provides a robust solution for ZS-HTC, significantly improving upon prior methods by generating more representative and informative category representations.

6.2. Ablation Studies / Parameter Analysis

Ablation studies were conducted to understand the contribution of each key component of HierPrompt.

6.2.1. Maximum Similarity Propagation (MSP)

This study investigates the effectiveness of the proposed MSP technique compared to other propagation methods and direct classification. The experiments were conducted on DBpedia and Amazon datasets, focusing on non-leaf layers (L1, L2) where propagation is relevant.

The following are the results from Table 5 of the original paper:

method	DBpedia		Amazon
method	L1	L2	L1	L2
USP	64.70	65.60	71.20	34.80
USP-label	68.60	69.80	72.94	36.00
direct-label	38.48	35.37	74.10	30.51
MSP-label	76.95	66.47	82.44	43.39
dirct-text	58.11	60.66	75.89	41.59
MSP-text	79.21	67.94	77.71	44.54

Note: The table formatting for dirct-text and MSP-text seems to have slight misalignment in the original. I will assume they are separate lines of results.

Analysis:

Direct vs. Propagation: Comparing direct-label (no propagation) with USP-label or MSP-label clearly shows the benefit of any similarity propagation. direct-label performs significantly worse (e.g., DBpedia L1: 38.48%) than methods employing propagation.
MSP vs. USP:
- MSP-label (using contextualized category names with MSP) consistently outperforms USP (using raw category names with USP) and USP-label (using contextualized names with USP).
- On DBpedia L1, MSP-label (76.95%) is much higher than USP (64.70%) and USP-label (68.60%). This is an improvement of over 8% for MSP-label compared to USP-label for DBpedia L1.
- On Amazon L1, MSP-label (82.44%) again outperforms USP (71.20%) and USP-label (72.94%) by significant margins.
- The paper notes that for DBpedia L2, USP-label (69.80%) is slightly better than MSP-label (66.47%), but MSP's advantage becomes clear when propagating further up to L1 (8.35% better than USP for DBpedia L1 as stated in text). For Amazon, MSP consistently outperforms USP on both L1 (9.5% better) and L2 (7.39% better).
MSP with ETP: MSP-text (using generated example texts with MSP) also shows strong performance, often outperforming MSP-label and dirct-text. This indicates that MSP is effective for propagating similarities derived from both CNPs and ETPs.

The results confirm that MSP greatly improves performance, especially for higher levels of the hierarchy, and demonstrates a significant advantage over the traditional USP method in most cases.

6.2.2. Category Name Contextualization with hierarchy-aware Prompting

This study evaluates the importance of integrating hierarchical information into prompts for contextualizing category names.

Coarse-grained (L1) Extension: The following are the results from Table 7 of the original paper:

method NYT DBpedia
Amazon
miF1 maF1 miF1 maF1 miF1 maF1
base-label 88.72 70.60 46.90 31.60 59.02 56.35
ours-hier∆ 93.99 85.78 (+5.27 +15.18) 76.95 76.95 (+30.05 +45.35) 73.20 74.10 (+14.18 +17.75)

Note: The abstract has slight differences in percentage increase, but the table values are used here. Also, the miF1 for ours-hier on DBpedia shows 76.95, which is the same as maF1, indicating a possible typo or specific characteristic of the dataset for that level. ours-hier (using P_coarse with hierarchical information) significantly boosts performance over base-label for L1 categories across all datasets. This validates the premise that contextualizing overly general coarse-grained names improves CNP quality.

Fine-grained (L2) Extension: The following are the results from Table 8 of the original paper:

method	NYT		DBpedia		Amazon
method	miF1 maF1	miF1 maF1	miF1 maF1
base-label	84.29	66.13	36.76	33.64	35.64	26.97
ours-w/hier∆	73.09	62.61 (-11.20 -3.52)	39.34	36.65 (+2.58 +3.01)	36.16	30.63 (+0.52 +3.66)
ours-hier∆	84.72	68.71 (+0.43 +2.58)	37.16	36.69 (+0.40 +3.05)	39.86	32.74 (+4.22 +5.77)

Note: The delta values (∆) are calculated as ours - base-label.

ours-w/hier uses prompts without hierarchical information (P_fine_direct).
ours-hier uses P_fine with hierarchical information (parent and sibling context).

Figure 4: Ablation: The Effect of utilizing hierarchical information in prompts in Category Name Extension (Row 1) and Example Text Generation (Row 2).

As depicted in the first row of Figure 4 (Category Name Extension for L2), and in Table 8:

ours-hier generally performs better than base-label and ours-w/hier.
On NYT L2, ours-w/hier (62.61%) actually performs worse than base-label (66.13%), indicating that an uncontextualized LLM output can be detrimental. However, ours-hier (68.71%) surpasses base-label.
On DBpedia L2 and Amazon L2, ours-hier consistently shows improvements over both base-label and ours-w/hier. This confirms that integrating hierarchical information (parent and sibling categories) into prompts is crucial for LLMs to generate accurate CNPs for fine-grained categories.

6.2.3. Example Text Generation with hierarchy-aware Prompting

This study investigates the impact of hierarchical information in prompts for generating example texts. Five variants were tested for L1 and L2.

The following are the results from Table 9 of the original paper:

method	NYT		DBpedia		Amazon
method	miF1 maF1	miF1 maF1	miF1 maF1
base-text	95.57	89.05	74.94	56.81	71.06	70.19
ours-w/hier	61.92	78.49	56.60	42.90	75.50	74.77
ours-L1	94.27	85.04	59.56	42.97	76.70	76.40
ours-L2	94.97	86.72	71.34	51.09	75.48	74.91
ours-L1+L2	93.92	82.44	72.58	51.41	76.94	76.31

Note: The table formatting of the original paper had some values listed on the same line. I have separated them into distinct rows for clarity according to the method names.

The following are the results from Table 10 of the original paper:

method	NYT		DBpedia		Amazon
method	miF1 maF1	miF1 maF1	miF1 maF1
base-text	65.19	58.22	25.56	25.62	25.44	22.53
ours-w/hier	68.32	61.92	26.22	19.67	37.40	27.14
ours-L1	78.73	62.66	29.52	25.78	35.18	27.95
ours-L2	70.60	62.60	33.34	31.88	39.88	31.22
ours-L1+L2	75.35	62.62	33.54	31.21	40.82	32.81

Note: The original table formatting had some values listed on the same line. I have separated them into distinct rows for clarity according to the method names.

Analysis (referencing Figure 4, row 2 and Tables 9, 10):

ours-w/hier: Prompting LLM with non-hierarchical prompts (P_text_direct).
ours-L1: Prompting with coarse-grained (L1) information.
ours-L2: Prompting with fine-grained (L2) information.
ours-L1+L2: Prompting with full hierarchical information (P_text).

The results show that:
- Generally, incorporating any hierarchical information (ours-L1, ours-L2) into the prompts for example text generation yields better results than using non-hierarchical prompts (ours-w/hier).
- Considering more complete hierarchical information (ours-L1+L2) tends to perform best among the variants that use generated texts, especially for L2, demonstrating the cumulative benefit of providing richer context to the LLM.
- For example, on Amazon L2, ours-L1+L2 (32.81%) significantly outperforms ours-w/hier (27.14%).
- This validates the design choice of hierarchy-aware prompts for ETP generation.

6.2.4. Example Text Generation with COG-based Prompting

This study isolates the effect of the Chain-Of-Generation (COG) strategy within the example text generation module.

The following are the results from Table 6 of the original paper:

Prompt	DBpedia			Amazon
Prompt	L1	L2	L3	L1	L2	L3
$P_{\text{text}}$	64.60	61.57	51.19	76.87	43.13	21.91
$P'_{\text{text}}$	59.06	58.06	57.28	76.06	41.95	21.89
$P^{\text{COG}}_{\text{text}}$	67.17	60.13	56.09	77.58	43.43	22.69
$P_{\text{text}}$ *	82.29	65.22	51.19	77.73	45.47	21.91
$P'_{\text{text}}$ *	79.70	65.69	57.28	76.70	46.03	21.89
$P^{\text{COG}}_{\text{text}}$ *	88.17	65.90	56.09	82.69	47.58	22.69

Note: The asterisk () indicates that MSP (Maximum Similarity Propagation) was applied for similarity calculation. $P_{\text{text}}$ generates 1 example, $P'_{\text{text}}$ generates $n$ examples without COG, and $P^{\text{COG}}_{\text{text}}$ generates $n$ examples with COG.*

Analysis:

Impact of COG: Comparing $P_{\text{text}}^{\text{COG}}$ (with COG) against $P_{\text{text}}$ (single example, no COG) and $P'_{\text{text}}$ (multiple examples, no COG) shows the benefit of the COG strategy.
On DBpedia, $P_{\text{text}}^{\text{COG}}$ shows advantages in L1 and L3 over $P_{\text{text}}$ . The advantage over $P'_{\text{text}}$ is less obvious here.
On Amazon, $P_{\text{text}}^{\text{COG}}$ demonstrates a significant performance improvement over both $P_{\text{text}}$ and $P'_{\text{text}}$ across all levels, regardless of whether MSP is applied. For example, with MSP, Amazon L1: $P_{\text{text}}^{\text{COG}}$* (82.69%) vs.$ P'_{\text{text}}$* (76.70%).
This indicates that explicitly guiding the LLM through the logical steps of generation (COG) is effective in producing more concrete and useful example texts, particularly for certain datasets or hierarchical levels.

6.2.5. Sensitivity Analysis

This analysis explores how the choice of LLM affects HierPrompt's performance. Three Alibaba Cloud Qwen models (qwen-turbo, qwen-plus, qwen-max) were tested on the DBpedia dataset for both category name extension and example text generation.

Figure 5: Sensitivity Analysis: Effect of different LLM for the proposed HierPrompt

Analysis:

Category Name Extension (CNP): F1-scores for CNP (blue bars) show minimal variation across different LLMs. This suggests that the task of understanding and extending category names, especially with hierarchical context, might be relatively straightforward for even less complex LLMs.
Example Text Generation (ETP & All): For ETP (orange bars) and the combined All (grey bars), there are noticeable differences in performance across LLMs.
- qwen-turbo and qwen-max generally show a decrease in performance as category generality increases (L3 to L1), meaning they perform better on finer-grained categories.
- Conversely, qwen-plus shows the opposite trend: its performance increases with the generality of the category (L3 to L1).
This indicates that while CNP contextualization is robust to LLM choice, ETP generation is more sensitive. The choice of LLM significantly impacts the quality and utility of the generated example texts, especially considering the level of generality of the categories. qwen-plus appears to be a good balance or even superior for coarse-grained example text generation.

6.3. Computational Budget

The paper notes that the computational cost for LLM querying (Category Name Contextualization and Example Text Generation) grows linearly with the number of categories in the hierarchical taxonomy, i.e., $O(n)$ , where $n$ is the number of categories.

For the largest dataset (Amazon), the non-batch querying for Category Name Contextualization takes approximately 120 seconds.
Example Text Generation takes around 6000 seconds (100 minutes) in non-batch querying.
During inference (classification of a new text), the process takes less than 5 seconds. The experiments were run on an NVIDIA GeForce RTX 2080 GPU. This suggests that while prototype generation can be time-consuming for large taxonomies, the inference time is very low, making the method practical for real-time classification once prototypes are generated.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces HierPrompt, a novel method for Zero-Shot Hierarchical Text Classification (ZS-HTC) that significantly enhances prototype quality and leverages hierarchical structure. HierPrompt refines the conventional prototype-based framework by integrating Example Text Prototypes (ETP) alongside Category Name Prototypes (CNP), thus capturing both semantic and stylistic information of categories. It employs a Maximum Similarity Propagation (MSP) technique to incorporate hierarchical relationships into the similarity calculation, improving classification consistency across levels.

The core innovation lies in using hierarchy-aware prompts to instruct Large Language Models (LLMs) to perform two key functions: (i) contextualizing category names from various hierarchical levels to generate more accurate CNPs, and (ii) generating detailed example texts for leaf categories to form ETPs, notably using a Chain Of Generation (COG) strategy to ensure high-quality, concrete examples. Experimental results on NYT, DBpedia, and Amazon datasets consistently demonstrate that HierPrompt substantially outperforms existing state-of-the-art ZS-HTC methods, validating the effectiveness of its LLM-enhanced prototype refinement and hierarchical propagation mechanism.

7.2. Limitations & Future Work

The authors acknowledge several limitations of HierPrompt and suggest future research directions:

LLM Knowledge Assumption and Hallucinations: The method assumes that the LLM possesses accurate background knowledge about the taxonomy. However, this may not hold true for highly specialized or evolving taxonomies. The risk of LLM hallucinations (generating factually incorrect or nonsensical information) could degrade the quality of contextualized category names and generated example texts, thereby negatively impacting model performance.
- Future Work: Incorporating explicit mechanisms like self-consistency or self-correctness into the LLM prompting process to mitigate hallucinations and ensure more reliable prototype generation.
Prototype Refinement Scope: HierPrompt primarily focuses on refining prototypes at the textual level (i.e., improving the text content that is then encoded into prototypes). It does not explicitly address the adjustment or optimization of these prototypes within the embedding space itself (e.g., fine-tuning the encoder or applying post-processing to prototype vectors).
- Future Work: Exploring methods to improve hierarchical prototypes on both the textual space (as done in this paper) and the embedding space, potentially through techniques like prototype regularization or metric learning.

7.3. Personal Insights & Critique

HierPrompt presents an elegant and effective solution to a challenging problem in NLP.

Inspirations and Strengths:

Clever use of LLMs for Prototype Generation: The paper moves beyond merely using LLMs for label augmentation (like HiLA) or in-context learning. Instead, it ingeniously leverages LLMs to generate rich, detailed, and context-aware prototypes from scratch in a zero-shot setting. This is a significant conceptual leap, transforming LLMs into powerful prototype engineers.
Hierarchy-Aware Prompting: The meticulous design of hierarchy-aware prompts is a key strength. By explicitly feeding LLMs information about parent, sibling, and child categories, the paper demonstrates how to guide complex models to generate highly relevant and contextualized outputs, which is crucial for hierarchical tasks.
Dual Prototype Approach (CNP + ETP): The recognition that category names alone are insufficient and the introduction of Example Text Prototypes (ETP) is a pragmatic and impactful decision. ETPs bridge the gap by providing stylistic and fine-grained information that static category names cannot, especially critical in diverse text domains like Amazon reviews.
Maximum Similarity Propagation (MSP): The MSP technique is a well-reasoned enhancement to hierarchical classification. By propagating the maximum similarity from descendants, it ensures that strong signals from lower levels are effectively utilized by their ancestors, leading to more robust and consistent hierarchical predictions.
Practicality for Zero-Shot: Despite the upfront cost of LLM inference for prototype generation, the final classification step is efficient, making it practical for real-world deployment once prototypes are established.

Potential Issues, Unverified Assumptions, and Areas for Improvement:

Reliance on LLM Quality and Cost: The performance of HierPrompt is inherently tied to the capabilities and cost of the underlying LLM. The sensitivity analysis confirms that different LLMs yield varying results, especially for ETP generation. Future work needs to explore how robust this method is to different LLM architectures and how to optimize for cost-effectiveness, particularly for very large taxonomies where ETP generation can be computationally intensive.
"Zero-Shot" Nuance: While the paper adheres to a strict definition of zero-shot (no labeled training data), the use of LLMs to generate example texts can be viewed as "synthetic data generation" or "LLM-enhanced data augmentation" rather than purely relying on semantic descriptions. The LLM itself has been trained on vast amounts of data, potentially including examples similar to the target domains. This raises questions about the true "zero-shotness" in terms of novelty of content for the LLM.
Generalizability of Prompts: The prompts are dataset-specific (e.g., [dataset] placeholder). While effective, ensuring that these prompts are universally applicable across vastly different domains or highly specialized taxonomies might require manual tuning. An adaptive prompting mechanism could be an interesting future direction.
Handling Long/Complex Taxonomies: For extremely deep hierarchies or those with very broad nodes and many children/descendants, the $[↓c_i^1]$ or $[↓{↑c_j^l}]$ lists in prompts could become very long, potentially exceeding LLM context windows or affecting their performance. The averaging approach for ETP aggregation for higher levels might also dilute specificity in very deep hierarchies.
Embedding Space Refinement: As noted by the authors, neglecting prototype refinement in the embedding space is a limitation. Combining the textual refinement of HierPrompt with techniques like prototype network fine-tuning or metric learning on synthetic data could lead to even more powerful zero-shot classifiers.

Overall, HierPrompt offers a compelling paradigm for ZS-HTC, demonstrating the immense potential of LLMs to create rich, dynamic knowledge representations for categories. Its innovative use of prompts and hierarchical awareness makes it a strong contender in the evolving landscape of zero-shot learning.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Dataset	L1	L2	L3	DocNum	AvgLen
NYT	5	26	nan	13081	648.13
DBpedia	9	70	219	50000	103.37
amazon	6	64	522	50000	90.29

HierPrompt: Zero-Shot Hierarchical Text Classification with LLM-Enhanced Prototypes

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~32 min read · 44,832 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Framework of ZS-HTC

4.2.1.1. Example Text Prototype (ETP)

4.2.2. Maximum Similarity Propagation (MSP)

4.2.3. Hierarchical Prototype Refinement With Hierarchy-Aware Prompts

4.2.3.1. Category Name Contextualization

4.2.3.2. Example Text Generation

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Macro-F1 Score

5.2.2. Micro-F1 Score

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Maximum Similarity Propagation (MSP)

6.2.2. Category Name Contextualization with hierarchy-aware Prompting

6.2.3. Example Text Generation with hierarchy-aware Prompting

6.2.4. Example Text Generation with COG-based Prompting

6.2.5. Sensitivity Analysis

6.3. Computational Budget

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers