HierPrompt: Zero-Shot Hierarchical Text Classification with LLM-Enhanced Prototypes
TL;DR Summary
HierPrompt is proposed for zero-shot hierarchical text classification, enhancing prototype representation and informativeness. It introduces Example Text Prototypes and Category Name Prototypes, utilizing Maximum Similarity Propagation to improve prototype construction, showing s
Abstract
Hierarchical Text Classification is a challenging task which classifies texts into categories arranged in a hierarchy. Zero-Shot Hierarchical Text Classification (ZS-HTC) further assumes only the availability of hierarchical taxonomy, without any training data. Existing works of ZS-HTC are typically built on the prototype-based framework by embedding the category names into prototypes, which, however, do not perform very well due to the ambiguity and impreciseness of category names. In this paper, we propose HierPrompt, a method that leverages hierarchy-aware prompts to instruct LLM to produce more representative and informative prototypes. Specifically, we first introduce Example Text Prototype (ETP), in conjunction with Category Name Prototype (CNP), to enrich the information contained in hierarchical prototypes. A Maximum Similarity Propagation (MSP) technique is also proposed to consider the hierarchy in similarity calculation. Then, the hierarchical prototype refinement module is utilized to (i) contextualize the category names for more accurate CNPs and (ii) produce detailed example texts for each leaf category to form ETPs. Experiments on three benchmark datasets demonstrate that HierPrompt substantially outperforms existing ZS-HTC methods.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "HierPrompt: Zero-Shot Hierarchical Text Classification with LLM-Enhanced Prototypes".
1.2. Authors
The authors are:
- Qian Zhang (School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China)
- Qinliang Su (School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, China)
- Wei Zhu (China Mobile Internet Company Ltd.)
- Yachun Pang (China Mobile Internet Company Ltd.)
1.3. Journal/Conference
The paper was published at EMNLP Findings, which is a reputable venue in the field of Natural Language Processing (NLP). EMNLP (Empirical Methods in Natural Language Processing) is one of the premier conferences, known for publishing high-quality research in various aspects of NLP. The "Findings" track typically includes strong papers that might not fit the main conference's thematic scope or length constraints but still contribute significantly.
1.4. Publication Year
The paper was published in 2025 (specifically, January 1, 2025).
1.5. Abstract
The paper addresses Zero-Shot Hierarchical Text Classification (ZS-HTC), a challenging task where texts are categorized into a hierarchical taxonomy without any labeled training data, relying solely on the taxonomy itself. Existing ZS-HTC methods, often prototype-based using category names, struggle due to the ambiguity and imprecision of these names.
The proposed method, HierPrompt, utilizes hierarchy-aware prompts to instruct Large Language Models (LLMs) to generate more representative and informative prototypes. Specifically, it introduces Example Text Prototypes (ETP) alongside Category Name Prototypes (CNP) to enrich prototype information. It also proposes a Maximum Similarity Propagation (MSP) technique to incorporate hierarchical structure into similarity calculations. A hierarchical prototype refinement module is used to (i) contextualize category names for more accurate CNPs and (ii) produce detailed example texts for leaf categories to form ETPs. Experiments on three benchmark datasets demonstrate that HierPrompt substantially outperforms existing ZS-HTC methods.
1.6. Original Source Link
The original source link is https://aclanthology.org/2025.findings-emnlp.207.pdf. It is officially published as part of the Findings of EMNLP 2025.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the effective classification of texts into hierarchical categories in a Zero-Shot Hierarchical Text Classification (ZS-HTC) setting.
This problem is crucial due to several factors:
-
Real-world applications: Hierarchical text classification (HTC) is fundamental for tasks like product categorization in e-commerce, document organization, and web content classification.
-
Cost of labeled data: Most existing HTC methods rely on extensive labeled datasets, which are expensive and time-consuming to create.
-
Evolving taxonomies: Hierarchical taxonomies frequently change, making it impractical to relabel data for every update. This highlights the need for methods that don't require retraining on new labels.
Specific challenges or gaps in prior research, particularly in
ZS-HTC, include: -
Limited prototype quality: Existing prototype-based ZS-HTC methods typically embed category names directly as prototypes. However, category names are often general, ambiguous, and lack context, leading to
unrepresentativeandimprecise prototypes. For instance, the word 'work' can have multiple meanings ('job' or 'creative content') without context. -
Lack of stylistic information: Category names are simplified phrases that fail to capture the fine-grained stylistic nuances of actual texts within a category. This can lead to
chaos in the embedding space, where texts from the same conceptual category but different datasets (e.g., product features vs. customer reviews for 'video games') appear semantically distant.The paper's entry point or innovative idea is to leverage the advanced capabilities of Large Language Models (LLMs) and the inherent hierarchical structure of taxonomies to dynamically generate richer, more representative prototypes. Instead of just embedding raw category names,
HierPromptaims to createcontextualized category namesandexample textsas prototypes, and then incorporates the hierarchy directly into the similarity calculation process.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of Zero-Shot Hierarchical Text Classification:
-
Generalized Prototype Framework: It refines the standard prototype-based framework by introducing
Example Text Prototypes (ETP)to complementCategory Name Prototypes (CNP).ETPscapture stylistic and detailed characteristics of texts, addressing the limitations ofCNPsalone. -
Maximum Similarity Propagation (MSP) Technique: A novel
MSPtechnique is proposed to explicitly integrate hierarchical relationships into the similarity calculation. Unlike previous propagation methods,MSPensures that parent categories benefit from the maximum similarities of their descendants for bothCNPandETP, thus better exploiting the structural dependencies of the hierarchy. -
LLM-Enhanced Prototype Refinement:
HierPromptleverages the understanding and generation capabilities of LLMs throughhierarchy-aware prompts. This module performs two key functions:- Category Name Contextualization: LLMs are prompted to contextualize category names at different hierarchical levels, considering their parent, sibling, and child categories. This leads to more accurate and representative
CNPs. - Example Text Generation: LLMs are instructed to generate concrete example texts for each leaf category, forming the
ETPs. AChain Of Generation (COG)prompting strategy is introduced to guide LLMs in producing high-quality, stylistically relevant examples.
- Category Name Contextualization: LLMs are prompted to contextualize category names at different hierarchical levels, considering their parent, sibling, and child categories. This leads to more accurate and representative
-
Superior Experimental Performance: Experiments on three public benchmark datasets (NYT, DBpedia, Amazon) demonstrate that
HierPromptsubstantially outperforms existing state-of-the-artZS-HTCmethods across almost all hierarchical levels. Ablation studies further validate the effectiveness of each proposed component, includingMSP, hierarchy-aware prompting forCNPcontextualization, andETPgeneration withCOG-based prompting.These findings address the core problems by providing a robust framework for generating high-quality, informative prototypes in a zero-shot setting, thus significantly improving the accuracy and representativeness of hierarchical text classification without requiring any labeled training data.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following concepts:
- Text Classification: The task of assigning one or more predefined categories or labels to a piece of text. For example, classifying a news article as "Sports" or "Politics."
- Hierarchical Text Classification (HTC): An extension of text classification where categories are organized in a tree-like or graph-like structure (a hierarchy or taxonomy). A text might belong to "Sports" (level 1), then "Basketball" (level 2), and further "NBA" (level 3). This means classification needs to be consistent across levels, e.g., if a text is "NBA," it must also be "Basketball" and "Sports."
- Zero-Shot Learning (ZSL) / Zero-Shot Text Classification (ZSC): A machine learning paradigm where the model is tasked with classifying data into categories it has never seen during training. The model relies on auxiliary information (like semantic descriptions of categories) to infer the correct label. In ZS-HTC, this means no labeled texts are available, only the hierarchical taxonomy itself.
- Prototype-based Classification: A common approach in ZSL. Each category is represented by a "prototype" (an embedding vector) in a shared semantic space. To classify a new input text, its embedding is compared to all category prototypes, and the text is assigned to the category whose prototype is most similar. This is often framed as a 1-nearest neighbor problem.
- Embedding: A numerical representation of text (words, phrases, sentences, or documents) in a high-dimensional vector space, where semantically similar texts are mapped to nearby points. An
encoder(e.g., a pre-trained language model) is used to generate these embeddings.
- Embedding: A numerical representation of text (words, phrases, sentences, or documents) in a high-dimensional vector space, where semantically similar texts are mapped to nearby points. An
- Large Language Models (LLMs): Advanced artificial intelligence models (like GPT-3, GPT-4, Qwen) trained on vast amounts of text data. They are capable of understanding natural language, generating coherent and contextually relevant text, summarizing, and even performing complex reasoning tasks through
prompting. - Prompting: The technique of providing natural language instructions or questions to an LLM to guide its behavior and elicit a desired output.
Hierarchy-aware promptsspecifically leverage structural information (like parent, sibling, child categories) within the prompt itself. - Cosine Similarity: A measure of similarity between two non-zero vectors in an inner product space. It measures the cosine of the angle between them. A cosine similarity of 1 means the vectors are identical in direction (most similar), 0 means they are orthogonal (no similarity), and -1 means they are opposite in direction (most dissimilar). In text classification, a higher cosine similarity between a text embedding and a prototype embedding indicates a stronger semantic match.
- The cosine similarity between two vectors and is calculated as:
$
\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}
$
Where:
- and are the two vectors being compared (e.g., text embedding and prototype embedding).
- and are the -th components of vectors and respectively.
- denotes the dot product.
- and denote the Euclidean norm (magnitude) of vectors and respectively.
- The cosine similarity between two vectors and is calculated as:
$
\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}
$
Where:
- Taxonomy/Hierarchy: A system of classification where categories are organized into a tree-like structure.
Root node: The topmost category, with no parent.Parent node: A category directly above another category in the hierarchy.Child node: A category directly below another category in the hierarchy.Ancestor: Any category above a given category in the hierarchy (including parents, grandparents, etc.).Descendant: Any category below a given category in the hierarchy (including children, grandchildren, etc.).Leaf category: A category that has no children.
3.2. Previous Works
The paper contextualizes its contributions by discussing existing approaches in Zero-Shot Text Classification (ZSC) and Hierarchical Text Classification (HTC), particularly focusing on the intersection: ZS-HTC.
-
Zero-Shot Classification (ZSC):
- In-context learning: Some studies leverage LLMs to perform ZSC by providing examples or instructions within the prompt itself (Edwards and Camacho-Collados, 2024; Lepagnol et al., 2024).
- Textual entailment: Another approach treats ZSC as a textual entailment problem, where an LLM determines if a text "entails" a category label (Williams et al., 2018; Pamies et al., 2023).
- Prototype-based ZSC: This is a common and relevant approach, where category names are embedded as prototypes, and texts are classified by finding the nearest prototype in the embedding space (Snell et al., 2017; Liu et al., 2022). The current paper builds upon this framework.
- Self-training: Some ZSC definitions include using unlabeled data for training (Gera et al., 2022), but the paper strictly adheres to a "no training data at all" scenario.
-
Hierarchical Text Classification (HTC):
- Supervised HTC: Earlier works are heavily reliant on labeled data (Kowsari et al., 2017; Huang et al., 2019).
- Weakly-supervised HTC: Methods that reduce the need for labeled data, often using a small number of labeled texts or keywords (Meng et al., 2019; Shen etal., 2021). Some use only label hierarchy and unlabeled data (Zhang et al., 2025).
-
Zero-Shot Hierarchical Text Classification (ZS-HTC):
- Upward Score Propagation (USP) (Bongiovanni et al., 2023): This method is built on the prototype-based framework. It addresses the hierarchical structure by propagating similarity scores from leaf nodes upwards to parent categories. If a text is highly similar to a child category, that similarity contributes to the parent's score. This helps ensure hierarchical consistency. The core idea is that the similarity score of a parent node is a function of its own similarity and the propagated similarity from its children.
- HiLA (Hierarchical Label Augmentation) (Paletto et al., 2024): This work extends
USPby using LLMs.HiLAaddresses theUSPlimitation of not providing hierarchical information for most leaf categories. It exploits LLMs to generate new subclasses (more fine-grained categories) for the original leaf categories, thereby enriching the hierarchy. TheUSPtechnique is then applied to this augmented hierarchy. This allows for better estimation of similarities, but still relies on category names and their augmentations. - TELEClass (Zhang et al., 2025): The paper mentions
TELEClassas a similar model but explicitly states it's not included as a baseline because it addresses a "weakly-supervised" task, not a "strictly zero-shot" one likeHierPrompt.TELEClassuses LLMs to generate keywords and pseudo texts, and incorporates hierarchy as an input to the LLM.
3.3. Technological Evolution
The field of text classification has evolved significantly:
- Early Supervised Methods: Initially, HTC relied heavily on large, manually labeled datasets and rule-based systems or traditional machine learning algorithms.
- Deep Learning for Supervised HTC: With the rise of deep learning, neural networks improved performance but still required substantial labeled data.
- Weakly Supervised & Few-Shot HTC: Researchers sought to reduce the annotation burden by using minimal labels, keywords, or unlabeled data.
- Zero-Shot Classification (ZSC): The most constrained scenario, where no labeled data is available, pushing methods to rely on semantic embeddings of labels and texts. Early ZSC primarily used direct embedding of category names as prototypes.
- LLM Era for ZSC/ZS-HTC: The advent of powerful LLMs marked a new phase.
- LLMs were initially used for
in-context learningortextual entailmentin ZSC. - For
ZS-HTC, models likeHiLAbegan using LLMs to augment the taxonomy by generating sub-labels, then applying existing propagation techniques likeUSP. - This paper (
HierPrompt) represents a further evolution, moving beyond just augmenting labels to using LLMs for generating richer prototypes themselves (contextualized category names and example texts) and integrating hierarchy into the similarity calculation in a more direct way (MSP).
- LLMs were initially used for
3.4. Differentiation Analysis
Compared to the main methods in related work, HierPrompt introduces several core differences and innovations:
-
Prototype Quality Enhancement:
- From
Category Name Prototypes (CNP)only (baseline ZSC, USP) toCNP + Example Text Prototypes (ETP):HierPromptexplicitly acknowledges the limitation ofCNPsin capturing stylistic and detailed information. It addresses this by introducingETPswhich are generated example texts, thus providing a richer representation of each category. - LLM-driven Prototype Generation vs. Augmentation: While
HiLAuses LLMs to expand the hierarchy by generating new subclasses,HierPromptuses LLMs to refine and generate the content of the prototypes themselves. This includes contextualizing existing category names and creating synthetic example texts, directly enriching the information embedded in the prototypes. - Hierarchy-aware Prompting for Prototype Refinement:
HierPromptdesigns specifichierarchy-aware promptsthat allow LLMs to perceive parent, sibling, and child categories during the generation of both contextualized category names and example texts. This deeper integration of hierarchical context into the LLM's understanding leads to more accurate and representative prototypes.
- From
-
Improved Hierarchical Similarity Calculation:
-
Maximum Similarity Propagation (MSP) vs. Upward Score Propagation (USP): While both
HierPromptandUSPpropagate similarity through the hierarchy,MSPspecifically considers the maximum similarity score of descendants for bothCNPandETPcomponents. This is a more aggressive and potentially more informative propagation strategy compared toUSP's general score accumulation, especially when categories have diverse descendant characteristics.MSPalso operates on the refinedCNPandETP, giving it a stronger foundation.In essence,
HierPrompt's innovation lies in its holistic approach to prototype quality: it not only recognizes the need for richer prototypes beyond raw category names but also leverages LLMs and hierarchical information in a sophisticated manner to generate these high-quality prototypes and then integrates the hierarchy into the classification step with a refined propagation mechanism.
-
4. Methodology
4.1. Principles
The core idea behind HierPrompt is to overcome the limitations of existing prototype-based Zero-Shot Hierarchical Text Classification (ZS-HTC) methods, which suffer from ambiguous and unrepresentative category name prototypes. It achieves this by:
-
Enriching Prototypes: Moving beyond simple category names to include
Example Text Prototypes (ETP)alongsideCategory Name Prototypes (CNP), thereby capturing both general semantic meaning and specific stylistic/detailed information. -
Leveraging Large Language Models (LLMs): Utilizing LLMs' strong understanding and generation capabilities to refine these prototypes. This involves
contextualizing category namesbased on their hierarchical position andgenerating synthetic example textsfor leaf categories. -
Integrating Hierarchy into Similarity: Developing a
Maximum Similarity Propagation (MSP)technique that explicitly considers the hierarchical structure during the text-prototype similarity calculation, ensuring that classifications are consistent and benefit from deeper hierarchical relationships.The theoretical basis is that more informative and contextually rich prototypes, combined with a hierarchy-aware similarity measure, will lead to more accurate and robust zero-shot hierarchical text classification.
4.2. Core Methodology In-depth (Layer by Layer)
The HierPrompt framework aims to classify an input text into hierarchical categories. The overall process involves several steps, as illustrated in Figure 2.

Figure 2: Overview of the proposed HierPrompt. First, the hierarchy-aware prompts are designed. Then the prompts serve as input of LLM for Category Name Contextualization and Example Text Generation (Sect.3.2). Finally, the refined prototypes are used to perform classification according to the Maximum Score Propagation technique (Sect. 3.1).
The method starts by defining a hierarchical taxonomy. The notation used is as follows:
The following are the results from Table 1 of the original paper:
| Symbol | Meaning |
| A level in hierarchy, | |
| the -th category in level | |
| The number of categories in level | |
| Parent of | |
| The set of ancestors of | |
| The set of children of | |
| The set of descendants of |
For a non-root node (), its parent refers to a single category. For a leaf category (), its child and descendant categories are empty sets, i.e., and .
4.2.1. Framework of ZS-HTC
Traditionally, Zero-Shot Classification (ZSC) on texts is framed as a 1-nearest neighbor problem. Given an input text , an encoder (e.g., a pre-trained language model like mpnet-all) is used to convert it into an embedding vector E(x). Similarly, category names are also encoded to form Category Name Prototypes (CNP). The similarity between the text embedding and each category prototype is calculated (typically using cosine similarity). The text is then assigned to the category with the highest similarity.
For hierarchical classification, the task is to determine the predicted category for each level of the hierarchy. This is formulated as:
Here, denotes the similarity between the embedding of the input text E(x) and the Category Name Prototype (CNP) of category .
4.2.1.1. Example Text Prototype (ETP)
The paper identifies that CNPs alone are insufficient because category names are often general, lack detail, and do not capture the stylistic nuances of actual texts. For example, "video game" in one dataset might describe features, while in another, it might describe user feelings, leading to chaos in the embedding space (as shown in Figure 1).

Figure 1: Category name lacks detailed and stylistic information, leading to chaos in embedding space.
To address this, HierPrompt introduces the Example Text Prototype (ETP). ETPs are derived from example texts that belong to a category and reflect its typical writing style and content. These example texts provide richer contextual information compared to simple category names.
-
ETP for Leaf Categories: For leaf categories (level ), the
ETPis obtained by encoding an example text associated with that category using the encoder : Where:- is the
Example Text Prototypefor the -th leaf category in level . - is the text encoder.
- is an example text for the -th leaf category.
- is the
-
ETP for Higher-Level Categories: For categories in higher levels (), their
ETPsare obtained by averaging theETPsof their descendant leaf categories. This hierarchical aggregation ensures that higher-level prototypes encapsulate the characteristics of their sub-categories. \mathrm { ETP } _ { i } ^ { l } = \frac { \sum _ { j \in \downarrow c _ { i } ^ { l } } \mathrm { ETP } _ { j } } { \mid \downarrow c _ _ { i } ^ { l } \mid } Where:- is the
Example Text Prototypefor the -th category in level . - denotes the set of descendant leaf categories of . Note: The notation in the paper states as children, but the formula implies summing over descendants, and typical hierarchical prototype aggregation sums over leaf descendants to build up the hierarchy. For
ETP, it averages over all descendantETPs (which are ultimately derived from leafETPs). - is the number of descendant leaf categories of .
- is the
-
Combined Similarity Score: With both
CNPsandETPsavailable, a more accurate similarity score between the input text and a category is calculated by summing the cosine similarities from both prototype types: Where:- is the overall similarity score.
- denotes the cosine similarity between the text embedding
E(x)and theCategory Name Prototype. - denotes the cosine similarity between the text embedding
E(x)and theExample Text Prototype.
4.2.2. Maximum Similarity Propagation (MSP)
The combined similarity score in the previous step still treats each category independently, ignoring the explicit hierarchical relationships. The paper argues that hierarchy exhibits transitivity: if a text is related to a lower-level category, it should also be related to its parent. To exploit this, HierPrompt introduces Maximum Similarity Propagation (MSP).
MSP recursively propagates the maximum similarity scores from descendant categories up the hierarchy. This means that for a non-leaf category, its similarity score is augmented by the strongest similarity found among its descendants, for both CNP and ETP components.
The MSP score for a category is defined as:
Where:
-
is the
Maximum Similarity Propagationscore for text and category . -
is the combined similarity score (from
CNPandETP) as defined in the previous section. -
is the current level, and is the deepest (leaf) level.
-
For leaf categories (), the
MSPscore is just the direct combined similarity. -
For non-leaf categories (), the
MSPscore is the direct combined similarity plus an additional term , which captures the maximum similarity from its descendants.The term is calculated by summing the maximum
CNPsimilarity and maximumETPsimilarity found among all descendant categories of : Where: -
represents the set of all descendant categories of .
-
takes the maximum similarity score over all descendants for a given prototype type.
-
is the cosine similarity between
E(x)andCNPof descendant . -
is the cosine similarity between
E(x)andETPof descendant .This means if a text is strongly related to a specific leaf category (e.g., "NBA"), that strong similarity will propagate upwards and boost the scores of its parent ("Basketball") and grandparent ("Sports") categories, making the hierarchical classification more robust and consistent.
4.2.3. Hierarchical Prototype Refinement With Hierarchy-Aware Prompts
The effectiveness of CNP and ETP heavily relies on the quality of the category names and example texts. Since this is a zero-shot scenario, actual example texts are unavailable, and raw category names can be ambiguous. HierPrompt addresses this by leveraging LLMs to refine prototypes using hierarchy-aware prompts.
4.2.3.1. Category Name Contextualization
The goal is to generate more accurate and representative CNPs by providing LLMs with hierarchical context for each category name.
-
Problem with Coarse-Grained Categories: The paper observes that classification performance can be counter-intuitively lower for coarse-grained (Level 1) categories compared to finer-grained ones (Figure 3). This is attributed to overly general category names (e.g., 'work' having multiple meanings).
Figure 3: Directly using category name for classification in DBpedia. -
Contextualization for Level 1 Categories: To resolve the ambiguity of coarse-grained category names, LLMs are prompted to summarize their meaning based on their descendant categories. The prompt for a Level 1 category is: Where the template
fill-Template_coarseis:[dataset]is a placeholder (e.g., "DBpedia article").- is the list of immediate child categories of . The LLM processes this prompt and outputs a revised category name and an explanatory description.
-
Contextualization for Other Levels (l > 1): For finer-grained categories, their parent and sibling categories provide crucial context. The prompt for a category (where ) is: Where the template
fill-Template_textis:[dataset]is a placeholder.- refers to the parent category of .
- refers to the sibling categories of (i.e., children of the parent of ). The LLM then provides a concise summary of the features for given its context.
-
Generating CNPs: The output from the LLM () for each category (either the revised coarse-grained name and description or the fine-grained summary) is then encoded to form the refined
CNP: Then, theCategory Name Prototypeis obtained by encoding this contextualized description: .
4.2.3.2. Example Text Generation
To create ETPs in a zero-shot setting, LLMs are prompted to generate synthetic example texts for each leaf category.
-
Initial Prompt Design: Similar to
CNPcontextualization, hierarchical information is integrated into the prompt to guide the LLM. An initial prompt for a leaf category could be: Where the templatefill-Template_textis:[dataset]refers to the domain (e.g., "DBpedia article").- is the parent of the leaf category.
- are the sibling categories of the leaf category.
-
Chain Of Generation (COG) Prompting: The authors found that LLMs might struggle to generate concrete examples directly, sometimes producing explanations instead of actual texts. This is due to a missing logical step: first exemplifying a concrete instance of the category, then generating text for that instance. To address this, a
Chain Of Generation (COG)strategy is explicitly embedded into the prompt. This guides the LLM through a sequence of thought: (1) Understanding category semantics; (2) Exemplifying concrete instance of category; (3) Generating corresponding text of the instance.The
COG-enhanced prompt is designed as: Where the templatefill-Template_COGis:- is the number of examples to generate per leaf category. This prompt explicitly instructs the LLM to first brainstorm specific instances within the category's context and then generate text for those instances, leading to higher quality example texts.
-
Generating ETPs from COG: The LLM generates example texts for each leaf category. These texts are individually encoded by , and their embeddings are averaged to form the
ETPfor that leaf category: Where:-
is the
Example Text Prototypefor the -th leaf category. -
is the text encoder.
-
is the -th example text generated for leaf category .
-
is the number of generated example texts.
The
ETPsfor higher levels are then obtained by averaging theETPsof their descendants, as described in Section 4.2.1.1. (Example Text Prototype). This complete process ofCNPcontextualization andETPgeneration, combined withMSP, forms theHierPromptmethod.
-
5. Experimental Setup
5.1. Datasets
Three public datasets with hierarchical labels were selected to evaluate HierPrompt:
-
NYT (Tao et al., 2018): Contains 13,081 news documents. Its taxonomy has two levels, with 5 categories at Level 1 (L1) and 26 categories at Level 2 (L2). The average document length is 648.13 words. This dataset likely represents formal news reporting style.
-
DBpedia (Lehmann et al., 2015): Contains 50,000 Wikipedia articles. It has a three-level taxonomy: 9 categories at L1, 70 at L2, and 219 at L3. The average document length is 103.37 words. Texts are typically neutral, objective, and descriptive.
-
Amazon (Kashnitsky, 2020): Contains 50,000 Amazon reviews. It has a three-level taxonomy: 6 categories at L1, 64 at L2, and 522 at L3. The average document length is 90.29 words. The language is typically casual and informal, reflecting personal opinions and experiences.
The following are the results from Table 2 of the original paper:
Dataset L1 L2 L3 DocNum AvgLen NYT 5 26 nan 13081 648.13 DBpedia 9 70 219 50000 103.37 amazon 6 64 522 50000 90.29
These datasets were chosen because they represent diverse domains (news, encyclopedic knowledge, e-commerce reviews) and varying text styles (formal, objective, informal), as well as different hierarchical depths and category counts. This diversity allows for a comprehensive evaluation of the proposed method's robustness.
To help the reader intuitively understand the data's form, Figure 1 (shown previously in the Methodology section) illustrates how the category "video games" can have different text styles in DBpedia (neutral, objective descriptions of game features) versus Amazon (informal, subjective reviews of game experience). This highlights the need for Example Text Prototypes (ETP) to capture these stylistic differences.
5.2. Evaluation Metrics
The paper reports Macro-F1 score in its experiments. The supplementary materials (Table 7 and Table 8) also include Micro-F1 score. Both are standard metrics for classification tasks, especially when dealing with multi-class or multi-label scenarios.
5.2.1. Macro-F1 Score
Conceptual Definition: Macro-F1 calculates the F1-score independently for each class and then takes the average. It treats all classes equally, regardless of their size, making it suitable for evaluating performance on imbalanced datasets where some classes might have significantly fewer instances than others. It emphasizes the model's performance on rare classes.
Mathematical Formula:
First, for each class :
$
\text{Precision}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FP}_k}
$
$
\text{Recall}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FN}_k}
$
$
\text{F1}_k = 2 \cdot \frac{\text{Precision}_k \cdot \text{Recall}_k}{\text{Precision}_k + \text{Recall}k}
$
Then, the Macro-F1 is the average of the F1-scores for all classes:
$
\text{Macro-F1} = \frac{1}{N_c} \sum{k=1}^{N_c} \text{F1}_k
$
Symbol Explanation:
- : True Positives for class . The number of instances correctly identified as class .
- : False Positives for class . The number of instances incorrectly identified as class (they belong to another class).
- : False Negatives for class . The number of instances that belong to class but were incorrectly identified as another class.
- : The proportion of instances predicted as class that were actually class .
- : The proportion of actual class instances that were correctly predicted as class .
- : The F1-score for class , which is the harmonic mean of precision and recall. It balances both metrics.
- : The total number of classes.
5.2.2. Micro-F1 Score
Conceptual Definition: Micro-F1 calculates global averages of True Positives (TP), False Positives (FP), and False Negatives (FN) across all classes. It then computes precision, recall, and F1-score based on these global counts. This metric is influenced by the frequency of each class, meaning that larger classes contribute more to the overall score. It's often preferred when class imbalance is not a major concern or when accuracy across all instances is paramount.
Mathematical Formula:
First, sum TP, FP, and FN across all classes:
$
\text{TP}{\text{total}} = \sum{k=1}^{N_c} \text{TP}k
$
$
\text{FP}{\text{total}} = \sum_{k=1}^{N_c} \text{FP}k
$
$
\text{FN}{\text{total}} = \sum_{k=1}^{N_c} \text{FN}k
$
Then, calculate Micro-Precision, Micro-Recall, and Micro-F1:
$
\text{Micro-Precision} = \frac{\text{TP}{\text{total}}}{\text{TP}{\text{total}} + \text{FP}{\text{total}}}
$
$
\text{Micro-Recall} = \frac{\text{TP}{\text{total}}}{\text{TP}{\text{total}} + \text{FN}_{\text{total}}}
$
$
\text{Micro-F1} = 2 \cdot \frac{\text{Micro-Precision} \cdot \text{Micro-Recall}}{\text{Micro-Precision} + \text{Micro-Recall}}
$
Symbol Explanation:
- , , : Sum of True Positives, False Positives, and False Negatives across all classes, respectively.
- : Global precision calculated from total counts.
- : Global recall calculated from total counts.
- : The overall F1-score, which is the harmonic mean of
Micro-PrecisionandMicro-Recall.
5.3. Baselines
The proposed HierPrompt method was compared against the following baselines:
-
base-label: This is the most basic zero-shot text classification baseline. It performs ZSC solely by embedding raw category names as prototypes and then classifying texts based on similarity to these prototypes. This highlights the performance without any advanced prototype refinement or hierarchical propagation.
-
base-text: This baseline uses actual example texts from the dataset to form prototypes. Specifically, it randomly selects one example text for each leaf category within the dataset and encodes it to create a prototype. While not a strictly zero-shot approach (as it uses real data, albeit limited), it serves as a strong reference to understand the upper bound of performance achievable if high-quality example texts were available.
-
USP (Bongiovanni et al., 2023):
Upward Score Propagationis a core ZS-HTC method that propagates similarity scores from finer-grained (child) categories up to coarser-grained (parent) categories according to certain rules. It enhances thebase-labelapproach by incorporating hierarchical structure during similarity calculation. -
HiLA (Paletto et al., 2024):
Hierarchical Label Augmentationbuilds uponUSP. It leverages LLMs (specificallygpt-3.5-turbo) to expand the categories of leaf nodes by generating new subclasses. This augmented hierarchy is then used with theUSPtechnique to perform ZS-HTC.HiLAis considered the state-of-the-art forZS-HTCprior toHierPrompt.TELEClass (Zhang et al., 2025) is mentioned as a similar model that also uses LLMs and hierarchy. However, the authors explicitly state it is not included as a direct baseline because
TELEClassaddresses aweakly-supervisedtask, whereasHierPromptfocuses on the stricterzero-shotscenario (where no labeled or unlabeled training data is provided).
5.4. Implementation Details
- LLMs for Prototype Refinement:
- Category Name Extension: Alibaba Cloud's
qwen-turbo(the LLM with the fewest parameters) was used. This indicates that this task is considered relatively straightforward for LLMs. - Example Text Generation: Alibaba Cloud's
qwen-plus(an LLM with a moderate number of parameters) was used. This suggests that text generation requires a more capable LLM than just name contextualization.
- Category Name Extension: Alibaba Cloud's
- Text Encoder: Consistent with
USPandHiLA, thempnet-allmodel (Song et al., 2020) was used as the zero-shot text classification encoder. This ensures fair comparison with baselines. - Prompt Customization: The
[dataset]placeholder in the prompts was filled according to the dataset:NYT: 'news report'DBpedia: 'dbpedia article'Amazon: 'amazon review'
- Number of Generated Examples (n): For
COG-based prompting in example text generation, the number of example texts was set to 4. - Evaluation Metric:
Macro-F1 scoreis the primary metric reported in the main experiments. - Hardware: All experiments were conducted using an
NVIDIA GeForce RTX 2080 GPU.
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results, demonstrating the superiority of HierPrompt on three public benchmarks, are summarized in Table 4.
The following are the results from Table 4 of the original paper:
| method | NYT | DBpedia | Amazon | |||||
| L1 | L2 | L1 | L2 | L3 | L1 | L2 | L3 | |
| base-label | 70.60 | 66.13 | 31.60 | 33.64 | 63.60 | 56.35 | 26.97 | 14.22 |
| base-text | 89.05 | 58.22 | 62.72 | 55.33 | 63.88 | 79.57 | 43.92 | 17.84 |
| USP | N.A. | 66.13 | 64.70 | 65.60 | 62.80 | 71.20 | 34.80 | 17.30 |
| HILA | N.A. | N.A. | 76.80 | 66.00 | 62.90 | 76.20 | 39.30 | 24.90 |
| ours-label | 85.78 | 68.71 | 76.95 | 66.47 | 66.20 | 82.44 | 43.39 | 19.45 |
| ours-text | 86.48 | 63.79 | 88.17 | 65.90 | 56.09 | 82.69 | 47.58 | 22.09 |
| ours | 87.00 | 69.98 | 83.56 | 71.32 | 65.78 | 82.56 | 48.67 | 22.69 |
Not strictly zero-shot classification, for reference only
Analysis of Results:
-
Effect of Category Name Contextualization (
ours-labelvs.base-label):- The
ours-labelrow shows the Macro-F1 scores after applying the Category Name Extension module (contextualizing category names with LLMs). - There is a substantial improvement over
base-labelacross all datasets and levels. - NYT L1:
ours-label(85.78%) vs.base-label(70.60%) — an increase of 15.18%. - DBpedia L1:
ours-label(76.95%) vs.base-label(31.60%) — a dramatic increase of 45.35% (the paper abstract states 50.57%, which is slightly off from table calculation, but still a huge jump). - Amazon L1:
ours-label(82.44%) vs.base-label(56.35%) — an increase of 26.09%. - This confirms the paper's hypothesis that coarse-grained labels are often too general and lead to misleading prototypes. Contextualizing these names with LLMs significantly improves their representativeness and classification performance.
- The
-
Effect of Example Text Generation (
ours-textvs.base-text):- The
ours-textrow presents F1 scores after generating example texts using the LLM andCOG-based prompting. - In most cases, the LLM-generated texts (
ours-text) perform comparably or even better than using randomly selected real examples (base-text). - For example, on DBpedia L1,
ours-text(88.17%) significantly outperformsbase-text(62.72%). On Amazon L2,ours-text(47.58%) is better thanbase-text(43.92%). - This demonstrates the effectiveness of the Example Text Generation module, showing that LLMs can generate high-quality synthetic examples that accurately capture the stylistic and detailed information of categories, even in a zero-shot scenario.
- The
-
Overall Performance of HierPrompt (
oursvs. Baselines):-
The
oursrow represents the fullHierPromptmethod, combiningCNPcontextualization,ETPgeneration, andMaximum Similarity Propagation (MSP). -
HierPromptconsistently outperforms all baselines, including the state-of-the-artHiLA, across almost all levels and datasets. -
DBpedia:
ours(L1: 83.56%, L2: 71.32%, L3: 65.78%) significantly surpassesHiLA(L1: 76.80%, L2: 66.00%, L3: 62.90%). The improvements are 6.76% (L1), 5.32% (L2), and 2.88% (L3). The abstract mentions 5.41%, 5.24%, and 4.99% for DBpedia, which might be a slight discrepancy, but the trend of significant improvement is clear. -
Amazon:
ours(L1: 82.56%, L2: 48.67%, L3: 22.69%) also shows strong gains overHiLA(L1: 76.20%, L2: 39.30%, L3: 24.90%). The improvements are 6.36% (L1) and 9.37% (L2). For L3,oursis slightly lower thanHiLA, but still competitive. -
HierPromptalso generally outperformsUSPby a wide margin, especially on L1 and L2 for DBpedia and Amazon.In summary, the results strongly validate
HierPrompt's effectiveness. The combination ofLLM-enhanced prototypes(bothCNPandETP) andMaximum Similarity Propagationprovides a robust solution forZS-HTC, significantly improving upon prior methods by generating more representative and informative category representations.
-
6.2. Ablation Studies / Parameter Analysis
Ablation studies were conducted to understand the contribution of each key component of HierPrompt.
6.2.1. Maximum Similarity Propagation (MSP)
This study investigates the effectiveness of the proposed MSP technique compared to other propagation methods and direct classification. The experiments were conducted on DBpedia and Amazon datasets, focusing on non-leaf layers (L1, L2) where propagation is relevant.
The following are the results from Table 5 of the original paper:
| method | DBpedia | Amazon | ||
| L1 | L2 | L1 | L2 | |
| USP | 64.70 | 65.60 | 71.20 | 34.80 |
| USP-label | 68.60 | 69.80 | 72.94 | 36.00 |
| direct-label | 38.48 | 35.37 | 74.10 | 30.51 |
| MSP-label | 76.95 | 66.47 | 82.44 | 43.39 |
| dirct-text | 58.11 | 60.66 | 75.89 | 41.59 |
| MSP-text | 79.21 | 67.94 | 77.71 | 44.54 |
Note: The table formatting for dirct-text and MSP-text seems to have slight misalignment in the original. I will assume they are separate lines of results.
Analysis:
-
Direct vs. Propagation: Comparing
direct-label(no propagation) withUSP-labelorMSP-labelclearly shows the benefit of any similarity propagation.direct-labelperforms significantly worse (e.g., DBpedia L1: 38.48%) than methods employing propagation. -
MSP vs. USP:
MSP-label(using contextualized category names withMSP) consistently outperformsUSP(using raw category names withUSP) andUSP-label(using contextualized names withUSP).- On DBpedia L1,
MSP-label(76.95%) is much higher thanUSP(64.70%) andUSP-label(68.60%). This is an improvement of over 8% forMSP-labelcompared toUSP-labelfor DBpedia L1. - On Amazon L1,
MSP-label(82.44%) again outperformsUSP(71.20%) andUSP-label(72.94%) by significant margins. - The paper notes that for DBpedia L2,
USP-label(69.80%) is slightly better thanMSP-label(66.47%), butMSP's advantage becomes clear when propagating further up to L1 (8.35% better than USP for DBpedia L1 as stated in text). For Amazon,MSPconsistently outperformsUSPon both L1 (9.5% better) and L2 (7.39% better).
-
MSP with ETP:
MSP-text(using generated example texts withMSP) also shows strong performance, often outperformingMSP-labelanddirct-text. This indicates thatMSPis effective for propagating similarities derived from bothCNPs andETPs.The results confirm that
MSPgreatly improves performance, especially for higher levels of the hierarchy, and demonstrates a significant advantage over the traditionalUSPmethod in most cases.
6.2.2. Category Name Contextualization with hierarchy-aware Prompting
This study evaluates the importance of integrating hierarchical information into prompts for contextualizing category names.
-
Coarse-grained (L1) Extension: The following are the results from Table 7 of the original paper:
method NYT DBpedia Amazon miF1 maF1 miF1 maF1 miF1 maF1 base-label 88.72 70.60 46.90 31.60 59.02 56.35 ours-hier∆ 93.99 85.78 (+5.27 +15.18) 76.95 76.95 (+30.05 +45.35) 73.20 74.10 (+14.18 +17.75) Note: The abstract has slight differences in percentage increase, but the table values are used here. Also, the
miF1forours-hieron DBpedia shows 76.95, which is the same asmaF1, indicating a possible typo or specific characteristic of the dataset for that level.ours-hier(usingP_coarsewith hierarchical information) significantly boosts performance overbase-labelfor L1 categories across all datasets. This validates the premise that contextualizing overly general coarse-grained names improvesCNPquality. -
Fine-grained (L2) Extension: The following are the results from Table 8 of the original paper:
method NYT DBpedia Amazon miF1 maF1 miF1 maF1 miF1 maF1 base-label 84.29 66.13 36.76 33.64 35.64 26.97 ours-w/hier∆ 73.09 62.61 (-11.20 -3.52) 39.34 36.65 (+2.58 +3.01) 36.16 30.63 (+0.52 +3.66) ours-hier∆ 84.72 68.71 (+0.43 +2.58) 37.16 36.69 (+0.40 +3.05) 39.86 32.74 (+4.22 +5.77) Note: The delta values (∆) are calculated as
ours - base-label.-
ours-w/hieruses prompts without hierarchical information (P_fine_direct). -
ours-hierusesP_finewith hierarchical information (parent and sibling context).
Figure 4: Ablation: The Effect of utilizing hierarchical information in prompts in Category Name Extension (Row 1) and Example Text Generation (Row 2).
As depicted in the first row of Figure 4 (Category Name Extension for L2), and in Table 8:
ours-hiergenerally performs better thanbase-labelandours-w/hier.- On NYT L2,
ours-w/hier(62.61%) actually performs worse thanbase-label(66.13%), indicating that an uncontextualized LLM output can be detrimental. However,ours-hier(68.71%) surpassesbase-label. - On DBpedia L2 and Amazon L2,
ours-hierconsistently shows improvements over bothbase-labelandours-w/hier. This confirms that integrating hierarchical information (parent and sibling categories) into prompts is crucial for LLMs to generate accurateCNPsfor fine-grained categories.
-
6.2.3. Example Text Generation with hierarchy-aware Prompting
This study investigates the impact of hierarchical information in prompts for generating example texts. Five variants were tested for L1 and L2.
The following are the results from Table 9 of the original paper:
| method | NYT | DBpedia | Amazon | |||
| miF1 maF1 | miF1 maF1 | miF1 maF1 | ||||
| base-text | 95.57 | 89.05 | 74.94 | 56.81 | 71.06 | 70.19 |
| ours-w/hier | 61.92 | 78.49 | 56.60 | 42.90 | 75.50 | 74.77 |
| ours-L1 | 94.27 | 85.04 | 59.56 | 42.97 | 76.70 | 76.40 |
| ours-L2 | 94.97 | 86.72 | 71.34 | 51.09 | 75.48 | 74.91 |
| ours-L1+L2 | 93.92 | 82.44 | 72.58 | 51.41 | 76.94 | 76.31 |
Note: The table formatting of the original paper had some values listed on the same line. I have separated them into distinct rows for clarity according to the method names.
The following are the results from Table 10 of the original paper:
| method | NYT | DBpedia | Amazon | |||
| miF1 maF1 | miF1 maF1 | miF1 maF1 | ||||
| base-text | 65.19 | 58.22 | 25.56 | 25.62 | 25.44 | 22.53 |
| ours-w/hier | 68.32 | 61.92 | 26.22 | 19.67 | 37.40 | 27.14 |
| ours-L1 | 78.73 | 62.66 | 29.52 | 25.78 | 35.18 | 27.95 |
| ours-L2 | 70.60 | 62.60 | 33.34 | 31.88 | 39.88 | 31.22 |
| ours-L1+L2 | 75.35 | 62.62 | 33.54 | 31.21 | 40.82 | 32.81 |
Note: The original table formatting had some values listed on the same line. I have separated them into distinct rows for clarity according to the method names.
Analysis (referencing Figure 4, row 2 and Tables 9, 10):
-
ours-w/hier: Prompting LLM with non-hierarchical prompts (P_text_direct). -
ours-L1: Prompting with coarse-grained (L1) information. -
ours-L2: Prompting with fine-grained (L2) information. -
ours-L1+L2: Prompting with full hierarchical information (P_text).The results show that:
- Generally, incorporating any hierarchical information (
ours-L1,ours-L2) into the prompts for example text generation yields better results than using non-hierarchical prompts (ours-w/hier). - Considering more complete hierarchical information (
ours-L1+L2) tends to perform best among the variants that use generated texts, especially for L2, demonstrating the cumulative benefit of providing richer context to the LLM. - For example, on Amazon L2,
ours-L1+L2(32.81%) significantly outperformsours-w/hier(27.14%). - This validates the design choice of
hierarchy-aware promptsforETPgeneration.
- Generally, incorporating any hierarchical information (
6.2.4. Example Text Generation with COG-based Prompting
This study isolates the effect of the Chain-Of-Generation (COG) strategy within the example text generation module.
The following are the results from Table 6 of the original paper:
| Prompt | DBpedia | Amazon | ||||
| L1 | L2 | L3 | L1 | L2 | L3 | |
| 64.60 | 61.57 | 51.19 | 76.87 | 43.13 | 21.91 | |
| 59.06 | 58.06 | 57.28 | 76.06 | 41.95 | 21.89 | |
| 67.17 | 60.13 | 56.09 | 77.58 | 43.43 | 22.69 | |
| * | 82.29 | 65.22 | 51.19 | 77.73 | 45.47 | 21.91 |
| * | 79.70 | 65.69 | 57.28 | 76.70 | 46.03 | 21.89 |
| * | 88.17 | 65.90 | 56.09 | 82.69 | 47.58 | 22.69 |
Note: The asterisk () indicates that MSP (Maximum Similarity Propagation) was applied for similarity calculation. generates 1 example, generates examples without COG, and generates examples with COG.*
Analysis:
- Impact of COG: Comparing (with COG) against (single example, no COG) and (multiple examples, no COG) shows the benefit of the
COGstrategy. - On DBpedia, shows advantages in L1 and L3 over . The advantage over is less obvious here.
- On Amazon, demonstrates a significant performance improvement over both and across all levels, regardless of whether
MSPis applied. For example, withMSP, Amazon L1: P_{\text{text}}^{\text{COG}}$* (82.69%) vs.P'_{\text{text}}$* (76.70%). - This indicates that explicitly guiding the LLM through the logical steps of generation (COG) is effective in producing more concrete and useful example texts, particularly for certain datasets or hierarchical levels.
6.2.5. Sensitivity Analysis
This analysis explores how the choice of LLM affects HierPrompt's performance. Three Alibaba Cloud Qwen models (qwen-turbo, qwen-plus, qwen-max) were tested on the DBpedia dataset for both category name extension and example text generation.

Figure 5: Sensitivity Analysis: Effect of different LLM for the proposed HierPrompt
Analysis:
- Category Name Extension (CNP): F1-scores for
CNP(blue bars) show minimal variation across different LLMs. This suggests that the task of understanding and extending category names, especially with hierarchical context, might be relatively straightforward for even less complex LLMs. - Example Text Generation (ETP & All): For
ETP(orange bars) and the combinedAll(grey bars), there are noticeable differences in performance across LLMs.qwen-turboandqwen-maxgenerally show a decrease in performance as category generality increases (L3 to L1), meaning they perform better on finer-grained categories.- Conversely,
qwen-plusshows the opposite trend: its performance increases with the generality of the category (L3 to L1).
- This indicates that while
CNPcontextualization is robust to LLM choice,ETPgeneration is more sensitive. The choice of LLM significantly impacts the quality and utility of the generated example texts, especially considering the level of generality of the categories.qwen-plusappears to be a good balance or even superior for coarse-grained example text generation.
6.3. Computational Budget
The paper notes that the computational cost for LLM querying (Category Name Contextualization and Example Text Generation) grows linearly with the number of categories in the hierarchical taxonomy, i.e., , where is the number of categories.
- For the largest dataset (Amazon), the non-batch querying for Category Name Contextualization takes approximately 120 seconds.
- Example Text Generation takes around 6000 seconds (100 minutes) in non-batch querying.
- During inference (classification of a new text), the process takes less than 5 seconds. The experiments were run on an NVIDIA GeForce RTX 2080 GPU. This suggests that while prototype generation can be time-consuming for large taxonomies, the inference time is very low, making the method practical for real-time classification once prototypes are generated.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces HierPrompt, a novel method for Zero-Shot Hierarchical Text Classification (ZS-HTC) that significantly enhances prototype quality and leverages hierarchical structure. HierPrompt refines the conventional prototype-based framework by integrating Example Text Prototypes (ETP) alongside Category Name Prototypes (CNP), thus capturing both semantic and stylistic information of categories. It employs a Maximum Similarity Propagation (MSP) technique to incorporate hierarchical relationships into the similarity calculation, improving classification consistency across levels.
The core innovation lies in using hierarchy-aware prompts to instruct Large Language Models (LLMs) to perform two key functions: (i) contextualizing category names from various hierarchical levels to generate more accurate CNPs, and (ii) generating detailed example texts for leaf categories to form ETPs, notably using a Chain Of Generation (COG) strategy to ensure high-quality, concrete examples. Experimental results on NYT, DBpedia, and Amazon datasets consistently demonstrate that HierPrompt substantially outperforms existing state-of-the-art ZS-HTC methods, validating the effectiveness of its LLM-enhanced prototype refinement and hierarchical propagation mechanism.
7.2. Limitations & Future Work
The authors acknowledge several limitations of HierPrompt and suggest future research directions:
- LLM Knowledge Assumption and Hallucinations: The method assumes that the LLM possesses accurate background knowledge about the taxonomy. However, this may not hold true for highly specialized or evolving taxonomies. The risk of LLM
hallucinations(generating factually incorrect or nonsensical information) could degrade the quality of contextualized category names and generated example texts, thereby negatively impacting model performance.- Future Work: Incorporating explicit mechanisms like
self-consistencyorself-correctnessinto the LLM prompting process to mitigate hallucinations and ensure more reliable prototype generation.
- Future Work: Incorporating explicit mechanisms like
- Prototype Refinement Scope:
HierPromptprimarily focuses on refining prototypes at thetextual level(i.e., improving the text content that is then encoded into prototypes). It does not explicitly address the adjustment or optimization of these prototypes within theembedding spaceitself (e.g., fine-tuning the encoder or applying post-processing to prototype vectors).- Future Work: Exploring methods to improve hierarchical prototypes on both the textual space (as done in this paper) and the embedding space, potentially through techniques like prototype regularization or metric learning.
7.3. Personal Insights & Critique
HierPrompt presents an elegant and effective solution to a challenging problem in NLP.
Inspirations and Strengths:
- Clever use of LLMs for Prototype Generation: The paper moves beyond merely using LLMs for label augmentation (like
HiLA) or in-context learning. Instead, it ingeniously leverages LLMs to generate rich, detailed, and context-aware prototypes from scratch in a zero-shot setting. This is a significant conceptual leap, transforming LLMs into powerfulprototype engineers. - Hierarchy-Aware Prompting: The meticulous design of
hierarchy-aware promptsis a key strength. By explicitly feeding LLMs information about parent, sibling, and child categories, the paper demonstrates how to guide complex models to generate highly relevant and contextualized outputs, which is crucial for hierarchical tasks. - Dual Prototype Approach (
CNP+ETP): The recognition that category names alone are insufficient and the introduction ofExample Text Prototypes (ETP)is a pragmatic and impactful decision.ETPsbridge the gap by providing stylistic and fine-grained information that static category names cannot, especially critical in diverse text domains like Amazon reviews. - Maximum Similarity Propagation (
MSP): TheMSPtechnique is a well-reasoned enhancement to hierarchical classification. By propagating the maximum similarity from descendants, it ensures that strong signals from lower levels are effectively utilized by their ancestors, leading to more robust and consistent hierarchical predictions. - Practicality for Zero-Shot: Despite the upfront cost of LLM inference for prototype generation, the final classification step is efficient, making it practical for real-world deployment once prototypes are established.
Potential Issues, Unverified Assumptions, and Areas for Improvement:
-
Reliance on LLM Quality and Cost: The performance of
HierPromptis inherently tied to the capabilities and cost of the underlying LLM. The sensitivity analysis confirms that different LLMs yield varying results, especially forETPgeneration. Future work needs to explore how robust this method is to different LLM architectures and how to optimize for cost-effectiveness, particularly for very large taxonomies whereETPgeneration can be computationally intensive. -
"Zero-Shot" Nuance: While the paper adheres to a strict definition of zero-shot (no labeled training data), the use of LLMs to generate example texts can be viewed as "synthetic data generation" or "LLM-enhanced data augmentation" rather than purely relying on semantic descriptions. The LLM itself has been trained on vast amounts of data, potentially including examples similar to the target domains. This raises questions about the true "zero-shotness" in terms of novelty of content for the LLM.
-
Generalizability of Prompts: The prompts are dataset-specific (e.g.,
[dataset]placeholder). While effective, ensuring that these prompts are universally applicable across vastly different domains or highly specialized taxonomies might require manual tuning. An adaptive prompting mechanism could be an interesting future direction. -
Handling Long/Complex Taxonomies: For extremely deep hierarchies or those with very broad nodes and many children/descendants, the or lists in prompts could become very long, potentially exceeding LLM context windows or affecting their performance. The averaging approach for
ETPaggregation for higher levels might also dilute specificity in very deep hierarchies. -
Embedding Space Refinement: As noted by the authors, neglecting prototype refinement in the embedding space is a limitation. Combining the textual refinement of
HierPromptwith techniques likeprototype network fine-tuningormetric learningon synthetic data could lead to even more powerful zero-shot classifiers.Overall,
HierPromptoffers a compelling paradigm forZS-HTC, demonstrating the immense potential of LLMs to create rich, dynamic knowledge representations for categories. Its innovative use of prompts and hierarchical awareness makes it a strong contender in the evolving landscape of zero-shot learning.
Similar papers
Recommended via semantic vector search.