Hierarchical Text Classification Using Black Box Large Language Models
TL;DR Summary
This study investigates the use of black box Large Language Models for Hierarchical Text Classification, evaluating three prompting strategies in zero-shot and few-shot settings. Results show few-shot settings improve accuracy, especially with the DH strategy, although API costs
Abstract
Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies -- Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) -- in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Hierarchical Text Classification Using Black Box Large Language Models
1.2. Authors
Kosuke Yoshimura and Hisashi Kashima, both affiliated with Kyoto University, Kyoto, Japan.
1.3. Journal/Conference
This paper is a preprint published on arXiv (https://arxiv.org/abs/2508.04219), a repository for preprints of scientific papers. As a preprint, it has not yet undergone formal peer review, but arXiv is a widely recognized platform for disseminating early research in fields like computer science, machine learning, and natural language processing.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses the challenges of Hierarchical Text Classification (HTC), such as data scarcity and model complexity. It investigates the use of black box Large Language Models (LLMs), accessed via APIs, as a lightweight alternative to traditional machine learning methods. The study evaluates three prompting strategies—Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH)—in both zero-shot and few-shot settings. Experiments on two datasets reveal that few-shot settings consistently improve accuracy. While traditional machine learning models perform well on shallow hierarchies, LLMs, particularly with the DH strategy, can outperform them on deeper hierarchies. The study also highlights a significant increase in API costs for deeper label hierarchies with the DH strategy due to higher input token requirements, emphasizing a trade-off between accuracy and computational cost. The findings suggest that black box LLMs have potential for HTC, provided prompt strategies are carefully chosen to balance performance and cost.
1.6. Original Source Link
https://arxiv.org/abs/2508.04219 (Preprint)
1.7. PDF Link
https://arxiv.org/pdf/2508.04219v1.pdf (Preprint)
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the efficient and accurate classification of texts into hierarchically structured labels, known as Hierarchical Text Classification (HTC). This problem is crucial for organizing vast amounts of digital content, such as medical texts, academic articles, and product reviews, enabling efficient information retrieval.
The main challenges in HTC that prior research faces are:
-
Data Scarcity: As the number of hierarchical labels grows (often hundreds to thousands), obtaining sufficient labeled training data for each specific category becomes increasingly difficult, leading to sparse data.
-
Model Complexity: Traditional machine learning and deep learning models struggle with the vast label space, often leading to issues like overfitting (model learns training data too well, performs poorly on new data) or underfitting (model is too simple to capture patterns) due to insufficient data per label. Furthermore, these models typically require extensive computational resources for training and fine-tuning.
The paper's innovative idea is to leverage the capabilities of
Large Language Models (LLMs)—specifically,black box LLMsaccessible viaAPIs—for HTC. LLMs have demonstrated strongzero-shot(no examples) andfew-shot(minimal examples) learning abilities, offering a potential solution to data scarcity and the need for complex, resource-intensive model training. By using black box LLMs, the study seeks a lightweight and practical implementation that avoids the substantial computational overhead required for deploying and fine-tuningwhite box LLMs(publicly available or self-trained models on one's own resources).
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Adaptation of HTC Strategies to LLM Prompting: The study applies three distinct prompting strategies—
Direct Leaf Label Prediction (DL),Direct Hierarchical Label Prediction (DH), andTop-down Multi-step Hierarchical Label Prediction (TMH)—to adapt typical HTC approaches for use with black box LLMs. -
Comprehensive Evaluation of Prompting Strategies: It conducts a thorough evaluation of these prompting strategies on real-world datasets, assessing both
classification accuracyandAPI invocation cost. -
Comparison of Zero-shot and Few-shot Performance: The research compares the performance of LLMs in
zero-shotsettings (no in-context examples) againstfew-shotsettings (with a small number of in-context examples) to highlight the effectiveness and limitations of black box LLMs for HTC.The key conclusions and findings are:
-
Few-shot Learning Improves Accuracy: A
few-shot settingconsistently improvesclassification accuracycompared to azero-shot settingacross all evaluated prompting strategies. This highlights the value of providing even a small number of examples to LLMs for HTC tasks. -
LLMs vs. Traditional ML: While a traditional machine learning model (
HPT) achieves high accuracy on datasets withshallow hierarchies(likeWeb of Science), LLMs (especially theDHstrategy) tend to outperform the machine learning model on datasets withdeeper hierarchies(likeAmazon Product Reviews). This suggests LLMs are particularly suited for complex hierarchical structures where traditional models might struggle with data sparsity at deeper levels. -
Accuracy-Cost Trade-off:
API costsincrease significantly withdeeper label hierarchiesand theDHstrategy due to the higher number ofinput tokensrequired. This underscores a crucial trade-off: improved accuracy, especially with more complex strategies and few-shot examples, comes with a higher computational (and monetary) cost. -
Potential of Black Box LLMs: The findings highlight the potential of
black box LLMsfor HTC, offering a viable alternative to traditional machine learning models, especially inlow-resource scenarios(where limited labeled data is available). However, careful selection of a prompt strategy is essential to balance performance and cost-effectiveness.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Hierarchical Text Classification (HTC): This is a specialized type of text classification where the target labels are organized in a hierarchy, similar to a family tree or a file system. Instead of assigning a text to a single flat category (e.g., "Sports"), HTC aims to classify it into one or more categories that are structured from general to specific (e.g., "Sports" -> "Outdoor Sports" -> "Hiking"). The hierarchy can be a tree (each child has only one parent) or a Directed Acyclic Graph (DAG) (a child can have multiple parents). The goal is to identify the most appropriate path or set of paths down this hierarchy for a given text. This is challenging because of the large number of potential labels and the need to maintain consistency across hierarchical levels.
-
Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data (e.g., internet text, books). They are designed to understand, generate, and process human language. LLMs are typically based on the
Transformerarchitecture, which usesself-attentionmechanisms to weigh the importance of different words in a sequence. Key characteristics include:- Scale: They have billions or even trillions of parameters, allowing them to capture complex patterns in language.
- Pre-training: They undergo extensive pre-training on diverse text corpora, learning grammar, facts, reasoning abilities, and common knowledge.
- Generative Capabilities: They can generate coherent and contextually relevant text, complete tasks like summarization, translation, question answering, and text classification.
-
Black Box LLMs: These refer to LLMs where the internal architecture, parameters, and training data are not publicly accessible. Users interact with them purely through
Application Programming Interfaces (APIs). Examples include OpenAI'sGPTseries (likegpt-4o-mini) or Google'sGemini. The user cannot fine-tune these models or inspect their internal workings directly; they only send an input (aprompt) and receive an output. -
White Box LLMs: In contrast to
black box LLMs,white box LLMsare those whose models or architectures are publicly available or can be deployed and fine-tuned on a user's own computational resources. Examples include models from theLlamaseries by Meta. While they offer greater control (e.g., fine-tuning for specific tasks), they require substantial computational power and expertise to set up and maintain. -
Prompting: This is the technique of guiding an LLM to perform a specific task by providing it with carefully crafted text instructions (the
prompt). Instead of traditional programming or fine-tuning,promptinginvolves framing the task as a natural language query or command. For example, to classify text, a prompt might include instructions like "Classify the following text into one of these categories: [list categories]." -
Zero-shot Learning: In the context of LLMs,
zero-shot learningmeans performing a task without providing any explicit examples of that task in theprompt. The LLM relies solely on its pre-trained knowledge to understand the instructions and generate an appropriate response. For example, classifying a text without showing any prior examples of classified texts. -
Few-shot Learning: This is an extension of prompting where a small number of examples (typically 1 to 20) are included in the
promptto guide the LLM's behavior for a specific task. These examples serve as "in-context learning" demonstrations, helping the LLM understand the desired format, style, and logic of the task without requiring any weight updates or fine-tuning of the model. For instance, to classify text, the prompt might include a few example texts paired with their correct hierarchical labels before asking the LLM to classify a new text. -
Levenshtein Distance: Also known as edit distance, it is a metric for measuring the difference between two sequences (e.g., strings). It quantifies the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. For example, the Levenshtein distance between "kitten" and "sitting" is 3 (k -> s, e -> i, delete n, insert g). In this paper, it's used to find the closest matching label when an LLM's output doesn't exactly match the provided candidate labels.
3.2. Previous Works
The paper discusses related work in three main areas: classical Hierarchical Text Classification, general applications of LLMs, and HTC with LLMs.
3.2.1. Hierarchical Text Classification
- Classical Machine Learning Approaches: Early methods focused on creating features from input text and applying specialized classification models like
Support Vector Machines (SVMs)for hierarchical classification [9,10]. These often involved training separate classifiers for each level or for parent-child relationships. - Deep Learning Approaches (e.g., HDLTEX [1]): Kowsari et al. (2017) proposed
HDLTex(Hierarchical Deep Learning for Text Classification), which uses a distinct deep learning architecture for each level of the hierarchy. It trains models to estimate child labels conditioned on parent labels. While addressing the issue of increasing labels, it still requires a large amount of training data. - Hierarchical Label Set Expansion (HLSE [2]): Gargiulo et al. addressed the problem in datasets where experts might not label all levels from root to leaf. They proposed
HLSEto infer and complement missing parent-child relationships, enhancing the dataset. - Prompt Tuning for Hierarchical Classification (HPT [11]): Wang et al. (2022) proposed
HPT(Hierarchy-aware Prompt Tuning). This method integrates label hierarchy information into dynamic virtual templates and hierarchy-aware label words, bridging the gap between conventional prompt tuning andPre-trained Language Model (PLM)training tasks. It also introduced a zero-bounded multi-label cross-entropy loss to handle label imbalance and low-resource scenarios.HPTis based on transformer-based architectures likeBERT, not full-scale LLMs. - Multi-Verbalizer Framework (HierVerb [12]): Ji et al. (2023) proposed
HierVerbfor few-shot HTC. It directly embeds hierarchical information into layer-specificverbalizers(mapping labels to words or phrases for PLMs). By combining a hierarchy-aware constraint chain andfat hierarchical contrastive loss, it leverages PLM knowledge effectively. LikeHPT,HierVerboperates on PLMs such asBERT, not black box LLMs.
3.2.2. Applications of LLMs
- General LLM Applications: The widespread availability of LLMs like
ChatGPThas spurred many studies using them for various tasks [8,13,7]. - LLMs as Zero-Shot Text Classifiers [7]: Wang et al. (2023) validated the use of LLMs as zero-shot text classifiers. Their work is similar in exploring zero-shot classification but does not consider hierarchical structures, which is a key focus of the current paper.
3.2.3. Hierarchical Text Classification with LLMs
Several recent studies have begun to explore LLMs for HTC:
- LLMs with Entailment Predictors [4]: Bhambhoria et al. (2023) proposed a framework combining LLMs with
Entailment Predictorsfor zero-shot hierarchical classification. They convert HTC into along-tail prediction task. This differs from the current paper by focusing on a combined framework rather than solely prompting strategies and does not explicitly compare strategies considering hierarchical structure within prompts. - TELEClass [14]: Zhang et al. (2024) introduced
TELEClass, a method for high-performance HTC model training using LLMs for annotation and taxonomy expansion. Their research primarily focuses on fine-tuning a pre-trained model with minimal supervision and is limited to zero-shot settings for LLM usage. The current paper, in contrast, aims to compare accuracy and cost across different prompting strategies in both zero-shot and few-shot settings, without fine-tuning. - Retrieval-based In-context Learning (ICL) [15]: Chen et al. (2024) proposed a
retrieval-based ICLapproach for few-shot HTC. This method requires task-specific training and building a retrieval database. The current paper, however, exclusively investigatesblack box LLMsrelying solely on prompting, without external retrieval or fine-tuning, making it a more lightweight approach. - Single-pass HTC with LLMs [16]: Schmidt et al. (2024) focused on zero-shot HTC, using LLMs with hierarchical label structures to improve performance. Their study is limited to zero-shot experiments, while the current paper systematically examines both zero-shot and few-shot settings, providing a more comprehensive analysis of the impact of in-context learning.
3.3. Technological Evolution
Text classification has evolved significantly:
-
Early Methods: Initially, statistical methods (
Naive Bayes,SVM) with handcrafted features were common. These struggled with large, high-dimensional text data and complex label structures. -
Deep Learning Era: The advent of deep learning brought models like
Recurrent Neural Networks (RNNs)andConvolutional Neural Networks (CNNs)which could automatically learn features from text. For HTC, specialized deep learning architectures were developed (e.g.,HDLTex), but they still demanded substantial labeled data. -
Pre-trained Language Models (PLMs): Models like
BERTrevolutionized NLP by pre-training on massive corpora and then fine-tuning for downstream tasks. Methods likeHPTandHierVerbextended PLM capabilities to HTC. While powerful, they still require task-specific training or fine-tuning. -
Large Language Models (LLMs) and Prompting: The latest wave involves very large, general-purpose LLMs (e.g.,
GPT,Llama). These models, due to their vast pre-training, can perform many tasks (including classification) withzero-shotorfew-shotprompting, significantly reducing the need for task-specific training data and model complexity.This paper's work fits into the most recent stage of this evolution, exploring
black box LLMsandpromptingas a lightweight, data-efficient, and computationally less demanding alternative to previous methods, particularly for HTC where data scarcity is a major issue.
3.4. Differentiation Analysis
Compared to related work, the core differences and innovations of this paper's approach are:
-
Focus on Black Box LLMs via API: Unlike
HPTandHierVerbwhich useBERT-like PLMs requiring fine-tuning, orwhite box LLMswhich need significant computational resources, this study exclusively usesblack box LLMsaccessed viaAPIs. This emphasizes a lightweight, practical, and resource-efficient solution for HTC without training or fine-tuning overhead. -
Systematic Evaluation of Prompting Strategies for HTC: The paper systematically evaluates different prompting strategies (
DL,DH,TMH) specifically designed to adapt traditional HTC approaches to the LLM paradigm. Previous LLM-based HTC studies often focus on frameworks (like combining withentailment predictors[4]) or specific fine-tuning approaches (TELEClass[14]) rather than a direct comparison of diverse prompting mechanisms for black box LLMs. -
Comprehensive Zero-shot and Few-shot Analysis: While some prior LLM-based HTC research focuses solely on
zero-shotsettings (TELEClass[14], Schmidt et al. [16]), this paper provides a more thorough analysis by comparing bothzero-shotand variousfew-shotsettings across all prompting strategies. This allows for a deeper understanding of theaccuracy-cost trade-offassociated within-context learning. -
Emphasis on Cost-Effectiveness: A unique aspect is the explicit
cost analysisin terms ofAPI tokensfor different prompting strategies and few-shot examples. This highlights the practical considerations for deploying LLM-based HTC solutions, which is often overlooked in academic evaluations. -
Addressing Deeper Hierarchies: The paper finds that LLMs, particularly the
DHstrategy, can outperform traditional ML models on datasets withdeeper hierarchieswheredata scarcityis more pronounced, offering a compelling advantage in such challenging scenarios.In essence, this paper provides a practical blueprint and empirical insights into effectively utilizing readily available
black box LLMsforHTCby rigorously comparing differentprompting strategiesand their associatedaccuracy-cost trade-offsin diversesupervision settings.
4. Methodology
The study aims to provide a low-cost and high-accuracy Hierarchical Text Classification (HTC) solution using only black box LLMs for inference, operating in zero-shot or few-shot settings. This means the LLM itself is not trained or fine-tuned; its capabilities are leveraged solely through carefully crafted prompts.
Given an input text and a black box large language model (accessed via API), the goal is to assign the most accurate labels from a hierarchically structured candidate label set to by devising effective prompting strategies. In a few-shot setting, a small set of labeled examples (where ) is provided as part of the prompt.
The paper evaluates three distinct prompting strategies:
-
Direct Leaf Label Prediction (DL) -
Direct Hierarchical Label Prediction (DH) -
Top-down Multi-step Hierarchical Label Prediction (TMH)These strategies adapt typical approaches used in
HTCto theLLM promptingparadigm.
4.1. Direct Leaf Label Prediction Strategy (DL)
The Direct Leaf Label Prediction (DL) strategy focuses on predicting only the leaf node label (the most specific label at the deepest level of the hierarchy) directly. Once a leaf label is predicted, its hierarchical path (i.e., its parent labels up to the root) is inferred by tracing back through the predefined label hierarchy.
4.1.1. Principles
The core idea is to treat the HTC problem as a flat classification task over all possible leaf labels. The LLM is given the input text and a comprehensive list of all candidate leaf labels. It is then instructed to select the single most appropriate leaf label from this list. This simplifies the prediction task for the LLM by removing the explicit need for multi-level hierarchical reasoning during the initial prediction step. The hierarchy above the leaf is then reconstructed.
4.1.2. Core Methodology
-
Input Text and Instructions: The LLM receives the input text and instructions to perform classification.
-
Candidate Labels: A complete list of all
leaf nodesfrom the candidate label set is presented to the LLM. -
Prediction: The LLM predicts one
leaf labelthat best matches the input text. -
Hierarchy Reconstruction: After the
leaf labelis predicted, the full hierarchical path from the root to this predictedleaf labelis automatically determined by looking up the predefined hierarchy structure.The following figure (Figure 1 from the original paper) visually represents the
Direct Leaf Label Prediction Strategy:
该图像是示意图,展示了直接叶标签预测策略的结构。左侧显示了用户输入的指令、候选标签及相关文本,右侧则呈现了标签层次结构,其中包含了多个类别,如医学科学和计算机科学等。
Fig. 1: Direct Leaf Label Prediction Strategy.
The prompt template for the DL strategy includes sections for Instructions, Candidates (listing all leaf labels), and Input Text. The Output Format specifies that the LLM should respond with only the chosen label.
The following figure (Figure 4 from the original paper) shows an example of the prompt template used for the DL strategy on the Web of Science dataset:

该图像是一个示意图,展示了 DL 策略的提示模板。图中包含了用于分类的指令、候选标签和输入文本占位符 {input data},示意如何选择一个相关领域标签。
Fig. 4: The prompt template for the DL strategy on the Web of Science dataset. area is replaced with actual input text. In this template:
Instructions: Guides the LLM on how to perform the classification. It asks the LLM to select anArealabel (which corresponds to a leaf label in this context) from the provided candidates that matches theinput data.Candidates: Lists all possibleleaf labelsthat the LLM can choose from. For instance, in theWeb of Sciencedataset, these could be specific sub-disciplines.Input Data: This placeholder is replaced with the actual text that needs to be classified.Output Format: Specifies that the output should be just the selectedArea.
4.2. Direct Hierarchical Label Prediction Strategy (DH)
The Direct Hierarchical Label Prediction (DH) strategy instructs the LLM to output the complete hierarchical path of labels directly, from the first depth level down to the leaf node.
4.2.1. Principles
This strategy provides the LLM with full contextual information about the label hierarchy upfront. Instead of individual labels, the LLM is presented with full hierarchical paths as candidate outputs. The intuition is that by seeing the entire path structure, the LLM can better reason about the relationships between parent and child nodes and make a more coherent, hierarchically consistent prediction.
4.2.2. Core Methodology
-
Input Text and Instructions: The LLM receives the input text and instructions to classify it into a hierarchical path.
-
Candidate Paths: Instead of individual labels, the LLM is provided with a list of all possible
hierarchical pathsfrom the root to eachleaf nodein the format . -
Prediction: The LLM directly predicts a single complete
hierarchical pathfrom the given candidates that best matches the input text.The following figure (Figure 2 from the original paper) visually represents the
Direct Hierarchical Label Prediction Strategy:
该图像是一个示意图,展示了直接层级标签预测策略。左侧显示了用户输入的文本、指令和候选标签,右侧则呈现了文本所属的层级结构,根节点连接到医疗科学和计算机科学等领域,进一步细分为具体的学科和概念。
Fig. 2: Direct Hierarchical Label Prediction Strategy.
The prompt template for the DH strategy is similar to DL but with Candidates defined as full paths.
The following figure (Figure 5 from the original paper) shows an example of the prompt template used for the DH strategy on the Web of Science dataset:

该图像是一个示意图,展示了用于层次文本分类的直接层级标签预测(DH)策略的提示模板。在图中,区域将被实际输入文本替换。
Fig. 5: The prompt template for the DH strategy on the Web of Science dataset. area is replaced with actual input text. In this template:
Instructions: Directs the LLM to classify theinput datainto a specificAreaby outputting the full hierarchical path.Candidates: Lists all possiblehierarchical pathsas options. Each path is structured as "Top-Level Category > Sub-Category > ... > Leaf Category". For example, "Medical Sciences > Anatomy & Physiology".Input Data: Placeholder for the text to be classified.Output Format: Specifies that the output should be the full path.
4.3. Top-down Multi-step Hierarchical Label Prediction Strategy (TMH)
The Top-down Multi-step Hierarchical Label Prediction (TMH) strategy breaks down the HTC problem into a sequence of single-level classification steps. The LLM predicts labels level-by-level, from the broadest (top-most) category to the most specific (leaf) category.
4.3.1. Principles
This strategy mimics a human decision-making process where one first categorizes broadly, then refines the classification at each subsequent level. It leverages the LLM's ability to reason conditionally: the prediction at a given depth guides the candidate labels presented at the next deeper depth. This reduces the number of candidate labels at each step compared to DL (which lists all leaf nodes) or DH (which lists all full paths), potentially making each step easier for the LLM.
4.3.2. Core Methodology
- Step 1: First Depth Prediction: The LLM is prompted to select a label from the
first depth(top-level) categories.- The prompt includes the input text and the candidate labels for the
first depth. - The LLM predicts a
first-depth label.
- The prompt includes the input text and the candidate labels for the
- Subsequent Steps: Deeper Depth Prediction: For each subsequent depth :
- The LLM is prompted again with the input text.
- Crucially, the
candidate labelsprovided for this step are limited to only thechild labelsof the label predicted in the previous depth . This means the search space is narrowed dynamically. - The LLM predicts a
depth-(d+1) label.
- Iteration: This process continues until a
leaf nodeis reached or the deepest level of the hierarchy is classified. - Handling Non-Exact Matches: A key challenge with LLMs is that they might not always return an output that perfectly matches one of the provided candidate labels. To address this:
-
If the LLM's predicted label is not found in the current set of candidate labels, the system identifies the candidate label with the
smallest Levenshtein distancefrom the LLM's output. -
The
child labelscorresponding to this closest-matching candidate label are then used as the candidate set for the next depth level. This ensures the process can continue even with slight LLM output variations.The following figure (Figure 3 from the original paper) visually represents the
Top-down Multi-step Hierarchical Label Prediction Strategy:
该图像是一个示意图,展示了Top-down Multi-step Hierarchical Label Prediction策略。在第一层,LLM选择了"Computer Science",随后在第二层仅提供"Computer Science"的子节点作为候选标签进行预测。
-
Fig. 3: Top-down Multi-step Hierarchical Label Prediction Strategy. In this figure, since LLM selects "Computer Science" at the 1st depth, this approach provides only the child nodes of "Computer Science" as the candidate labels to prompt text at the 2nd depth. This figure illustrates that if the LLM chooses "Computer Science" at the first level, then for the second level, only its direct child categories (e.g., "Artificial Intelligence", "Software Engineering", etc.) will be presented as candidate labels.
The prompts for each depth in TMH are structurally similar to the DL strategy (Figure 4), but the Candidates section changes dynamically to include only the relevant child labels based on the previous step's prediction.
5. Experimental Setup
5.1. Datasets
Experiments were conducted using two real-world datasets: Web of Science (WOS) and Amazon Product Reviews (APR).
5.1.1. Web of Science (WOS)
- Source: A collection of 46,985 published papers from the
Web of Science[1]. - Characteristics: For each article, the
abstractis used as the input text. Thedomainsare used asfirst-depth labels, andkeywordsare used assecond-depth labels. - Hierarchy Depth: 2 depths.
- Data Splits: The original
WOSdataset lacks predefined train/valid/test splits. The study usedChatGPT-Cheat?[18] andTimeTravel-in-LLMs[19] for contamination assessment and instance-level analysis. 1,800 uncontaminated instances were selected as the test set. From the remaining data, 1,250 instances were randomly sampled for the training set (used for few-shot examples).
5.1.2. Amazon Product Reviews (APR)
-
Source: Review and Product categories scraped from
amazon.comand published onkaggle.com[17]. -
Characteristics: Originally, 40,000 records with three tiers of labels. The study used a subset of these records.
-
Hierarchy Depth: 3 depths.
-
Data Splits: The original dataset is split into train and validation sets. Similar to WOS, 1,800 uncontaminated instances were selected as the test set, and 1,250 instances were randomly sampled for the training set.
The following are the results from Table 1 of the original paper:
dataset name #(data) #(candidate labels) train test 1st 2nd 3rd Web of Science 1,250 1,800 7 136 Amazon Product Reviews 1,250 1,800 6 62 309 -
WOS: Has a
shallow hierarchywith 7 labels at the 1st depth and 136 at the 2nd. The third depth is not applicable. -
APR: Has a
deeper hierarchywith 6 labels at the 1st depth, 62 at the 2nd, and 309 at the 3rd.These datasets were chosen because they represent different complexities in hierarchical structures (2-depth vs. 3-depth) and are common benchmarks in
HTCresearch, making them effective for validating the method's performance across varying hierarchical depths.
5.1.3. Data Contamination Checks
To ensure the integrity of the evaluation and prevent data contamination (where LLMs might have seen the test data during their training), ChatGPT-Cheat? [18] and TimeTravel-in-LLMs [19] were employed.
-
ChatGPT-Cheat?was used to check for dataset-level contamination. It classifies responses into 'contaminated', 'suspicious', 'safety-filtered', and 'clean'. Prompts varied to include or exclude split names (train/validation) and format types. -
TimeTravel-in-LLMswas used for a more rigorousinstance-level contaminationassessment. -
Results: No contamination or suspicious cases were detected for either dataset, confirming their validity for evaluating LLM performance.
The following figure (Figure 6 from the original paper) shows the prompts used for contamination validation:
该图像是图示,展示了 ChatGPT-Cheat? 的提示示例,用于验证数据污染。左侧(a)为未包含拆分名称的提示,右侧(b)为包含拆分名称的提示。
Fig.6: The prompts of ChatGPT-Cheat? to validate data contamination. is replaced with a target dataset name, is replaced with a target split name, and is replaced with a target data format type.
5.2. Evaluation Metrics
The performance is evaluated using accuracy at different depths of the hierarchy and conditional accuracy for child labels. Before evaluation, text normalization (removing symbols and decapitalizing text) is applied to both LLM outputs and ground truth labels to account for slight variations in LLM responses.
5.2.1. Accuracy at Depth ()
- Conceptual Definition:
Accuracyat depth measures the proportion of instances where the predicted label at that specific hierarchical level perfectly matches the true (ground truth) label at the same level. It quantifies the overall correctness of classification at a particular depth. - Mathematical Formula: $ ACC_d = \frac{\sum_{i=1}^{N} \mathbb{I}(\hat{y}{i,d} = y{i,d})}{N} $
- Symbol Explanation:
- : Total number of instances in the test set.
- : The predicted label for instance at depth .
- : The true (ground truth) label for instance at depth .
- : The indicator function, which returns 1 if the condition inside is true, and 0 otherwise.
5.2.2. Conditional Accuracy ()
- Conceptual Definition: This metric, denoted as , measures the accuracy of predicting a child label at depth , given that the parent label at depth was already correctly predicted. It assesses the consistency and correctness of the LLM's hierarchical reasoning as it moves down the hierarchy. A high value indicates that once a correct parent is identified, the model is good at finding the correct child within that branch.
- Mathematical Formula: $ P(p_{d+1}^{True}|p_d^{True}) = \frac{\sum_{i=1}^{N} \mathbb{I}(\hat{y}{i,d+1} = y{i,d+1} \land \hat{y}{i,d} = y{i,d})}{\sum_{i=1}^{N} \mathbb{I}(\hat{y}{i,d} = y{i,d})} $
- Symbol Explanation:
- : Total number of instances in the test set.
- : The predicted label for instance at depth .
- : The true label for instance at depth .
- : The predicted label for instance at depth .
- : The true label for instance at depth .
- : The indicator function.
- : Logical AND operator.
5.3. Baselines
The black box LLM-based prompting strategies are compared against Hierarchy-aware Prompt Tuning for Hierarchical Text Classification (HPT) [11].
- HPT: This is a non-LLM, machine learning-based approach. It leverages a transformer-based architecture (like
BERT) and integrates hierarchical label dependencies intoprompt tuningto enhance classification accuracy. - Parameters: For
HPT,batch_sizewas set to 16, with all other parameters at their default values, using the official implementation. - Representativeness:
HPTis a strong and relatively recent baseline forHTCthat explicitly incorporates hierarchical information using PLMs, making it a suitable comparison for assessing the effectiveness of LLM-based prompting.
5.4. LLM Configuration
- Model:
gpt-4o-mini(specifically,gpt-4o-mini-2024-07-18) provided byOpenAI. - Parameters:
temperatureand were both set to 1.0.temperature: Controls the randomness of the output. Higher values (like 1.0) make the output more diverse and creative, while lower values make it more deterministic. This choice allows the LLM to explore a wider range of possible label interpretations.- : Controls the diversity of the output by sampling from the most probable tokens whose cumulative probability exceeds . Setting it to 1.0 means considering all possible tokens, again prioritizing diversity.
- Prompting Settings: Experiments were conducted in
zero-shot(no examples in the prompt) and variousfew-shotsettings (1, 3, 5, 10, 20 examples). Forfew-shot prompting, examples were randomly sampled from the training data for each dataset.
6. Results & Analysis
6.1. Core Results Analysis
The experiments compared the three prompt strategies (DL, DH, TMH) in zero-shot and few-shot settings against the HPT machine learning model on two datasets: Web of Science (WOS) and Amazon Product Reviews (APR).
6.1.1. Web of Science Dataset Results (Shallow Hierarchy)
The following are the results from Table 2 of the original paper:
| Method | #(Few Shot) | ACC1 | P(p2True |p1True) | ACC2 |
| Machine Learning Model | ||||
| HPT | 0.826 | 0.655 | 0.571 | |
| Prompt Strategies | ||||
| DL | 0 | 0.677 | 0.581 | 0.393 |
| DL | 1 | 0.707 | 0.604 | 0.427 |
| DL | 3 | 0.708 | 0.620 | 0.439 |
| DL | 5 | 0.713 | 0.617 | 0.440 |
| DL | 10 | 0.712 | 0.605 | 0.431 |
| DL | 20 | 0.710 | 0.611 | 0.434 |
| DH | 0 | 0.627 | 0.601 | 0.401 |
| DH | 1 | 0.693 | 0.598 | 0.434 |
| DH | 3 | 0.688 | 0.579 | 0.417 |
| DH | 5 | 0.691 | 0.572 | 0.413 |
| DH | 10 | 0.688 | 0.567 | 0.407 |
| DH | 20 | 0.684 | 0.575 | 0.416 |
| TMH | 0 | 0.616 | 0.652 | 0.405 |
| TMH | 1 | 0.654 | 0.664 | 0.436 |
| TMH | 3 | 0.652 | 0.665 | 0.434 |
| TMH | 5 | 0.651 | 0.653 | 0.427 |
| TMH | 10 | 0.656 | 0.657 | 0.433 |
| TMH | 20 | 0.654 | 0.663 | 0.437 |
- HPT Baseline Performance: The
HPTmachine learning model significantly outperforms all LLM-based prompting strategies on theWOSdataset, achieving , , and . This dataset has a relativelyshallow hierarchy(2 depths) and a moderate number of labels (7 at 1st, 136 at 2nd). - Few-shot vs. Zero-shot: For LLMs,
few-shot promptingconsistently improves performance overzero-shot prompting. For example,DL(0-shot ) improves toDL(5-shot ). Similar trends are observed forDHandTMH. - Best LLM Strategy:
DLwith 5-shot prompting achieves the highest (0.713) and (0.440) among LLM strategies.TMHwith 3-shot prompting has the highestconditional accuracy(0.665), suggesting its top-down approach helps maintain consistency if the first level is correct. - Overall LLM Performance: Even with
few-shot prompting, the LLM strategies do not reach the performance of theHPTmodel on this dataset. This suggests that forshallow hierarchieswith perhaps more available underlying training data (even if not used directly by black box LLMs), dedicated machine learning models can still hold an advantage.
6.1.2. Amazon Product Reviews Dataset Results (Deeper Hierarchy)
The following are the results from Table 3 of the original paper:
| Method | #(Few Shot) | ACC1 | P(p2True |p1True) | ACC2 | P(p3True |p2True) | ACC3 |
| Machine Learning Model | ||||||
| HPT | 0.823 | 0.657 | 0.556 | 0.641 | 0.377 | |
| Prompt Strategies | ||||||
| DL | 0 | 0.637 | 0.561 | 0.357 | 0.720 | 0.257 |
| DL | 1 | 0.667 | 0.629 | 0.419 | 0.768 | 0.322 |
| DL | 3 | 0.693 | 0.675 | 0.468 | 0.785 | 0.367 |
| DL | 5 | 0.690 | 0.688 | 0.474 | 0.783 | 0.372 |
| DL | 10 | 0.701 | 0.679 | 0.476 | 0.788 | 0.375 |
| DL | 20 | 0.709 | 0.707 | 0.502 | 0.781 | 0.392 |
| DH | 0 | 0.817 | 0.718 | 0.591 | 0.782 | 0.491 |
| DH | 1 | 0.854 | 0.718 | 0.616 | 0.784 | 0.510 |
| DH | 3 | 0.862 | 0.732 | 0.633 | 0.770 | 0.507 |
| DH | 5 | 0.868 | 0.733 | 0.640 | 0.769 | 0.517 |
| DH | 10 | 0.867 | 0.744 | 0.649 | 0.769 | 0.521 |
| DH | 20 | 0.854 | 0.744 | 0.646 | 0.796 | 0.532 |
| TMH | 0 | 0.847 | 0.680 | 0.576 | 0.754 | 0.436 |
| TMH | 1 | 0.824 | 0.679 | 0.560 | 0.783 | 0.440 |
| TMH | 3 | 0.828 | 0.673 | 0.558 | 0.793 | 0.442 |
| TMH | 5 | 0.825 | 0.678 | 0.560 | 0.811 | 0.455 |
| TMH | 10 | 0.836 | 0.681 | 0.570 | 0.842 | 0.481 |
| TMH | 20 | 0.828 | 0.691 | 0.573 | 0.853 | 0.490 |
- HPT Baseline Performance: On the
APRdataset (deeper hierarchy with 3 depths),HPTachieves , , and . - LLMs Outperform HPT on Deeper Hierarchy: In contrast to
WOS, several LLM-based strategies, particularlyDHandTMHinfew-shot settings, achieve performance comparable to, or even outperform,HPTon this dataset with adeeper hierarchy.DH(5-shot) achieves the highest (0.868) and (0.640), surpassingHPTsignificantly.DH(20-shot) achieves the highest (0.532) andconditional accuracy(0.796), again outperformingHPT.TMHalso shows strong performance, withTMH(20-shot) achieving and .
- Effectiveness of Few-shot Prompting: The improvement from
zero-shottofew-shotis even more pronounced forAPR. For example,DH(0-shot ) improves toDH(5-shot ). This suggests that for more complex hierarchical tasks,in-context learningis highly beneficial for LLMs. - Strategy Comparison:
DHgenerally exhibits the strongest performance forAPR, particularly inACC1,ACC2, andACC3, indicating its effectiveness in directly predicting full hierarchical paths in a deeper structure.TMHshows very highconditional accuracy( of 0.853 at 20-shot), implying that once a parent is correctly classified, it's very good at finding the correct child.DLshows consistent improvement withfew-shotbut generally lags behindDHandTMHfor this deeper hierarchy.
6.1.3. Overall Trends
- Few-shot Advantage: Across both datasets and all strategies,
few-shot promptingconsistently leads to higher accuracy thanzero-shot prompting. This reinforces the importance ofin-context learningfor leveraging LLM capabilities in specific tasks. - Hierarchy Depth Matters: For
shallow hierarchies(WOS), the traditional machine learning model (HPT) still maintains an edge. However, fordeeper hierarchies(APR) where data sparsity issues become more severe for traditional models, LLMs (especiallyDHandTMHwithfew-shot) demonstrate superior or comparable performance. This suggests LLMs are particularly valuable in complex, low-resourceHTCscenarios.
6.2. Cost Analysis
The cost analysis focuses on the average number of input tokens (prompt tokens) and output tokens (completion tokens) required for different few-shot settings and prompt strategies.
The following are the results from Table 4 of the original paper:
| #(few shot examples) | ||||||||
| dataset | prompt | 0 | 1 | 3 | 5 | 10 | 20 | |
| prompt tokens | ||||||||
| WOS | DL | 833.33 | 1105.00 | 1662.39 | 2210.69 | 3594.35 | 6326.98 | |
| DH | 1249.33 | 1523.39 | 2080.72 | 2642.91 | 4034.88 | 6822.23 | ||
| TMH | 783.70 | 1305.11 | 2389.67 | 3491.28 | 6250.63 | 11755.44 | ||
| APR | DL | 1337.16 | 1440.54 | 1653.19 | 1866.96 | 2377.60 | 3424.61 | |
| DH | 3354.16 | 3465.70 | 3689.27 | 3912.13 | 4460.17 | 5574.73 | ||
| TMH | 511.23 | 828.81 | 1444.18 | 2057.71 | 3559.82 | 6472.83 | ||
| completion tokens | ||||||||
| WOS | DL | 4.47 | 3.90 | 3.64 | 3.67 | 3.47 | 3.83 | |
| DH | 6.30 | 6.23 | 6.21 | 6.22 | 6.23 | 6.33 | ||
| TMH | 7.51 | 6.81 | 7.07 | 6.78 | 6.95 | 7.03 | ||
| APR | DL | 4.49 | 3.83 | 3.92 | 4.03 | 4.08 | 4.11 | |
| DH | 9.99 | 10.06 | 10.10 | 10.08 | 10.10 | 10.07 | ||
| TMH | 12.58 | 11.32 | 11.45 | 11.25 | 11.51 | 11.33 | ||
- Prompt Tokens Dominate Cost: The number of
prompt tokensincreases significantly with the number offew-shot examples. This is the primary driver of computational cost. For instance,DLonWOSjumps from 833 tokens (0-shot) to over 6,300 tokens (20-shot).TMHonWOSreaches almost 12,000 tokens for 20-shot. - Completion Tokens Stability: In contrast,
completion tokens(output tokens) remain relatively stable across different settings and prompt types, indicating that the cost is largely determined by the inputpromptsize rather than the generated output. - Strategy-specific Cost Characteristics:
- DH: Consistently requires the highest number of
prompt tokens, especially on the deeperAPRdataset (from ~3,300 to ~5,500 tokens). This is expected asDHincludes full hierarchical paths as candidate labels in the prompt, which can become very long for deeper hierarchies. This higher token count translates directly to higher API costs. - DL: Shows a more moderate increase in
token usage. It generally requires fewer tokens thanDHbecause it only listsleaf labelsdirectly, rather than full paths. - TMH: Starts with relatively few tokens in
zero-shotbut scales dramatically withfew-shot examples. For example, onWOS,TMH20-shot uses ~11,755 tokens, significantly more thanDLorDHat the same shot count. This is becauseTMHgenerates multiple prompts (one per depth level), and each prompt forfew-shotincludes examples. As the number of depths and few-shot examples increase, the cumulative token count forTMHcan become very high.
- DH: Consistently requires the highest number of
- Accuracy-Cost Trade-off: The cost analysis highlights the crucial trade-off. While
DHandTMHcan achieve higher accuracy, particularly on deeper hierarchies withfew-shot examples, they also incur significantly higher API costs due to increasedprompt tokenusage.DH's high accuracy onAPRcomes with the highestprompt tokencount.
6.3. Ablation Studies / Parameter Analysis
The paper does not explicitly present ablation studies or detailed parameter analyses (e.g., impact of temperature or values). The focus is on comparing the three prompting strategies and few-shot vs. zero-shot settings with a fixed LLM configuration. The number of few-shot examples (#(Few Shot)) is a key parameter varied, showing a general trend of increasing accuracy with more examples, up to a certain point (e.g., DL 5-shot for WOS, DH 10-shot for APR).
7. Conclusion & Reflections
7.1. Conclusion Summary
This study successfully demonstrated the feasibility of using black box Large Language Models (LLMs) via APIs for Hierarchical Text Classification (HTC). By evaluating three prompting strategies (Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH)) in both zero-shot and few-shot settings, the research provided crucial insights into their accuracy and cost-effectiveness.
The key findings are:
-
Few-shot promptingconsistently and significantly enhancesclassification accuracyacross all strategies and datasets, underscoring the value ofin-context learning. -
LLM-based methods, particularly the
DHstrategy, show strong performance on datasets withdeeper hierarchies(Amazon Product Reviews), often outperforming traditional machine learning models (HPT). This highlights their potential inlow-resource scenarioswhere data sparsity is a significant challenge for traditional approaches. -
A critical
trade-offexists betweenaccuracyandcomputational cost. Strategies likeDH, while achieving high accuracy, incur substantialAPI costsdue to increasedinput tokenrequirements for deeper hierarchies and morefew-shot examples. This implies that careful strategy selection is paramount to balance performance and financial considerations.The study concludes that prompting-based
HTCwithblack box LLMsis a viable and often superior alternative to traditional machine learning models, especially for complex hierarchical structures, provided thatcomputational costis meticulously managed.
7.2. Limitations & Future Work
The authors acknowledge several limitations:
- Limited LLM Scope: The analysis was restricted to
OpenAI's GPT-4o mini. Future work should broaden this to include otherblack box LLMs(e.g., from Google, Anthropic) and explorefine-tuned white box LLMsto gain a more comprehensive understanding of cost-effectiveness and performance trade-offs across different model types and deployment strategies. - Hierarchy Depth: Experiments were conducted on datasets with only two- to three-depth hierarchies. While
DHshowed promise for deeper hierarchies, further experiments on more profoundly nested and complex datasets are needed to confirm these benefits and assess the scalability ofprompting strategies. - Specific Prompt Engineering: The study used predefined prompt templates. Future work could involve more advanced
prompt engineeringtechniques, such as chain-of-thought prompting or self-consistency, to further improve reasoning capabilities and accuracy, potentially with cost implications.
7.3. Personal Insights & Critique
This paper offers a highly practical perspective on HTC by focusing on black box LLMs and API usage. The explicit cost analysis is particularly valuable, as it translates theoretical performance into real-world deployment considerations, a factor often overlooked in research. The finding that LLMs can outperform traditional ML on deeper hierarchies is significant, suggesting a niche where their extensive pre-trained knowledge helps overcome data scarcity at granular levels.
Potential Issues/Unverified Assumptions:
-
Generalizability of
gpt-4o-mini: Whilegpt-4o-miniis cost-effective, its performance might not fully represent the capabilities of larger, more expensiveblack box LLMs(e.g.,GPT-4). The "mini" version might have inherent limitations that are not present in its larger counterparts, potentially understating the true potential of LLMs for HTC. -
Robustness to Prompt Variations: The study used fixed prompt templates. LLMs can be sensitive to subtle changes in prompt wording. The robustness of these strategies to minor variations in
prompt engineeringorfew-shot example selectionis not explored. -
Error Propagation in
TMH: TheTop-down Multi-step Hierarchical Label Prediction (TMH)strategy inherently suffers fromerror propagation: an incorrect prediction at an upper level will inevitably lead to incorrect candidate labels and thus incorrect predictions at lower levels. WhileLevenshtein distancehelps recover from slight mismatches, it cannot fix a fundamentally wrong high-level classification. The paper's highconditional accuracyforTMHis good, but the overallACCforTMHis generally lower thanDHforAPR, indicating that initial errors might be dragging down its overall performance. -
Implicit Knowledge vs. Explicit Structure: The
DHstrategy performs well on deeper hierarchies, which suggests that the LLM is effectively leveraging the explicit hierarchical path provided in the prompt. However, it's not entirely clear whether the LLM is truly "understanding" the hierarchical relationships or simply performing a sophisticated pattern matching on the long string of the path. Further probing might reveal the depth of this understanding.Transferability and Applications: The methodology of adapting
HTCtoprompting strategiesis highly transferable. This approach could be applied to any domain where hierarchical classification is needed and labeled data is sparse, such as:
-
Legal Document Classification: Classifying legal cases or documents into complex legal codes and sub-sections.
-
Biological Taxonomy: Categorizing species or biological entities based on their hierarchical classifications.
-
Customer Support Ticket Routing: Directing customer inquiries through a multi-level support hierarchy to the most appropriate department or agent.
-
Patent Classification: Assigning patents to a deep, specialized classification system.
The study provides a strong foundation for future work, especially in optimizing cost-performance for practical
HTCapplications usingblack box LLMs. The trade-off identified is crucial for practitioners, enabling them to make informed decisions based on their specific accuracy requirements and budget constraints.
Similar papers
Recommended via semantic vector search.