Paper status: completed

Hierarchical Text Classification Using Black Box Large Language Models

Published:08/06/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study investigates the use of black box Large Language Models for Hierarchical Text Classification, evaluating three prompting strategies in zero-shot and few-shot settings. Results show few-shot settings improve accuracy, especially with the DH strategy, although API costs

Abstract

Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies -- Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) -- in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Hierarchical Text Classification Using Black Box Large Language Models

1.2. Authors

Kosuke Yoshimura and Hisashi Kashima, both affiliated with Kyoto University, Kyoto, Japan.

1.3. Journal/Conference

This paper is a preprint published on arXiv (https://arxiv.org/abs/2508.04219), a repository for preprints of scientific papers. As a preprint, it has not yet undergone formal peer review, but arXiv is a widely recognized platform for disseminating early research in fields like computer science, machine learning, and natural language processing.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses the challenges of Hierarchical Text Classification (HTC), such as data scarcity and model complexity. It investigates the use of black box Large Language Models (LLMs), accessed via APIs, as a lightweight alternative to traditional machine learning methods. The study evaluates three prompting strategies—Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH)—in both zero-shot and few-shot settings. Experiments on two datasets reveal that few-shot settings consistently improve accuracy. While traditional machine learning models perform well on shallow hierarchies, LLMs, particularly with the DH strategy, can outperform them on deeper hierarchies. The study also highlights a significant increase in API costs for deeper label hierarchies with the DH strategy due to higher input token requirements, emphasizing a trade-off between accuracy and computational cost. The findings suggest that black box LLMs have potential for HTC, provided prompt strategies are carefully chosen to balance performance and cost.

https://arxiv.org/abs/2508.04219 (Preprint)

https://arxiv.org/pdf/2508.04219v1.pdf (Preprint)

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the efficient and accurate classification of texts into hierarchically structured labels, known as Hierarchical Text Classification (HTC). This problem is crucial for organizing vast amounts of digital content, such as medical texts, academic articles, and product reviews, enabling efficient information retrieval.

The main challenges in HTC that prior research faces are:

  1. Data Scarcity: As the number of hierarchical labels grows (often hundreds to thousands), obtaining sufficient labeled training data for each specific category becomes increasingly difficult, leading to sparse data.

  2. Model Complexity: Traditional machine learning and deep learning models struggle with the vast label space, often leading to issues like overfitting (model learns training data too well, performs poorly on new data) or underfitting (model is too simple to capture patterns) due to insufficient data per label. Furthermore, these models typically require extensive computational resources for training and fine-tuning.

    The paper's innovative idea is to leverage the capabilities of Large Language Models (LLMs)—specifically, black box LLMs accessible via APIs—for HTC. LLMs have demonstrated strong zero-shot (no examples) and few-shot (minimal examples) learning abilities, offering a potential solution to data scarcity and the need for complex, resource-intensive model training. By using black box LLMs, the study seeks a lightweight and practical implementation that avoids the substantial computational overhead required for deploying and fine-tuning white box LLMs (publicly available or self-trained models on one's own resources).

2.2. Main Contributions / Findings

The paper's primary contributions are:

  • Adaptation of HTC Strategies to LLM Prompting: The study applies three distinct prompting strategies—Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH)—to adapt typical HTC approaches for use with black box LLMs.

  • Comprehensive Evaluation of Prompting Strategies: It conducts a thorough evaluation of these prompting strategies on real-world datasets, assessing both classification accuracy and API invocation cost.

  • Comparison of Zero-shot and Few-shot Performance: The research compares the performance of LLMs in zero-shot settings (no in-context examples) against few-shot settings (with a small number of in-context examples) to highlight the effectiveness and limitations of black box LLMs for HTC.

    The key conclusions and findings are:

  • Few-shot Learning Improves Accuracy: A few-shot setting consistently improves classification accuracy compared to a zero-shot setting across all evaluated prompting strategies. This highlights the value of providing even a small number of examples to LLMs for HTC tasks.

  • LLMs vs. Traditional ML: While a traditional machine learning model (HPT) achieves high accuracy on datasets with shallow hierarchies (like Web of Science), LLMs (especially the DH strategy) tend to outperform the machine learning model on datasets with deeper hierarchies (like Amazon Product Reviews). This suggests LLMs are particularly suited for complex hierarchical structures where traditional models might struggle with data sparsity at deeper levels.

  • Accuracy-Cost Trade-off: API costs increase significantly with deeper label hierarchies and the DH strategy due to the higher number of input tokens required. This underscores a crucial trade-off: improved accuracy, especially with more complex strategies and few-shot examples, comes with a higher computational (and monetary) cost.

  • Potential of Black Box LLMs: The findings highlight the potential of black box LLMs for HTC, offering a viable alternative to traditional machine learning models, especially in low-resource scenarios (where limited labeled data is available). However, careful selection of a prompt strategy is essential to balance performance and cost-effectiveness.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Hierarchical Text Classification (HTC): This is a specialized type of text classification where the target labels are organized in a hierarchy, similar to a family tree or a file system. Instead of assigning a text to a single flat category (e.g., "Sports"), HTC aims to classify it into one or more categories that are structured from general to specific (e.g., "Sports" -> "Outdoor Sports" -> "Hiking"). The hierarchy can be a tree (each child has only one parent) or a Directed Acyclic Graph (DAG) (a child can have multiple parents). The goal is to identify the most appropriate path or set of paths down this hierarchy for a given text. This is challenging because of the large number of potential labels and the need to maintain consistency across hierarchical levels.

  • Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data (e.g., internet text, books). They are designed to understand, generate, and process human language. LLMs are typically based on the Transformer architecture, which uses self-attention mechanisms to weigh the importance of different words in a sequence. Key characteristics include:

    • Scale: They have billions or even trillions of parameters, allowing them to capture complex patterns in language.
    • Pre-training: They undergo extensive pre-training on diverse text corpora, learning grammar, facts, reasoning abilities, and common knowledge.
    • Generative Capabilities: They can generate coherent and contextually relevant text, complete tasks like summarization, translation, question answering, and text classification.
  • Black Box LLMs: These refer to LLMs where the internal architecture, parameters, and training data are not publicly accessible. Users interact with them purely through Application Programming Interfaces (APIs). Examples include OpenAI's GPT series (like gpt-4o-mini) or Google's Gemini. The user cannot fine-tune these models or inspect their internal workings directly; they only send an input (a prompt) and receive an output.

  • White Box LLMs: In contrast to black box LLMs, white box LLMs are those whose models or architectures are publicly available or can be deployed and fine-tuned on a user's own computational resources. Examples include models from the Llama series by Meta. While they offer greater control (e.g., fine-tuning for specific tasks), they require substantial computational power and expertise to set up and maintain.

  • Prompting: This is the technique of guiding an LLM to perform a specific task by providing it with carefully crafted text instructions (the prompt). Instead of traditional programming or fine-tuning, prompting involves framing the task as a natural language query or command. For example, to classify text, a prompt might include instructions like "Classify the following text into one of these categories: [list categories]."

  • Zero-shot Learning: In the context of LLMs, zero-shot learning means performing a task without providing any explicit examples of that task in the prompt. The LLM relies solely on its pre-trained knowledge to understand the instructions and generate an appropriate response. For example, classifying a text without showing any prior examples of classified texts.

  • Few-shot Learning: This is an extension of prompting where a small number of examples (typically 1 to 20) are included in the prompt to guide the LLM's behavior for a specific task. These examples serve as "in-context learning" demonstrations, helping the LLM understand the desired format, style, and logic of the task without requiring any weight updates or fine-tuning of the model. For instance, to classify text, the prompt might include a few example texts paired with their correct hierarchical labels before asking the LLM to classify a new text.

  • Levenshtein Distance: Also known as edit distance, it is a metric for measuring the difference between two sequences (e.g., strings). It quantifies the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. For example, the Levenshtein distance between "kitten" and "sitting" is 3 (k -> s, e -> i, delete n, insert g). In this paper, it's used to find the closest matching label when an LLM's output doesn't exactly match the provided candidate labels.

3.2. Previous Works

The paper discusses related work in three main areas: classical Hierarchical Text Classification, general applications of LLMs, and HTC with LLMs.

3.2.1. Hierarchical Text Classification

  • Classical Machine Learning Approaches: Early methods focused on creating features from input text and applying specialized classification models like Support Vector Machines (SVMs) for hierarchical classification [9,10]. These often involved training separate classifiers for each level or for parent-child relationships.
  • Deep Learning Approaches (e.g., HDLTEX [1]): Kowsari et al. (2017) proposed HDLTex (Hierarchical Deep Learning for Text Classification), which uses a distinct deep learning architecture for each level of the hierarchy. It trains models to estimate child labels conditioned on parent labels. While addressing the issue of increasing labels, it still requires a large amount of training data.
  • Hierarchical Label Set Expansion (HLSE [2]): Gargiulo et al. addressed the problem in datasets where experts might not label all levels from root to leaf. They proposed HLSE to infer and complement missing parent-child relationships, enhancing the dataset.
  • Prompt Tuning for Hierarchical Classification (HPT [11]): Wang et al. (2022) proposed HPT (Hierarchy-aware Prompt Tuning). This method integrates label hierarchy information into dynamic virtual templates and hierarchy-aware label words, bridging the gap between conventional prompt tuning and Pre-trained Language Model (PLM) training tasks. It also introduced a zero-bounded multi-label cross-entropy loss to handle label imbalance and low-resource scenarios. HPT is based on transformer-based architectures like BERT, not full-scale LLMs.
  • Multi-Verbalizer Framework (HierVerb [12]): Ji et al. (2023) proposed HierVerb for few-shot HTC. It directly embeds hierarchical information into layer-specific verbalizers (mapping labels to words or phrases for PLMs). By combining a hierarchy-aware constraint chain and fat hierarchical contrastive loss, it leverages PLM knowledge effectively. Like HPT, HierVerb operates on PLMs such as BERT, not black box LLMs.

3.2.2. Applications of LLMs

  • General LLM Applications: The widespread availability of LLMs like ChatGPT has spurred many studies using them for various tasks [8,13,7].
  • LLMs as Zero-Shot Text Classifiers [7]: Wang et al. (2023) validated the use of LLMs as zero-shot text classifiers. Their work is similar in exploring zero-shot classification but does not consider hierarchical structures, which is a key focus of the current paper.

3.2.3. Hierarchical Text Classification with LLMs

Several recent studies have begun to explore LLMs for HTC:

  • LLMs with Entailment Predictors [4]: Bhambhoria et al. (2023) proposed a framework combining LLMs with Entailment Predictors for zero-shot hierarchical classification. They convert HTC into a long-tail prediction task. This differs from the current paper by focusing on a combined framework rather than solely prompting strategies and does not explicitly compare strategies considering hierarchical structure within prompts.
  • TELEClass [14]: Zhang et al. (2024) introduced TELEClass, a method for high-performance HTC model training using LLMs for annotation and taxonomy expansion. Their research primarily focuses on fine-tuning a pre-trained model with minimal supervision and is limited to zero-shot settings for LLM usage. The current paper, in contrast, aims to compare accuracy and cost across different prompting strategies in both zero-shot and few-shot settings, without fine-tuning.
  • Retrieval-based In-context Learning (ICL) [15]: Chen et al. (2024) proposed a retrieval-based ICL approach for few-shot HTC. This method requires task-specific training and building a retrieval database. The current paper, however, exclusively investigates black box LLMs relying solely on prompting, without external retrieval or fine-tuning, making it a more lightweight approach.
  • Single-pass HTC with LLMs [16]: Schmidt et al. (2024) focused on zero-shot HTC, using LLMs with hierarchical label structures to improve performance. Their study is limited to zero-shot experiments, while the current paper systematically examines both zero-shot and few-shot settings, providing a more comprehensive analysis of the impact of in-context learning.

3.3. Technological Evolution

Text classification has evolved significantly:

  • Early Methods: Initially, statistical methods (Naive Bayes, SVM) with handcrafted features were common. These struggled with large, high-dimensional text data and complex label structures.

  • Deep Learning Era: The advent of deep learning brought models like Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) which could automatically learn features from text. For HTC, specialized deep learning architectures were developed (e.g., HDLTex), but they still demanded substantial labeled data.

  • Pre-trained Language Models (PLMs): Models like BERT revolutionized NLP by pre-training on massive corpora and then fine-tuning for downstream tasks. Methods like HPT and HierVerb extended PLM capabilities to HTC. While powerful, they still require task-specific training or fine-tuning.

  • Large Language Models (LLMs) and Prompting: The latest wave involves very large, general-purpose LLMs (e.g., GPT, Llama). These models, due to their vast pre-training, can perform many tasks (including classification) with zero-shot or few-shot prompting, significantly reducing the need for task-specific training data and model complexity.

    This paper's work fits into the most recent stage of this evolution, exploring black box LLMs and prompting as a lightweight, data-efficient, and computationally less demanding alternative to previous methods, particularly for HTC where data scarcity is a major issue.

3.4. Differentiation Analysis

Compared to related work, the core differences and innovations of this paper's approach are:

  • Focus on Black Box LLMs via API: Unlike HPT and HierVerb which use BERT-like PLMs requiring fine-tuning, or white box LLMs which need significant computational resources, this study exclusively uses black box LLMs accessed via APIs. This emphasizes a lightweight, practical, and resource-efficient solution for HTC without training or fine-tuning overhead.

  • Systematic Evaluation of Prompting Strategies for HTC: The paper systematically evaluates different prompting strategies (DL, DH, TMH) specifically designed to adapt traditional HTC approaches to the LLM paradigm. Previous LLM-based HTC studies often focus on frameworks (like combining with entailment predictors [4]) or specific fine-tuning approaches (TELEClass [14]) rather than a direct comparison of diverse prompting mechanisms for black box LLMs.

  • Comprehensive Zero-shot and Few-shot Analysis: While some prior LLM-based HTC research focuses solely on zero-shot settings (TELEClass [14], Schmidt et al. [16]), this paper provides a more thorough analysis by comparing both zero-shot and various few-shot settings across all prompting strategies. This allows for a deeper understanding of the accuracy-cost trade-off associated with in-context learning.

  • Emphasis on Cost-Effectiveness: A unique aspect is the explicit cost analysis in terms of API tokens for different prompting strategies and few-shot examples. This highlights the practical considerations for deploying LLM-based HTC solutions, which is often overlooked in academic evaluations.

  • Addressing Deeper Hierarchies: The paper finds that LLMs, particularly the DH strategy, can outperform traditional ML models on datasets with deeper hierarchies where data scarcity is more pronounced, offering a compelling advantage in such challenging scenarios.

    In essence, this paper provides a practical blueprint and empirical insights into effectively utilizing readily available black box LLMs for HTC by rigorously comparing different prompting strategies and their associated accuracy-cost trade-offs in diverse supervision settings.

4. Methodology

The study aims to provide a low-cost and high-accuracy Hierarchical Text Classification (HTC) solution using only black box LLMs for inference, operating in zero-shot or few-shot settings. This means the LLM itself is not trained or fine-tuned; its capabilities are leveraged solely through carefully crafted prompts.

Given an input text XX and a black box large language model ff (accessed via API), the goal is to assign the most accurate labels from a hierarchically structured candidate label set YY to XX by devising effective prompting strategies. In a few-shot setting, a small set of labeled examples {(Xi,yi)}i\{ (X_i, y_i) \}_i (where yiYy_i \in Y) is provided as part of the prompt.

The paper evaluates three distinct prompting strategies:

  1. Direct Leaf Label Prediction (DL)

  2. Direct Hierarchical Label Prediction (DH)

  3. Top-down Multi-step Hierarchical Label Prediction (TMH)

    These strategies adapt typical approaches used in HTC to the LLM prompting paradigm.

4.1. Direct Leaf Label Prediction Strategy (DL)

The Direct Leaf Label Prediction (DL) strategy focuses on predicting only the leaf node label (the most specific label at the deepest level of the hierarchy) directly. Once a leaf label is predicted, its hierarchical path (i.e., its parent labels up to the root) is inferred by tracing back through the predefined label hierarchy.

4.1.1. Principles

The core idea is to treat the HTC problem as a flat classification task over all possible leaf labels. The LLM is given the input text and a comprehensive list of all candidate leaf labels. It is then instructed to select the single most appropriate leaf label from this list. This simplifies the prediction task for the LLM by removing the explicit need for multi-level hierarchical reasoning during the initial prediction step. The hierarchy above the leaf is then reconstructed.

4.1.2. Core Methodology

  1. Input Text and Instructions: The LLM receives the input text XX and instructions to perform classification.

  2. Candidate Labels: A complete list of all leaf nodes from the candidate label set YY is presented to the LLM.

  3. Prediction: The LLM predicts one leaf label that best matches the input text.

  4. Hierarchy Reconstruction: After the leaf label is predicted, the full hierarchical path from the root to this predicted leaf label is automatically determined by looking up the predefined hierarchy structure.

    The following figure (Figure 1 from the original paper) visually represents the Direct Leaf Label Prediction Strategy:

    Fig. 1: Direct Leaf Label Prediction Strategy. 该图像是示意图,展示了直接叶标签预测策略的结构。左侧显示了用户输入的指令、候选标签及相关文本,右侧则呈现了标签层次结构,其中包含了多个类别,如医学科学和计算机科学等。

Fig. 1: Direct Leaf Label Prediction Strategy.

The prompt template for the DL strategy includes sections for Instructions, Candidates (listing all leaf labels), and Input Text. The Output Format specifies that the LLM should respond with only the chosen label.

The following figure (Figure 4 from the original paper) shows an example of the prompt template used for the DL strategy on the Web of Science dataset:

Fig. 4: The prompt template for the DL strategy on the Web of Science dataset. \(\\{ \\mathrm { i n p u t ~ d a t a } \\}\) area is replaced with actual input text.
该图像是一个示意图,展示了 DL 策略的提示模板。图中包含了用于分类的指令、候选标签和输入文本占位符 {input data},示意如何选择一个相关领域标签。

Fig. 4: The prompt template for the DL strategy on the Web of Science dataset. mathrminput data\\{ \\mathrm { i n p u t ~ d a t a } \\} area is replaced with actual input text. In this template:

  • Instructions: Guides the LLM on how to perform the classification. It asks the LLM to select an Area label (which corresponds to a leaf label in this context) from the provided candidates that matches the input data.
  • Candidates: Lists all possible leaf labels that the LLM can choose from. For instance, in the Web of Science dataset, these could be specific sub-disciplines.
  • Input Data: This placeholder mathrminput data\\{ \\mathrm { i n p u t ~ d a t a } \\} is replaced with the actual text that needs to be classified.
  • Output Format: Specifies that the output should be just the selected Area.

4.2. Direct Hierarchical Label Prediction Strategy (DH)

The Direct Hierarchical Label Prediction (DH) strategy instructs the LLM to output the complete hierarchical path of labels directly, from the first depth level down to the leaf node.

4.2.1. Principles

This strategy provides the LLM with full contextual information about the label hierarchy upfront. Instead of individual labels, the LLM is presented with full hierarchical paths as candidate outputs. The intuition is that by seeing the entire path structure, the LLM can better reason about the relationships between parent and child nodes and make a more coherent, hierarchically consistent prediction.

4.2.2. Core Methodology

  1. Input Text and Instructions: The LLM receives the input text XX and instructions to classify it into a hierarchical path.

  2. Candidate Paths: Instead of individual labels, the LLM is provided with a list of all possible hierarchical paths from the root to each leaf node in the format "(1stdepth)>(2nddepth)>...>(leafnode)""(1st depth) > (2nd depth) > ... > (leaf node)".

  3. Prediction: The LLM directly predicts a single complete hierarchical path from the given candidates that best matches the input text.

    The following figure (Figure 2 from the original paper) visually represents the Direct Hierarchical Label Prediction Strategy:

    Fig. 2: Direct Hierarchical Label Prediction Strategy. 该图像是一个示意图,展示了直接层级标签预测策略。左侧显示了用户输入的文本、指令和候选标签,右侧则呈现了文本所属的层级结构,根节点连接到医疗科学和计算机科学等领域,进一步细分为具体的学科和概念。

Fig. 2: Direct Hierarchical Label Prediction Strategy.

The prompt template for the DH strategy is similar to DL but with Candidates defined as full paths.

The following figure (Figure 5 from the original paper) shows an example of the prompt template used for the DH strategy on the Web of Science dataset:

Fig. 5: The prompt template for the DH strategy on the Web of Science dataset. \(\\{ \\mathrm { i n p u t ~ d a t a } \\}\) area is replaced with actual input text.
该图像是一个示意图,展示了用于层次文本分类的直接层级标签预测(DH)策略的提示模板。在图中,{input data}\{ \mathrm { input ~ data } \}区域将被实际输入文本替换。

Fig. 5: The prompt template for the DH strategy on the Web of Science dataset. mathrminput data\\{ \\mathrm { i n p u t ~ d a t a } \\} area is replaced with actual input text. In this template:

  • Instructions: Directs the LLM to classify the input data into a specific Area by outputting the full hierarchical path.
  • Candidates: Lists all possible hierarchical paths as options. Each path is structured as "Top-Level Category > Sub-Category > ... > Leaf Category". For example, "Medical Sciences > Anatomy & Physiology".
  • Input Data: Placeholder mathrminput data\\{ \\mathrm { i n p u t ~ d a t a } \\} for the text to be classified.
  • Output Format: Specifies that the output should be the full path.

4.3. Top-down Multi-step Hierarchical Label Prediction Strategy (TMH)

The Top-down Multi-step Hierarchical Label Prediction (TMH) strategy breaks down the HTC problem into a sequence of single-level classification steps. The LLM predicts labels level-by-level, from the broadest (top-most) category to the most specific (leaf) category.

4.3.1. Principles

This strategy mimics a human decision-making process where one first categorizes broadly, then refines the classification at each subsequent level. It leverages the LLM's ability to reason conditionally: the prediction at a given depth guides the candidate labels presented at the next deeper depth. This reduces the number of candidate labels at each step compared to DL (which lists all leaf nodes) or DH (which lists all full paths), potentially making each step easier for the LLM.

4.3.2. Core Methodology

  1. Step 1: First Depth Prediction: The LLM is prompted to select a label from the first depth (top-level) categories.
    • The prompt includes the input text and the candidate labels for the first depth.
    • The LLM predicts a first-depth label.
  2. Subsequent Steps: Deeper Depth Prediction: For each subsequent depth d+1d+1:
    • The LLM is prompted again with the input text.
    • Crucially, the candidate labels provided for this step are limited to only the child labels of the label predicted in the previous depth dd. This means the search space is narrowed dynamically.
    • The LLM predicts a depth-(d+1) label.
  3. Iteration: This process continues until a leaf node is reached or the deepest level of the hierarchy is classified.
  4. Handling Non-Exact Matches: A key challenge with LLMs is that they might not always return an output that perfectly matches one of the provided candidate labels. To address this:
    • If the LLM's predicted label is not found in the current set of candidate labels, the system identifies the candidate label with the smallest Levenshtein distance from the LLM's output.

    • The child labels corresponding to this closest-matching candidate label are then used as the candidate set for the next depth level. This ensures the process can continue even with slight LLM output variations.

      The following figure (Figure 3 from the original paper) visually represents the Top-down Multi-step Hierarchical Label Prediction Strategy:

      Fig. 3: Top-down Multi-step Hierarchical Label Prediction Strategy. In this figure, since LLM selects "Computer Science" at the 1st depth, this approach provides only the child nodes of "Computer Science" as the candidate labels to prompt text at the 2nd depth. 该图像是一个示意图,展示了Top-down Multi-step Hierarchical Label Prediction策略。在第一层,LLM选择了"Computer Science",随后在第二层仅提供"Computer Science"的子节点作为候选标签进行预测。

Fig. 3: Top-down Multi-step Hierarchical Label Prediction Strategy. In this figure, since LLM selects "Computer Science" at the 1st depth, this approach provides only the child nodes of "Computer Science" as the candidate labels to prompt text at the 2nd depth. This figure illustrates that if the LLM chooses "Computer Science" at the first level, then for the second level, only its direct child categories (e.g., "Artificial Intelligence", "Software Engineering", etc.) will be presented as candidate labels.

The prompts for each depth in TMH are structurally similar to the DL strategy (Figure 4), but the Candidates section changes dynamically to include only the relevant child labels based on the previous step's prediction.

5. Experimental Setup

5.1. Datasets

Experiments were conducted using two real-world datasets: Web of Science (WOS) and Amazon Product Reviews (APR).

5.1.1. Web of Science (WOS)

  • Source: A collection of 46,985 published papers from the Web of Science [1].
  • Characteristics: For each article, the abstract is used as the input text. The domains are used as first-depth labels, and keywords are used as second-depth labels.
  • Hierarchy Depth: 2 depths.
  • Data Splits: The original WOS dataset lacks predefined train/valid/test splits. The study used ChatGPT-Cheat? [18] and TimeTravel-in-LLMs [19] for contamination assessment and instance-level analysis. 1,800 uncontaminated instances were selected as the test set. From the remaining data, 1,250 instances were randomly sampled for the training set (used for few-shot examples).

5.1.2. Amazon Product Reviews (APR)

  • Source: Review and Product categories scraped from amazon.com and published on kaggle.com [17].

  • Characteristics: Originally, 40,000 records with three tiers of labels. The study used a subset of these records.

  • Hierarchy Depth: 3 depths.

  • Data Splits: The original dataset is split into train and validation sets. Similar to WOS, 1,800 uncontaminated instances were selected as the test set, and 1,250 instances were randomly sampled for the training set.

    The following are the results from Table 1 of the original paper:

    dataset name #(data) #(candidate labels)
    train test 1st 2nd 3rd
    Web of Science 1,250 1,800 7 136
    Amazon Product Reviews 1,250 1,800 6 62 309
  • WOS: Has a shallow hierarchy with 7 labels at the 1st depth and 136 at the 2nd. The third depth is not applicable.

  • APR: Has a deeper hierarchy with 6 labels at the 1st depth, 62 at the 2nd, and 309 at the 3rd.

    These datasets were chosen because they represent different complexities in hierarchical structures (2-depth vs. 3-depth) and are common benchmarks in HTC research, making them effective for validating the method's performance across varying hierarchical depths.

5.1.3. Data Contamination Checks

To ensure the integrity of the evaluation and prevent data contamination (where LLMs might have seen the test data during their training), ChatGPT-Cheat? [18] and TimeTravel-in-LLMs [19] were employed.

  • ChatGPT-Cheat? was used to check for dataset-level contamination. It classifies responses into 'contaminated', 'suspicious', 'safety-filtered', and 'clean'. Prompts varied to include or exclude split names (train/validation) and format types.

  • TimeTravel-in-LLMs was used for a more rigorous instance-level contamination assessment.

  • Results: No contamination or suspicious cases were detected for either dataset, confirming their validity for evaluating LLM performance.

    The following figure (Figure 6 from the original paper) shows the prompts used for contamination validation:

    Fig.6: The prompts of ChatGPT-Cheat? to validate data contamination. \(\\{ \\mathrm { d a t a s e t \\_ n a m e } \\}\) is replaced with a target dataset name, \(\\{ { \\mathrm { s p l i t } } \\}\) is replaced with a target split name, and \(\\{ { \\mathrm { f o r m a t } } \\}\) is replaced with a target data format type. 该图像是图示,展示了 ChatGPT-Cheat? 的提示示例,用于验证数据污染。左侧(a)为未包含拆分名称的提示,右侧(b)为包含拆分名称的提示。

Fig.6: The prompts of ChatGPT-Cheat? to validate data contamination. mathrmdatasetname\\{ \\mathrm { d a t a s e t \\_ n a m e } \\} is replaced with a target dataset name, mathrmsplit\\{ { \\mathrm { s p l i t } } \\} is replaced with a target split name, and mathrmformat\\{ { \\mathrm { f o r m a t } } \\} is replaced with a target data format type.

5.2. Evaluation Metrics

The performance is evaluated using accuracy at different depths of the hierarchy and conditional accuracy for child labels. Before evaluation, text normalization (removing symbols and decapitalizing text) is applied to both LLM outputs and ground truth labels to account for slight variations in LLM responses.

5.2.1. Accuracy at Depth dd (ACCdACC_d)

  • Conceptual Definition: Accuracy at depth dd measures the proportion of instances where the predicted label at that specific hierarchical level perfectly matches the true (ground truth) label at the same level. It quantifies the overall correctness of classification at a particular depth.
  • Mathematical Formula: $ ACC_d = \frac{\sum_{i=1}^{N} \mathbb{I}(\hat{y}{i,d} = y{i,d})}{N} $
  • Symbol Explanation:
    • NN: Total number of instances in the test set.
    • y^i,d\hat{y}_{i,d}: The predicted label for instance ii at depth dd.
    • yi,dy_{i,d}: The true (ground truth) label for instance ii at depth dd.
    • I()\mathbb{I}(\cdot): The indicator function, which returns 1 if the condition inside is true, and 0 otherwise.

5.2.2. Conditional Accuracy (P(pd+1TruepdTrue)P(p_{d+1}^{True}|p_d^{True}))

  • Conceptual Definition: This metric, denoted as P(pd+1TruepdTrue)P(p_{d+1}^{True}|p_d^{True}), measures the accuracy of predicting a child label at depth d+1d+1, given that the parent label at depth dd was already correctly predicted. It assesses the consistency and correctness of the LLM's hierarchical reasoning as it moves down the hierarchy. A high value indicates that once a correct parent is identified, the model is good at finding the correct child within that branch.
  • Mathematical Formula: $ P(p_{d+1}^{True}|p_d^{True}) = \frac{\sum_{i=1}^{N} \mathbb{I}(\hat{y}{i,d+1} = y{i,d+1} \land \hat{y}{i,d} = y{i,d})}{\sum_{i=1}^{N} \mathbb{I}(\hat{y}{i,d} = y{i,d})} $
  • Symbol Explanation:
    • NN: Total number of instances in the test set.
    • y^i,d+1\hat{y}_{i,d+1}: The predicted label for instance ii at depth d+1d+1.
    • yi,d+1y_{i,d+1}: The true label for instance ii at depth d+1d+1.
    • y^i,d\hat{y}_{i,d}: The predicted label for instance ii at depth dd.
    • yi,dy_{i,d}: The true label for instance ii at depth dd.
    • I()\mathbb{I}(\cdot): The indicator function.
    • \land: Logical AND operator.

5.3. Baselines

The black box LLM-based prompting strategies are compared against Hierarchy-aware Prompt Tuning for Hierarchical Text Classification (HPT) [11].

  • HPT: This is a non-LLM, machine learning-based approach. It leverages a transformer-based architecture (like BERT) and integrates hierarchical label dependencies into prompt tuning to enhance classification accuracy.
  • Parameters: For HPT, batch_size was set to 16, with all other parameters at their default values, using the official implementation.
  • Representativeness: HPT is a strong and relatively recent baseline for HTC that explicitly incorporates hierarchical information using PLMs, making it a suitable comparison for assessing the effectiveness of LLM-based prompting.

5.4. LLM Configuration

  • Model: gpt-4o-mini (specifically, gpt-4o-mini-2024-07-18) provided by OpenAI.
  • Parameters: temperature and topptop_p were both set to 1.0.
    • temperature: Controls the randomness of the output. Higher values (like 1.0) make the output more diverse and creative, while lower values make it more deterministic. This choice allows the LLM to explore a wider range of possible label interpretations.
    • topptop_p: Controls the diversity of the output by sampling from the most probable tokens whose cumulative probability exceeds topptop_p. Setting it to 1.0 means considering all possible tokens, again prioritizing diversity.
  • Prompting Settings: Experiments were conducted in zero-shot (no examples in the prompt) and various few-shot settings (1, 3, 5, 10, 20 examples). For few-shot prompting, examples were randomly sampled from the training data for each dataset.

6. Results & Analysis

6.1. Core Results Analysis

The experiments compared the three prompt strategies (DL, DH, TMH) in zero-shot and few-shot settings against the HPT machine learning model on two datasets: Web of Science (WOS) and Amazon Product Reviews (APR).

6.1.1. Web of Science Dataset Results (Shallow Hierarchy)

The following are the results from Table 2 of the original paper:

Method #(Few Shot) ACC1 P(p2True |p1True) ACC2
Machine Learning Model
HPT 0.826 0.655 0.571
Prompt Strategies
DL 0 0.677 0.581 0.393
DL 1 0.707 0.604 0.427
DL 3 0.708 0.620 0.439
DL 5 0.713 0.617 0.440
DL 10 0.712 0.605 0.431
DL 20 0.710 0.611 0.434
DH 0 0.627 0.601 0.401
DH 1 0.693 0.598 0.434
DH 3 0.688 0.579 0.417
DH 5 0.691 0.572 0.413
DH 10 0.688 0.567 0.407
DH 20 0.684 0.575 0.416
TMH 0 0.616 0.652 0.405
TMH 1 0.654 0.664 0.436
TMH 3 0.652 0.665 0.434
TMH 5 0.651 0.653 0.427
TMH 10 0.656 0.657 0.433
TMH 20 0.654 0.663 0.437
  • HPT Baseline Performance: The HPT machine learning model significantly outperforms all LLM-based prompting strategies on the WOS dataset, achieving ACC1=0.826ACC_1 = 0.826, P(p2Truep1True)=0.655P(p_2^{True}|p_1^{True}) = 0.655, and ACC2=0.571ACC_2 = 0.571. This dataset has a relatively shallow hierarchy (2 depths) and a moderate number of labels (7 at 1st, 136 at 2nd).
  • Few-shot vs. Zero-shot: For LLMs, few-shot prompting consistently improves performance over zero-shot prompting. For example, DL (0-shot ACC1=0.677ACC_1=0.677) improves to DL (5-shot ACC1=0.713ACC_1=0.713). Similar trends are observed for DH and TMH.
  • Best LLM Strategy: DL with 5-shot prompting achieves the highest ACC1ACC_1 (0.713) and ACC2ACC_2 (0.440) among LLM strategies. TMH with 3-shot prompting has the highest conditional accuracy P(p2Truep1True)P(p_2^{True}|p_1^{True}) (0.665), suggesting its top-down approach helps maintain consistency if the first level is correct.
  • Overall LLM Performance: Even with few-shot prompting, the LLM strategies do not reach the performance of the HPT model on this dataset. This suggests that for shallow hierarchies with perhaps more available underlying training data (even if not used directly by black box LLMs), dedicated machine learning models can still hold an advantage.

6.1.2. Amazon Product Reviews Dataset Results (Deeper Hierarchy)

The following are the results from Table 3 of the original paper:

Method #(Few Shot) ACC1 P(p2True |p1True) ACC2 P(p3True |p2True) ACC3
Machine Learning Model
HPT 0.823 0.657 0.556 0.641 0.377
Prompt Strategies
DL 0 0.637 0.561 0.357 0.720 0.257
DL 1 0.667 0.629 0.419 0.768 0.322
DL 3 0.693 0.675 0.468 0.785 0.367
DL 5 0.690 0.688 0.474 0.783 0.372
DL 10 0.701 0.679 0.476 0.788 0.375
DL 20 0.709 0.707 0.502 0.781 0.392
DH 0 0.817 0.718 0.591 0.782 0.491
DH 1 0.854 0.718 0.616 0.784 0.510
DH 3 0.862 0.732 0.633 0.770 0.507
DH 5 0.868 0.733 0.640 0.769 0.517
DH 10 0.867 0.744 0.649 0.769 0.521
DH 20 0.854 0.744 0.646 0.796 0.532
TMH 0 0.847 0.680 0.576 0.754 0.436
TMH 1 0.824 0.679 0.560 0.783 0.440
TMH 3 0.828 0.673 0.558 0.793 0.442
TMH 5 0.825 0.678 0.560 0.811 0.455
TMH 10 0.836 0.681 0.570 0.842 0.481
TMH 20 0.828 0.691 0.573 0.853 0.490
  • HPT Baseline Performance: On the APR dataset (deeper hierarchy with 3 depths), HPT achieves ACC1=0.823ACC_1 = 0.823, ACC2=0.556ACC_2 = 0.556, and ACC3=0.377ACC_3 = 0.377.
  • LLMs Outperform HPT on Deeper Hierarchy: In contrast to WOS, several LLM-based strategies, particularly DH and TMH in few-shot settings, achieve performance comparable to, or even outperform, HPT on this dataset with a deeper hierarchy.
    • DH (5-shot) achieves the highest ACC1ACC_1 (0.868) and ACC2ACC_2 (0.640), surpassing HPT significantly.
    • DH (20-shot) achieves the highest ACC3ACC_3 (0.532) and conditional accuracy P(p3Truep2True)P(p_3^{True}|p_2^{True}) (0.796), again outperforming HPT.
    • TMH also shows strong performance, with TMH (20-shot) achieving ACC3=0.490ACC_3 = 0.490 and P(p3Truep2True)=0.853P(p_3^{True}|p_2^{True}) = 0.853.
  • Effectiveness of Few-shot Prompting: The improvement from zero-shot to few-shot is even more pronounced for APR. For example, DH (0-shot ACC1=0.817ACC_1=0.817) improves to DH (5-shot ACC1=0.868ACC_1=0.868). This suggests that for more complex hierarchical tasks, in-context learning is highly beneficial for LLMs.
  • Strategy Comparison: DH generally exhibits the strongest performance for APR, particularly in ACC1, ACC2, and ACC3, indicating its effectiveness in directly predicting full hierarchical paths in a deeper structure. TMH shows very high conditional accuracy (P(p3Truep2True)P(p_3^{True}|p_2^{True}) of 0.853 at 20-shot), implying that once a parent is correctly classified, it's very good at finding the correct child. DL shows consistent improvement with few-shot but generally lags behind DH and TMH for this deeper hierarchy.
  • Few-shot Advantage: Across both datasets and all strategies, few-shot prompting consistently leads to higher accuracy than zero-shot prompting. This reinforces the importance of in-context learning for leveraging LLM capabilities in specific tasks.
  • Hierarchy Depth Matters: For shallow hierarchies (WOS), the traditional machine learning model (HPT) still maintains an edge. However, for deeper hierarchies (APR) where data sparsity issues become more severe for traditional models, LLMs (especially DH and TMH with few-shot) demonstrate superior or comparable performance. This suggests LLMs are particularly valuable in complex, low-resource HTC scenarios.

6.2. Cost Analysis

The cost analysis focuses on the average number of input tokens (prompt tokens) and output tokens (completion tokens) required for different few-shot settings and prompt strategies.

The following are the results from Table 4 of the original paper:

#(few shot examples)
dataset prompt 0 1 3 5 10 20
prompt tokens
WOS DL 833.33 1105.00 1662.39 2210.69 3594.35 6326.98
DH 1249.33 1523.39 2080.72 2642.91 4034.88 6822.23
TMH 783.70 1305.11 2389.67 3491.28 6250.63 11755.44
APR DL 1337.16 1440.54 1653.19 1866.96 2377.60 3424.61
DH 3354.16 3465.70 3689.27 3912.13 4460.17 5574.73
TMH 511.23 828.81 1444.18 2057.71 3559.82 6472.83
completion tokens
WOS DL 4.47 3.90 3.64 3.67 3.47 3.83
DH 6.30 6.23 6.21 6.22 6.23 6.33
TMH 7.51 6.81 7.07 6.78 6.95 7.03
APR DL 4.49 3.83 3.92 4.03 4.08 4.11
DH 9.99 10.06 10.10 10.08 10.10 10.07
TMH 12.58 11.32 11.45 11.25 11.51 11.33
  • Prompt Tokens Dominate Cost: The number of prompt tokens increases significantly with the number of few-shot examples. This is the primary driver of computational cost. For instance, DL on WOS jumps from 833 tokens (0-shot) to over 6,300 tokens (20-shot). TMH on WOS reaches almost 12,000 tokens for 20-shot.
  • Completion Tokens Stability: In contrast, completion tokens (output tokens) remain relatively stable across different settings and prompt types, indicating that the cost is largely determined by the input prompt size rather than the generated output.
  • Strategy-specific Cost Characteristics:
    • DH: Consistently requires the highest number of prompt tokens, especially on the deeper APR dataset (from ~3,300 to ~5,500 tokens). This is expected as DH includes full hierarchical paths as candidate labels in the prompt, which can become very long for deeper hierarchies. This higher token count translates directly to higher API costs.
    • DL: Shows a more moderate increase in token usage. It generally requires fewer tokens than DH because it only lists leaf labels directly, rather than full paths.
    • TMH: Starts with relatively few tokens in zero-shot but scales dramatically with few-shot examples. For example, on WOS, TMH 20-shot uses ~11,755 tokens, significantly more than DL or DH at the same shot count. This is because TMH generates multiple prompts (one per depth level), and each prompt for few-shot includes examples. As the number of depths and few-shot examples increase, the cumulative token count for TMH can become very high.
  • Accuracy-Cost Trade-off: The cost analysis highlights the crucial trade-off. While DH and TMH can achieve higher accuracy, particularly on deeper hierarchies with few-shot examples, they also incur significantly higher API costs due to increased prompt token usage. DH's high accuracy on APR comes with the highest prompt token count.

6.3. Ablation Studies / Parameter Analysis

The paper does not explicitly present ablation studies or detailed parameter analyses (e.g., impact of temperature or topptop_p values). The focus is on comparing the three prompting strategies and few-shot vs. zero-shot settings with a fixed LLM configuration. The number of few-shot examples (#(Few Shot)) is a key parameter varied, showing a general trend of increasing accuracy with more examples, up to a certain point (e.g., DL 5-shot for WOS, DH 10-shot for APR).

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully demonstrated the feasibility of using black box Large Language Models (LLMs) via APIs for Hierarchical Text Classification (HTC). By evaluating three prompting strategies (Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH)) in both zero-shot and few-shot settings, the research provided crucial insights into their accuracy and cost-effectiveness.

The key findings are:

  • Few-shot prompting consistently and significantly enhances classification accuracy across all strategies and datasets, underscoring the value of in-context learning.

  • LLM-based methods, particularly the DH strategy, show strong performance on datasets with deeper hierarchies (Amazon Product Reviews), often outperforming traditional machine learning models (HPT). This highlights their potential in low-resource scenarios where data sparsity is a significant challenge for traditional approaches.

  • A critical trade-off exists between accuracy and computational cost. Strategies like DH, while achieving high accuracy, incur substantial API costs due to increased input token requirements for deeper hierarchies and more few-shot examples. This implies that careful strategy selection is paramount to balance performance and financial considerations.

    The study concludes that prompting-based HTC with black box LLMs is a viable and often superior alternative to traditional machine learning models, especially for complex hierarchical structures, provided that computational cost is meticulously managed.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

  • Limited LLM Scope: The analysis was restricted to OpenAI's GPT-4o mini. Future work should broaden this to include other black box LLMs (e.g., from Google, Anthropic) and explore fine-tuned white box LLMs to gain a more comprehensive understanding of cost-effectiveness and performance trade-offs across different model types and deployment strategies.
  • Hierarchy Depth: Experiments were conducted on datasets with only two- to three-depth hierarchies. While DH showed promise for deeper hierarchies, further experiments on more profoundly nested and complex datasets are needed to confirm these benefits and assess the scalability of prompting strategies.
  • Specific Prompt Engineering: The study used predefined prompt templates. Future work could involve more advanced prompt engineering techniques, such as chain-of-thought prompting or self-consistency, to further improve reasoning capabilities and accuracy, potentially with cost implications.

7.3. Personal Insights & Critique

This paper offers a highly practical perspective on HTC by focusing on black box LLMs and API usage. The explicit cost analysis is particularly valuable, as it translates theoretical performance into real-world deployment considerations, a factor often overlooked in research. The finding that LLMs can outperform traditional ML on deeper hierarchies is significant, suggesting a niche where their extensive pre-trained knowledge helps overcome data scarcity at granular levels.

Potential Issues/Unverified Assumptions:

  1. Generalizability of gpt-4o-mini: While gpt-4o-mini is cost-effective, its performance might not fully represent the capabilities of larger, more expensive black box LLMs (e.g., GPT-4). The "mini" version might have inherent limitations that are not present in its larger counterparts, potentially understating the true potential of LLMs for HTC.

  2. Robustness to Prompt Variations: The study used fixed prompt templates. LLMs can be sensitive to subtle changes in prompt wording. The robustness of these strategies to minor variations in prompt engineering or few-shot example selection is not explored.

  3. Error Propagation in TMH: The Top-down Multi-step Hierarchical Label Prediction (TMH) strategy inherently suffers from error propagation: an incorrect prediction at an upper level will inevitably lead to incorrect candidate labels and thus incorrect predictions at lower levels. While Levenshtein distance helps recover from slight mismatches, it cannot fix a fundamentally wrong high-level classification. The paper's high conditional accuracy for TMH is good, but the overall ACC for TMH is generally lower than DH for APR, indicating that initial errors might be dragging down its overall performance.

  4. Implicit Knowledge vs. Explicit Structure: The DH strategy performs well on deeper hierarchies, which suggests that the LLM is effectively leveraging the explicit hierarchical path provided in the prompt. However, it's not entirely clear whether the LLM is truly "understanding" the hierarchical relationships or simply performing a sophisticated pattern matching on the long string of the path. Further probing might reveal the depth of this understanding.

    Transferability and Applications: The methodology of adapting HTC to prompting strategies is highly transferable. This approach could be applied to any domain where hierarchical classification is needed and labeled data is sparse, such as:

  • Legal Document Classification: Classifying legal cases or documents into complex legal codes and sub-sections.

  • Biological Taxonomy: Categorizing species or biological entities based on their hierarchical classifications.

  • Customer Support Ticket Routing: Directing customer inquiries through a multi-level support hierarchy to the most appropriate department or agent.

  • Patent Classification: Assigning patents to a deep, specialized classification system.

    The study provides a strong foundation for future work, especially in optimizing cost-performance for practical HTC applications using black box LLMs. The trade-off identified is crucial for practitioners, enabling them to make informed decisions based on their specific accuracy requirements and budget constraints.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.