TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks
TL;DR Summary
TagRouter is a training-free routing method leveraging tags to optimize synergy among multiple LLMs for open-domain text generation, boosting acceptance by 6.15% and cutting costs by 17.20%, achieving superior performance and cost-efficiency.
Abstract
Model routing allocates queries to the suitable model, improving system performance while reducing costs. However, existing routing methods face practical limitations that hinder scalability in large-scale applications and struggle to keep up with the rapid growth of the large language model (LLM) ecosystem. To tackle these challenges, we propose TagRouter, a training-free model routing method designed to optimize the synergy among multiple LLMs for open-domain text generation tasks. Experimental results demonstrate that TagRouter outperforms 13 baseline methods, increasing the accept rate of system by 6.15% and reducing costs by 17.20%, achieving optimal cost-efficiency. Our findings provides the LLM community with an efficient and scalable solution for model ensembling, offering users an evolvable "super model."
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks
- Authors: Zhou Chen¹, Zhiqiang Wei², Yuqi Bai¹, Xue Xiong², Jianmin Wu²
- Affiliations: ¹Tsinghua University, ²AI Cloud Group, Baidu Inc. The authors are a mix of academic researchers and industry professionals, indicating a focus on practical, real-world applications.
- Journal/Conference: The paper is available on arXiv, a preprint server for academic papers. This means it has been shared publicly but may not have completed a formal peer-review process yet. The paper itself references an ICLR 2025 Workshop, suggesting it may be intended for or has been submitted to that venue.
- Publication Year: The arXiv ID (
2506.12473) indicates a submission date of June 2025. - Abstract: The abstract introduces model routing as a technique to assign user queries to the most suitable Large Language Model (LLM) to improve performance and reduce costs. The authors argue that existing methods have practical limitations in scalability and adaptability. They propose
TagRouter, a training-free routing method that uses "tags" to synergize multiple LLMs for open-domain text generation. Experiments showTagRouteroutperforms 13 other methods, achieving a 6.15% higher acceptance rate and a 17.20% cost reduction. The authors position their work as an efficient, scalable solution for creating an evolvable "super model." - Original Source Link:
- arXiv Page: https://arxiv.org/abs/2506.12473
- PDF Link: https://arxiv.org/pdf/2506.12473v1.pdf
2. Executive Summary
- Background & Motivation (Why):
The world of AI is now filled with thousands of Large Language Models (LLMs), each with unique strengths, weaknesses, and costs. For any given user query, simply using the biggest, most expensive model (like GPT-4) is often wasteful, as a smaller, cheaper model might produce an equally good or even better answer. The core problem is how to automatically and efficiently choose the right model for each specific query. This is called model routing.
Prior solutions to this problem suffered from several key drawbacks:
- High Latency & Cost: Some methods call multiple models and pick the best response, which is slow and expensive.
- Lack of Adaptability: Many routers need to be completely retrained whenever a new LLM is added to the system, making them unsuitable for the rapidly evolving model ecosystem.
- Limited Scope: Some approaches only work for specific tasks or can only choose between two models (a "large" one and a "small" one).
- Practical Barriers: Some methods require access to internal model details (like
logits), which is impossible for proprietary, black-box models (e.g., those accessed via API).
- Main Contributions / Findings (What):
The paper introduces
TagRouter, a novel and practical approach to model routing that addresses these challenges. Instead of analyzing the raw, complex user query,TagRouterfirst simplifies it into a set of structured tags that capture its core semantic meaning (e.g., "Role Playing", "Text Generation", "Summarization"). The routing decision is then based on these simple tags. The primary contributions are:- A Novel Method (
TagRouter): A system with three components (TagGenerator,TagScorer,TagDecider) that is training-free in its routing logic. This means adding new models to the system doesn't require expensive retraining of the router itself. - Superior Performance:
TagRouteris shown to outperform 13 baseline methods, increasing the system's overall quality (Accept Rate up by 6.15%) while significantly cutting operational costs (down by 17.20%). - A New Tag-Based Framework: The paper demonstrates that using tags as an intermediate representation boosts the performance of even existing routing methods, proving the general power of this "tag-based" idea.
- Practicality and Scalability:
TagRouteris designed for the real world. It supports routing among multiple models, works with proprietary APIs, controls costs, and avoids redundant model calls, offering a truly scalable solution.
- A Novel Method (
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data (e.g., the internet). They can understand and generate human-like text for a wide range of tasks, such as answering questions, writing essays, and creating code. Examples include OpenAI's GPT series, Baidu's ERNIE, and Meta's Llama.
- Model Routing: Imagine a dispatcher for LLMs. Given a user's request, the model router's job is to send that request to the best-suited LLM from a pool of available models, balancing performance and cost.
- Model Ensembling: A technique where predictions from multiple models are combined to produce a final result that is often better than any single model's prediction. Model routing can be seen as a dynamic form of ensembling.
- Knowledge Distillation: A machine learning technique where a large, powerful "teacher" model trains a smaller, more efficient "student" model to mimic its behavior.
TagRouteruses this to create a lightweight tag generator.
-
Previous Works (The State of Model Routing): The paper categorizes previous routing methods based on when the routing decision is made relative to the LLM generating a response.
- Routing After Inference: These methods first get responses from multiple candidate models and then use another model (a "judge") to pick the best one.
- Examples:
FrugalGPT,LLM-Blender. - Limitation: This is very slow and expensive because it requires running multiple LLMs for a single user query.
- Examples:
- Routing During Inference: These methods make routing decisions on the fly, token-by-token, as the response is being generated, often switching between a small and large model.
- Example:
BiLD. - Limitation: This is technically complex and often requires the models to have compatible architectures, hindering scalability.
- Example:
- Routing Before Inference: These methods analyze the input query and select a single model before any response is generated. This is the most efficient approach.
- Examples:
FORC,RouteLLM,RouterBench. - Limitation: While efficient, these methods may not be as accurate. More importantly, they often need to be retrained from scratch whenever the pool of candidate models changes, which is a major bottleneck.
- Examples:
- Routing After Inference: These methods first get responses from multiple candidate models and then use another model (a "judge") to pick the best one.
-
Differentiation:
TagRouteris a routing before inference method, but it cleverly overcomes the key limitations of its predecessors. Its core innovation is the use of tags. By converting a complex query into a simple set of tags, the routing decision becomes a much simpler lookup problem. This approach makesTagRouter:- Training-Free (for the router): When a new LLM is added, you only need to evaluate its performance on a pre-defined set of tags and add these new scores to a lookup table. No complex retraining is needed.
- Highly Scalable: It easily supports multiple candidate models, not just two.
- Practical: It works with proprietary models since it doesn't need internal access.
4. Methodology (Core Technology & Implementation)
TagRouter is composed of three main modules: TagGenerator, TagScorer, and TagDecider. The overall workflow is illustrated in Figure 1.
该图像是TagRouter方法的流程示意图,展示了训练阶段与推理阶段的模块和数据流。包括四个模块:①训练数据标签化,②TagGenerator生成细粒度标签,③TagScorer计算模型分数,④TagDecider基于分数和成本阈值选择模型。
The process is divided into a one-time Training phase (①) and a real-time Inference phase (②, ③, ④).
4.1. TAGGENERATOR
The goal of this module is to take a raw user query and convert it into a set of meaningful, standardized tags.
- Step 1: Open Tagging:
Instead of using a fixed list of tags, the authors first use a very powerful LLM (
ERNIE-4.0-Turbo-8K) to generate descriptive tags for every query in a large dataset (BCUQ). This resulted in over 14,000 unique raw tags. - Step 2: Tag Normalization:
To make this large, noisy set of tags useful, it is refined through a three-step normalization process:
- Frequency Filtering: Rare tags (appearing less than five times) are discarded.
- Rule Aggregation: Tags are cleaned up by replacing special characters with spaces and standardizing capitalization.
- Semantic Aggregation: To group similar tags (e.g., "Travel Plan" and "Trip Itinerary"), the authors use
PhraseBERTto get vector embeddings for each tag. Then,DBSCANclustering groups semantically similar tags. This process reduces the 14,000+ tags to a clean, manageable set of 1,601 unique tags.
- Step 3: Training the
TAGGENERATOR: Continuously using a huge LLM for tagging is expensive. So, the authors use knowledge distillation to train a much smaller, faster model (Qwen2.5-0.5B) to perform the tagging task. The training data consists of pairs of queries and their normalized tags . A special sampling algorithm (Hybrid Weight-Based Data Sampling) is used to ensure that rare but important tags are well-represented in the training process.
4.2. TAGSCORER
This module's job is to pre-calculate how well each candidate LLM performs on each of the 1,601 standardized tags. This creates a "capability map" for every model.
- Step 1: Tag-Score Mapping:
For each tag, the system gathers all queries associated with it. For each of these queries, responses are generated from a smaller model () and a powerful reference model (, e.g., ERNIE 3.5). An LLM-as-a-judge (e.g., ERNIE 4.0) then compares the two responses and labels the outcome as a win, tie, or loss for the smaller model.
The performance score of model on tag is then calculated using the following formula:
- : The final score of model for tag .
- : A weight for tag . More frequent and consistent tags get a higher weight.
- : The number of times model achieved result (win, tie, or loss) for queries with tag .
- : A numerical value assigned to each outcome. Based on experiments, the authors use , , and . A tie is positive but not as good as a clear win.
- Step 2: Tag Alignment:
During inference, the lightweight
TAGGENERATORmight produce a tag that isn't in the pre-defined set of 1,601. In this case,PhraseBERTembeddings are used to find the most semantically similar tag from the set, ensuring every generated tag can be mapped to a score.
This scoring process results in a simple key-value map, where the key is a (Model, Tag) pair and the value is its performance score.
4.3. TAGDECIDER
This final module uses the scores from the TAGSCORER to make the routing decision for a given query.
- Step 1: Model Selection:
For a query , the
TAGGENERATORproduces a set of tags . TheTAGDECIDERthen calculates a total score for each candidate model by summing up its pre-computed scores for all the generated tags. The model with the highest total score is chosen. - Step 2: Cost-Awareness Control:
To balance performance and cost, a cost-awareness threshold is introduced. This logic is primarily for a two-model (large/small) scenario but can be extended. If the best-scoring model is the expensive, large model (), the system checks if a smaller model could have done a "good enough" job. It computes the score difference:
- If , the query is sent to the large model . The score difference is not large enough to justify switching to the smaller model.
- If , the query is re-routed to the smaller model , saving cost. The paper finds that a default value of works very well, providing a great balance out-of-the-box.
5. Experimental Setup
-
Datasets:
-
BCUQ (Baidu AI Cloud User Queries): The primary dataset for training and evaluation. It contains 95,559 real-world query logs from Baidu's ERNIE Bot platform. It covers eight diverse task categories, with 'classification' and 'content creation' being the most common.
该图像是一张饼图,展示了BCUQ中不同任务的分布情况。分类任务占比最大为40.42%,内容创作为20.17%,‘其他’类占14.95%,还包含闭合问答、改写、摘要、大脑风暴和开放问答等多种任务类型。 -
Alpaca & Dolly: Two well-known public datasets used to test the generalization capabilities of
TagRouter.
-
-
Evaluation Metrics:
- Accept Rate (AR):
- Conceptual Definition: This metric measures the quality of the routing system. It's the percentage of queries for which the model selected by the router produces a response that is judged as either a "win" or a "tie" when compared to the response from the best and most expensive model in the pool (). A higher AR means the router is making better choices.
- Mathematical Formula:
- Symbol Explanation:
- : The total number of queries in the evaluation set.
- : The model chosen by the router for a specific query .
- : A function that returns 1 if the response from is a win or tie, and 0 otherwise.
- GPT-Rank (Rank):
- Conceptual Definition: The average rank of the router-selected model's response among all candidate models' responses for each query. A rank of 1 is the best. A lower average rank indicates the router consistently picks top-performing models.
- Mathematical Formula: The paper does not provide a formula, but it is conceptually calculated as .
- Area Under Curve (AUC):
- Conceptual Definition: This metric evaluates the router's performance across all possible cost-performance trade-offs. It's the area under the curve that plots Accept Rate (y-axis) against the proportion of queries sent to the most expensive model (x-axis). A higher AUC means the router is robustly effective at various budget constraints.
- Mathematical Formula:
- Symbol Explanation:
- : The ratio of queries routed to the most expensive model, (from 0 to 1).
- : The Accept Rate achieved at a specific routing ratio .
- Partial Area Under Curve (PAUC):
- Conceptual Definition: This is a stricter version of AUC. It measures the "added value" of the router by calculating the area under the curve only for the parts where the router's Accept Rate is higher than the Accept Rate of simply always using the best model (). A high PAUC score is strong evidence that the routing system is genuinely improving upon the best single model.
- Mathematical Formula:
- Symbol Explanation:
- : The baseline Accept Rate achieved by always routing to the large model .
- Accept Rate (AR):
-
Baselines:
- Individual Models:
ERNIE-3.5-8K(the powerful, expensive model, ) andERNIE-Speed-8K(the smaller, cheaper model, ). - Existing Routing Methods: 10 diverse methods including
FrugalGPT,PairRanker,RouteLLMMF,RouterBenchKNN, andFORC. - Tag-based Variants (proposed by authors): To isolate the benefit of tags, the authors also created versions of the top 3 baselines (
RouteLLMMF,RouterBenchKNN,FORC) that use tags as input instead of the raw query.
- Individual Models:
6. Results & Analysis
Core Results on BCUQ
The main results are summarized in Table 2, which compares TagRouter to all baselines on the BCUQ dataset.
(This is a transcribed version of Table 2 from the paper.)
| Category | Method | Performance at Max AR | AUC(%)↑ | PAUC(%)↑ | |||
| AR(%)↑ | Uplift(%)↑ | Cost↓ | Rank↓ | ||||
| Individual LLM | EBspeed | 59.78 | -24.1 | 2.01 | 1.400 | 0 | |
| EB3.5 | 78.76 | 0 | 13.49 | 1.212 | - | 0 | |
| Existing Routing Methods | FrugalGPT (Chen et al., 2023) | 78.88 | 0.15 | 13.24 | 1.211 | 70.11 | 0.01 |
| PairRanker (Jiang et al., 2023) | 78.76 | 0 | 13.49 | 1.212 | 72.17 | 0 | |
| Blending (Lu et al., 2024d) | 78.76 | 0 | 13.49 | 1.212 | 69.22 | 0 | |
| RouteLLMSWR (Ong et al., 2024) | 78.76 | 0 | 13.49 | 1.212 | 70.88 | 0 | |
| RouteLLMBERT (Ong et al., 2024) | 78.76 | 0 | 13.43 | 1.212 | 71.35 | 0 | |
| RouteLLMLLM (Ong et al., 2024) | 78.76 | 0 | 13.49 | 1.212 | 73.02 | 0 | |
| RouteLLMMF (Ong et al., 2024) | 80.34 | 2.01 | 11.82 | 1.197 | 73.94 | 0.12 | |
| RouterBenchMLP (Hu et al., 2024) | 78.88 | 0.15 | 13.40 | 1.211 | 73.58 | 0.01 | |
| RouterBenchKNN (Hu et al., 2024) | 80.45 | 2.15 | 11.77 | 1.196 | 75.15 | 0.40 | |
| FORC (Sakota et al., 2024) | 81.80 | 3.86 | 11.81 | 1.182 | 75.73 | 0.76 | |
| Tag-based Methods (ours) | RouteLLMMF w/ TAGGENERATOR | 82.02 | 4.14 | 11.66 | 1.180 | 76.08 | 0.76 |
| RouterBenchKNN w/ TAGGENERATOR | 81.57 | 3.57 | 11.76 | 1.184 | 74.48 | 0.98 | |
| FORC w/ TAgGENERATOR | 81.91 | 4.00 | 11.79 | 1.181 | 75.97 | 0.59 | |
| TAGROUTER | 83.60 | 6.15 | 11.17 | 1.164 | 76.10 | 1.46 | |
Key Findings:
-
Tags Boost Performance: The "Tag-based Methods" (e.g.,
FORC w/ TAGGENERATOR) consistently outperform their original counterparts that use raw queries. This proves that converting queries to tags is a powerful feature engineering step. -
TagRouterAchieves State-of-the-Art:TagRouteroutperforms all other methods across every key metric. It achieves the highestAR(83.60%), which is a 6.15% improvement over just using the best single model (EB3.5). At the same time, it reduces the system cost to 11.17 (a 17.20% reduction from 13.49). Its superiorAUCandPAUCscores confirm its robustness and value-add.These results are visualized in Figure 5, where
TagRouter's performance curve (red) is consistently above the other methods.
该图像是图表,展示了图5中在BCUQ数据集上TAGROUTER与基线方法的性能比较,分别为(a)与前三种现有路由方法对比,(b)与其他基于标签的路由方法对比,结果显示TAGROUTER在接受率和成本效率上表现最佳。
Performance Across Different Tasks
Figure 2 breaks down performance by task type. It shows that even a large model like EB3.5 isn't universally superior; the smaller EBspeed outperforms it on "summarization" tasks. TagRouter consistently achieves the highest AUC across most tasks, demonstrating its ability to exploit these nuanced model strengths.
该图像是图表,来源于论文中的图2,展示了TagRouter与排名前三的已有路由方法在八类任务数据上的性能对比。图中以接受率(Accept Rate)与成本比( Ratio to EB3.5 )为轴,显现TagRouter在多数任务上优于基线表现,提升系统效益。
Scalability of TagRouter
Figure 3 demonstrates that TagRouter's performance improves as more models are added to the candidate pool. Increasing the number of models from two to three, and then to five, progressively raises the overall system AUC from 0.7610 to 0.8043. This confirms the method's ability to effectively orchestrate a larger and more diverse set of LLMs.
该图像是图表,展示了TagRouter在Alpaca和Dolly数据集上的性能表现,对比了使用原始与增强版TagScorer的接受率曲线,横轴为相对于EB3.5的比例,纵轴为接受率,并标注了对应的AUC值。
Ablation Studies
-
Tag Normalization & Alignment: Figure 11 shows that both the tag normalization and tag alignment steps are crucial. The full system (red line) with both components enabled achieves the highest
AUC.
该图像是论文中的图表,展示了标签归一化和标签对齐对路由系统性能的影响。图中用不同颜色和符号表示是否归一化和对齐,曲线表现了接受率随比率变化的趋势,横线标出基线性能点EB3.5和EBspeed。 -
Tuning : The authors experimented with the score for a "tie" outcome. Figure 12 shows that setting provides the best balance of
Max AR,AUC, and cost-effectiveness, validating their choice over the simpler value of 1 used in prior work.
该图像是图表,展示了不同 值对模型系统性能的影响。“1/Relative Cost”表示当 AR 达到最大值时的归一化成本倒数,图中标注了 位置。
Generalization Across Datasets and Models
Table 3 shows TagRouter's performance on different datasets (Alpaca, Dolly) and with a different pair of candidate models (GLM4-9B and Qwen2-7B). In all scenarios, TagRouter consistently achieves the highest average AUC, demonstrating strong generalization.
(This is a transcribed version of Table 3 from the paper.)
| Method | GLM4-9B and Qwen2-7B | EB3.5 and EBspeeed | Average | ||||
| Alpaca | Dolly | BCUQ | Alpaca | Dolly | BCUQ | ||
| RouteLLMMF (Ong et al., 2024) | 0.7142 | 0.7566 | 0.7626 | 0.6950 | 0.6475 | 0.7394 | 0.7192 |
| RouterBenchKNN (Hu et al., 2024) | 0.7326 | 0.7583 | 0.7548 | 0.6978 | 0.6216 | 0.7515 | 0.7194 |
| FORC (Sakota et al., 2024) | 0.7384 | 0.7620 | 0.7659 | 0.7077 | 0.6700 | 0.7573 | 0.7336 |
| TAgROuTER | 0.7438 | 0.7623 | 0.7706 | 0.7239 | 0.7016 | 0.7610 | 0.7439 |
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces
TagRouter, a highly effective and practical model routing method. By converting complex queries into a simple, structured set of tags, it creates a "super model" that outperforms any single LLM. The method is training-free (for routing), scalable to multiple models, and cost-efficient. The extensive experiments convincingly demonstrate its superiority over 13 other methods, establishing it as a state-of-the-art solution for orchestrating LLMs in open-domain text generation. -
Limitations & Future Work (Acknowledged by Authors):
- Language Capability: The current
TagGeneratorwas trained on Chinese and English queries and is thus limited to these languages. - Evaluation Methods: The reliance on an
LLM-as-a-judgeis a practical necessity but is known to have biases and is less reliable than large-scale human evaluation. They also suggest that using an Elo rating system (like in Chatbot Arena) to evaluate responses could provide more robust scores and better support scaling.
- Language Capability: The current
-
Personal Insights & Critique:
- Elegant Simplicity: The core idea of
TagRouteris powerful because of its simplicity. It transforms a complex, high-dimensional problem (routing based on raw text) into a simple, low-dimensional one (routing based on a few tags). This abstraction is what makes the system so efficient and scalable. - High Practical Value: The "training-free" aspect for adding new models is a game-changer for real-world deployment. Companies can easily integrate new open-source or proprietary models into their systems without incurring massive retraining costs and downtime. The lightweight nature of the inference process (a small tagger model and key-value lookups) is also a major practical advantage.
- Dependency on the "Teacher": The entire system's quality is heavily dependent on the quality of the initial tag generation and pairwise comparisons, which are performed by a powerful "teacher" LLM (
ERNIE 4.0). Any biases or blind spots in this teacher model will inevitably be baked into theTagRouter's scoring map. - Future Directions: This tag-based framework could be extended beyond just routing. The tags themselves are a valuable, structured representation of user intent. They could be used for analytics, content moderation, or even to fine-tune specialist models for specific tag categories. The concept of a pre-computed "model capability map" based on semantic tags is a significant contribution that will likely influence future research in model ensembling and MLOps.
- Elegant Simplicity: The core idea of
Similar papers
Recommended via semantic vector search.