RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs
TL;DR Summary
RouterEval benchmark analyzes 8,500+ LLMs and reveals that capable routers improve performance with more candidate models, surpassing single models. It enables comprehensive router evaluation, showing current methods have room for growth.
Abstract
Routing large language models (LLMs) is a new paradigm that uses a router to recommend the best LLM from a pool of candidates for a given input. In this paper, our comprehensive analysis with more than 8,500 LLMs reveals a novel model-level scaling up phenomenon in Routing LLMs, i.e., a capable router can significantly enhance the performance of this paradigm as the number of candidates increases. This improvement can even surpass the performance of the best single model in the pool and many existing strong LLMs, confirming it a highly promising paradigm. However, the lack of comprehensive and open-source benchmarks for Routing LLMs has hindered the development of routers. In this paper, we introduce RouterEval, a benchmark tailored for router research, which includes over 200,000,000 performance records for 12 popular LLM evaluations across various areas such as commonsense reasoning, semantic understanding, etc., based on over 8,500 various LLMs. Using RouterEval, extensive evaluations of existing Routing LLM methods reveal that most still have significant room for improvement. See https://github.com/MilkThink-Lab/RouterEval for all data, code and tutorial.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs
- Authors: Zhongzhan Huang, Guoming Ling, Yupei Lin, Yandong Chen, Shanshan Zhong, Hefeng Wu, Liang Lin
- Affiliations: All authors are affiliated with Sun Yat-sen University.
- Journal/Conference: The paper is available on arXiv, which is a preprint server for academic papers. This means the work has not yet undergone formal peer review for a conference or journal. The provided link and abstract indicate a submission date in March 2025, which is likely a typo and should be interpreted as a recent (e.g., 2024) submission.
- Publication Year: 2025 (as listed on arXiv, subject to the typo note above).
- Abstract: The paper introduces "Routing LLMs," a new paradigm where a router model selects the best-suited Large Language Model (LLM) from a candidate pool for a given input. Through a massive analysis involving over 8,500 LLMs, the authors discover a "model-level scaling up" phenomenon: the performance of the routing system significantly improves as the number of candidate LLMs increases, potentially outperforming even the best single model in the pool. To address the lack of standardized evaluation tools which has hindered research in this area, the authors introduce RouterEval, a comprehensive benchmark. RouterEval contains over 200 million performance records for 12 popular LLM evaluation tasks. Using this benchmark, the paper evaluates existing routing methods and concludes that there is substantial room for improvement.
- Original Source Link:
- arXiv Page:
https://arxiv.org/abs/2503.10657 - PDF Link:
https://arxiv.org/pdf/2503.10657v2.pdf - Code and Data:
https://github.com/MilkThink-Lab/RouterEval
- arXiv Page:
2. Executive Summary
- Background & Motivation (Why): The AI landscape is saturated with thousands of Large Language Models (LLMs), each with unique strengths and weaknesses. For any given task, it is unlikely that a single LLM is universally the best. The core problem is how to dynamically and efficiently select the optimal LLM for a specific input query to maximize performance or other objectives (like cost-efficiency). This emerging field, which the paper calls "Routing LLMs," has been hampered by a critical lack of large-scale, open-source benchmarks to train and evaluate "router" models. Existing benchmarks are often too small, not diverse enough, or have closed-source data.
- Main Contributions / Findings (What):
- Discovery of the "Model-Level Scaling Up" Phenomenon: The paper's most significant finding is that routing systems exhibit a scaling law at the model level. With a capable router, the overall system's performance increases predictably and substantially as more LLMs are added to the candidate pool. This suggests that a collection of individually weaker models can, when properly routed, collectively outperform a single, much stronger model.
- Introduction of the RouterEval Benchmark: To catalyze research, the authors constructed and open-sourced
RouterEval, a massive benchmark for developing and testing LLM routers. It is built upon:- Over 8,500 LLMs, primarily open-source models.
- Over 200,000,000 performance records, detailing how each model performed on specific inputs.
- 12 popular LLM evaluation datasets covering diverse areas like commonsense reasoning, math, and instruction following.
- Comprehensive Evaluation of Existing Routers: Using
RouterEval, the paper provides the first large-scale evaluation of existing router design methods. The results show that current approaches are a clear improvement over random selection but still fall far short of the theoretical optimal performance, often failing to even surpass the best single model in the candidate pool. This highlights a significant opportunity for future research.
3. Prerequisite Knowledge & Related Work
To understand this paper, a few key concepts are essential.
-
Foundational Concepts:
-
Large Language Models (LLMs): These are advanced AI models (e.g., GPT-4, Llama 2) trained on vast amounts of text data to understand and generate human-like language. They have varying sizes, capabilities, and specializations.
-
Routing LLMs: As defined in this paper, this is a paradigm where a dedicated "router" model analyzes an incoming query and dispatches it to the most suitable LLM from a predefined pool. This is distinct from using a single, general-purpose LLM for all tasks. As illustrated in Figure 1, different inputs can be sent to different LLMs to optimize for accuracy or other goals.
该图像是一个示意图,展示了Routing LLMs的工作流程。两个输入通过一个路由器被分配到不同的LLM池中的模型,分别得到不同的输出结果,体现了路由选择对任务处理的影响。
-
-
Technological Evolution and Differentiation: The Routing LLMs paradigm is related to, but distinct from, several other techniques for combining models:
- Mixture-of-Experts (MoE): Traditional MoE models (like Mixtral 8x7B) operate within a single neural network. They have multiple "expert" sub-modules (e.g., feed-forward networks), and a gating network routes each input token to a small subset of these experts. Routing LLMs can be seen as a model-level MoE, where the "experts" are entire, independent LLMs. This allows for routing heterogeneous models with different architectures.
- LLM Ensembling: Ensemble methods typically run an input through all candidate models and then aggregate their outputs (e.g., via majority vote). This can improve accuracy but is computationally expensive as it requires multiple inferences. Routing LLMs are more efficient because they only require inference from one selected model per input.
- LLM Fusion/Merging: This involves combining the parameters of multiple LLMs (usually with identical architectures) into a single, new model. Routing LLMs do not merge the models; they keep them separate and simply choose among them, allowing for greater flexibility and the inclusion of heterogeneous models (e.g., models of different sizes or from different families).
- Recommender Systems (RS): The paper astutely frames LLM routing as a recommendation problem. The input query is the "user," the candidate LLMs are the "items," and the router's job is to recommend the best item for the user. The performance records in
RouterEvalact as the historical user-item interaction data.
4. Methodology (Core Technology & Implementation)
The paper formalizes the task of building an LLM router and introduces the structure of the RouterEval benchmark.
-
Principles: The core idea is to treat the selection of the best LLM as a multi-class classification problem. The router's goal is to learn a mapping from an input query to the index of the best-performing LLM in the candidate pool.
-
Steps & Procedures:
- Input Representation: A given input sentence is encoded into a fixed-size numerical vector using a pre-trained sentence encoder like
Sentence-BERTorRoBERTa. - LLM Candidate Pool: A pool of LLMs, denoted , is defined.
- Ground-Truth Labels: For each input , a target selection vector is created based on pre-computed performance records. If the metric is correctness, any LLM that answers correctly can have its corresponding dimension in set to 1. For continuous metrics, models performing within 95% of the best performance are marked as optimal (1). This allows for multiple correct choices.
- Router Training: A router model with learnable parameters is trained on the dataset . The training objective is to make the router's prediction for a given input match the ground-truth selection vector. This is formalized as: where represents optional external data that can be used for training.
- Input Representation: A given input sentence is encoded into a fixed-size numerical vector using a pre-trained sentence encoder like
-
Simulating Router Capability: To study the impact of router quality, the authors define a simulated router :
- is an oracle router that perfectly selects an optimal LLM if one exists in the pool for the given input.
- is a random router that picks an LLM uniformly at random from the candidates.
- is the probability of using the oracle. A higher simulates a more capable router, while corresponds to a random router. This elegant construction allows them to precisely study how performance scales with both router capability () and the number of candidates ().
-
Construction of RouterEval:
- Data Format: The benchmark provides pairs of input embeddings and target selection vectors:
- LLM Candidate Groups: To ensure robust evaluation, for each benchmark and number of candidates , three types of candidate groups are created:
- "all-strong": Candidates are sampled from the top 20% of performing LLMs.
- "all-weak": Candidates are sampled from the bottom 20% of performing LLMs.
- "strong-to-weak": A mix of models from across the performance spectrum. This setup allows for analyzing router performance under different scenarios, such as whether a router can effectively leverage a pool of individually weak but complementary models.
- Extra Training Data: In addition to the direct training pairs,
RouterEvalprovides over 200 million raw performance records. This massive dataset can be used for more advanced training techniques like pre-training, data augmentation, or methods inspired by recommender systems.
5. Experimental Setup
- Datasets:
RouterEvalis built from 12 diverse and popular LLM evaluation benchmarks:ARC(reasoning),HellaSwag(commonsense),MMLU(multitask knowledge),TruthfulQA(truthfulness),Winogrande(commonsense),GSM8k(math word problems),IFEval(instruction following),BBH(Big-Bench Hard),GPQA(graduate-level physics, biology, chemistry),MUSR(multi-step reasoning),MATH Lvl 5(advanced math), andMMLU-PRO(professional-level MMLU). - Evaluation Metrics: The paper uses four metrics to evaluate router performance:
- Original Metric (): The final performance (e.g., accuracy) of the system after routing. A higher value is better.
- Reference Value ():
- Conceptual Definition: This metric compares the router's performance to that of a strong, state-of-the-art reference LLM (like GPT-4). A value greater than 1 means the routing system outperforms the strong reference model.
- Symbol Explanation: is the router's performance, and
Perf.(ref.)is the reference model's performance on the same benchmark.
- Best Single Model Value ():
- Conceptual Definition: This metric measures if the router adds value beyond simply identifying and always picking the single best model from the candidate pool. A value greater than 1 indicates that the router is successfully leveraging the complementary strengths of multiple models.
- Symbol Explanation:
Perf.(BSM)is the performance of the best-performing single model within the given candidate pool.
- Classification Bias ():
- Conceptual Definition: This metric uses Shannon entropy to measure the diversity of the router's selections. A high entropy value indicates the router is selecting a wide variety of LLMs, while a value near 0 implies classification bias, where the router almost always picks the same model, defeating the purpose of routing.
- Symbol Explanation: is the number of test samples, is the number of candidate LLMs, and is the probability assigned by the router to selecting the -th LLM for the -th sample.
- Baselines:
- Strong Routers:
Oracle router() and (a 50/50 mix of oracle and random) serve as performance upper bounds. - Existing Routers:
LinearR(a linear classifier),MLPR(a multi-layer perceptron),C-RoBERTa(a fine-tuned RoBERTa classifier),MLC, andPRknn. - Trivial Baseline:
Randomselection.
- Strong Routers:
6. Results & Analysis
-
The Model-Level Scaling Up Phenomenon: Figure 2 is the cornerstone of the paper's primary claim. Across four different benchmarks (
ARC,MMLU-PRO,MATH Lvl 5,TruthfulQA), the plots consistently show that as the number of LLM candidates (x-axis) increases, the overall accuracy (y-axis) rises. This effect is drastically more pronounced for more capable routers (warmer colors, representing higher ). With a sufficiently capable router (e.g., ) and a large enough pool (e.g., 100+ models), the routing system's performance can easily surpass that of a powerful reference LLM (the dashed grey line). This demonstrates that routing is a viable path to "scale up" performance by adding more models, not just by making a single model bigger.
该图像是四个子图组成的图表,展示了Routing LLMs中模型级别规模效应,横轴为LLM候选数量,纵轴为准确率,不同颜色对应概率值,虚线表示参考LLM性能。 -
Performance of Existing Routers: The tables below (transcribed from Tables 1 and 2 in the paper) show the performance of baseline routers on the
RouterEvalbenchmark for easy settings ( and ).Table 1: The Results on RouterEval (part 1) This table has been transcribed from the paper's content. denotes the number of candidate LLMs.
m Router ARC HellaSwag MMLU TruthfulQA μo↑ VR↑ VB↑ Ep↑ μo↑ VR↑ VB↑ Ep↑ μo↑ VR↑ VB↑ Ep↑ μo↑ VR↑ VB↑ Ep↑ 3 Oracle ro 0.80 0.94 1.34 1.02 0.80 0.84 1.08 1.32 0.89 1.03 1.35 1.00 0.85 1.27 1.21 1.05 r(0.5) 0.67 0.79 1.11 1.47 0.74 0.78 1.00 1.53 0.75 0.87 1.11 1.47 0.74 1.10 1.04 1.47 LinearR 0.61 0.71 0.96 1.42 0.75 0.79 1.00 1.43 0.74 0.85 1.04 1.30 0.72 1.08 1.00 1.36 MLPR 0.61 0.71 0.96 1.42 0.75 0.78 1.00 1.43 0.74 0.86 1.04 1.26 0.71 1.06 0.96 1.30 C-RoBERTa 0.62 0.73 1.00 1.03 0.75 0.79 1.00 0.29 0.73 0.84 1.02 0.62 0.71 1.06 0.96 0.31 MLC 0.63 0.74 1.00 0.81 0.75 0.78 1.00 1.01 0.73 0.85 1.02 0.79 0.70 1.05 0.95 0.49 PRknn 0.60 0.71 0.97 1.56 0.72 0.76 0.97 1.57 0.70 0.81 0.98 1.55 0.70 1.04 0.95 1.55 Random 0.54 0.64 0.89 1.59 0.68 0.71 0.91 1.59 0.62 0.71 0.88 1.59 0.62 0.93 0.86 1.59 5 Oracle ro 0.85 1.00 1.34 1.57 0.81 0.85 1.10 2.00 0.92 1.07 1.63 1.49 0.89 1.33 1.27 1.72 ro(0.5) 0.70 0.82 1.09 2.16 0.74 0.78 1.00 2.25 0.75 0.87 1.24 2.14 0.75 1.12 1.05 2.19 LinearR 0.64 0.75 0.93 2.15 0.75 0.79 1.00 2.19 0.69 0.80 1.01 2.04 0.72 1.08 0.97 2.15 MLPR 0.64 0.75 0.93 2.13 0.75 0.79 1.01 2.20 0.70 0.81 1.02 2.00 0.71 1.05 0.93 2.11 C-RoBERTa 0.66 0.78 0.97 0.82 0.75 0.79 1.00 0.52 0.68 0.79 0.98 1.02 0.70 1.04 0.92 0.84 MLC 0.63 0.74 0.90 1.28 0.75 0.78 1.01 1.65 0.69 0.79 0.99 1.11 0.68 1.02 0.91 1.04 PRknn 0.63 0.74 0.95 2.30 0.71 0.74 0.95 2.31 0.64 0.74 0.94 2.30 0.70 1.04 0.95 2.29 Random 0.55 0.65 0.83 2.32 0.67 0.71 0.91 2.32 0.58 0.67 0.86 2.32 0.61 0.92 0.83 2.32 Table 2 (part 2, continued from paper) This table has been transcribed from the paper's content.
m Router Winogrande GSM8k IFEval BBH μo↑ VR↑ VB↑ Ep↑ μo↑ VR↑ VB↑ Ep↑ μo↑ VR↑ VB↑ Ep↑ μo↑ VR↑ VB↑ Ep↑ 3 Oracle ro 0.95 1.09 1.22 1.20 0.87 0.95 1.29 1.10 0.79 1.02 1.33 1.04 0.82 0.99 1.42 0.97 r(0.5) 0.86 0.98 1.09 1.51 0.76 0.82 1.10 1.49 0.67 0.87 1.08 1.47 0.68 0.82 1.15 1.46 LinearR 0.76 0.87 0.95 1.45 0.71 0.77 0.97 1.37 0.70 0.91 1.08 1.10 0.63 0.76 1.04 1.34 MLPR 0.78 0.89 0.98 1.30 0.69 0.75 0.95 1.33 0.70 0.91 1.08 0.94 0.63 0.76 1.05 1.30 C-RoBERTa 0.78 0.89 0.98 0.60 0.69 0.75 0.94 0.61 0.70 0.91 1.09 0.79 0.60 0.72 0.98 0.80 MLC 0.76 0.87 0.96 1.56 0.70 0.76 0.97 0.74 0.68 0.88 0.98 0.40 0.62 0.74 1.02 0.38 PRknn 0.74 0.84 0.92 1.57 0.70 0.76 0.99 1.56 0.69 0.90 1.04 1.55 0.61 0.73 1.00 1.56 Random 0.77 0.88 0.96 1.59 0.64 0.70 0.90 1.59 0.54 0.71 0.82 1.59 0.53 0.64 0.88 1.59 5 Oracle ro 0.98 1.12 1.31 1.77 0.89 0.96 1.33 1.67 0.81 1.06 1.36 1.63 0.88 1.06 1.69 1.43 ro(0.5) 0.85 0.97 1.12 2.21 0.74 0.81 1.09 2.19 0.67 0.87 1.06 2.17 0.70 0.84 1.29 2.13 LinearR 0.75 0.85 0.96 2.15 0.72 0.78 0.98 2.01 0.67 0.87 0.95 1.86 0.63 0.75 1.08 2.11 MLPR 0.80 0.91 1.03 2.08 0.72 0.78 0.98 1.99 0.67 0.87 0.96 1.80 0.62 0.74 1.05 2.05 C-RoBERTa 0.76 0.87 0.97 0.83 0.72 0.78 0.99 0.82 0.67 0.87 0.92 1.02 0.59 0.71 0.99 1.03 MLC 0.74 0.84 0.93 2.21 0.71 0.78 0.96 1.11 0.53 0.69 0.75 0.57 0.60 0.72 1.00 0.41 PRknn 0.72 0.83 0.93 2.30 0.71 0.77 1.00 2.30 0.62 0.80 0.91 2.29 0.58 0.70 1.00 2.29 Random 0.72 0.82 0.93 2.32 0.60 0.65 0.85 2.32 0.53 0.68 0.76 2.32 0.52 0.62 0.89 2.32 The results show that while existing routers outperform random selection, their performance is modest. Critically, their values are almost always , meaning they rarely outperform the best single model in their pool. Furthermore, the large gap between existing methods and the
Oracle rohighlights a massive potential for improvement. -
Analysis of Candidate Groups & Bias: Figure 3 shows how different types of candidate pools ("all-strong", "all-weak", "strong-to-weak") affect performance. With a perfect oracle router (), even a pool of "weak" models can achieve performance comparable to a strong reference LLM. This confirms that heterogeneous, weaker models possess complementary knowledge that a good router can exploit. However, existing routers like
C-RoBERTaandPRknnstruggle significantly with the "all-weak" group, indicating their inability to effectively manage and leverage diversity.
该图像是一个柱状图,展示了不同候选模型组在TruthfulQA和MMLU两个任务上的准确率表现,柱子颜色区分了Strong、Weak及其组合,虚线表示参考LLM的准确率基准。This is further explained by the classification bias. Table 3 (transcribed below) shows the entropy () for different routers on the MMLU benchmark.
Table 3: The on Various Candidate Groups (MMLU) This table has been transcribed from the paper's content.
m Router all-strong all-weak strong-to-weak 3 Oracle roro(0.5) 1.39 0.77 0.96 1.55 1.42 1.45 LinearR 1.54 1.54 0.81 MLPR 1.50 1.52 0.76 C-RoBERTa 0.93 0.94 0.00 MLC 1.52 0.34 0.52 PRknn 1.58 1.56 1.52 Random 1.59 1.59 1.59 5 Oracle roro(0.5) 2.09 0.90 1.49 2.27 2.00 2.15 LinearR 2.27 2.28 1.58 MLPR 2.26 2.25 1.50 C-RoBERTa 1.53 1.53 0.00 MLC 2.25 0.03 1.06 PRknn 2.31 2.30 2.28 Random 2.32 2.32 2.32 Notice the stark result for
C-RoBERTaon the "strong-to-weak" group: its entropy () is 0.00. This means it has learned to ignore the weaker models entirely and only ever picks the strongest model, effectively degenerating into a non-routing "best single model" selector. This demonstrates a critical failure mode of current routers: they suffer from classification bias and fail to harness the collective power of the pool.
7. Conclusion & Reflections
-
Conclusion Summary: This paper makes two primary contributions. First, it identifies and empirically demonstrates the "model-level scaling up" phenomenon in LLMs, showing that routing is a powerful and promising paradigm for performance enhancement. Second, it introduces
RouterEval, a large-scale, open-source benchmark designed to accelerate research on LLM routers. The comprehensive evaluation onRouterEvalreveals that existing router methods are still in their infancy and have significant room for improvement, particularly in overcoming classification bias and effectively leveraging model diversity. -
Limitations & Future Work:
- Authors' Limitations: The authors acknowledge that deploying a large number of LLMs for a routing system can be challenging. However, they argue that significant performance gains are seen with just 3-10 candidates, which is a manageable number. They also state that the current 200 million data points in
RouterEval, while massive, are still insufficient to train a truly exceptional router, highlighting the need for a community-wide effort to collect more performance data. - Future Work: The authors propose several exciting research directions:
- Advanced Training Strategies: Using the provided raw data for pre-training, few-shot learning, or data augmentation to build more robust routers.
- Recommender System Techniques: Applying classic RS methods to tackle challenges like representation learning for inputs/LLMs, the "cold start" problem (new models or new tasks), and using causal inference to debias router predictions.
- Multi-Objective Routing: Extending the paradigm beyond just performance to also optimize for computational cost, latency, or reducing hallucinations.
- Authors' Limitations: The authors acknowledge that deploying a large number of LLMs for a routing system can be challenging. However, they argue that significant performance gains are seen with just 3-10 candidates, which is a manageable number. They also state that the current 200 million data points in
-
Personal Insights & Critique:
- Impact: This paper is a significant contribution to the field. By providing both a novel insight (model-level scaling) and a practical tool (
RouterEval), it lays the groundwork for a new wave of research into intelligent, efficient, and multi-model AI systems. - Novelty: The concept of "model-level scaling up" is an elegant and powerful framing. While the idea of routing isn't entirely new, this paper is the first to study it at such a massive scale and formalize it as a scaling law.
- Critique: The current evaluation is a simulation based on pre-computed performance records. A real-world deployment would introduce additional factors like the router's own inference latency and the engineering complexity of maintaining a large pool of models. While the paper defers these practical concerns, they will be crucial for real-world adoption.
- Open Questions: The "chicken-and-egg" problem remains the biggest hurdle: to train great routers, we need vast performance datasets, but generating these datasets is extremely expensive.
RouterEvalis a monumental first step, but the ultimate solution will likely involve more clever, data-efficient training methods or community-driven data-sharing platforms. This paper successfully highlights this challenge and provides the community with the tools to start tackling it.
- Impact: This paper is a significant contribution to the field. By providing both a novel insight (model-level scaling) and a practical tool (
Similar papers
Recommended via semantic vector search.