Paper status: completed

RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs

Published:03/08/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
10 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

RouterEval benchmark analyzes 8,500+ LLMs and reveals that capable routers improve performance with more candidate models, surpassing single models. It enables comprehensive router evaluation, showing current methods have room for growth.

Abstract

Routing large language models (LLMs) is a new paradigm that uses a router to recommend the best LLM from a pool of candidates for a given input. In this paper, our comprehensive analysis with more than 8,500 LLMs reveals a novel model-level scaling up phenomenon in Routing LLMs, i.e., a capable router can significantly enhance the performance of this paradigm as the number of candidates increases. This improvement can even surpass the performance of the best single model in the pool and many existing strong LLMs, confirming it a highly promising paradigm. However, the lack of comprehensive and open-source benchmarks for Routing LLMs has hindered the development of routers. In this paper, we introduce RouterEval, a benchmark tailored for router research, which includes over 200,000,000 performance records for 12 popular LLM evaluations across various areas such as commonsense reasoning, semantic understanding, etc., based on over 8,500 various LLMs. Using RouterEval, extensive evaluations of existing Routing LLM methods reveal that most still have significant room for improvement. See https://github.com/MilkThink-Lab/RouterEval for all data, code and tutorial.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs
  • Authors: Zhongzhan Huang, Guoming Ling, Yupei Lin, Yandong Chen, Shanshan Zhong, Hefeng Wu, Liang Lin
  • Affiliations: All authors are affiliated with Sun Yat-sen University.
  • Journal/Conference: The paper is available on arXiv, which is a preprint server for academic papers. This means the work has not yet undergone formal peer review for a conference or journal. The provided link and abstract indicate a submission date in March 2025, which is likely a typo and should be interpreted as a recent (e.g., 2024) submission.
  • Publication Year: 2025 (as listed on arXiv, subject to the typo note above).
  • Abstract: The paper introduces "Routing LLMs," a new paradigm where a router model selects the best-suited Large Language Model (LLM) from a candidate pool for a given input. Through a massive analysis involving over 8,500 LLMs, the authors discover a "model-level scaling up" phenomenon: the performance of the routing system significantly improves as the number of candidate LLMs increases, potentially outperforming even the best single model in the pool. To address the lack of standardized evaluation tools which has hindered research in this area, the authors introduce RouterEval, a comprehensive benchmark. RouterEval contains over 200 million performance records for 12 popular LLM evaluation tasks. Using this benchmark, the paper evaluates existing routing methods and concludes that there is substantial room for improvement.
  • Original Source Link:
    • arXiv Page: https://arxiv.org/abs/2503.10657
    • PDF Link: https://arxiv.org/pdf/2503.10657v2.pdf
    • Code and Data: https://github.com/MilkThink-Lab/RouterEval

2. Executive Summary

  • Background & Motivation (Why): The AI landscape is saturated with thousands of Large Language Models (LLMs), each with unique strengths and weaknesses. For any given task, it is unlikely that a single LLM is universally the best. The core problem is how to dynamically and efficiently select the optimal LLM for a specific input query to maximize performance or other objectives (like cost-efficiency). This emerging field, which the paper calls "Routing LLMs," has been hampered by a critical lack of large-scale, open-source benchmarks to train and evaluate "router" models. Existing benchmarks are often too small, not diverse enough, or have closed-source data.
  • Main Contributions / Findings (What):
    1. Discovery of the "Model-Level Scaling Up" Phenomenon: The paper's most significant finding is that routing systems exhibit a scaling law at the model level. With a capable router, the overall system's performance increases predictably and substantially as more LLMs are added to the candidate pool. This suggests that a collection of individually weaker models can, when properly routed, collectively outperform a single, much stronger model.
    2. Introduction of the RouterEval Benchmark: To catalyze research, the authors constructed and open-sourced RouterEval, a massive benchmark for developing and testing LLM routers. It is built upon:
      • Over 8,500 LLMs, primarily open-source models.
      • Over 200,000,000 performance records, detailing how each model performed on specific inputs.
      • 12 popular LLM evaluation datasets covering diverse areas like commonsense reasoning, math, and instruction following.
    3. Comprehensive Evaluation of Existing Routers: Using RouterEval, the paper provides the first large-scale evaluation of existing router design methods. The results show that current approaches are a clear improvement over random selection but still fall far short of the theoretical optimal performance, often failing to even surpass the best single model in the candidate pool. This highlights a significant opportunity for future research.

3. Prerequisite Knowledge & Related Work

To understand this paper, a few key concepts are essential.

  • Foundational Concepts:

    • Large Language Models (LLMs): These are advanced AI models (e.g., GPT-4, Llama 2) trained on vast amounts of text data to understand and generate human-like language. They have varying sizes, capabilities, and specializations.

    • Routing LLMs: As defined in this paper, this is a paradigm where a dedicated "router" model analyzes an incoming query and dispatches it to the most suitable LLM from a predefined pool. This is distinct from using a single, general-purpose LLM for all tasks. As illustrated in Figure 1, different inputs can be sent to different LLMs to optimize for accuracy or other goals.

      Figure 1: The Overview of Routing LLMs. For each given input, the router distributes it to the appropriate LLM to achieve specific objectives, such as high accuracy, low computational cost, reduced h… 该图像是一个示意图,展示了Routing LLMs的工作流程。两个输入通过一个路由器被分配到不同的LLM池中的模型,分别得到不同的输出结果,体现了路由选择对任务处理的影响。

  • Technological Evolution and Differentiation: The Routing LLMs paradigm is related to, but distinct from, several other techniques for combining models:

    • Mixture-of-Experts (MoE): Traditional MoE models (like Mixtral 8x7B) operate within a single neural network. They have multiple "expert" sub-modules (e.g., feed-forward networks), and a gating network routes each input token to a small subset of these experts. Routing LLMs can be seen as a model-level MoE, where the "experts" are entire, independent LLMs. This allows for routing heterogeneous models with different architectures.
    • LLM Ensembling: Ensemble methods typically run an input through all candidate models and then aggregate their outputs (e.g., via majority vote). This can improve accuracy but is computationally expensive as it requires multiple inferences. Routing LLMs are more efficient because they only require inference from one selected model per input.
    • LLM Fusion/Merging: This involves combining the parameters of multiple LLMs (usually with identical architectures) into a single, new model. Routing LLMs do not merge the models; they keep them separate and simply choose among them, allowing for greater flexibility and the inclusion of heterogeneous models (e.g., models of different sizes or from different families).
    • Recommender Systems (RS): The paper astutely frames LLM routing as a recommendation problem. The input query is the "user," the candidate LLMs are the "items," and the router's job is to recommend the best item for the user. The performance records in RouterEval act as the historical user-item interaction data.

4. Methodology (Core Technology & Implementation)

The paper formalizes the task of building an LLM router and introduces the structure of the RouterEval benchmark.

  • Principles: The core idea is to treat the selection of the best LLM as a multi-class classification problem. The router's goal is to learn a mapping from an input query to the index of the best-performing LLM in the candidate pool.

  • Steps & Procedures:

    1. Input Representation: A given input sentence sjs_j is encoded into a fixed-size numerical vector κ(sj)\kappa(s_j) using a pre-trained sentence encoder like Sentence-BERT or RoBERTa.
    2. LLM Candidate Pool: A pool of mm LLMs, denoted {i}i=1m\{ \ell_i \}_{i=1}^m, is defined.
    3. Ground-Truth Labels: For each input sjs_j, a target selection vector vj{0,1}mv_j \in \{0, 1\}^m is created based on pre-computed performance records. If the metric is correctness, any LLM that answers correctly can have its corresponding dimension in vjv_j set to 1. For continuous metrics, models performing within 95% of the best performance are marked as optimal (1). This allows for multiple correct choices.
    4. Router Training: A router model rθr_\theta with learnable parameters θ\theta is trained on the dataset {(κ(sj),vj)}j=1n\{ (\kappa(s_j), v_j) \}_{j=1}^n. The training objective is to make the router's prediction for a given input match the ground-truth selection vector. This is formalized as: rθ[κ(sj)D]vj r_\theta[\kappa(s_j) | \mathcal{D}] \rightarrow v_j where D\mathcal{D} represents optional external data that can be used for training.
  • Simulating Router Capability: To study the impact of router quality, the authors define a simulated router ro(p)r_o(p): ro(p)={ro,with probability p,ωm,with probability 1p, r_o(p) = \begin{cases} r_o, & \text{with probability } p, \\ \omega_m, & \text{with probability } 1-p, \end{cases}

    • ror_o is an oracle router that perfectly selects an optimal LLM if one exists in the pool for the given input.
    • ωm\omega_m is a random router that picks an LLM uniformly at random from the mm candidates.
    • p[0,1]p \in [0, 1] is the probability of using the oracle. A higher pp simulates a more capable router, while p=0p=0 corresponds to a random router. This elegant construction allows them to precisely study how performance scales with both router capability (pp) and the number of candidates (mm).
  • Construction of RouterEval:

    • Data Format: The benchmark provides pairs of input embeddings and target selection vectors: (X,Y)={κ(sj),vj}j=1n (\mathcal{X}, \mathcal{Y}) = \{\kappa(s_j), v_j\}_{j=1}^n
    • LLM Candidate Groups: To ensure robust evaluation, for each benchmark and number of candidates mm, three types of candidate groups are created:
      1. "all-strong": Candidates are sampled from the top 20% of performing LLMs.
      2. "all-weak": Candidates are sampled from the bottom 20% of performing LLMs.
      3. "strong-to-weak": A mix of models from across the performance spectrum. This setup allows for analyzing router performance under different scenarios, such as whether a router can effectively leverage a pool of individually weak but complementary models.
    • Extra Training Data: In addition to the direct training pairs, RouterEval provides over 200 million raw performance records. This massive dataset can be used for more advanced training techniques like pre-training, data augmentation, or methods inspired by recommender systems.

5. Experimental Setup

  • Datasets: RouterEval is built from 12 diverse and popular LLM evaluation benchmarks: ARC (reasoning), HellaSwag (commonsense), MMLU (multitask knowledge), TruthfulQA (truthfulness), Winogrande (commonsense), GSM8k (math word problems), IFEval (instruction following), BBH (Big-Bench Hard), GPQA (graduate-level physics, biology, chemistry), MUSR (multi-step reasoning), MATH Lvl 5 (advanced math), and MMLU-PRO (professional-level MMLU).
  • Evaluation Metrics: The paper uses four metrics to evaluate router performance:
    1. Original Metric (μo\mu_o): The final performance (e.g., accuracy) of the system after routing. A higher value is better.
    2. Reference Value (VRV_R): VR=μo(rθ)Perf.(ref.) V_R = \frac{\mu_o(r_\theta)}{\text{Perf.(ref.)}}
      • Conceptual Definition: This metric compares the router's performance to that of a strong, state-of-the-art reference LLM (like GPT-4). A value greater than 1 means the routing system outperforms the strong reference model.
      • Symbol Explanation: μo(rθ)\mu_o(r_\theta) is the router's performance, and Perf.(ref.) is the reference model's performance on the same benchmark.
    3. Best Single Model Value (VBV_B): VB=μo(rθ)Perf.(BSM) V_B = \frac{\mu_o(r_\theta)}{\text{Perf.(BSM)}}
      • Conceptual Definition: This metric measures if the router adds value beyond simply identifying and always picking the single best model from the candidate pool. A value greater than 1 indicates that the router is successfully leveraging the complementary strengths of multiple models.
      • Symbol Explanation: Perf.(BSM) is the performance of the best-performing single model within the given candidate pool.
    4. Classification Bias (EpE_p): Ep=1nj=1ni=1mPi(j)logPi(j) E_p = - \frac{1}{n} \sum_{j=1}^n \sum_{i=1}^m P_i^{(j)} \log P_i^{(j)}
      • Conceptual Definition: This metric uses Shannon entropy to measure the diversity of the router's selections. A high entropy value indicates the router is selecting a wide variety of LLMs, while a value near 0 implies classification bias, where the router almost always picks the same model, defeating the purpose of routing.
      • Symbol Explanation: nn is the number of test samples, mm is the number of candidate LLMs, and Pi(j)P_i^{(j)} is the probability assigned by the router to selecting the ii-th LLM for the jj-th sample.
  • Baselines:
    • Strong Routers: Oracle router (ror_o) and ro(0.5)r_o(0.5) (a 50/50 mix of oracle and random) serve as performance upper bounds.
    • Existing Routers: LinearR (a linear classifier), MLPR (a multi-layer perceptron), C-RoBERTa (a fine-tuned RoBERTa classifier), MLC, and PRknn.
    • Trivial Baseline: Random selection.

6. Results & Analysis

  • The Model-Level Scaling Up Phenomenon: Figure 2 is the cornerstone of the paper's primary claim. Across four different benchmarks (ARC, MMLU-PRO, MATH Lvl 5, TruthfulQA), the plots consistently show that as the number of LLM candidates (x-axis) increases, the overall accuracy (y-axis) rises. This effect is drastically more pronounced for more capable routers (warmer colors, representing higher pp). With a sufficiently capable router (e.g., p0.5p \ge 0.5) and a large enough pool (e.g., 100+ models), the routing system's performance can easily surpass that of a powerful reference LLM (the dashed grey line). This demonstrates that routing is a viable path to "scale up" performance by adding more models, not just by making a single model bigger.

    Figure 2: The Model-level Scaling Up Phenomenon in Routing LLMs. As shown in Section 3, the Prob. \(p\) EY u l r capability. If \(p 0\) , then \(r _ { o } ( p )\) degenerates into a random sampler. When th… 该图像是四个子图组成的图表,展示了Routing LLMs中模型级别规模效应,横轴为LLM候选数量,纵轴为准确率,不同颜色对应概率pp值,虚线表示参考LLM性能。

  • Performance of Existing Routers: The tables below (transcribed from Tables 1 and 2 in the paper) show the performance of baseline routers on the RouterEval benchmark for easy settings (m=3m=3 and m=5m=5).

    Table 1: The Results on RouterEval (part 1) This table has been transcribed from the paper's content. mm denotes the number of candidate LLMs.

    m Router ARC HellaSwag MMLU TruthfulQA
    μo↑VR↑VB↑Ep↑ μo↑VR↑VB↑Ep↑ μo↑VR↑VB↑Ep↑ μo↑VR↑VB↑Ep↑
    3 Oracle ro 0.800.941.341.02 0.800.841.081.32 0.891.031.351.00 0.851.271.211.05
    r(0.5) 0.670.791.111.47 0.740.781.001.53 0.750.871.111.47 0.741.101.041.47
    LinearR 0.610.710.961.42 0.750.791.001.43 0.740.851.041.30 0.721.081.001.36
    MLPR 0.610.710.961.42 0.750.781.001.43 0.740.861.041.26 0.711.060.961.30
    C-RoBERTa 0.620.731.001.03 0.750.791.000.29 0.730.841.020.62 0.711.060.960.31
    MLC 0.630.741.000.81 0.750.781.001.01 0.730.851.020.79 0.701.050.950.49
    PRknn 0.600.710.971.56 0.720.760.971.57 0.700.810.981.55 0.701.040.951.55
    Random 0.540.640.891.59 0.680.710.911.59 0.620.710.881.59 0.620.930.861.59
    5 Oracle ro 0.851.001.341.57 0.810.851.102.00 0.921.071.631.49 0.891.331.271.72
    ro(0.5) 0.700.821.092.16 0.740.781.002.25 0.750.871.242.14 0.751.121.052.19
    LinearR 0.640.750.932.15 0.750.791.002.19 0.690.801.012.04 0.721.080.972.15
    MLPR 0.640.750.932.13 0.750.791.012.20 0.700.811.022.00 0.711.050.932.11
    C-RoBERTa 0.660.780.970.82 0.750.791.000.52 0.680.790.981.02 0.701.040.920.84
    MLC 0.630.740.901.28 0.750.781.011.65 0.690.790.991.11 0.681.020.911.04
    PRknn 0.630.740.952.30 0.710.740.952.31 0.640.740.942.30 0.701.040.952.29
    Random 0.550.650.832.32 0.670.710.912.32 0.580.670.862.32 0.610.920.832.32

    Table 2 (part 2, continued from paper) This table has been transcribed from the paper's content.

    m Router Winogrande GSM8k IFEval BBH
    μo↑VR↑VB↑Ep↑ μo↑VR↑VB↑Ep↑ μo↑VR↑VB↑Ep↑ μo↑VR↑VB↑Ep↑
    3 Oracle ro 0.951.091.221.20 0.870.951.291.10 0.791.021.331.04 0.820.991.420.97
    r(0.5) 0.860.981.091.51 0.760.821.101.49 0.670.871.081.47 0.680.821.151.46
    LinearR 0.760.870.951.45 0.710.770.971.37 0.700.911.081.10 0.630.761.041.34
    MLPR 0.780.890.981.30 0.690.750.951.33 0.700.911.080.94 0.630.761.051.30
    C-RoBERTa 0.780.890.980.60 0.690.750.940.61 0.700.911.090.79 0.600.720.980.80
    MLC 0.760.870.961.56 0.700.760.970.74 0.680.880.980.40 0.620.741.020.38
    PRknn 0.740.840.921.57 0.700.760.991.56 0.690.901.041.55 0.610.731.001.56
    Random 0.770.880.961.59 0.640.700.901.59 0.540.710.821.59 0.530.640.881.59
    5 Oracle ro 0.981.121.311.77 0.890.961.331.67 0.811.061.361.63 0.881.061.691.43
    ro(0.5) 0.850.971.122.21 0.740.811.092.19 0.670.871.062.17 0.700.841.292.13
    LinearR 0.750.850.962.15 0.720.780.982.01 0.670.870.951.86 0.630.751.082.11
    MLPR 0.800.911.032.08 0.720.780.981.99 0.670.870.961.80 0.620.741.052.05
    C-RoBERTa 0.760.870.970.83 0.720.780.990.82 0.670.870.921.02 0.590.710.991.03
    MLC 0.740.840.932.21 0.710.780.961.11 0.530.690.750.57 0.600.721.000.41
    PRknn 0.720.830.932.30 0.710.771.002.30 0.620.800.912.29 0.580.701.002.29
    Random 0.720.820.932.32 0.600.650.852.32 0.530.680.762.32 0.520.620.892.32

    The results show that while existing routers outperform random selection, their performance is modest. Critically, their VBV_B values are almost always 1\le 1, meaning they rarely outperform the best single model in their pool. Furthermore, the large gap between existing methods and the Oracle ro highlights a massive potential for improvement.

  • Analysis of Candidate Groups & Bias: Figure 3 shows how different types of candidate pools ("all-strong", "all-weak", "strong-to-weak") affect performance. With a perfect oracle router (ror_o), even a pool of "weak" models can achieve performance comparable to a strong reference LLM. This confirms that heterogeneous, weaker models possess complementary knowledge that a good router can exploit. However, existing routers like C-RoBERTa and PRknn struggle significantly with the "all-weak" group, indicating their inability to effectively manage and leverage diversity.

    Figure 3: The Results on Different Candidate Group. 该图像是一个柱状图,展示了不同候选模型组在TruthfulQA和MMLU两个任务上的准确率表现,柱子颜色区分了Strong、Weak及其组合,虚线表示参考LLM的准确率基准。

    This is further explained by the classification bias. Table 3 (transcribed below) shows the entropy (EpE_p) for different routers on the MMLU benchmark.

    Table 3: The EpE_p on Various Candidate Groups (MMLU) This table has been transcribed from the paper's content.

    m Router all-strong all-weak strong-to-weak
    3 Oracle roro(0.5) 1.39 0.77 0.96
    1.55 1.42 1.45
    LinearR 1.54 1.54 0.81
    MLPR 1.50 1.52 0.76
    C-RoBERTa 0.93 0.94 0.00
    MLC 1.52 0.34 0.52
    PRknn 1.58 1.56 1.52
    Random 1.59 1.59 1.59
    5 Oracle roro(0.5) 2.09 0.90 1.49
    2.27 2.00 2.15
    LinearR 2.27 2.28 1.58
    MLPR 2.26 2.25 1.50
    C-RoBERTa 1.53 1.53 0.00
    MLC 2.25 0.03 1.06
    PRknn 2.31 2.30 2.28
    Random 2.32 2.32 2.32

    Notice the stark result for C-RoBERTa on the "strong-to-weak" group: its entropy (EpE_p) is 0.00. This means it has learned to ignore the weaker models entirely and only ever picks the strongest model, effectively degenerating into a non-routing "best single model" selector. This demonstrates a critical failure mode of current routers: they suffer from classification bias and fail to harness the collective power of the pool.

7. Conclusion & Reflections

  • Conclusion Summary: This paper makes two primary contributions. First, it identifies and empirically demonstrates the "model-level scaling up" phenomenon in LLMs, showing that routing is a powerful and promising paradigm for performance enhancement. Second, it introduces RouterEval, a large-scale, open-source benchmark designed to accelerate research on LLM routers. The comprehensive evaluation on RouterEval reveals that existing router methods are still in their infancy and have significant room for improvement, particularly in overcoming classification bias and effectively leveraging model diversity.

  • Limitations & Future Work:

    • Authors' Limitations: The authors acknowledge that deploying a large number of LLMs for a routing system can be challenging. However, they argue that significant performance gains are seen with just 3-10 candidates, which is a manageable number. They also state that the current 200 million data points in RouterEval, while massive, are still insufficient to train a truly exceptional router, highlighting the need for a community-wide effort to collect more performance data.
    • Future Work: The authors propose several exciting research directions:
      1. Advanced Training Strategies: Using the provided raw data for pre-training, few-shot learning, or data augmentation to build more robust routers.
      2. Recommender System Techniques: Applying classic RS methods to tackle challenges like representation learning for inputs/LLMs, the "cold start" problem (new models or new tasks), and using causal inference to debias router predictions.
      3. Multi-Objective Routing: Extending the paradigm beyond just performance to also optimize for computational cost, latency, or reducing hallucinations.
  • Personal Insights & Critique:

    • Impact: This paper is a significant contribution to the field. By providing both a novel insight (model-level scaling) and a practical tool (RouterEval), it lays the groundwork for a new wave of research into intelligent, efficient, and multi-model AI systems.
    • Novelty: The concept of "model-level scaling up" is an elegant and powerful framing. While the idea of routing isn't entirely new, this paper is the first to study it at such a massive scale and formalize it as a scaling law.
    • Critique: The current evaluation is a simulation based on pre-computed performance records. A real-world deployment would introduce additional factors like the router's own inference latency and the engineering complexity of maintaining a large pool of models. While the paper defers these practical concerns, they will be crucial for real-world adoption.
    • Open Questions: The "chicken-and-egg" problem remains the biggest hurdle: to train great routers, we need vast performance datasets, but generating these datasets is extremely expensive. RouterEval is a monumental first step, but the ultimate solution will likely involve more clever, data-efficient training methods or community-driven data-sharing platforms. This paper successfully highlights this challenge and provides the community with the tools to start tackling it.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.