Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
TL;DR Summary
Router-R1, an RL framework, tackles complex multi-LLM tasks by sequential routing and aggregation. It uses an LLM to interleave reasoning and model calls, optimizing performance-cost with a novel reward. Generalizing to unseen models via simple descriptors, it outperforms baselin
Abstract
The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost, opening a pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
- Authors: Haozhen Zhang, Tao Feng, Jiaxuan You (University of Illinois at Urbana-Champaign)
- Journal/Conference: This paper is available on arXiv, which is a preprint server. This means it has not yet undergone formal peer review for publication in a conference or journal but is shared to disseminate research quickly.
- Publication Year: 2024 (as per the arXiv submission date).
- Abstract: The paper addresses the limitation of existing Large Language Model (LLM) routers, which typically assign a user query to a single "best" model in one step. This approach is insufficient for complex tasks that could benefit from the combined strengths of multiple LLMs. The authors propose Router-R1, a framework based on Reinforcement Learning (RL) that treats routing as a sequential decision-making process. Router-R1 uses an LLM as the router itself, allowing it to alternate between internal reasoning (
think) and calling other LLMs (route). The framework is trained with a simple rule-based reward system that considers output format, correctness, and a novel cost component to balance performance and expense. A key feature is its ability to generalize to new, unseen LLMs using simple text descriptions of their capabilities, price, and latency, without needing to be retrained. Experiments on seven question-answering benchmarks show Router-R1 outperforms strong baselines in performance, generalization, and cost management. - Original Source Link:
- arXiv Link: https://arxiv.org/abs/2506.09033
- PDF Link: http://arxiv.org/pdf/2506.09033v2
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: With a vast and growing number of LLMs available, each with unique strengths (e.g., creativity, factual accuracy, coding ability), a key challenge is selecting the right one for a given task. Current LLM "routers" simplify this problem by mapping one query to one model in a single shot. This is a significant limitation for complex problems, such as multi-hop question answering, which often require breaking down a problem and integrating information from multiple sources or perspectives—something a single model might struggle with.
- Importance & Gaps: The single-round, one-to-one mapping approach fails to orchestrate the complementary abilities of different LLMs. For instance, one model might be good at decomposing a question, another at finding factual details, and a third at synthesizing a coherent answer. Existing routers cannot manage such a collaborative workflow. Furthermore, training a system to make a sequence of discrete model choices is difficult for standard gradient-based machine learning methods.
- Fresh Angle: Router-R1 re-frames the problem from a single "dispatch" decision to a sequential decision-making process. It introduces the idea of an LLM acting as the "coordinator" or router, which can introspect (
think), delegate sub-tasks to other LLMs (route), and integrate the results to build a final answer over multiple rounds. This process is optimized using Reinforcement Learning, which is well-suited for learning sequences of actions to maximize a long-term reward.
-
Main Contributions / Findings (What):
- A Novel RL-based Framework for Multi-Round Routing: Router-R1 is the first framework to formulate multi-LLM coordination as a sequential decision process trained with RL. It enables dynamic, multi-step interactions with a pool of LLMs.
- An LLM as the Router: By instantiating the router itself as a capable LLM, the framework naturally combines internal reasoning (deliberation) with external tool use (model invocation), allowing for more sophisticated and adaptive problem-solving strategies.
- A Lightweight and Effective Reward Function: The paper introduces a simple rule-based reward function with three components: a
format rewardto ensure structured output, afinal outcome rewardfor correctness, and a novelcost rewardto optimize the trade-off between performance and the financial/computational cost of using powerful models. - Strong Generalization to Unseen Models: Router-R1 can incorporate new LLMs into its routing pool at inference time without retraining, simply by being provided with their text descriptions (e.g., price, specialization). This is a crucial feature for practical deployment in the rapidly evolving LLM ecosystem.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data (e.g., GPT-4, LLaMA). They can understand and generate human-like text, answer questions, write code, and perform a wide range of language tasks. Different LLMs excel at different things.
- LLM Router: An LLM router is a system that acts as a traffic controller. Given a user's query, it decides which LLM from a pool of available models is best suited to handle it. The goal is typically to maximize answer quality, minimize cost, or reduce latency.
- Reinforcement Learning (RL): A paradigm of machine learning where an "agent" learns to make decisions by performing actions in an "environment" to maximize a cumulative "reward." The agent learns a "policy"—a strategy for choosing actions—through trial and error. This is different from supervised learning, where the model is given explicit correct answers. RL is ideal for tasks involving sequences of decisions, like playing a game or, in this case, deciding which LLM to call next.
- Multi-hop Question Answering (QA): A type of complex question answering where finding the final answer requires finding and combining multiple pieces of information (or "hops"). For example, "Who was the U.S. president when the director of Inception was born?" requires finding the director's name (Christopher Nolan), his birth year (1970), and then who was president in that year (Richard Nixon).
-
Previous Works:
-
Query-based Routers (Single-Round): The paper positions itself against a line of work on single-shot routers.
FrugalGPTandFORCuse cascades or simple routing to balance cost and performance, often starting with cheaper models and escalating if necessary.GraphRouterandRouterDCuse more advanced techniques like graph-based prediction or contrastive learning to learn better mappings between queries and models.- However, all these methods are fundamentally single-round: they make one decision and stop. They do not orchestrate a sequence of calls.
-
Optimizing LLMs with RL: RL has been successfully used to fine-tune LLM behavior.
RLHF (Reinforcement Learning from Human Feedback)is a famous technique used to align LLMs with human preferences, making them more helpful and harmless.Search-R1is a closely related work that uses RL to teach an LLM how to interact with a search engine over multiple turns to answer questions. This is similar in spirit to Router-R1, butSearch-R1focuses on a single external tool (a search engine), whereas Router-R1 orchestrates a pool of diverse LLMs.
-
-
Differentiation: Router-R1's key innovation is its multi-round, reasoning-interleaved routing process. Unlike prior routers that perform a static, one-time assignment, Router-R1 engages in a dynamic dialogue with its pool of LLMs. It can pose a sub-question to one model, analyze the response, and then formulate a new sub-question for another model based on the new information. This sequential and adaptive nature, enabled by using an LLM as the router and training it with RL, is its primary distinction.
4. Methodology (Core Technology & Implementation)
The core of Router-R1 is an LLM (the "policy LLM") trained via reinforcement learning to decide on a sequence of actions: either think internally or route a query to an external LLM from a pool.
-
Principles: The central idea is to model LLM coordination as a sequential task. At each step, the policy LLM observes the current state (original question + all previous interactions) and decides what to do next. The goal is to learn a policy that generates a high-quality final answer efficiently.
-
Reinforcement Learning Formulation: The training objective is to find an optimal policy that maximizes the expected reward. The paper uses a standard regularized policy optimization formula:
- : The policy LLM being trained (e.g., Qwen2.5-3B-Instruct).
- : A reference LLM, which is a stable copy of the policy model. It's used to prevent the policy from changing too drastically during training, which aids stability.
- : The input query from the training dataset .
- : The full generated output sequence from the policy LLM, including all blocks and interactions with external LLMs.
- : The pool of available candidate LLMs that can be called (e.g., LLaMA-3.1-70B, Mixtral-8x22B).
- : The reward function that scores the quality of the generated output .
- : This is a KL-divergence penalty term. It measures how much the new policy has diverged from the reference policy . The coefficient controls the strength of this penalty. Its purpose is to ensure stable learning by preventing the model from making overly large updates.
-
Reward Curation: The reward function is a crucial component that guides the learning process. It is a weighted sum of three simple, rule-based rewards.
-
Format Reward (): This reward ensures the LLM generates output in the correct, parsable format. It acts as a strong structural constraint. An incorrect format receives a large penalty, effectively forcing the model to learn the required syntax (, , etc.). The appendix specifies rules like all tags must be closed, the response must start with and end with , and each must have a corresponding .
-
Final Outcome Reward (): This measures the correctness of the final answer. The paper uses Exact Match (EM). Where is the predicted answer extracted from the tag and is the ground truth answer. EM is 1 if they are identical, and 0 otherwise.
-
Cost Reward (): This novel component encourages cost-efficiency by penalizing the use of large, expensive models. The reward is inversely proportional to the cost.
- : The number of parameters of the selected candidate LLM (a proxy for its cost).
- : The number of tokens generated by the candidate LLM.
- : A function mapping model size to a per-token cost (e.g., based on API pricing). In practice, this raw cost is normalized using a sliding window of recent costs and then inverted, so that lower costs yield higher rewards (closer to 1) and higher costs yield lower rewards (closer to 0).
- Hierarchical Reward: The paper mentions a crucial implementation detail: the rewards are applied hierarchically. If is -1, the other two rewards are set to 0. This prioritizes learning the correct structure above all else, which stabilizes training.
- Cost-Performance Trade-off: The hyperparameter in the overall reward formula controls the balance. If , the model only cares about getting the right answer. As increases, the model is incentivized more strongly to use cheaper LLMs, even at the risk of slightly lower accuracy.
-
-
Steps & Procedures (Multi-Round Interaction):
该图像是Router-R1架构示意图,比较了两种LLM路由方式。图(a)展示了单轮路由,查询经路由器直接分配给一个LLM并给出答案。图(b)是Router-R1的多轮路由,它将查询分解为子查询并调用多个LLM获取信息,通过内部推理和外部LLM交互的迭代过程,最终生成更准确的答案。As shown in Figure 1, the process is as follows:
- Input: The policy LLM receives the initial question and a list of available candidate LLMs with their descriptions.
- Think: The LLM first generates text within tags to reason about the problem, assess what information is needed, and plan its next action.
- Route (Optional): If it decides external knowledge is needed, it generates a tag. This specifies which LLM from the pool to call and what sub-question to ask.
- Execution & Integration: The system detects the tag, calls the specified LLM API, and feeds the response back to the policy LLM within tags.
- Iteration: The policy LLM now has new information in its context. It can repeat steps 2-4, reasoning further and potentially calling other LLMs. This can happen for a maximum of 4 routing steps.
- Answer: Once the LLM believes it has enough information, it generates the final answer within tags, which concludes the sequence.
5. Experimental Setup
-
Datasets: The evaluation was conducted on seven diverse question-answering (QA) datasets:
- General QA (single-hop):
Natural Questions (NQ),TriviaQA,PopQA. These typically require retrieving a single fact. - Multi-Hop QA:
HotpotQA (HpQA), ,Musique,Bamboogle. These require connecting multiple pieces of information to derive the answer. - Training: The model was trained on a mix of 7k samples from
NQand 7k fromHotpotQA. This meansNQandHotpotQAare "in-domain" datasets for evaluation, while the other five are "out-of-domain" to test generalization.
- General QA (single-hop):
-
Evaluation Metrics:
-
Exact Match (EM):
- Conceptual Definition: This metric measures the percentage of predictions that match the ground truth answer exactly, after normalization (e.g., removing articles, punctuation, and converting to lowercase). It is a strict measure of accuracy.
- Mathematical Formula:
- Symbol Explanation:
- : The total number of questions in the test set.
- : The predicted answer for the -th question.
- : The ground truth answer for the -th question.
- : A function that standardizes the text.
- : An indicator function that is 1 if the condition inside is true, and 0 otherwise.
-
F1-Score:
- Conceptual Definition: This metric is a more lenient measure than EM. It treats the prediction and ground truth as bags of words and computes the harmonic mean of precision and recall. It is useful when the predicted answer contains the correct information but is not an exact string match.
- Mathematical Formula: where
- Symbol Explanation:
- : The set of unique words (tokens) in the ground truth answer.
- : The set of unique words (tokens) in the predicted answer.
- : The size of the set.
-
-
Baselines: Router-R1 was compared against two groups of baselines:
- Basic Baselines:
Direct: Directly prompting the base LLM.CoT: Using Chain-of-Thought prompting.SFT: Supervised fine-tuning the base LLM on the QA task.RAG: Retrieval-Augmented Generation using a Wikipedia-based retriever.Search-R1: The most similar baseline, which uses RL to interact with a search engine.
- Query-based LLM Routers:
Prompt LLM: Prompting the base LLM to select a candidate.Largest LLM: A simple strategy of always picking the biggest (and presumably best) model.KNN Router,MLP Router,BERT Router: Simple router models that use different techniques to map queries to LLMs.RouterDC,GraphRouter: More advanced, state-of-the-art single-round routers.Prompt LLM*,KNN Router*: Enhanced versions that first decompose the question into sub-queries and then route each one. This simulates a multi-step process but without the adaptive reasoning of Router-R1.
- Basic Baselines:
6. Results & Analysis
-
Core Results: The main results are presented in Table 1. As no image was provided, the table is transcribed below.
Manual Transcription of Table 1: Experimental results on seven QA datasets w.r.t. Exact Match.
Methods General QA Multi-Hop QA Avg. NQ† TriviaQA PopQA HpQA† 2wiki Musique Bamb Qwen2.5-3B-Instruct Direct 0.092 0.260 0.122 0.140 0.266 0.026 0.040 0.135 CoT 0.126 0.358 0.160 0.168 0.208 0.046 0.224 0.184 SFT 0.212 0.400 0.160 0.198 0.256 0.052 0.112 0.199 RAG 0.298 0.540 0.366 0.216 0.146 0.078 0.224 0.267 Search-R1 0.328 0.510 0.324 0.236 0.278 0.090 0.272 0.291 Prompt LLM 0.300 0.580 0.340 0.268 0.262 0.108 0.448 0.329 Largest LLM 0.296 0.578 0.354 0.278 0.274 0.104 0.480 0.338 ... (other routers) ... ... ... ... ... ... ... ... Router-R1-Qwen 0.388 0.706 0.384 0.352 0.434 0.138 0.512 0.416 Llama-3.2-3B-Instruct Direct 0.202 0.328 0.176 0.144 0.134 0.018 0.048 0.150 ... (other baselines) ... ... ... ... ... ... ... ... Router-R1-Llama 0.416 0.680 0.432 0.322 0.368 0.128 0.520 0.409 Note: For brevity, not all baselines are transcribed, but the main conclusions are based on the full table in the paper.
- Key Findings:
- Dominant Performance: Router-R1, using either Qwen or Llama as the base model, consistently achieves the highest Exact Match score across all seven datasets. The average scores (
0.416for Qwen,0.409for Llama) are significantly higher than all baselines. - Superiority over Single-Round Routers: It clearly outperforms all query-based LLM routers, including advanced ones like
GraphRouterandRouterDC. This demonstrates the advantage of its multi-round, reasoning-interleaved strategy over single-shot decisions. Even the enhancedPrompt LLM*andKNN Router*baselines, which try to mimic multi-step behavior, fall short, indicating that the adaptive reasoning of Router-R1 is crucial. - Strong Generalization: Despite being trained only on
NQandHotpotQAdata, Router-R1 performs exceptionally well on the five unseen, out-of-domain datasets (e.g.,TriviaQA,Musique), showcasing its ability to learn generalizable routing strategies.
- Dominant Performance: Router-R1, using either Qwen or Llama as the base model, consistently achieves the highest Exact Match score across all seven datasets. The average scores (
- Key Findings:
-
Analysis of Cost Rewards:
该图像是图3,展示了成本奖励对NQ、PopQA、HotpotQA (HpQA) 和2WikiMultiHopQA (2wiki) 四个数据集的影响分析。左侧图表显示,随着成本系数 从0.6增大到0.9,模型的精确匹配 (EM) 性能普遍呈现下降趋势。右侧图表则显示,随着成本系数 增大,成本奖励值普遍上升。这表明成本系数 越高,模型在性能上可能有所牺牲,但获得了更高的成本奖励。Figure 3 illustrates the trade-off between performance (EM) and cost.
- Left Chart (Performance): As the cost coefficient increases from 0.6 to 0.9, the Exact Match performance generally decreases across all four datasets. This is expected, as a higher makes the model prioritize cost savings over correctness.
- Right Chart (Cost Reward): Conversely, as increases, the
Cost Rewardrises. This indicates that the policy is successfully learning to make cheaper choices (e.g., calling smaller models or making fewer calls) to maximize this component of the reward. - Conclusion: This analysis confirms that the cost reward mechanism is effective and provides a controllable knob to tune the balance between performance and computational expense, enabling an emergent adaptive routing strategy.
-
Generalization Capability to Unseen Candidate LLMs: Table 2 shows the results when two new, unseen LLMs are added to the routing pool at inference time without retraining.
Manual Transcription of Table 2: Generalization capability w.r.t. Exact Match and F1-Score.
Methods NQ† TriviaQA PopQA HpQA† Avg. EM F1 EM F1 EM F1 EM F1 EM F1 Router-R1-Qwen 0.388 0.484 0.706 0.772 0.384 0.447 0.352 0.449 0.458 0.538 Router-R1-Qwen‡ 0.382 0.493 0.722 0.778 0.402 0.464 0.346 0.459 0.463 0.549 Note: ‡ indicates the routing pool was extended with unseen LLMs.
- Key Finding: When the routing pool is extended (
Router-R1-Qwen‡), the performance of Router-R1 not only remains robust but actually improves on average (EM:0.458->0.463, F1:0.538->0.549). This shows that Router-R1 can effectively leverage the capabilities of new models by interpreting their text descriptions at runtime, without needing to be retrained. In contrast, other baselines showed limited or inconsistent gains.
- Key Finding: When the routing pool is extended (
-
Discussion:
该图像是图4,展示了LLM API调用次数和Router-R1训练收敛情况。(a)显示了不同基准测试中平均LLM API调用次数在1.01到1.36之间。(b)和(c)分别绘制了训练奖励曲线和策略熵曲线,对比了“w/ format reward”和“w/o format reward”两种情况。无格式奖励时训练在约150步后崩溃,而有格式奖励时训练稳定且收敛。- LLM API Call Count (Figure 4a): The bar chart shows that Router-R1 makes more API calls on average for the more complex multi-hop QA datasets (
HotpotQA,2wiki,Musique) than for the simpler general QA datasets (NQ,TriviaQA). This is strong evidence of adaptive behavior: the model correctly assesses that complex tasks require more external information and adjusts its strategy accordingly. - Convergence Analysis (Figure 4b, 4c): The training curves show that Router-R1 converges quickly (within ~100 steps). More importantly, they highlight the critical role of the
format reward. The training run "w/o format reward" becomes unstable and crashes, while the run "w/ format reward" is smooth and stable. This is because the format reward provides a strong, consistent signal that prevents the model from generating nonsensical, unparsable outputs.
- LLM API Call Count (Figure 4a): The bar chart shows that Router-R1 makes more API calls on average for the more complex multi-hop QA datasets (
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces Router-R1, an RL-based framework that redefines LLM routing as a multi-round, sequential decision process. By using an LLM as the router, it interleaves internal reasoning with external model calls, enabling it to solve complex tasks by coordinating the strengths of multiple LLMs. The simple yet powerful rule-based reward function, including a novel cost component, allows it to achieve state-of-the-art performance while managing a flexible trade-off between accuracy and cost. Its ability to generalize to unseen datasets and unseen LLMs makes it a robust and practical solution for real-world LLM orchestration.
-
Limitations & Future Work: The authors acknowledge several limitations:
- Task Scope: The evaluation is confined to QA. Its effectiveness on other tasks like dialogue or code generation is unknown.
- Reward Simplicity: While effective, the rule-based reward might not capture nuances like factual consistency or creativity. Future work could explore using learned reward models or human feedback.
- Inference Latency: The multi-round nature inherently adds latency due to sequential API calls, which may be unsuitable for real-time applications.
- Dependence on Model Descriptors: Generalization to unseen models relies on simple text descriptions, which may not fully capture a model's true capabilities or weaknesses.
-
Personal Insights & Critique:
- Novelty and Impact: The core idea of treating routing as a sequential RL problem and using an LLM as the reasoning agent is highly innovative and powerful. It moves beyond simple classification-style routing to a more general framework for agentic behavior and tool use, where the "tools" are other LLMs. This paradigm has significant potential for building more sophisticated and capable AI systems.
- Practicality: The ability to generalize to new models without retraining is a standout feature, addressing a major pain point in the fast-moving LLM landscape. The cost-performance trade-off knob () is also a very practical feature for production systems.
- Potential Weaknesses: The framework's performance is likely bottlenecked by the capability of the base "router" LLM. A smaller, less capable router may struggle with the complex reasoning needed to orchestrate more powerful models effectively. The added latency is also a significant practical hurdle that needs to be addressed, perhaps through parallelization of independent sub-queries.
- Future Directions: This work opens up exciting avenues. One could explore more complex aggregation strategies beyond simple context integration. The router could learn not just which model to call, but also how to prompt it differently based on the task. Finally, applying this framework to a heterogeneous pool of tools (e.g., LLMs, search engines, code interpreters, databases) would be a natural and powerful extension.
Similar papers
Recommended via semantic vector search.