Paper status: completed

RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

Published:06/04/2025

LLM Routing (2)RadialFormer Architecture (1)Lightweight Transformer Backbone (1)Query-Model Relationship Modeling (1)Contrastive Loss Based Robust Optimization (1)

Original Link PDF

Price: 0.100000

8 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

RadialRouter uses a lightweight radial Transformer to model query-LLM relations, optimized with KL divergence and contrastive loss, achieving up to 9.2% improvement over existing routing methods on RouterBench.

Abstract

The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present RadialRouter, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2% and 5.8% in the Balance and Cost First scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.

Mind Map

In-depth Reading

English Analysis~15 min read · 17,255 chars

1. Bibliographic Information

Title: RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing
Authors: Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuai Zhang, and Jianhua Tao.
Affiliations: The authors are from the Department of Automation at Tsinghua University and the Beijing National Research Center for Information Science and Technology, both highly respected institutions in China.
Journal/Conference: The paper is available on arXiv, a preprint server for academic papers. This means it has not yet undergone formal peer review for publication in a conference or journal, but it represents cutting-edge research. The arXiv identifier is 2506.03880.
Publication Year: The preprint was submitted in June 2025.
Abstract: The paper addresses the challenge of efficiently using multiple Large Language Models (LLMs). It introduces RadialRouter, a new method to select the best LLM for a specific user query. The core of RadialRouter is a novel, lightweight Transformer architecture called RadialFormer, which models the relationship between the query and the available LLMs. The system is trained using a combination of Kullback-Leibler (KL) divergence and a query-query contrastive loss to improve its accuracy and robustness. On the RouterBench benchmark, RadialRouter significantly outperforms existing methods, demonstrating its practical potential for optimizing performance and cost.
Original Source Link:
- arXiv Page: https://arxiv.org/abs/2506.03880
- PDF Link: https://arxiv.org/pdf/2506.03880v2.pdf

2. Executive Summary

Background & Motivation (Why):
- The Problem: Using a collection of powerful LLMs (an "LLM ensemble") can produce excellent results, but querying all of them for every task is extremely expensive and slow. The solution is LLM Routing: an intelligent system that, for any given user query, selects the single most suitable LLM from the pool. This aims to achieve the best possible performance at the lowest possible cost.
- The Gap: Existing routing methods are often too simplistic. They might only choose between a "small" and "large" model or use basic similarity matching. The paper argues that these methods fail to capture the deep, intrinsic connection between the nuances of a user's query and the specific strengths and weaknesses of each LLM. This leads to suboptimal routing decisions.
- The Innovation: RadialRouter introduces a structured approach. Instead of treating the query and LLMs as separate entities, it creates a unified representation where the query and all candidate LLMs "interact" within a specialized neural network. This allows the router to make a more informed decision.
Main Contributions / Findings (What):
1. A Novel Framework (RadialRouter): A complete system for dynamically routing queries to the most appropriate LLM, designed for efficiency and robustness.
2. A Lightweight Backbone (RadialFormer): A custom Transformer-based architecture with a radial (star-shaped) structure. It models the interaction between a central "query node" and several "LLM nodes," capturing their relationships efficiently.
3. An Advanced Optimization Strategy: The model is trained with a dual-objective loss function:
  - Kullback-Leibler (KL) Divergence Loss: To align the router's predictions with a target distribution of LLM suitability.
  - Query-Query Contrastive Loss: To make the model's internal representations more robust by grouping semantically similar queries.
4. State-of-the-Art Performance: Experimental results on the RouterBench benchmark show that RadialRouter significantly outperforms previous methods, especially in scenarios where balancing performance and cost is critical. It also shows promise in adapting to a changing pool of available LLMs.

This section explains the foundational concepts needed to understand the paper and how it fits into the broader field.

Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT-4, Llama 2) trained on vast amounts of text data. They can understand and generate human-like language for various tasks like question answering, summarization, and coding. Their size makes them computationally expensive to run.
- LLM Ensemble: The practice of using multiple different LLMs together. Just like a team of human experts with different specialties, an LLM ensemble can tackle a wider range of problems more effectively than any single model. However, naively querying every model is impractical.
- LLM Routing: The core topic of this paper. It's a strategy to manage an LLM ensemble by creating a "router" or "dispatcher." This router analyzes an incoming query and directs it to the one LLM in the ensemble that is most likely to answer it well and cost-effectively.
- Transformer Architecture: The neural network design that powers most modern LLMs. Its key mechanism is self-attention, which allows the model to weigh the importance of different words in a sequence when processing information. RadialRouter's RadialFormer is a specialized, lightweight variant of the Transformer.
- Contrastive Learning: A machine learning technique used to learn good data representations. The basic idea is to train a model to pull embeddings of "similar" data points (positive pairs) closer together in a high-dimensional space, while pushing embeddings of "dissimilar" data points (negative pairs) farther apart.
- Kullback-Leibler (KL) Divergence: A measure from information theory that quantifies how much one probability distribution differs from a second, reference probability distribution. A KL divergence of zero means the two distributions are identical. In this paper, it's used to train the router to predict an LLM probability distribution that matches a "ground truth" distribution based on known performance and cost.
Previous Works and Their Limitations: The paper categorizes previous routing methods to highlight their shortcomings, which RadialRouter aims to solve.

该图像是论文中图1的示意图，展示了不同大语言模型(LLM)路由方法的范式对比。左上角为LLMs-pair路由，右上为LLM cascade，左下为基于相似度的路由，右下为本文提出的基于结构化query-LLM表示的RadialRouter方法，突出其统一和高效的路由过程。

As shown in Image 2, prior methods include:
- LLMs-pair router: The simplest form, where a router decides between just two models, typically a cheap/weak one and an expensive/strong one (e.g., HybridLLM). This is not scalable to a large pool of LLMs.
- LLM cascade: Models are queried sequentially, starting with the cheapest. If a model's output is deemed "good enough," the process stops. Otherwise, the next more expensive model is tried (e.g., FrugalGPT). This can be slow due to multiple sequential calls.
- Similarity-based router: The query is converted into an embedding (a vector of numbers), and this embedding is compared to pre-defined embeddings for each LLM. The LLM with the most similar embedding is chosen (e.g., RouterDC). The paper argues this doesn't fully capture the complex query-LLM relationship.
Differentiation: RadialRouter (bottom-right in Image 2) is fundamentally different. It doesn't treat the query and LLMs as separate items to be matched. Instead, it creates a dynamic, interactive system where the query representation and LLM representations are jointly updated. This structured representation is the key innovation, allowing for a more nuanced and holistic routing decision.

4. Methodology (Core Technology & Implementation)

This section details the inner workings of RadialRouter.

Figure 2: Overview of RadialRouter methodology. 该图像是图2，RadialRouter方法的示意图，展示了从查询编码、RadialFormer环形结构的特征初始化和更新，到基于预测得分和多层感知机（MLP）选择最优LLM的完整流程，并在过程中结合了 $\mathcal{L}_{q-q}$ 和 $\mathcal{L}_{KL}$ 两个损失函数。

Image 1 provides a complete overview of the RadialRouter framework, which can be broken down into three main steps:

Step 1: Feature Initialization

The process starts by creating initial representations for the query and the candidate LLMs.

A user query $x$ is fed into a pre-trained language encoder (like DeBERTa) to produce a query embedding: $\mathbf{q} = \mathcal{E}(\mathbf{x})$ .
The system maintains a set of $n$ learnable embeddings, $\{\mathbf{m}_1, \ldots, \mathbf{m}_n\}$ , one for each of the $n$ LLMs in the pool. These embeddings are parameters that will be learned during training.

Step 2: Update of RadialFormer

This is the core of the framework, where the RadialFormer architecture processes the initial embeddings.

Principles and Architecture: RadialFormer is inspired by the Star-Transformer but is tailored for the routing task. It has a radial (star-like) structure consisting of:
- One Relay Node ( $r$ ): Represents the user query. It is initialized with the query embedding $q$ .
- $n$ Satellite Nodes ( $s_i$ ): Each represents one of the $n$ candidate LLMs. They are initialized with the learnable LLM embeddings $m_i$ .
  
  $该图像是论文中的示意图，展示了RadialFormer结构的核心组成——一个中心节点r通过多头注意力机制与周围卫星节点$s_1, s_2, ..., s_n$进行交互，且该结构被重复应用T层。$ 该图像是论文中的示意图，展示了RadialFormer结构的核心组成——一个中心节点r通过多头注意力机制与周围卫星节点 $s_1, s_2, ..., s_n$ 进行交互，且该结构被重复应用T层。
As shown in Image 3, information flows radially: each satellite node only communicates with the central relay node, not with other satellite nodes. This design is much more computationally efficient than a standard Transformer where every node attends to every other node. The complexity is reduced from $O(l^2d)$ to $O(ld)$ , where $l$ is the sequence length and $d$ is the embedding dimension.
Update Procedure (Algorithm 1): The nodes are updated iteratively over $T$ layers. In each layer $t$ :
1. Satellite Node Update: Each satellite node $\mathbf{s}_i^t$ is updated based on its previous state $\mathbf{s}_i^{t-1}$ , its initial LLM embedding $\mathbf{m}_i$ , and the current state of the relay node $\mathbf{r}^{t-1}$ . This allows each LLM representation to be refined based on the query's information. $\mathbf{C}_i^t = [\mathbf{s}_i^{t-1}; \mathbf{m}_i; \mathbf{r}^{t-1}]$ $\mathbf{s}_i^t = \mathrm{MHAttn}(\mathbf{s}_i^{t-1}, \mathbf{C}_i^t)$ Here, $\mathrm{MHAttn}$ is the standard Multi-Head Attention mechanism. The context $\mathbf{C}_i^t$ is a concatenation of the three vectors.
2. Relay Node Update: The relay node $\mathbf{r}^t$ is updated by gathering information from all the newly updated satellite nodes $\mathbf{S}^t$ and its own previous state $\mathbf{r}^{t-1}$ . This allows the query representation to be informed by how it relates to all the candidate LLMs. $\mathbf{r}^t = \operatorname{MHAttn}(\mathbf{r}^{t-1}, [\mathbf{r}^{t-1}; \mathbf{S}^t])$ After $T$ layers, the final satellite node states $\mathbf{S}^T$ contain rich, context-aware information about the suitability of each LLM for the given query.

Step 3: Optimal LLM Selection and Optimization

LLM Selection: The final satellite states $\{\mathbf{s}_1^T, \ldots, \mathbf{s}_n^T\}$ are passed through a simple Multi-Layer Perceptron (MLP) to produce a final score for each LLM. These scores are converted into a probability distribution $p$ using a softmax function. The LLM with the highest probability is selected as the optimal choice. $\hat{i} = \mathrm{argmax}_i(p_i)$
Optimization Objective: The model is trained by minimizing a combined loss function.
1. Kullback-Leibler Divergence Loss ( $\mathcal{L}_{\mathrm{KL}}$ ): This is the primary loss for the routing task. For each query, a "ground truth" probability distribution $q$ over the LLMs is calculated based on their pre-computed performance and cost scores. The $\mathcal{L}_{\mathrm{KL}}$ loss forces the router's predicted probability distribution $p$ to match this target distribution $q$ . $\mathcal{L}_{\mathrm{KL}}(\mathbf{x}; \boldsymbol{\theta}) = D_{\mathrm{KL}}(p \| q) = \sum_{i=0}^{n} p_i \log \frac{p_i}{q_i}$
2. Query-Query Contrastive Loss ( $\mathcal{L}_{\mathrm{q-q}}$ ): This loss enhances the robustness of the initial query encoder. It encourages the encoder to produce similar embeddings for queries that are semantically related (e.g., from the same task domain). $\mathcal{L}_{\mathrm{q-q}}(\mathbf{x}; \boldsymbol{\theta}) = -\log \frac{e^{\mathrm{sim}\langle \mathcal{E}(\mathbf{x}), \mathcal{E}(\mathbf{x}^{+}) \rangle}}{e^{\mathrm{sim}\langle \mathcal{E}(\mathbf{x}), \mathcal{E}(\mathbf{x}^{+}) \rangle} + \sum_{t} e^{\mathrm{sim}\langle \mathcal{E}(\mathbf{x}), \mathcal{E}(\mathbf{x}_t^{-}) \rangle}}$ Here, $\mathbf{x}^{+}$ is a "positive" sample (a similar query from the same cluster), and $\mathbf{x}_t^{-}$ are "negative" samples (dissimilar queries from other clusters). sim denotes cosine similarity.
3. Final Loss: The two losses are combined with a hyperparameter $\lambda$ to balance their contributions. $\boldsymbol{\theta}^* = \arg\min \underset{\mathbf{x} \sim \mathcal{D}_{\mathrm{train}}}{\mathbb{E}} \mathcal{L}_{\mathrm{KL}}(\mathbf{x}; \boldsymbol{\theta}) + \lambda \mathcal{L}_{\mathbf{q-q}}(\mathbf{x}; \boldsymbol{\theta})$

5. Experimental Setup

Datasets: The experiments use RouterBench, a benchmark specifically designed for evaluating LLM routers. It includes queries from 6 datasets across 4 domains:
- Commonsense Reasoning: Hellaswag, Winogrande, ARC Challenge
- Knowledge-based LU: MMLU
- Math: GSM8K
- Coding: MBPP
Candidate LLMs: A diverse pool of 11 popular open-source and proprietary models, including GPT-4, Claude-v2, Llama-70B-chat, and Mistral-7B-chat.
Evaluation Metrics:
- Performance: The average accuracy of the responses generated by the routed LLMs. Higher is better.
- Cost: The average cost in dollars to generate responses. This is based on the pricing of the chosen LLMs. Lower is better.
- Score: The primary metric for evaluating the trade-off between performance and cost.
  1. Conceptual Definition: It is calculated as the model's performance minus a penalty for its cost. A hyperparameter, $\alpha$ , controls how heavily cost is penalized.
  2. Mathematical Formula: $\mathrm{score}_{ij} = \mathrm{performance}_{ij} - \alpha \cdot \mathrm{cost}_{i}$
  3. Symbol Explanation:
    - $\mathrm{performance}_{ij}$ : The accuracy of $LLM_i$ on query $j$ .
    - $\mathrm{cost}_{i}$ : The inference cost of $LLM_i$ .
    - $\alpha$ : A non-negative coefficient that determines the preference for cost-saving. A higher $\alpha$ means cost is more important. The paper defines three scenarios based on $\alpha$ :
  - Performance First: $\alpha = 0$ (cost is ignored).
  - Balance: $\alpha = 0.02$ (a balance between performance and cost).
  - Cost First: $\alpha = 0.1$ (cost is heavily prioritized).
Baselines: RadialRouter is compared against several methods:
- CosineClassifier: A simple baseline that uses a cosine classifier.
- HybridLLM: A router for a small-large model pair.
- FrugalGPT: A cascade-based approach.
- RouterDC: A state-of-the-art method using dual contrastive learning.
- GraphRouter: A graph-based routing framework.

6. Results & Analysis

Core Results: Comparison with Baselines

This is a manual transcription of Table 1 from the paper.

	Performance First			Balance			Cost First
	Perf.↑	Cost↓	Score↑	Perf.↑	Cost↓	Score↑	Perf.↑	Cost↓	Score↑
Best candidate	0.813	7.185	0.813	0.709	0.562	0.698	0.704	0.439	0.660
Random	0.627	1.847	0.627	0.627	1.847	0.590	0.627	1.847	0.442
CosineClassifier	0.662	1.448	0.662	0.584	0.189	0.580	0.566	0.162	0.549
HybridLLM	0.801	6.869	0.801	0.791	6.612	0.659	0.517	0.107	0.506
FrugalGPT	0.813	7.185	0.813	0.671	0.336	0.664	0.549	0.124	0.536
RouterDC	0.815	6.768	0.815	0.716	1.313	0.690	0.718	0.418	0.676
GraphRouter	0.813	7.185	0.813	0.713	0.987	0.693	0.709	0.500	0.659
RadialRouter	0.816	6.759	0.816	0.781	1.179	0.757	0.763	0.476	0.715
Oracle	0.925	1.015	0.925	0.917	0.393	0.909	0.891	0.258	0.865

Analysis:
- In the Performance First scenario, most advanced routers perform similarly, as they all learn to pick the best-performing (and most expensive) model, GPT-4.
- The superiority of RadialRouter is most evident in the Balance and Cost First scenarios. It outperforms the next best method (GraphRouter and RouterDC) by 9.2% (0.757 vs. 0.693) in the Balance scenario and 5.8% (0.715 vs. 0.676) in the Cost First scenario, based on the primary Score metric.
- This demonstrates that RadialRouter's structured approach is particularly effective at navigating the complex trade-offs between performance and cost. It also achieves over 82% of the theoretical Oracle score, indicating a highly effective routing strategy.

Ablation Studies

This is a manual transcription of Table 2 from the paper.

Setting	PF	BA	CF	Time/ms
RadialRouter	0.816	0.757	0.715	10.7
w/o RF
+ Star-T	0.813	0.751	0.709	13.5
+ T	0.815	0.753	0.705	15.8
+ MLP	0.781	0.732	0.701	4.6
w/o L_KL	0.548	0.442	0.017	-
w/o L_q-q	0.813	0.740	0.711	-

Impact of RadialFormer: Replacing RadialFormer with a standard Transformer ( $+T$ ), a Star-Transformer (+Star-T), or a simple MLP ( $+MLP$ ) all result in lower scores. This confirms that the specific radial design of RadialFormer is key to its success. It is also faster than the more complex Transformer variants.
Impact of Loss Functions:
- Removing the KL divergence loss (w/o L_KL) causes a catastrophic drop in performance, showing it is essential for guiding the router.
- Removing the query-query contrastive loss (w/o L_q-q) also degrades performance, confirming its role in building a robust query representation. This is visually supported by the t-SNE plot below.
  
  该图像是论文中的图表，展示了在有无查询-查询对比损失 $\mathcal{L}_{query-query}$ 情况下，RadialRouter学习语言编码器提取的测试查询嵌入的t-SNE可视化分布，反映了模型特征的聚类效果。
Image 4 shows that without the contrastive loss (a), query embeddings from different tasks are mixed together. With the loss (b), the embeddings form distinct, well-separated clusters, which provides a cleaner signal for the routing mechanism.

Further Analysis

Adaptability to Trade-Offs:

$该图像是论文中的折线图，展示了不同路由方法在不同参数 $\\alpha$ 下的性能得分。图中RadialRouter始终保持较高得分，明显优于其他方法，体现了其在Balance和Cost First场景中的优势。$ 该图像是论文中的折线图，展示了不同路由方法在不同参数 $\alpha$ 下的性能得分。图中RadialRouter始终保持较高得分，明显优于其他方法，体现了其在Balance和Cost First场景中的优势。

Image 5 shows the Score of different methods as the cost-penalty parameter $\alpha$ increases. RadialRouter (the dark blue line) consistently achieves the highest score across all trade-offs, demonstrating its superior adaptability.

该图像是一个性能与成本关系的二维散点对比图，展示了多个LLM路由方法在不同成本下的性能表现，包括CosineClassifier、HybridLLM、FrugalGPT、RouterDC、GraphRouter和RadialRouter，其中RadialRouter表现出较优的性能-成本平衡。

Image 6 plots Performance vs. Cost. RadialRouter's curve is positioned in the desirable top-left region, meaning it achieves higher performance for a given cost compared to baselines.
Adaptability to a Dynamic LLM Pool:

该图像是论文中的图表，展示了不同LLM数量对性能和评分的影响。横轴为LLM数量，纵轴为性能和评分指标，随着LLM数量增加，性能和评分整体呈上升趋势。

Image 7 and the accompanying Table 4 show that as more LLMs are added to the pool (from 1 to 11), both the performance and score of RadialRouter consistently increase. This proves the framework's ability to effectively leverage an expanding set of resources.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces RadialRouter, a novel and effective framework for LLM routing. Its key strength lies in the RadialFormer architecture, which provides a structured and efficient way to model the complex relationships between a user query and a pool of candidate LLMs. Combined with a principled optimization strategy using both KL divergence and contrastive loss, RadialRouter achieves state-of-the-art results, demonstrating strong adaptability to various performance-cost trade-offs and dynamic LLM pools.
Limitations & Future Work: The authors acknowledge two main limitations:
1. Static LLM Pool: The current model requires full retraining whenever a new LLM is added to the pool. A future direction is to enable "training-free" adaptation to new models, perhaps by learning a general embedding space for LLMs.
2. Scope: The experiments were limited to English language text tasks. The framework's applicability to multilingual and multimodal (e.g., text-and-image) LLMs remains to be explored.
Personal Insights & Critique:
- Practical Importance: This work tackles a highly practical and urgent problem in the field of applied AI. As more specialized LLMs become available, intelligent routing is not just a "nice-to-have" but a necessity for building cost-effective and performant applications.
- Elegant Design: The adaptation of the Star-Transformer architecture into RadialFormer is an elegant solution. It directly maps the problem structure (one query vs. many LLMs) onto the model architecture, leading to both efficiency and effectiveness.
- The "Ground Truth" Challenge: A potential weakness in the overall approach (common to many router models) is the reliance on pre-computed "true scores" to generate the target distribution for the KL loss. In a real-world setting, obtaining these exhaustive ground-truth scores for every possible query and LLM is infeasible. The model's performance in a production environment where these scores must be estimated or approximated is an open question.
- Future Potential: The idea of learnable LLM embeddings is powerful. If these embeddings could be generalized to capture fundamental properties of LLMs (e.g., their reasoning ability, knowledge scope, creativity), it could pave the way for a universal router that can reason about new, unseen LLMs without retraining, as hinted in the paper's limitations. This would be a major breakthrough for LLM operations (LLMOps).

Similar papers

Recommended via semantic vector search.

No similar papers found yet.