LLM4WM: Adapting LLM for Wireless Multi-Tasking
TL;DR Summary
The LLM4WM framework is designed for wireless channel tasks, employing Mixture of Experts and Low-Rank Adaptation for multi-task fine-tuning. It aligns channel data with model features and outperforms existing methods in both full-sample and few-shot evaluations.
Abstract
The wireless channel is fundamental to communication, encompassing numerous tasks collectively referred to as channel-associated tasks. These tasks can leverage joint learning based on channel characteristics to share representations and enhance system design. To capitalize on this advantage, LLM4WM is proposed--a large language model (LLM) multi-task fine-tuning framework specifically tailored for channel-associated tasks. This framework utilizes a Mixture of Experts with Low-Rank Adaptation (MoE-LoRA) approach for multi-task fine-tuning, enabling the transfer of the pre-trained LLM's general knowledge to these tasks. Given the unique characteristics of wireless channel data, preprocessing modules, adapter modules, and multi-task output layers are designed to align the channel data with the LLM's semantic feature space. Experiments on a channel-associated multi-task dataset demonstrate that LLM4WM outperforms existing methodologies in both full-sample and few-shot evaluations, owing to its robust multi-task joint modeling and transfer learning capabilities.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is LLM4WM: Adapting LLM for Wireless Multi-Tasking. It focuses on developing a framework that leverages Large Language Models (LLMs) for handling multiple channel-associated tasks in wireless communication systems.
1.2. Authors
The authors of this paper are:
-
Xuanyu Liu (State Key Laboratory of Advanced Optical Communication Systems and Networks, School of Electronics, Peking University, Beijing, China)
-
Shijian Gao (Internet of Things Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China)
-
Boxun Liu (State Key Laboratory of Advanced Optical Communication Systems and Networks, School of Electronics, Peking University, Beijing, China)
-
Xiang Cheng (State Key Laboratory of Advanced Optical Communication Systems and Networks, School of Electronics, Peking University, Beijing, China)
-
Liuqing Yang (Internet of Things Thrust and Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China, and Department of Electronic and Computer Engineering and the Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology, Hong Kong, SAR, China)
The authors have backgrounds in electrical engineering, communications, and computer science, with affiliations to prominent universities and research institutions known for work in wireless communication, signal processing, and AI.
1.3. Journal/Conference
This paper was published as a preprint on arXiv. While not a peer-reviewed journal or conference yet, arXiv is a highly reputable platform for disseminating cutting-edge research in physics, mathematics, computer science, and related fields. Papers on arXiv often precede formal publication and allow for early sharing and feedback within the academic community.
1.4. Publication Year
The paper was published at (UTC): 2025-01-22T16:12:38.000Z.
1.5. Abstract
The abstract introduces the concept of channel-associated tasks in wireless communication, highlighting their potential for joint learning to enhance system design. To leverage this, the paper proposes LLM4WM, a Large Language Model (LLM) multi-task fine-tuning framework specifically designed for these tasks. The framework employs a Mixture of Experts with Low-Rank Adaptation (MoE-LoRA) approach for multi-task fine-tuning, which transfers the pre-trained LLM's general knowledge to the wireless domain. It also incorporates preprocessing modules, adapter modules, and multi-task output layers to align the unique characteristics of wireless channel data with the LLM's semantic feature space. Experimental results on a channel-associated multi-task dataset demonstrate that LLM4WM surpasses existing methodologies in both full-sample and few-shot scenarios, owing to its robust multi-task joint modeling and transfer learning capabilities.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2501.12983 - PDF Link:
https://arxiv.org/pdf/2501.12983v2.pdf - Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the efficient and robust handling of multiple interconnected tasks in wireless communication systems, often referred to as channel-associated tasks. These tasks, such as channel estimation, prediction, and beam management, are fundamental to communication quality and system design.
This problem is important because:
-
Complexity of Wireless Channels: Modern wireless systems, especially those using millimeter-wave (mmWave) and Multiple-Input Multiple-Output (MIMO) technologies, face increasing complexity due to a growing number of antennas and dynamic environments. Accurate understanding and management of the wireless channel are crucial for optimizing performance.
-
Limitations of Existing AI Approaches: While Artificial Intelligence (AI) has significantly improved various communication tasks, current AI-powered methods often require large amounts of high-quality data, incur substantial communication overhead for data collection and model retraining due to environmental changes, and struggle in complex, highly dynamic scenarios due to limited model scale and generalization issues.
-
Inefficiency of Single-Task Learning: Many communication tasks are deeply rooted in the characteristics of the wireless channel. Jointly learning these
channel-associated taskscould enable shared representations and significant training benefits, which single-task approaches fail to capitalize on. Existing multi-task learning methods, however, often face issues with data imbalance, theseesaw effect, and limited capacity to scale with the number and diversity of tasks.The paper's entry point and innovative idea revolve around leveraging
Large Language Models (LLMs)forwireless multi-tasking. Inspired by the remarkable multi-task handling, reasoning, and generalization capabilities of LLMs in other domains (NLP, healthcare, finance), the authors propose adapting these powerful pre-trained models to wireless communication tasks. The innovation lies in developing a specialized framework,LLM4WM, that effectively transfers the general knowledge of LLMs to the unique context of wireless channel data, while addressing challenges like cross-domain knowledge transfer and task diversity.
2.2. Main Contributions / Findings
The primary contributions of the paper are:
-
Introduction of LLM4WM: The paper proposes
LLM4WM, a novel method that pioneers the use of LLMs for wireless multi-tasking. It is the first to fine-tune an LLM usingMixture of Experts with Low-Rank Adaptation (MoE-LoRA)to extract a joint representation specifically for wireless multi-task scenarios, setting a new standard in this research area. -
Customized Cross-Domain Alignment Mechanisms: To bridge the gap between the LLM's semantic feature space and the specific feature space of wireless tasks, the authors designed:
- A customized
pre-processing methodfor each task. - Corresponding
output headersfor each task. Multi-task adaptersat both the input and output layers of the LLM. These ensure effective alignment and enhance adaptability and performance.
- A customized
-
Demonstrated Robust Performance and Generalization: The
LLM4WMmodel exhibits excellent performance across a range of wireless communication tasks, includingchannel estimation,channel prediction,localization enhancement, andbeam management. Furthermore, it demonstrates impressive generalization capabilities, highlighting its robustness and versatility for diverse applications in the wireless domain, even in few-shot and transfer learning scenarios.Key conclusions and findings reached by the paper include:
LLM4WMsignificantly outperforms existing methodologies (traditional, small single-task, small multi-task, and single-task large models) in various channel-associated tasks.- The
MoE-LoRAfine-tuning approach effectively enables tasks to share common knowledge through shared expert weights while maintaining differentiation for task-specific features via independent experts and a gating mechanism. - Large models, when properly adapted, are better suited for handling multiple downstream tasks in wireless communication compared to small models, which struggle with conflicting task knowledge and limited capacity for joint representations.
- The
multi-task adapter modulesandpreprocessing/output layer designsare crucial for aligning the unique characteristics of wireless channel data with the LLM's semantic space, thereby enhancing performance. - Ablation studies confirm the critical contribution of each proposed module (preprocessor, multi-task adapters, and backbone LLM) to the overall system performance.
- Hyper-parameter analysis provides insights into optimizing LoRA rank and the number of experts for balanced performance and computational efficiency.
- Efficiency evaluations demonstrate that
MoE-LoRAmakesLLM4WMcomparable in trainable parameters to smaller models, indicating practical deployability.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following fundamental concepts:
-
Wireless Channel: The medium through which electromagnetic waves travel to transmit information. It's not a perfect conduit; signals are affected by phenomena like
fading(variation in signal strength),interference(unwanted signals), andmultipath propagation(signals arriving at the receiver via multiple paths due to reflections, diffractions, or scattering). Understanding these characteristics is crucial for designing robust communication systems. -
Millimeter-Wave (mmWave): A band of radio frequencies in the electromagnetic spectrum between 30 GHz and 300 GHz. mmWave offers very high bandwidth, enabling extremely fast data rates. However, it is highly susceptible to blockages and experiences significant path loss, necessitating directional antennas and sophisticated
beamformingtechniques. -
Multiple-Input Multiple-Output (MIMO): A wireless technology that uses multiple antennas at both the transmitter and receiver to improve communication performance. By exploiting
spatial diversity, MIMO can increase data throughput and link reliability without requiring additional bandwidth or transmit power. -
Orthogonal Frequency Division Multiplexing (OFDM): A digital modulation scheme that divides a single high-rate data stream into several lower-rate data streams, which are transmitted simultaneously over multiple orthogonal subcarriers. This makes OFDM robust against frequency-selective fading and inter-symbol interference (ISI).
-
Channel State Information (CSI): The known channel properties of a communication link. This information describes how a signal propagates from the transmitter to the receiver, including effects like scattering, fading, and power decay. Accurate CSI is vital for tasks like
channel estimation,beamforming, andchannel prediction. -
Channel Estimation: The process of acquiring an estimate of the CSI. In wireless systems, the channel characteristics can change rapidly, so accurate and timely estimation is critical for coherent detection and optimizing transmission parameters. Methods often involve sending known
pilot signalsand inferring the channel from their received versions. -
Channel Prediction: The task of forecasting future channel conditions based on past and current CSI. This is particularly important in dynamic environments with mobile users, where predicting future channel states can enable proactive resource allocation and
beamforming. -
Beamforming: A signal processing technique used in smart antennas for directional signal transmission or reception. By combining signals from multiple antennas with appropriate phase and amplitude weights, a
beamcan be steered towards a desired user, enhancing signal strength and reducing interference. In mmWave,analog beamforminguses phase shifters, whiledigital beamforminguses independent digital signal processing for each antenna element. -
Artificial Intelligence (AI) in Wireless Communication: The application of machine learning techniques (e.g., neural networks) to solve problems in wireless systems. This includes optimizing network performance, improving signal processing tasks, and enabling intelligent decision-making, often by learning complex patterns from data.
-
Multi-Task Learning (MTL): A machine learning paradigm where a single model is trained to perform multiple related tasks simultaneously. The goal is to improve the learning efficiency and prediction accuracy for all tasks by sharing representations that capture common underlying structures. Benefits include improved generalization, reduced overfitting, and the ability to learn from small datasets for individual tasks. Challenges include
data imbalance(some tasks have more data than others) and theseesaw effect(improving one task's performance degrades another's). -
Large Language Models (LLMs): A type of artificial intelligence model characterized by a vast number of parameters (billions to trillions) and trained on massive amounts of text data. LLMs exhibit powerful capabilities in natural language understanding, generation, reasoning, and generalization across various domains. They are pre-trained on a general corpus and then often fine-tuned for specific downstream tasks.
-
Mixture of Experts (MoE): A neural network architecture that consists of multiple "expert" sub-networks and a "gating network." For a given input, the gating network learns to activate (or weight the outputs of) a subset of experts. This allows the model to selectively use specialized experts for different parts of the input space or different tasks, increasing model capacity without a proportional increase in computational cost for inference.
-
Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning technique for large pre-trained models. Instead of fine-tuning all parameters of a large model, LoRA freezes the pre-trained model weights and injects small, trainable low-rank matrices into each layer (typically the attention layers). These low-rank matrices capture the task-specific adaptations, significantly reducing the number of trainable parameters and computational cost during fine-tuning, while preserving the general knowledge of the pre-trained model.
-
Adapter Modules: Small neural network modules inserted into a pre-trained model (e.g., within transformer layers) to adapt it to new tasks or domains. Like LoRA, they aim to reduce the number of trainable parameters during fine-tuning. Adapters typically consist of a down-projection, a non-linearity, and an up-projection, acting as a bottleneck. They allow the pre-trained backbone to remain mostly frozen, preserving its knowledge while the adapters learn task-specific features.
3.2. Previous Works
The paper contextualizes its approach by discussing several prior studies and existing methodologies:
-
AI for Channel Estimation and Communication Tasks:
- The paper acknowledges the significant improvements AI has brought to
channel estimation accuracy[6]-[8] and its enhancements to tasks likechannel prediction,beamforming, andpositioning. - Limitation: These methods often require vast amounts of high-quality data, incur substantial communication overhead for retraining in dynamic environments, and frequently struggle in complex scenarios due to limited model scale and generalization issues.
- The paper acknowledges the significant improvements AI has brought to
-
Multi-modal Sensing and Synesthesia of Machines (SoM):
- [9] proposed
multi-modal sensingto capture wireless channel propagation characteristics, a concept summarized asSynesthesia of Machines (SoM). - [10], [11] further explored this for optimizing communication systems.
- Relevance: This highlights the idea of extracting richer information from the wireless environment, which aligns with the goal of
LLM4WMto learn comprehensive channel representations.
- [9] proposed
-
Traditional Multi-Task Learning in Wireless:
- [12] jointly trained
signal classificationandmodulation recognition. - [13] proposed joint training and inference for
direct channelandcascaded channel estimationinreconfigurable intelligent surface (RIS)systems to reducepilot overhead. - Limitation: These methods often face issues with
data imbalance, theseesaw effect(where optimizing one task degrades another), and struggle to scale with an increasing number and diversity of tasks due to limited model capacity, typically combining only two closely related tasks.
- [12] jointly trained
-
LLMs in Wireless Communication:
LLM4CP[19] was an early work that applied LLMs forchannel prediction, demonstrating improved few-shot generalization.WiFo[20] introduced afoundational channel modeltrained on diverse channel datasets for tasks like time-domain and frequency-domain prediction, focusing onchannel reconstructionwithzero-shot learning.- Limitation: These LLM applications primarily focus on single channel reconstruction tasks or specific predictive capabilities, and do not sufficiently consider the complex relationships and joint learning across multiple diverse channel-associated tasks, which limits their ability to acquire truly general representations across a wider range of wireless objectives.
3.3. Technological Evolution
The evolution of technologies addressing wireless communication challenges can be broadly categorized:
- Traditional Model-Based Approaches: Early methods relied on mathematical models of the wireless channel and signal processing techniques (e.g., Least Squares for channel estimation, classical codebooks for beamforming). These methods are deterministic but struggle with the complexity and dynamism of real-world channels.
- Small-Scale AI Models: With the rise of deep learning, specialized neural networks like
CNNs,LSTMs, andMLPswere applied to specific wireless tasks. These models learned patterns from data, outperforming traditional methods.- Examples:
CNNforCSIdata as image processing [24],LSTMfor fading channel prediction [39],MLPfor mapping complex relationships [37], [38],Transformerfor channel prediction [23]. - Multi-task extensions: Approaches like
Cross-stitch networks[40] attempted to enable multi-task learning by combining activations from multiple networks. - Limitations: These models are often task-specific, require significant data, and struggle with generalization across highly dynamic environments or diverse tasks, necessitating retraining and potentially facing capacity limitations for complex
multi-task learningscenarios.
- Examples:
- Large Pre-trained Models (Foundational Models / LLMs): The recent paradigm shift in AI, especially with
TransformersandLLMs, has shown that models pre-trained on vast, diverse datasets can acquire broad general knowledge and remarkablefew-shotandzero-shot learningcapabilities.- Examples:
GPT-4in NLP [14],TTMfor time series [18],BloombergGPTin finance [17]. - Early wireless adaptations:
LLM4CP[19] andWiFo[20] demonstrated the potential of LLMs for single wireless tasks. - Contribution of This Paper:
LLM4WMfits into this timeline by pushing the boundaries of LLM application in wireless. It moves beyond single-task adaptation to a holistic multi-task framework, specifically addressing the unique challenges of learning joint representations for diversechannel-associated tasksby proposingMoE-LoRAand customized adapters.
- Examples:
3.4. Differentiation Analysis
Compared to the main methods in related work, LLM4WM offers several core differences and innovations:
-
From Single-Task to Multi-Task LLM Adaptation:
- Prior LLM works (e.g.,
LLM4CP,WiFo): Primarily focused on adapting LLMs for single wireless tasks, such as channel prediction or reconstruction. While demonstrating strong performance for their specific tasks, they don't fully exploit the LLM's potential for joint learning across diverse communication objectives. LLM4WM's Innovation: It's a pioneering effort in fine-tuning LLMs for multiple, concurrentchannel-associated tasks. This allows the model to learn shared, generalized representations of the wireless channel that benefit all tasks simultaneously.
- Prior LLM works (e.g.,
-
Advanced Multi-Task Fine-tuning with MoE-LoRA:
- Traditional Multi-Task Learning (e.g.,
Cross-stitch): Often relies on shared bottom layers or explicit feature fusion mechanisms. These approaches can suffer fromseesaw effects(conflicting gradients) and limited scalability when the number and diversity of tasks increase. Small models typically lack the capacity to effectively capture diverse task knowledge without conflict. LLM4WM's Innovation: EmploysMixture of Experts (MoE)combined withLow-Rank Adaptation (LoRA).MoE-LoRAallows different tasks to activate specific "expert" sub-networks, enabling both shared knowledge acquisition (through common components) and task-specific specialization (through distinct experts). This dynamically balances commonality and specificity, mitigating theseesaw effectand enhancing scalability. TheLoRAcomponent ensures parameter efficiency, making fine-tuning a large LLM feasible.
- Traditional Multi-Task Learning (e.g.,
-
Comprehensive Feature Alignment for Wireless Data:
- General LLM Adaptation: Often involves tokenization and direct mapping of inputs/outputs to the LLM's semantic space. This might be insufficient for the unique structure and characteristics of wireless channel data (e.g., complex numbers, spatio-temporal correlations).
LLM4WM's Innovation: Designs specializedpreprocessing modulesto transform raw channel data (e.g., flattening CSI, DFT for angle domain conversion) before feeding it to the LLM. It also introducesmulti-task adapter modulesat both input and output toalignthe specific feature space of wireless tasks with the LLM's general semantic embedding space. Additionally,multi-task output layerswithCNNsorMLPsare tailored for task-specific output formats, overcoming the textual output limitation of standard LLMs.
-
Leveraging LLM's General Knowledge for Wireless:
-
LLM4WMexplicitly aims totransfer the pre-trained LLM's general knowledge(acquired from massive diverse data) to the wireless domain. This is a significant advantage over training models from scratch on limited wireless datasets, leading to improvedfew-shot learningand generalization capabilities.In essence,
LLM4WMinnovatively combines the power of LLMs with a sophisticatedMoE-LoRAfine-tuning strategy and customizeddomain-specific adaptersto create a robust, generalized, and efficient framework forwireless multi-task learning, addressing the shortcomings of previous single-task LLM adaptations and traditional multi-task approaches.
-
4. Methodology
4.1. Principles
The core idea behind LLM4WM is to adapt a pre-trained Large Language Model (LLM) for diverse wireless channel-associated tasks by leveraging its inherent multi-task learning capabilities and general knowledge. The theoretical basis and intuition are:
- Shared Representations: Many wireless tasks, despite their different objectives (e.g., estimation, prediction, beamforming), are fundamentally rooted in the characteristics of the wireless channel. A powerful model should be able to learn a
joint representationthat captures these common underlying features, leading to improved performance for individual tasks. LLMs, with their vast parameter counts and exposure to diverse data, are hypothesized to excel at learning such rich, generalized representations. - Transfer Learning: Pre-trained LLMs possess a broad "understanding" of patterns and relationships from their extensive training data. This general knowledge can be
transferredto a new, specialized domain (wireless communication) through fine-tuning, allowing the model to quickly adapt and perform well even with limited domain-specific data (few-shot learning). - Parameter Efficiency: Fine-tuning an entire LLM for specific tasks can be computationally prohibitive. Parameter-efficient fine-tuning techniques like
Low-Rank Adaptation (LoRA)allow adapting the LLM with a small fraction of trainable parameters, making the process feasible. - Task Specialization and Collaboration: While shared representations are beneficial, tasks also have unique requirements. A
Mixture of Experts (MoE)architecture allows for flexibletask specializationby dynamically activating different "expert" sub-networks for different tasks, while still enabling collaboration and knowledge sharing. - Domain Alignment: The input and output data in wireless communication (e.g., complex channel matrices, specific metrics) differ significantly from the textual data LLMs are typically trained on. Specialized
preprocessing modulesandmulti-task adapter modulesare necessary to effectively translate wireless data into the LLM's semantic feature space and vice-versa.
4.2. Core Methodology In-depth (Layer by Layer)
The LLM4WM framework is designed to integrate the learning of multiple channel-associated tasks within a unified LLM-based network. The complete workflow is illustrated in Figure 1 of the paper.
The following figure (Figure 1 from the original paper) shows the system architecture:
该图像是一个示意图,展示了无线单任务学习和多任务学习的框架。左侧展示了小模型分布式建模网络,右侧展示了大型模型的联合建模网络LLM4WM,包含了多个任务如信道估计、距离估计和路径损耗等,并利用 Mixture of Experts 模型和低秩适应方法。该框架旨在优化无线通信任务的学习效率。
The framework consists of four main components: a Pre-process Module, Multi-Task Adapter Module, Backbone LLM (fine-tuned with MoE-LoRA), and Multi-Task Output Module.
The following figure (Figure 2 from the original paper) shows the detailed structure of the LLM4WM framework:
该图像是一个示意图,展示了LLM4WM框架的结构,包括预处理模块、多任务适配器、主干LLM和多任务输出模块。它涉及数据的归一化、领域转换和线性投影等任务特定操作,并通过Mixture of Experts与Low-Rank Adaptation进行多任务微调。
4.2.1. System Description (Section II)
The paper considers a dual-frequency communication system with both sub-6G and mmWave frequency bands. This system comprises one Base Station (BS) and one User Equipment (UE), both equipped with transceivers for both bands. Multiple-Input Single-Output (MISO) and Orthogonal Frequency Division Multiplexing (OFDM) technologies are applied. The sub-6G and mmWave antennas are co-located and aligned, implying they share spatial features. The BS uses an analog beamforming architecture for mmWave ( antennas, Uniform Linear Array (ULA)) and a fully digital beamforming architecture for sub-6G ( ULA antennas). The UE has an omnidirectional antenna for both links.
4.2.1.1. Channel Model (II.A)
The channel for both sub-6G and mmWave is described by a classical cluster-based multipath channel model for downlink and uplink CSI at time and frequency .
The channel impulse response h(t, f) is given by:
$
h ( t , f ) = \sum _ { n = 1 } ^ { N } \sum _ { p = 1 } ^ { P _ { n } } \beta _ { n , p } e ^ { j [ 2 \pi ( \upsilon _ { n , p } t - f \tau _ { n , p } ) + \Phi _ { n , p } ] } \pmb { a } ( \theta _ { n , p } ) .
$
Here:
-
: Number of clusters.
-
: Number of paths in the -th cluster.
-
: Complex path gain of the -th path in the -th cluster.
-
: Doppler frequency shift of the -th path in the -th cluster.
-
: Delay of the -th path in the -th cluster.
-
: Random phase of the -th path in the -th cluster.
-
: Steering vector of the corresponding path, where is the azimuth angle of departure (AoD).
For a
ULAantenna array, the steering vector is expressed as: $ \begin{array} { r } { \pmb { a } ( \theta _ { n , p } ) = [ 1 , e ^ { j \frac { 2 \pi f d _ { \mathrm { v } } \sin ( \theta _ { n , p } ) } { c } } , . . . , e ^ { j \frac { 2 \pi ( N _ { \mathrm { t } } - 1 ) f d _ { \mathrm { v } } \sin ( \theta _ { n , p } ) } { c } } ] . } \end{array} $ Here: -
: Antenna spacing in the vertical direction.
-
: Speed of light.
-
: Number of transmit antennas.
4.2.1.2. Signal Model (II.B)
For mmWave links, a downlink MISO-OFDM signal transmission is considered with subcarriers. The downlink CSI at time and -th subcarrier is .
The received downlink mmWave signal at the user side for time and -th subcarrier is:
$
{ \pmb y } _ { t , k } = { \pmb h } _ { t , k } ^ { \mathrm { H } } { \pmb w } _ { t } x _ { t , k } + { \pmb n } _ { t , k } ,
$
Here:
-
: Conjugate transpose of the channel vector.
-
: Beam vector selected from a predefined codebook , i.e., . is related to the number of antennas.
-
: Transmitted symbol.
-
: Additive white Gaussian noise (AWGN).
The
achievable spectral efficiency (SE)of the downlink transmission is: $ R _ { t } = \sum _ { k = 1 } ^ { K _ { \mathrm { s } } } \log _ { 2 } \left( 1 + \frac { \vert h _ { t , k } ^ { \mathrm { H } } w _ { t } \vert ^ { 2 } } { \sigma _ { n } ^ { 2 } } \right) . $ Here: -
: Number of active subcarriers.
-
: Squared magnitude of the effective channel gain with beamforming.
-
: Noise power.
The optimal beam vector is selected through beam training to maximize : $ \mathbf { \Delta } \mathbf { w } _ { t } ^ { * } = \arg \operatorname* { m a x } _ { \mathbf { \Delta } w _ { t } \in \mathcal { W } } R _ { t } . $
For sub-6G links, the transmission process is similar but uses fully digital precoding. The uplink channel at pilot positions is estimated using the Least Squares (LS) method:
$
\tilde { h } _ { L S } ( t , f _ { k } ) = \tilde { y } _ { t , k } / \tilde { x } _ { t , k } ,
$
Here:
-
: Uplink pilot signal sent by the user.
-
: Signal received by the base station.
To maximize SE,
matched filtering based precodingis applied: $ \tilde { w } _ { t , k } ^ { * } = \frac { \tilde { h } _ { L S } ( t , f _ { k } ) } { \lVert \tilde { h } _ { L S } ( t , f _ { k } ) \rVert } , $ Here: -
: Optimal precoding vector for time and -th subcarrier. Its accuracy depends on .
4.2.2. Task Description (Section III)
The paper defines a set of channel-associated tasks , categorized into three classes: Channel Reconstruction, Beam Management, and Radio Environment Mining. Each task corresponds to a dataset , where is the input and is the label.
All tasks are formulated as an optimization problem:
$
\begin{array} { r l } { \underset { \Omega _ { n } } { \operatorname* { m a x } } } & { \mathrm { S c o r e } _ { n } = \mathbb { E } \left{ f _ { e v a l _ { n } } ( X _ { n } ^ { L } , X _ { n } ^ { O } ) \right} } \ { s . t . } & { X _ { n } ^ { O } = f _ { \Omega _ { n } } ( X _ { n } ^ { I } ) , \quad n \in \mathcal { T } , } \end{array}
$
Here:
-
: Evaluation function for task .
-
: Model function for task , parametrized by .
-
: Output results (prediction/estimation) of the model for task .
The following are the results from Table I of the original paper:
Task class Task id Task content Channel Reconstruction CE Channel estimation CP PF Temporal domain channel prediction Frequency domain channel prediction Beam Management BF Sub-6G assisted mmWave beamforming Radio Environment Mining DE Distance estimation PE Path loss estimation
4.2.2.1. Channel Reconstruction: Interpolation and Prediction (III.A)
These tasks aim to predict or interpolate target channel matrices from known ones, capturing inter-domain correlations across time, frequency, and antenna domains. The channel matrix is considered to have three dimensions: time, space, and antenna. $ H [ i , j , : ] = h ( i \Delta t , f _ { 1 } + ( j - 1 ) \Delta f ) , $ Here:
-
: Time index with interval .
-
: Frequency index with interval and lowest frequency .
For
Channel Estimation (CE): A comb-type pilot pattern is used. The network interpolates missing channels. $ \begin{array} { l } { { { \cal { X } } _ { C E } ^ { I } = H [ 1 : \tilde { T } , 1 : \tilde { K } / n _ { p i l o t } : \tilde { K } , 1 : \tilde { N } _ { t } ] } } \ { { { \cal { X } } _ { C E } ^ { L } = H [ 1 : \tilde { T } , 1 : \tilde { K } , 1 : \tilde { N } _ { t } ] . } } \end{array} $ Here: -
: Total number of timestamps.
-
: Number of OFDM symbols.
-
: Number of pilots, typically .
-
: Input to CE, representing pilots sparse in frequency.
-
: Label for CE, representing the full channel matrix.
For
Time-Domain Channel Prediction (CP): $ \begin{array} { r l } & { X _ { C P } ^ { I } = H [ 1 : \tilde { T } , 1 : \tilde { K } , 1 : \tilde { N } _ { t } ] } \ & { \boldsymbol { X } _ { C P } ^ { L } = H [ \tilde { T } + 1 : \tilde { T } + \tilde { P } , 1 : \tilde { K } , 1 : \tilde { N } _ { t } ] , } \end{array} $ Here: -
: Historical channel data up to time .
-
: Future channel data for moments to be predicted.
For
Frequency-Domain Channel Prediction (PF): $ \begin{array} { l } { { { \pmb X } _ { P F } ^ { I } = { \pmb H } [ 1 : \tilde { T } , 1 : \tilde { K } / 2 , 1 : \tilde { N } _ { t } ] } } \ { { { \pmb X } _ { P F } ^ { L } = { \pmb H } [ 1 : \tilde { T } , \tilde { K } / 2 : \tilde { K } , 1 : \tilde { N } _ { t } ] . } } \end{array} $ Here: -
: Channel data for the first half of frequencies.
-
: Label for the second half of frequencies.
4.2.2.2. Beam Management: Sub-6G Aided Beamforming (III.B)
This task focuses on acquiring the optimal weight vector from a codebook (where for super-resolution DFT codebook). It leverages the spatial correlation between sub-6G and mmWave bands, particularly in Line-of-Sight (LoS) conditions, to assist mmWave beamforming and reduce pilot overhead. The input and label for this task are: $ \begin{array} { l } { { { \cal X } _ { B F } ^ { I } = { \cal H } [ 1 , 1 : { \tilde { K } } , 1 : { \tilde { N } } _ { t } ] } } \ { { { \cal X } _ { B F } ^ { L } = w _ { t } ^ { * } . } } \end{array} $ Here:
- : Sub-6G channel matrix at time 1, across all subcarriers and antennas.
- : Optimal mmWave beam vector .
4.2.2.3. Radio Environment Mining: Distance Estimation and Path Loss Estimation (III.C)
These tasks extract environmental information from channel data, such as distance and path loss , to adjust communication system configurations.
For Distance Estimation (DE):
$
\begin{array} { l } { { { \cal { X } } _ { D E } ^ { I } = { \cal { H } } [ 1 , 1 : \tilde { K } , 1 : \tilde { N } _ { t } ] } } \ { { { \cal { X } } _ { D E } ^ { L } = x _ { d } . } } \end{array}
$
Here:
-
: Sub-6G channel matrix.
-
: Distance from UE to BS.
For
Path Loss Estimation (PE): $ \begin{array} { l } { { { \cal X } _ { P E } ^ { I } = H [ 1 , 1 : \tilde { K } , 1 : \tilde { N } _ { t } ] } } \ { { { \cal X } _ { P E } ^ { L } = x _ { p l } . } } \end{array} $ Here: -
: Sub-6G channel matrix.
-
: Path loss .
4.2.3. LLM for Wireless Channel-Associated Tasks (Section IV)
4.2.3.1. Preprocessor Module (IV.A)
Given that each task requires different channel characteristics, a specific preprocessing function is designed for each task .
The preprocessed data for task is obtained as:
$
X _ { n } ^ { p r e } = f _ { \mathrm { p r e } , n } ( X _ { n } ^ { I } ) ,
$
Here:
-
: Input data for task .
-
: Preprocessing operation for task .
For
channel reconstruction tasks(CE, CP, PF), the preprocessing involvestokenizingthe CSI by flattening its spatial and frequency features: $ X _ { n } ^ { p r e } = \mathrm { F l a t t e n } ( X _ { n } ^ { I } , - 2 ) , $ Here: -
: Operation that flattens the -th dimension of tensor and all subsequent dimensions into a single dimension. For example, if has shape , and refers to , then it flattens and dimensions, resulting in a shape .
For tasks requiring
channel angle features(BF, DE, PE), the CSI data undergoes adomain transformationfrom the spatial domain to theangle domainusing aDiscrete Fourier Transform (DFT)matrix: $ X _ { n } ^ { p r e } = X _ { n } ^ { I } F _ { \tilde { N } _ { n } } , $ Here: -
: An -dimensional DFT matrix. This transforms the spatial domain CSI into its angular spectrum representation.
4.2.3.2. Multi-Task Adapter Module (IV.B)
This module extends conventional single-task adapters to address multiple tasks simultaneously. It consists of multiple individual adapters, each assigned to a specific task. These adapters perform dimensional alignment and intrinsic representation alignment between the wireless task features and the LLM's semantic space.
The following figure (Figure 3 from the original paper) shows the structure of the multi-task adapter module:
该图像是图示,展示了多任务适配器模块中的维度对齐和特征对齐过程。左侧展示了通过线性层和转置操作实现维度对齐的步骤,右侧则展示了特征对齐的处理流程,包括残差块和激活函数 GELU。
An individual adapter, denoted as for task , first uses a linear alignment layer:
$
\begin{array} { r } { \pmb { X } _ { n } ^ { f } = \mathrm { L i n e a r } ( \pmb { X } _ { n } ^ { p r e } ) \in \mathbb { R } ^ { L \times D _ { l l m } } , } \end{array}
$
Here:
-
: Preprocessed data for task .
-
: Operation including at least two fully connected layers that linearly map the first and second dimensions of the input to the specified dimensions.
-
: Token length of the LLM's input.
-
: Hidden dimension of the LLM.
-
: Feature map after linear alignment.
Following the linear alignment,
residual feature extraction networksand anactivation functionare applied to obtain feature maps with aligned semantic features: $ \begin{array} { r } { \pmb { X _ { n } ^ { a } } = \mathbf { R e s } \big ( \mathbf { G E L U } \big ( \mathbf { R e s } ( \pmb { X _ { n } ^ { f } } ) \big ) \big ) \in \mathbb { R } ^ { L \times D _ { l l m } } , } \end{array} $ Here: -
: Operation including
Res-blocks. Each Res-block contains two 1-dimensional convolution kernels (size 3, stride 1) and aReLUactivation function. -
: Gaussian Error Linear Unit, a smooth activation function [30].
-
: Aligned feature map that is fed into the LLM.
The entire process for the input adapter can be simplified as: $ X _ { n } ^ { a } = \mathrm { A d a p t e r } _ { n } ^ { i n } ( X _ { n } ^ { p r e } ) . $
4.2.3.3. Mixture-of-LoRA Based Fine-tuning (IV.C)
The backbone LLM module is fine-tuned using MoE-LoRA to adapt it to wireless channel tasks while keeping most of its parameters frozen.
First, standard LoRA is introduced. For pre-trained weights (where is input dimension, is output dimension), two trainable low-rank matrices and are used. The fine-tuned weights are:
$
W = W _ { 0 } + { \frac { \alpha } { r } } B A ,
$
Here:
-
: Original pre-trained weights.
-
A, B: Low-rank matrices ( maps to , maps to ). -
: Rank of the low-rank approximation.
-
: Scaling hyperparameter, typically set as .
If the input to the feed-forward network is and output is , the forward propagation is: $ y _ { t } = W x _ { t } = W _ { 0 } x _ { t } + { \frac { \alpha } { r } } B A x _ { t } . $
To extend this to multi-task learning, MoE principles are incorporated. A collection of independent low-rank matrices (experts) are used, and a gating network selects and combines them.
The MoE-LoRA forward propagation is:
$
y _ { t } = W x _ { t } = W _ { 0 } x _ { t } + \frac { \alpha } { r } \sum _ { k = 1 } ^ { N _ { e } } \omega _ { k } B _ { k } A _ { k } x _ { t } ,
$
Here:
-
and : The -th pair of low-rank matrices (an "expert").
-
: Number of experts.
-
: Weight of the -th expert, determined by a
gating network. A single-layer linear network followed bySoftmax(\cdot)is used for the gating network to generate expert weights and normalize them.The following figure (Figure 4 from the original paper) illustrates the MoE-LoRA fine-tuning method:
该图像是一个示意图,展示了MoE-LoRA微调方法的结构。图中显示了多任务学习中的不同LoRA组件如何与预训练权重W0相结合,并通过加法与归一化层和多头注意力层进行交互,以实现任务特定的表现。
MoE-LoRA is applied to the linear layers within the feed-forward network (FFN) of the LLM, freezing other parameters. This reduces trainable parameters and training costs.
4.2.3.4. Multi-Task Output Module (IV.D)
Standard LLMs map to a probability distribution over a vocabulary, which is unsuitable for wireless tasks. Therefore, a specialized output layer is designed.
First, a multi-task adapter (identical to the input adapter, denoted as ) is connected to the LLM's output for task :
$
\pmb { X _ { n } ^ { p } } = \mathrm { A d a p t e r } _ { n } ^ { o u t } ( \pmb { X } _ { n } ^ { L L M } )
$
Here:
-
: LLM's output feature for task .
-
: Output of the multi-task adapter for task .
Then, different processing networks are applied based on task type:
-
For
channel estimationandchannel predictiontasks (CE, CP, PF), which are sensitive to local features,CNNsare used for processing and dimensional alignment. -
For tasks requiring a
global feature representation(BF, DE, PE), such asbeamforming,distance estimation, andpath loss estimation, the feature map is flattened and processed by anMLP networkfor feature processing and dimensional alignment. The final prediction/estimation result for task is: $ \begin{array} { r } { { \cal X } _ { n } ^ { \mathrm { o } } = \left{ \begin{array} { l l } { \mathrm { C N N } ( X _ { n } ^ { p } ) , } & { n \in { C E , C P , P F } } \ { \mathrm { M L P } ( X _ { n } ^ { p } ) , } & { n \in { B F , D E , P E } } \end{array} \right. } \end{array} $ BothCNNandMLPnetworks are typically three layers, with the final layer being a single fully connected layer to align output dimensions.
4.2.3.5. Training Configuration (IV.E)
The network is trained on a multi-task mixed dataset using a two-stage training approach:
-
Stage 1: Only the
multi-task adaptersandoutput layerare trained. The LLM parameters arefrozen. In this stage, the model learns the mapping between the task feature space and the pre-trained LLM's text feature space. -
Stage 2: The LLM is
fine-tunedusingMoE-LoRA. Themulti-task adaptersbecome frozen, but theoutput layerremains trainable. This stage leverages the LLM for joint modeling and generalized representations.The same
loss functionis used for both stages: $ \operatorname { L o s s } = \sum _ { n } \omega _ { n } f _ { l o s s , n } ( X _ { n } ^ { \mathrm { o } } , X _ { n } ^ { \mathrm { l } } ) $ Here:
-
: Loss function for task .
-
: Predicted/estimated output for task .
-
: True label for task .
-
: Task weighting. The
Dynamic Weight Average (DWA)algorithm [33] is used to dynamically adjust each epoch based on task losses, ensuring all tasks are well-trained.The choice of depends on the task:
-
For
classification problems(e.g., BF),cross-entropy lossis used. -
For
regression problems(e.g., CP),normalized mean square error (NMSE)[23] is used.
5. Experimental Setup
5.1. Datasets
The experiments utilize time-varying CSI datasets simulated using the QuaDRiGa channel generator [34], which is compliant with 3GPP standards.
-
System Setup: A dual-frequency wireless system with a
1.9 GHz sub-6G linkand a28 GHz mmWave linkis considered. -
Scenario: (Urban Macrocell Line-of-Sight) is the primary scenario for dataset generation.
-
Data Generation Parameters: The following are the results from Table II of the original paper:
Parameter mmWave sub-6G Scenario 3GPP_38.901_UMa_LOS Active BSs 1 1 Codebook size 256 N/A Transmit antennas 64 8 Center frequency (GHz) 28 1.9 Bandwidth (GHz) 0.5 0.06 Antenna spacing 0.5 0.5 OFDM sub-carriers 64 64 Clusters N/A 21 Paths per cluster N/A 20 -
Sub-6G Link Details: Operates in
FDD mode. Uplink and downlink channels are adjacent. Apilotis placed every 8 subcarriers for the uplink channel.- For
channel prediction (CP): Future RBs are predicted based on historical RBs. The time interval of pilots is . - For
frequency-domain prediction (PF): Downlink channel at pilot locations is inferred from the uplink channel (estimated or predicted).
- For
-
mmWave Link Details: Employs
TDD mode.- For
sub-6G assisted mmWave beamforming (BF): Downlink analog precoding is derived based on the spatial correlation of the uplink sub-6G channel estimated by the uplink pilot.
- For
-
User Mobility: Initial user position is randomized. Motion trajectory is linear at a speed of .
-
Dataset Size: Total of 20,000 samples.
-
Training set: 15,000 samples.
-
Validation set: 1,600 samples.
-
Test set: 3,400 samples.
These datasets are effective for validating the method's performance because they are simulated using a widely accepted channel generator (
QuaDRiGa) and adhere to3GPP standards, making them representative of realistic wireless communication environments. The inclusion ofdual-frequency bandsanduser mobilityadds to the complexity and realism, allowing for a comprehensive evaluation ofLLM4WM's capabilities across diverse tasks.
-
5.2. Evaluation Metrics
The paper uses task-specific metrics for evaluation, along with an average metric and Spectral Efficiency (SE).
-
Normalized Mean Square Error (NMSE): Used for
channel reconstruction tasks(CE, CP, PF) andpath loss estimation (PE).- Conceptual Definition: NMSE is a normalized measure of the average squared difference between predicted values and ground truth values. It is commonly used to quantify the accuracy of regression models, especially when the magnitude of the target variable can vary significantly. Normalization by the power of the ground truth makes it scale-independent and comparable across different channel conditions. A lower NMSE indicates better performance.
- Mathematical Formula: $ \mathrm{NMSE} = \frac { \mathbb { E } { | \mathbf { \hat { X } } - \mathbf { X } | _ { 2 } ^ { 2 } } } { \mathbb { E } { | \mathbf { X } | _ { 2 } ^ { 2 } } } $
- Symbol Explanation:
- : Statistical expectation, typically approximated by averaging over the test samples.
- : Squared -norm, representing the squared Euclidean distance or power.
- : Predicted channel matrix, path loss, or other regression target.
- : Ground truth (actual) channel matrix, path loss, or other regression target.
-
Top-1 Accuracy (Acc): Used for the
beam management task(BF).- Conceptual Definition: Top-1 accuracy measures the proportion of samples for which the model's highest-probability prediction matches the true label. In the context of beamforming, it indicates how often the model correctly predicts the optimal beam index. A higher accuracy indicates better performance.
- Mathematical Formula: $ \mathrm{Acc} = \frac{1}{N_{samples}} \sum_{i=1}^{N_{samples}} \mathbb{I}(\hat{y}_i = y_i) $
- Symbol Explanation:
- : Total number of samples in the evaluation set.
- : Indicator function, which equals 1 if the condition inside is true, and 0 otherwise.
- : The model's predicted label (e.g., beam index) for the -th sample.
- : The true label for the -th sample.
-
Mean Absolute Error (MAE): Used for
distance estimation (DE).- Conceptual Definition: MAE measures the average magnitude of the errors between predicted values and ground truth values. It provides a linear score where all individual differences are weighted equally. MAE is less sensitive to outliers compared to MSE. A lower MAE indicates better performance.
- Mathematical Formula: $ \mathrm{MAE} = \frac{1}{N_{samples}} \sum_{i=1}^{N_{samples}} |\hat{y}_i - y_i| $
- Symbol Explanation:
- : Total number of samples in the evaluation set.
- : The model's predicted value for the -th sample.
- : The true value for the -th sample.
- : Absolute value.
-
Average Metric (Avg.): A composite metric across all tasks for intuitive comparison.
- Conceptual Definition: This metric provides a single normalized score that averages the performance across all six diverse tasks. It's constructed such that a lower value always indicates better overall performance. For accuracy-based metrics,
(1 - Accuracy)is used to align with the "lower is better" philosophy of error metrics. - Mathematical Formula: $ \mathrm { A v g . } = \frac { 1 } { 6 } \ast [ \mathrm { N M S E } \left( \mathrm { C E } \right) + \mathrm { N M S E } \left( \mathrm { C P } \right) + \mathrm { N M S E } \left( \mathrm { P F } \right) + \ ( 1 - \mathrm { A c c } \left( \mathrm { B F } \right) ) + \mathrm { M A E } \left( \mathrm { D E } \right) + \mathrm { N M S E } \left( \mathrm { P E } \right) ] . $
- Symbol Explanation:
- : Normalized Mean Square Error for the specified task.
- : Top-1 Accuracy for the Beamforming task.
- : Mean Absolute Error for the Distance Estimation task.
- : Task IDs representing Channel Estimation, Channel Prediction, Frequency-domain Prediction, Beamforming, Distance Estimation, and Path Loss Estimation, respectively.
- Conceptual Definition: This metric provides a single normalized score that averages the performance across all six diverse tasks. It's constructed such that a lower value always indicates better overall performance. For accuracy-based metrics,
-
Spectral Efficiency (SE): Reflects the overall performance of the communication system.
- Conceptual Definition: SE measures the achievable data rate per unit of bandwidth (bits/s/Hz) that can be transmitted over a wireless communication link. It is a crucial metric for evaluating the efficiency of a communication system. Higher SE indicates a more efficient use of the available spectrum. In this context, it's calculated using the channel information predicted or estimated by the model.
- Calculation: Referred back to Eq. (4) (see Section 4.2.1.2) where is the actual CSI and is obtained as per Eq. (7) using the predicted . The communication SNR is set as .
5.3. Baselines
To demonstrate the superiority of LLM4WM, the authors compare it against a comprehensive set of baselines, categorized by their approach:
-
Traditional Methods (without deep learning): These methods do not involve a training process and rely on inherent channel characteristics.
- BI (Bilinear Interpolation): Treats CSI as a time series and uses bilinear interpolation to complete
channel reconstruction tasks. - Codebook [35]: For
beam management, this method uses a super-resolution codebook and spatial correlation in the sub-6G band to find the optimal mmWave downlink beam. - FIFS (Fine-Grained Indoor Fingerprinting System) [36]: A CSI-based fingerprinting system that utilizes a coherence bandwidth-enhanced probability algorithm and a correlation filter to map objects to fingerprints, implemented for
radio environment mining tasks.
- BI (Bilinear Interpolation): Treats CSI as a time series and uses bilinear interpolation to complete
-
Single-Task Small Model Methods: These employ specially designed, relatively small-parameter models for specific downstream tasks.
- MLP (MultiLayer Perceptron) [37], [38]: Used for
radio environment sensingandbeam management. - LSTM (Long Short-Term Memory) [39]: A recurrent neural network designed for sequence data, implemented with 4
LSTMlayers forchannel reconstruction tasks. - CNN (Convolutional Neural Network) [24]: A
CNN-based predictor (ten convolutional layers, kernel) forFDD systems, treating time-frequencyCSIdata as a 2D image processing task. Implemented forchannel reconstruction tasks. - WiT (Wireless Transformer) [26]: A
transformer-basedlocation estimation method leveraging the attention mechanism for robust learning. Implemented forradio environment sensing tasks. - Transformer [23]: A
transformer-basedparallel channel predictor (3 encoders, 2 decoders) forTDD systemsto mitigate error propagation. Implemented forchannel reconstruction tasks.
- MLP (MultiLayer Perceptron) [37], [38]: Used for
-
Multi-Task Small Model Methods: These use techniques like low-level sharing and cross-feature fusion to enable feature sharing across different tasks.
- Cross-stitch [40]: A convolutional multi-task learning neural network with "cross-stitch" units to combine activations from multiple networks. Uses
ResNet[41] as the backbone. - Cross-stitch(s): A variant of
Cross-stitchwhere it performs only a single task, serving as a baseline to illustrate the impact of multi-task learning for small models.
- Cross-stitch [40]: A convolutional multi-task learning neural network with "cross-stitch" units to combine activations from multiple networks. Uses
-
Single-Task Large Model Methods: These fine-tune large pre-trained models for a single downstream task.
-
LLM4CP [19]: The first method to apply LLMs to the
channel prediction taskthrough fine-tuning. UsesGPT-2as the backboneLLMandLN Tuning[42] as the fine-tuning method, implemented forchannel reconstruction tasks. -
LLM4WM(s): A single-task fine-tuning version of the proposed
LLM4WMframework, performing only a single task. This serves as an ablation to show the benefit of multi-task learning within theLLM4WMarchitecture.These baselines are representative because they cover a wide spectrum of approaches, from traditional signal processing to state-of-the-art deep learning models, including both small and large models, and both single-task and multi-task learning paradigms relevant to wireless communication. This diverse comparison provides a robust validation of
LLM4WM's performance.
-
5.4. Network and Training Parameters
-
Multi-Task Adapter Module: Uses for input and output feature alignment.
-
MoE-LoRA Fine-tuning:
- Number of experts: 8.
- LoRA rank (): 8 for each LoRA matrix.
-
Multi-Task Output Module:
- For CE, CP, PF: Three-layer CNN with kernels.
- For BF, DE, PE: Three-layer MLP with 768-dimensional features.
- Both types of output modules use only one single-layer fully connected network for output dimension alignment.
-
Backbone LLM: Smallest version of
GPT-2[43] with a feature dimension of . The first layers ofGPT-2are deployed. -
Scheduler: Both
warm-upandcosine annealing schedulerare employed.- Warm-up phase: First 50 epochs, learning rate increases linearly from to .
- Subsequent training: Learning rate dynamically adjusted using cosine annealing.
-
Hyperparameters for Network Training: The following are the results from Table III of the original paper:
Parameter Value Batch size 512 Epochs 250 Optimizer Adam (betas=(0.9, 0.999)) Learning rate scheduler Cosine Annealing Cosine annealing period 100 epochs Learning rate range [1 × 10−5, 1 × 10−3]
6. Results & Analysis
6.1. Core Results Analysis
The LLM4WM framework is evaluated against various baselines across six channel-associated tasks. The primary performance indicators are NMSE (for channel reconstruction and path loss estimation, lower is better), Top-1 Accuracy (for beamforming, higher is better), MAE (for distance estimation, lower is better), and Spectral Efficiency (SE) (for overall system performance, higher is better). The custom Avg. metric provides a unified measure where lower values indicate better overall performance.
The following are the results from Table IV of the original paper:
| Method | CE | CP | PF | Method | BF | Method | DE PE | Avg. ↓ | |||||
| NMSE↓ | SE ↑ | NMSE↓ | SE ↑ | NMSE ↓ | SE ↑ | Acc ↑ | SE ↑ | MAE ↓ | NMSE ↓ | ||||
| BI | 0.654 | 5.612 | 1.796 | 2.965 | 1.293 | 5.321 | Codebook | 0.288 | 7.868 | FIFS | 0.249 | 0.204 | 0.818 |
| CNN | 0.119 | 6.043 | 0.125 | 6.038 | 0.283 | 5.888 | CNN | 0.356 | 6.852 | WiT | 0.160 | 0.053 | 0.230 |
| LSTM | 1.000 | 4.182 | 0.161 | 5.994 | 0.280 | 5.902 | MLP | 0.831 | 8.522 | MLP | 0.218 | 0.091 | 0.320 |
| Cross-stitch(s) | 0.153 | 5.999 | 0.112 | 6.058 | 0.226 | 5.947 | Cross-stitch(s) | 0.884 | 8.545 | Cross-stitch(s) | 0.177 | 0.054 | 0.140 |
| Cross-stitch | 0.157 | 5.996 | 0.112 | 6.059 | 0.232 | 5.947 | Cross-Stitch | 0.858 | 8.525 | Cross-stitch | 0.131 | 0.032 | 0.134 |
| LLM4CP | 0.106 | 6.062 | 0.106 | 6.066 | 0.151 | 6.027 | LLM4CP | 0.682 | 8.430 | LLM4CP | 0.199 | 0.122 | 0.167 |
| LLM4WM(s) | 0.108 | 6.060 | 0.106 | 6.057 | 0.114 | 6.061 | LLM4WM(s) | 0.878 | 8.530 | LLM4WM(s) | 0.153 | 0.052 | 0.109 |
| LLM4WM | 0.103 | 6.069 | 0.106 | 6.068 | 0.100 | 6.081 | LLM4WM | 0.904 | 8.557 | LLM4WM | 0.087 | 0.028 | 0.087 |
Analysis:
-
Overall Dominance of LLM4WM:
LLM4WMconsistently achieves the best performance across almost all tasks and metrics, as indicated by the bolded values, and achieves the lowestAvg. ↓score of 0.087. This demonstrates its robust multi-task joint modeling and transfer learning capabilities. -
Superiority over Traditional and Small Models: Traditional methods (BI, Codebook, FIFS) and many small models (CNN, LSTM, MLP) perform significantly worse than
LLM4WM, highlighting the limitations of non-learning or capacity-constrained approaches. For instance,BIandLSTMshow very highNMSEforCE,CP,PFtasks, andCodebookhas very lowAccforBF. -
Advantage over Single-Task LLM Adaptation (
LLM4CP,LLM4WM(s)):LLM4WMoutperformsLLM4CPin most tasks (e.g., lowerNMSEfor CE, PF; higherAccfor BF; lowerMAEfor DE; lowerNMSEfor PE). This indicates the benefit ofLLM4WM's multi-task design over single-task LLM fine-tuning.- When comparing
LLM4WMto its single-task variantLLM4WM(s),LLM4WMstill performs better across most metrics (e.g., lowerNMSEfor CE, PF; higherAccfor BF; lowerMAEfor DE; lowerNMSEfor PE). This clearly validates the effectiveness of themulti-task learningaspect of the proposed framework. The authors note that single-task fine-tuning is prone to overfitting, whichLLM4WMaddresses through its multi-task approach.
-
Impact of Multi-Task Learning (Small vs. Large Models): The paper highlights a crucial distinction in the benefits of multi-task learning for small vs. large models.
- Small models (
Cross-stitchvs.Cross-stitch(s)):Cross-stitch(multi-task) shows only a marginal improvement overCross-stitch(s)(single-task), withAvg.scores of 0.134 vs. 0.140. The average improvement is only (likely referring to an overall SE metric or similar aggregated performance not explicitly listed for small models). This suggests small models struggle to effectively extract generalized representations due to conflicting task knowledge. - Large models (
LLM4WMvs.LLM4WM(s)):LLM4WM(multi-task) achieves a much larger improvement overLLM4WM(s)(single-task), withAvg.scores of 0.087 vs. 0.109. This represents an improvement of (again, likely an aggregated metric). This finding strongly supports the hypothesis that large models possess a greater capacity to learn and utilize joint representations across diverse tasks.
- Small models (
-
Spectral Efficiency (SE) Improvement:
LLM4WMachieves the highestSEvalues acrossCE,CP,PF, andBFtasks, reaching up to forPFand forBF. This directly translates to improved communication system performance.The following figure (Figure 5 from the original paper) illustrates the performance comparison of large and small models before and after wireless multi-task learning:
该图像是六边形雷达图,展示了大模型(LM)和小模型(SM)在多任务学习(MTL)和单任务学习(STL)中各评估指标的性能比较。图中分别标识了不同模型在通道效率(CE)、通道潜力(CP)、信号质量(PQ)、边缘性能(PE)、延迟(DE)和频谱效率(PF)等方面的表现。
This radar chart visually reinforces the findings from Table IV regarding the multi-task learning impact. It shows that Large Models (LM) exhibit a more significant improvement when transitioning from Single-Task Learning (STL) to Multi-Task Learning (MTL) compared to Small Models (SM). The LM-MTL polygon (blue) covers a larger area than LM-STL (green), while the difference between SM-MTL (orange) and SM-STL (red) is less pronounced. This confirms that LLMs are better suited for multi-task learning in this context.
The following figure (Figure 6 from the original paper) shows the Pearson correlation coefficient heatmap of expert combination weights for various tasks:
Analysis of Expert Combination Weights:
This heatmap visualizes the Pearson correlation coefficient between the expert combination weights (i.e., from Eq. 24) learned by the gating network in two randomly selected MoE-LoRA layers.
- Distinct Expert Allocation: The overall low correlation between expert weights for most tasks (indicated by lighter colors or smaller coefficient values) suggests that the
gating networksuccessfully learns to activatedistinct expert combinationsfor different task types. This is a key mechanism ofMoE, allowing for task-specific specialization without having to train entirely separate models. - Task Similarity Reflected: Tasks with similar characteristics (e.g., likely those within the
Channel Reconstructionclass or relatedRadio Environment Miningtasks) tend to exhibit higher correlations (darker cells). This indicates that their optimal expert combinations might share common experts, allowing for efficient knowledge sharing. This finding supports the design principle ofMoE-LoRAto balanceshared knowledgeandtask-specific differentiation.
6.2. Generalization Experiments
Generalization is crucial for real-world deployment. The paper tests LLM4WM's generalization in two scenarios:
-
Scenario Transfer: Training on the
UMa(Urban Macrocell) dataset and testing on theRMa(Rural Macrocell) dataset (using only 10% of theRMadata for transfer). -
Frequency Transfer: Training on the
1.9 GHz sub-6G linkdataset and testing on a2.4 GHz sub-6G linkdataset.The following are the results from Table V of the original paper:
Train Set Test Set Method CE CP PF Method BF Method DE PE Avg. ↓ NMSE ↓ NMSE ↓ NMSE ↓ Acc ↑ MAE ↓ NMSE ↓ UMa1.9GHz RMa1.9GHz LLM4WM 0.143 0.145 0.162 LLM4WM 0.413 LLM4WM 0.336 0.285 0.276 LLM4CP 0.177 0.133 0.292 LLM4CP 0.306 LLM4CP 0.370 0.311 0.330 CNN 0.187 0.137 0.384 CNN 0.215 WiT 0.339 0.220 0.376 LSTM 1.000 0.309 0.545 MLP 0.365 MLP 0.539 0.473 0.584 UMa2.4GHz LLM4WM 0.101 0.110 0.135 LLM4WM 0.785 LLM4WM 0.126 0.047 0.122 LLM4CP 0.110 0.113 0.196 LLM4CP 0.685 LLM4CP 0.182 0.073 0.165 CNN 0.115 0.121 0.381 CNN 0.375 WiT 0.143 0.047 0.239 LSTM 1.000 0.174 0.340 MLP 0.769 MLP 0.256 0.134 0.356
Analysis:
- Robust Generalization of LLM4WM: In both scenario transfer (UMa to RMa) and frequency transfer (1.9GHz to 2.4GHz),
LLM4WMconsistently achieves the best overallAvg. ↓metric (0.276 and 0.122 respectively), indicating superior generalization capabilities compared to other methods. - Performance in Complex Tasks:
LLM4WMparticularly excels in complex tasks likechannel estimation (CE)andchannel prediction (CP, PF), which require understanding multi-dimensional features. For instance, in UMa to RMa transfer,LLM4WMhas significantly lowerNMSEforCEandPFcompared toLLM4CPand small models. This confirms the value of large models in handling dynamic real-world communication scenarios. - Small Models Struggle with Generalization: Small models like
LSTMandCNNgenerally show much poorer generalization, especiallyLSTMwith anNMSEof 1.000 forCEin both transfer scenarios, indicating complete failure for that task under new conditions.MLPalso performs poorly. Radio Environment MiningAnomaly: The paper notes a "slight dip" inradio environment miningperformance forLLM4WMin some generalization cases, where smaller models likeWiTmight perform comparably or slightly better for specific metrics (e.g.,WiThas comparableNMSEforPEin the 2.4GHz test set). This is attributed to the relative simplicity of these tasks in theLOS(Line-of-Sight) scenario, where simpler models can be sufficient. However, the overallAvg.metric still strongly favorsLLM4WM.- Multi-Task Generalization: The ability of
LLM4WMto generalize well across multiple tasks simultaneously in new environments/frequencies is a key strength, reducing the need for frequent retraining or task-specific models in dynamic wireless settings.
6.3. Hyper-parameter Analysis
The paper investigates the impact of LoRA rank and the number of experts on LLM4WM's performance.
The following figure (Figure 7 from the original paper) shows the performance of LLM4WM under different Lora ranks and number of experts:

Analysis:
-
Effect of LoRA Rank:
- As the
LoRA rankincreases (from 2 to 10 on the left plot), the performance ofLLM4WM(measured byAverage Loss) gradually improves. This is expected because a higher rank means more trainable parameters are introduced, allowing the model to adapt more finely to the data distribution. - However, increasing the rank also leads to higher
training overhead(more trainable parameters). - The authors found that a
LoRA rank of 8provided the "optimal balance" between performance gain and computational efficiency, likely reaching a point of diminishing returns for further increases.
- As the
-
Effect of Number of Experts:
-
With the
LoRA rankfixed at 8, increasing thenumber of experts(from 2 to 10 on the right plot) also generally leads to a decrease inAverage Loss, meaning improved performance. -
A larger number of experts enhances the model's
analytical and representational capacity, allowing for more specialized learning paths for different tasks or input types. -
Similar to LoRA rank, the increase in experts also linearly increases
training and inference costs. -
The authors concluded that setting the
number of experts to 8was the most appropriate choice, balancing performance and computational efficiency.This analysis validates the selected hyperparameter settings as a result of a careful trade-off between model accuracy and practical inference/training costs.
-
6.4. Ablation Experiments
Ablation studies were conducted to determine the effectiveness and contribution of individual modules within the LLM4WM framework.
The following are the results from Table VI of the original paper:
| Metric | LLM4WM | w/o Adapterin | w/o Adapterout | w/o Adapter | w/o LLM | Frozen LLM |
| Average Loss | 0.087 | 0.092 | 0.095 | 0.102 | 0.117 | 0.092 |
| Loss Increase Ratio | 0.00% | 6.50% | 9.54% | 17.62% | 34.40% | 6.15% |
Analysis:
-
Effectiveness of Multi-Task Adapters:
w/o Adapterin: Removing the input adapter leads to a6.50%increase inAverage Loss(from 0.087 to 0.092).w/o Adapterout: Removing the output adapter leads to a9.54%increase inAverage Loss(from 0.087 to 0.095).w/o Adapter: Removing both input and output adapters results in the largest performance drop among adapter-related ablations, with a17.62%increase inAverage Loss(from 0.087 to 0.102).- These results clearly demonstrate the critical role of the
multi-task adapter modulesinaligning feature spacesbetween wireless data and the LLM's semantic space. Both input and output adapters are necessary, with the output adapter appearing slightly more impactful in this specific setup.
-
Critical Role of the Backbone LLM:
-
w/o LLM: Completely removing the large model (LLM backbone) results in the most significant performance degradation, with a34.40%increase inAverage Loss(from 0.087 to 0.117). This highlights that thebackbone LLMis the "critical engine" for the success of multi-task joint learning in wireless tasks, confirming the hypothesis that large models' general knowledge and capacity are essential. -
Frozen LLM: Freezing the LLM parameters entirely (i.e., not applyingMoE-LoRAfine-tuning) leads to a6.15%increase inAverage Loss(from 0.087 to 0.092). This shows that while the pre-trained LLM provides a strong foundation, theMoE-LoRA fine-tuningis necessary to adapt its general knowledge effectively to the specific nuances of wireless tasks and achieve optimal performance.In summary, all modules (input adapters, output adapters, and the fine-tuned LLM backbone) contribute significantly to
LLM4WM's performance, validating the integrated design of the framework.
-
6.5. Efficiency Evaluation
The paper assesses the training and inference costs of LLM4WM compared to baselines to evaluate its practical deployability.
The following are the results from Table VII of the original paper:
| Metric | MLP | CNN | LSTM | WiT | LLM4CP | LLM4WM |
| Trainable Network parameters (M) | 1.29 | 2.14 | 1.17 | 19.19 | 1.80 | 1.13 |
| Total Network parameters (M) | 1.29 | 2.14 | 1.17 | 19.19 | 82.91 | 88.71 |
| Interference time (ms) | 0.32 | 0.49 | 6.49 | 2.97 | 8.62 | 6.00 |
Analysis:
-
Parameter Efficiency of MoE-LoRA:
LLM4WMhas aTotal Network parameters (M)of88.71 M, which is large due to theGPT-2backbone.- However, its
Trainable Network parameters (M)is only1.13 M. This is remarkably low, even smaller thanMLP(1.29 M),CNN(2.14 M),WiT(19.19 M), andLLM4CP(1.80 M), and comparable toLSTM(1.17 M). - This low number of trainable parameters is a direct result of the
MoE-LoRAfine-tuning method, which freezes most of the LLM's backbone and only trains small, low-rank matrices and the adapters. This significantly reduces training costs and makes the adaptation of a large model feasible. The paper emphasizes that adding new tasks only increases the model's parameters by about1.13 M, which is a small fraction of the total, indicating highparameter efficiency.
-
Acceptable Inference Time:
-
The
Interference time (ms)forLLM4WMis6.00 ms. While higher than very small models likeMLP(0.32 ms) andCNN(0.49 ms), it is comparable to or even better than other complex models likeLSTM(6.49 ms) and significantly faster thanLLM4CP(8.62 ms). -
This "acceptable" inference speed, despite using a large LLM backbone, is achieved by utilizing a lightweight
GPT-2version and the parameter-efficient nature ofMoE-LoRAwhich means that during inference, only the activated experts and small LoRA matrices contribute to the computation alongside the frozen backbone.The efficiency evaluation demonstrates that
LLM4WMsuccessfully addresses the challenge of deploying large models in practical wireless scenarios. Itsparameter efficiencymakes training manageable, and itsacceptable inference speedmakes it suitable for real-time or near real-time communication systems. This suggestsLLM4WMhas significant potential for future wireless communication systems that require robust multi-tasking and customization.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces LLM4WM, a novel multi-task fine-tuning framework specifically designed to adapt Large Language Models (LLMs) for various channel-associated tasks in wireless communication systems. By leveraging a diverse multi-task dataset, LLM4WM enables concurrent execution of tasks like channel estimation, prediction, beam management, and radio environment mining.
The core innovations enabling this are:
-
MoE-LoRA Integration: The framework incorporates
Mixture of Experts with Low-Rank Adaptation (MoE-LoRA)into the fine-tuning process. This empowers the LLM backbone to dynamically adapt by optimally combining specialized expert modules, facilitating the extraction of shared representations across tasks while also improving task-specific performance and ensuring parameter efficiency. -
Multi-Task Adapter for Feature Alignment: Custom-designed
multi-task adapter modulesare employed at both the input and output layers to harmonize the distinct feature spaces of wireless tasks with the LLM's general semantic embedding space, ensuring coherent task alignment and improved adaptability. Task-specificpreprocessing modulesandoutput layersfurther refine this alignment.Extensive simulation results demonstrate that
LLM4WMexhibits robustmulti-task learningandgeneralization capabilities, outperforming existing methodologies in bothfull-sampleandfew-shotevaluations. Ablation studies confirm the critical contributions of each module (adapters, MoE-LoRA, and the LLM backbone) to the overall system performance. Theexpert weight heatmapfurther validates the efficacy of theMoE mechanismin adaptively allocating resources, enhancing model specialization and flexibility. The efficiency evaluation indicates practical deployability due to parameter-efficient fine-tuning and acceptable inference times.
7.2. Limitations & Future Work
The paper's conclusion section highlights the "preliminary simulation results" which, while robust, implies that real-world deployment or validation remains a future step. While the authors don't explicitly list a "Limitations" section, some can be inferred:
-
Simulation-Based Evaluation: All experiments are conducted on simulated datasets (
QuaDRiGa). Real-world wireless channels are notoriously complex and diverse, posing challenges that simulations might not fully capture (e.g., unexpected interference, hardware impairments, diverse environmental dynamics). -
Specific LLM Backbone: The paper uses a specific, smaller version of
GPT-2. WhileMoE-LoRAmakes it efficient, the choice of LLM backbone and its pre-training data might inherently limit the types of knowledge transferable to wireless. The current LLM might not have "seen" anything analogous to wireless channel data during its initial language training. -
Static Expert Allocation: While
MoEallows dynamic selection of experts, the number of experts and their underlying structure are fixed. Adapting this dynamically could be a future enhancement. -
Interpretability: LLMs are often black-box models. Understanding why an LLM makes certain predictions or how it internally represents wireless channel features could be challenging, which is important for critical communication systems.
-
Computational Overhead: While
MoE-LoRAsignificantly reduces trainable parameters and achieves acceptable inference speeds, the total parameter count of the LLM backbone is still substantial (88.71 M). This might still pose challenges for deployment on resource-constrained edge devices common in wireless networks.The paper implicitly suggests future work by stating that
LLM4WM"demonstrates significant potential for deployment in future communication scenarios marked by increasing demand and the need for customized services involving numerous tasks." This implies: -
Real-world Deployment and Validation: Moving beyond simulations to evaluate
LLM4WMin live communication networks. -
Larger Scale and Diversity of Tasks: Exploring the framework's scalability to an even greater number and more diverse set of wireless tasks.
-
Adaptive LLM Architectures: Potentially exploring other LLM architectures or more advanced
MoEdesigns that could further optimize performance or efficiency. -
On-Device Deployment: Researching optimizations for deploying
LLM4WMon edge devices with limited computational resources.
7.3. Personal Insights & Critique
This paper presents an exciting and highly relevant direction for integrating cutting-edge AI (LLMs) with fundamental wireless communication challenges. The idea of leveraging the vast pre-trained knowledge of LLMs for multi-task learning in wireless is conceptually very powerful.
Insights:
- Paradigm Shift for Wireless AI: The work represents a significant step towards "foundational models" in wireless. Instead of designing a new neural network from scratch for each wireless task, this approach suggests fine-tuning a powerful, general-purpose model, potentially accelerating research and development in the field.
- Implicit Feature Extraction: LLMs, despite being language models, learn complex patterns and relationships. The success of
LLM4WMimplies that the abstract numerical patterns in wireless channel data can be effectively mapped and processed by these models, possibly by aligning wireless data to a "semantic space" where LLMs can find analogies to their learned representations. This is a fascinating cross-domain transfer. - Efficiency of MoE-LoRA: The demonstration of parameter efficiency and acceptable inference times with
MoE-LoRAis crucial. Without these techniques, the direct application of LLMs to wireless would likely be impractical. This makes the approach much more viable. - Multi-Task Synergy: The compelling evidence that large models benefit significantly more from multi-task learning than small models is a key takeaway. It highlights that the sheer capacity of LLMs allows them to learn truly generalized representations without the conflicting gradients often seen in smaller multi-task models.
Critique / Areas for Improvement:
-
Interpretability for Wireless Engineers: While LLMs perform well, the lack of interpretability can be a hurdle in mission-critical wireless systems. For example, if the model predicts a specific beam, understanding why it chose that beam (e.g., what channel features it prioritized) would be invaluable for system validation and debugging. Future work could explore methods to make the LLM's internal "reasoning" for wireless tasks more transparent.
-
Real-World vs. Simulation Gap: The reliance on simulated data, while standard, means there's a significant gap to bridge before practical deployment. Real-world channels have complexities (e.g., non-stationary statistics, hardware imperfections, measurement noise, diverse interference) that
QuaDRiGamight not fully model. Validation with measured channel data would be a strong next step. -
Computational Cost for Fine-tuning: While trainable parameters are low, the initial pre-training of the LLM and the full-model loading still require significant computational resources. This might limit who can replicate or further develop such models, favoring well-resourced institutions.
-
Cold Start Problem for New Tasks: While the
MoE-LoRAhandles new tasks efficiently, the process of designing the specificpreprocessing modules,multi-task adapters, andoutput layersfor each new type of wireless task still requires domain expertise and effort. -
Data Representation Nuances: The paper details how wireless data is preprocessed to be compatible with the LLM (e.g., flattening, DFT). A deeper analysis of how these representations align with the LLM's internal token embeddings (e.g., what "tokens" actually represent in the wireless domain) could offer more theoretical understanding.
Overall,
LLM4WMoffers a promising new direction for enhancing wireless communication systems using the power of LLMs, demonstrating the feasibility and benefits of transfer learning and multi-task learning in this domain. Its robust performance and efficiency indicate that foundational models for wireless are becoming a tangible reality.
Similar papers
Recommended via semantic vector search.