ModRWKV: Transformer Multimodality in Linear Time
TL;DR Summary
This study introduces ModRWKV, a framework based on RWKV architecture that achieves multimodal processing with linear time complexity, outperforming traditional quadratic-complexity Transformer models. It balances performance and computational efficiency for multi-source informat
Abstract
Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV—a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone—which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly accelerates multimodal training. Comparative experiments with different pretrained checkpoints further demonstrate that such initialization plays a crucial role in enhancing the model’s ability to understand multimodal signals. Supported by extensive experiments, we conclude that modern RNN architectures present a viable alternative to Transformers in the domain of multimodal large language models (MLLMs). Furthermore, we identify the optimal configuration of the ModRWKV architecture through systematic exploration.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
ModRWKV: Transformer Multimodality in Linear Time
The title clearly states the paper's core contribution: achieving multimodal capabilities, which are typically associated with Transformer models, using an architecture that operates in linear time complexity. This points to a focus on computational efficiency.
1.2. Authors
Jiale Kang, Ziyin Yue, Qingyu Yin, Jiang Rui, Weile Li, Zening Lu, and zhouran Ji.
The authors are affiliated with Yuanishi Inc, Zhejiang University, and The Hong Kong University of Science and Technology. This mix of industry and academic researchers suggests the work has both practical application and theoretical grounding.
1.3. Journal/Conference
The paper is slated for publication at the EMNLP 2025 main conference, as indicated by the ACL Anthology link (aclanthology.org/2025.emnlp-main.204/). EMNLP (Conference on Empirical Methods in Natural Language Processing) is a top-tier, highly competitive, and influential conference in the field of natural language processing. Publication at EMNLP signifies a high level of peer-reviewed validation and impact.
1.4. Publication Year
2025, with the specific publication date given as October 31, 2025.
1.5. Abstract
The abstract introduces ModRWKV, a multimodal framework built upon the RWKV architecture, a modern Recurrent Neural Network (RNN). Current multimodal models are predominantly based on Transformers, which have quadratic computational complexity. In contrast, linear models like RNNs are more efficient but have been mostly limited to text. ModRWKV aims to bridge this gap by using a decoupled design with dynamically adaptable encoders for different modalities (e.g., vision, audio). The framework uses an extremely lightweight multimodal module and leverages pretrained weights from the RWKV7 language model to speed up training and enhance understanding. Through extensive experiments, the paper concludes that modern RNNs are a viable alternative to Transformers for Multimodal Large Language Models (MLLMs) and identifies an optimal architectural configuration for ModRWKV.
1.6. Original Source Link
-
Official Source: https://aclanthology.org/2025.emnlp-main.204/
-
Code Repository: https://github.com/JL-er/WordRWKV
The paper is scheduled for official publication.
2. Executive Summary
2.1. Background & Motivation
- Core Problem: The dominant architecture for Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) is the Transformer. While powerful, its core self-attention mechanism has a computational and memory complexity that scales quadratically () with the input sequence length (). This makes processing long sequences—common in high-resolution images, audio, and video—prohibitively expensive.
- Identified Gap: In contrast, architectures like Recurrent Neural Networks (RNNs) have linear time complexity () for inference, making them highly efficient for long sequences. However, their application has been largely confined to text-only domains, and they have historically struggled with long-range dependencies and parallelizable training. Modern RNNs, such as
RWKV, have overcome many of these limitations, but their potential in multimodal contexts remains largely unexplored. - Innovative Idea: The paper's entry point is to investigate whether a modern, efficient RNN architecture (
RWKV7) can serve as the backbone for a high-performing MLLM, offering a computationally cheaper alternative to Transformer-based systems without a significant performance trade-off. The core idea is to build a unified framework,ModRWKV, that combines theRWKV7LLM with lightweight, plug-and-play encoders for various modalities.
2.2. Main Contributions / Findings
The paper presents three main contributions:
-
A Novel RNN-based Multimodal Framework: It proposes
ModRWKV, which the authors claim is the first unified multimodal training paradigm based on an RNN architecture. Its "plug-and-play" design for modality encoders enhances scalability and allows the model to easily switch between processing different data types like images, audio, and time series. -
Comprehensive Benchmarking: The paper conducts a systematic and wide-ranging evaluation of
ModRWKV's capabilities across vision, audio, and time-series tasks. This establishes a strong benchmark for assessing the cross-modal performance of RNN-based architectures, which has been lacking in the field. -
Optimal Design Identification: Through extensive ablation studies, the authors identify the most effective components and configurations for
ModRWKV. This includes finding the best-performing vision encoders, sequence compression techniques, and adapter designs that strike an optimal balance between performance and computational efficiency.Key Finding: The central conclusion is that modern RNN architectures, exemplified by
RWKV, are a viable and competitive alternative to Transformers for building MLLMs, offering significant advantages in inference efficiency and scalability with long sequences.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Recurrent Neural Networks (RNNs): An RNN is a type of neural network designed for sequential data. It processes inputs one step at a time, maintaining an internal
hidden statethat captures information from previous steps. This "memory" allows it to handle tasks where context is important, like language. However, simple RNNs suffer from the vanishing/exploding gradient problem, making it difficult for them to learn long-range dependencies. - Transformer Architecture: The Transformer, introduced in "Attention Is All You Need," revolutionized sequence modeling. Instead of recurrence, it relies on a
self-attentionmechanism. This mechanism allows every token in the input sequence to directly attend to every other token, calculating a weighted sum of their values based on their compatibility. This enables the model to capture complex, long-range dependencies effectively and allows for high parallelization during training. However, this all-to-all comparison is what leads to its quadratic complexity. - Linear Complexity Models: These are models designed to have a computational cost that grows linearly with sequence length, i.e., . This is in contrast to the standard Transformer's complexity. Examples include linear attention variants, State Space Models (
SSM) likeMamba, and RNNs likeRWKV. They are particularly advantageous for very long sequences. - Multimodal Large Language Models (MLLMs): MLLMs are models that can process and understand information from multiple modalities, not just text. A typical MLLM architecture consists of:
- A Modality Encoder: A specialized model that converts non-text data (e.g., an image) into a sequence of numerical representations (embeddings). Examples include
CLIPfor images orWavLMfor audio. - An LLM Backbone: A pretrained text-based LLM (e.g., Llama, GPT) that performs reasoning.
- An Adapter/Projector: A small neural network (often a simple MLP) that maps the embeddings from the modality encoder into the same vector space as the LLM's text embeddings. This allows the LLM to "read" the non-text information.
- A Modality Encoder: A specialized model that converts non-text data (e.g., an image) into a sequence of numerical representations (embeddings). Examples include
3.2. Previous Works
-
RWKV (Receptance Weighted Key Value): This is the core architecture used in the paper.
RWKVis a novel RNN that combines the parallelizable training of Transformers with the efficient inference of RNNs. It reformulates self-attention into a linear, recurrent form. Its key components are:- Time-mixing Block: An RNN-based block that acts as a replacement for the Transformer's self-attention layer. It updates its state based on the current input and the state from the previous time step.
- Channel-mixing Block: A simple feed-forward network that acts as a replacement for the Transformer's feed-forward layer, processing information within each time step.
The
RWKV7version used in this paper introduces enhancements like a generalized delta rule, vector-valued gating, and in-context learning rates to improve its expressiveness, as detailed in the methodology section.
-
LLaVA (Large Language and Vision Assistant): A pioneering work in building open-source MLLMs.
LLaVA's architecture is simple yet effective: it connects a pretrained vision encoder (CLIP) to a pretrained LLM (Vicuna) using a simple MLP projector. It introduced a two-stage training strategy thatModRWKValso adopts:- Feature Alignment (Phase I): Freeze the vision encoder and LLM, and train only the MLP projector on image-caption pairs. This quickly aligns the visual features with the LLM's word embedding space.
- Instruction Tuning (Phase II): Unfreeze the projector and the LLM (or parts of it), and fine-tune on a more complex dataset of multimodal instructions (e.g., visual question answering).
-
State Space Models (SSMs) & Mamba: SSMs are another class of linear-time models inspired by classical state space systems in control theory.
Mambais a recent, highly successful SSM that uses a selection mechanism to allow its state to be context-aware, selectively remembering or forgetting information. The paper comparesModRWKVtoVL-Mamba, a multimodal version ofMamba, positioningRWKVas a competitive RNN-based alternative to SSMs. -
Modality Encoders: The paper leverages several established encoders:
- Vision:
CLIP(Contrastive Language-Image Pre-training) andSigLIP2(Sigmoid Loss for Language Image Pre-training) are models that learn to associate images with text. They are powerful visual feature extractors. - Audio:
WavLMandWhisperare large models pretrained on vast amounts of audio data, making them excellent for extracting features from raw speech for tasks like speech recognition. - Time Series:
WaveNetuses causal dilated convolutions to capture temporal dependencies, whileTimeris a Transformer-based model for time series forecasting.
- Vision:
3.3. Technological Evolution
The field of large-scale modeling has evolved from text-only Transformers (e.g., GPT-3) to Transformer-based MLLMs (e.g., GPT-4V, LLaVA), which extended these models' capabilities to vision and other modalities. The major bottleneck of this paradigm has been the quadratic complexity of attention. This has spurred research into more efficient, linear-time architectures. This research has largely split into two camps: non-recurrent models like Mamba (SSMs) and modern recurrent models like RWKV. This paper firmly places itself in the latter camp, pushing the boundaries of what RNNs can do by systematically adapting them to the complex, multi-domain challenges of multimodality.
3.4. Differentiation Analysis
Compared to mainstream MLLMs like LLaVA, the core innovation of ModRWKV is its backbone architecture.
-
LLaVA (and similar models): Uses a Transformer backbone. Inference is slow for long sequences, and the key-value cache grows linearly with context length.
-
ModRWKV: Uses an
RWKV(RNN) backbone. Inference is extremely fast and has constant memory usage per new token, as it only needs to maintain a fixed-size state. Prefill computation scales linearly with sequence length, a major advantage over the Transformer's quadratic scaling.Compared to other linear-time MLLMs like
VL-Mamba: -
VL-Mamba: Based on a State Space Model.
-
ModRWKV: Based on a Recurrent Neural Network. This paper provides a direct comparison point between these two distinct approaches to linear-time sequence modeling in the multimodal domain.
ModRWKV's design is intentionally simple (lightweight adapters) to rigorously test the inherent multimodal reasoning capability of theRWKVarchitecture itself.
4. Methodology
4.1. Principles
The core principle of ModRWKV is to create a computationally efficient yet powerful MLLM by replacing the standard Transformer backbone with a modern RNN, RWKV7. The design is decoupled and modular: a shared RWKV7 language model serves as the central reasoning engine, while specialized, "plug-and-play" modules handle the initial processing of different modalities. This allows the framework to adapt to new data types by simply adding a new encoder, without altering the core LLM. The methodology intentionally uses a lightweight adapter to force the RWKV7 backbone to perform the complex cross-modal fusion and reasoning, thereby providing a strong test of the RNN's capabilities.
4.2. Core Methodology In-depth
The overall architecture of ModRWKV is shown in Figure 1. A multimodal input (e.g., an image) and a text prompt are processed through parallel streams before being fused and fed into the RWKV7 LLM.
该图像是一个示意图,展示了ModRWKV框架中的多模态数据处理流程。该框架通过1D卷积和Mod Encoder对多模态输入进行编码,并结合文本嵌入,以实现信息融合,最终输出RWKV模块。该图描绘了视觉理解、音频识别等任务的组成部分。
The data flow is as follows:
- Modality Encoding: Non-text data is first processed by a modality-specific encoder (e.g.,
SigLIP2for images) to extract a sequence of high-level feature vectors. - Sequence Compression: The resulting feature sequence, which can be very long, is passed through a 1D Convolutional layer to reduce its length, making subsequent processing more efficient.
- Dimension Alignment: The compressed feature sequence is then fed into a lightweight Adapter (an MLP) that projects the features into the same embedding dimension as the
RWKV7model's vocabulary. - Text Encoding: Concurrently, the input text prompt is converted into a sequence of embeddings using the
RWKV7model's standard Text Embedding layer. - Concatenation & Fusion: The projected multimodal features and the text embeddings are concatenated into a single sequence. This combined sequence is then fed into the
RWKV7backbone, which processes the information autoregressively to generate a response.
4.2.1. RWKV7 Backbone
RWKV7 is a modern RNN that serves as the LLM backbone. The paper provides the state update equations that define its core "time-mixing" block. While a simple RNN can be written as:
$
h_t = Wh_{t - 1} + Ux_t \quad (1)
$
RWKV7 uses a much more expressive formulation to update its state at each time step . The state update is given by:
$
s_t = G_{t}s_{t - 1} + a_{t}k_{t}v_{t}^{T} \quad (3)
$
Where:
- is the state from the previous time step.
- and are the key and value vectors, which are linearly projected from the input token , similar to a Transformer.
- is the in-context learning rate, a vector that controls how much of the new information () should be added to the state. It is dynamically computed from the input: . This makes the update content-dependent.
- is the dynamic transition matrix, which controls how the previous state decays and transforms. It is defined as: .
- is a vector-valued gating parameter, also computed from the input (). This allows different channels in the hidden state to decay at different rates, making the model's memory highly adaptive.
- The term represents a "relaxed value replacement rule."
- Note on notation: The term represents an outer product, which would result in a matrix. This is an unusual formulation as the state is typically a vector. While unconventional, this analysis adheres strictly to the formula presented in the paper. This formulation allows for complex interactions between key and value dimensions to be incorporated into the state matrix.
4.2.2. Multimodal Encoders
ModRWKV is designed to be agnostic to the choice of encoder. The paper experiments with several options for different modalities:
- Vision Encoder:
CLIPandSigLIP2are tested. These encoders take a raw image and output a sequence of patch embeddings. - Audio Encoder:
WavLMandWhisperare used. They process raw audio sampled at 16,000 Hz and generate feature vectors. - Time Series Encoder:
WaveNetandTimerare used to transform raw time-series data into feature embeddings.
4.2.3. Adapter Design
To connect the encoders to the LLM, the paper uses a very simple adapter consisting of a single MLP block with a ReLU activation function. This forces the RWKV7 backbone to handle the bulk of the cross-modal reasoning.
$
h = \mathrm{Linear}_2(\mathrm{ReLU}(\mathrm{Linear}_1(\boldsymbol {\mathfrak{x}}))) \quad (4)
$
Where:
- is the input feature sequence from the encoder.
- projects the features to a hidden dimension (e.g., 4x the input dimension).
- projects the hidden representation to the final dimension required by the
RWKVmodel. - is the final sequence of embeddings ready to be concatenated with the text embeddings.
4.2.4. Sequence Compression
To manage the computational cost of long sequences from high-resolution modalities, ModRWKV employs a 1D convolution layer. This layer acts as a downsampler, reducing the sequence length while preserving important features. The computation for the -th output channel is:
$
y_c = \sum_{i = 1}^{C_{\mathrm{in}}}\left(\sum_{j = 0}^{k - 1}W_{c,i,j}\cdot {\pmb{x}}_{i,s \cdot t + j}\right) + b_c \quad (5)
$
Where:
- is the output sequence, with being one of its channels.
- is the input sequence with channels and length .
- is the convolutional kernel of size .
- is the stride, which determines the downsampling factor.
- is a bias term.
- The output length is reduced according to the formula: , where is the padding.
5. Experimental Setup
5.1. Datasets
The authors used a diverse set of public datasets for training and evaluation across three modalities. The following are the benchmark datasets used for evaluation, as listed in Table 1:
- Vision:
VQA-v2: A visual question answering dataset requiring image understanding.TextVQA: Requires reading and understanding text within images to answer questions (OCR-VQA).GQA: Focuses on compositional reasoning and real-world visual understanding.ScienceQA: Multimodal reasoning over scientific diagrams and text.POPE: A benchmark designed to evaluate object hallucination in vision-language models.MMMU: A challenging benchmark with college-level problems spanning multiple disciplines.MMBench: A comprehensive benchmark for evaluating MLLMs on a wide range of capabilities.
- Audio:
LibriSpeech: A large corpus (960 hours) of English read-aloud speech, used for speech recognition.Aishell-1: A corpus (178 hours) of Mandarin Chinese speech, used for non-English speech recognition.
- Time Series:
-
GIFT-Eval: A benchmark for general time series forecasting. -
UTSD: Another public dataset for time series analysis. -
The paper also mentions
ECL,ETTh1,ETTh2,ETTm1,ETTm2,WTH, andTrafficas specific forecasting datasets.For training, the model used
LLaVA-595Kfor the initial feature alignment phase andLLaVA-665Kfor the instruction tuning phase in the vision domain. For audio and time series, models were trained on the training splits of their respective datasets (e.g., LibriSpeech, GIFT-Eval).
-
5.2. Evaluation Metrics
The paper uses several standard metrics to evaluate performance.
-
Word Error Rate (WER): Used for speech recognition. It measures the number of errors (substitutions, deletions, and insertions) made by the model compared to a reference transcript. A lower WER is better.
- Conceptual Definition: WER quantifies the difference between the predicted text and the ground-truth text at the word level.
- Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $
- Symbol Explanation:
- : Number of substitutions (words replaced incorrectly).
- : Number of deletions (words missed by the model).
- : Number of insertions (words added by the model that are not in the reference).
- : Total number of words in the reference transcript.
-
Character Error Rate (CER): Also used for speech recognition, especially for languages like Chinese where word segmentation is ambiguous. It is analogous to WER but operates at the character level. A lower CER is better.
- Conceptual Definition: CER quantifies the difference between the predicted text and the ground-truth text at the character level.
- Mathematical Formula: $ \mathrm{CER} = \frac{S + D + I}{N_c} $
- Symbol Explanation:
S, D, I: Substitutions, deletions, and insertions at the character level.- : Total number of characters in the reference transcript.
-
Mean Squared Error (MSE): Used for time series forecasting. It measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. A lower MSE is better.
- Conceptual Definition: MSE quantifies the average squared difference between predicted and actual values, heavily penalizing large errors.
- Mathematical Formula: $ \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 $
- Symbol Explanation:
- : The number of data points.
- : The actual value for the -th data point.
- : The value predicted by the model for the -th data point.
-
VQA Accuracy: Used for benchmarks like
VQA-v2. It measures the percentage of questions for which the model provides the correct answer. For VQA-v2, the formula is more nuanced, accounting for inter-annotator agreement, but is generally reported as a percentage score.
5.3. Baselines
The paper compares ModRWKV against several state-of-the-art models in its class:
- Transformer-based MLLMs:
LLaVA-1.5 / LLaVA-1.6: Popular open-source vision-language models based on Vicuna (a Llama variant).LLaVA-Phi: A smaller version using the Phi-2 LLM.MobileVLM-3B: A lightweight MLLM designed for mobile applications.qwen2.5-vl-3B: A strong MLLM from Alibaba.
- Linear-Time MLLMs:
VL-Mamba: A multimodal model using theMambaSSM backbone, representing a direct competitor in the linear-time space.
- Time Series Models:
-
TimeFM,Timer,UniTS,TTM,MOIRAI,ROSE: Various state-of-the-art models for time series forecasting.These baselines are representative because they include both dominant Transformer-based models of various sizes and a direct architectural competitor (
VL-Mamba), allowing for a fair and comprehensive comparison.
-
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Vision Understanding
The primary results for vision tasks are presented in Table 2, comparing ModRWKV with other MLLMs.
The following are the results from Table 2 of the original paper:
| Method | LLM | PT | IT | VQAv2 | GQA | SQA1 | VQA T | POPE | MMB | MMMU |
|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5 | Vicuna-7B | 558K | 665K | 78.5 | 62.0 | 66.8 | 58.2 | 86.5 | 64.3 | - |
| LLaVA-1.6 | Vicuna-7B | 558K | 665K | 81.8 | 64.2 | 72.8 | 65.7 | 86.7 | 67.7 | 35.8 |
| LLaVA-Phi | Phi-2-2.7B | 558K | 665K | 71.4 | - | 68.4 | 48.6 | 85.0 | 59.8 | - |
| MobileVLM-3B | MobileLLaMA-2.7B | 558K | 665K | - | 59.0 | 61.2 | 47.5 | 84.9 | 59.6 | - |
| VL-Mamba | Mamba LLM-2.8B | 558K | 665K | 76.6 | 56.2 | 65.4 | 48.9 | 84.4 | 57.0 | |
| +ShareGPT4V | LLaMA2-13B | 558K | 665K | 80.6 | 63.2 | 73.1 | 65.3 | 84.8 | 70.8 | |
| ModRWKV | RWKV7 LLM-3B | 558K | 665K | 78.3 | 60.8 | 70.9 | 51.1 | 87.1 | 66.6 | 38.7 |
(Note: SQA1 is ScienceQA-IMG, VQA T is TextVQA, MMB is MMBench)
- Comparison with
VL-Mamba:ModRWKV-3Bconsistently and significantly outperformsVL-Mamba-2.8Bacross all reported benchmarks. This suggests that for a similar parameter count, theRWKVarchitecture is more effective for multimodal reasoning than theMambaarchitecture in this setup. - Comparison with Transformer Models:
ModRWKV(3B) is highly competitive withLLaVA-1.5(7B), a model more than twice its size. It surpassesLLaVA-1.5onScienceQA,POPE, andMMBench. - Standout Performance:
ModRWKVachieves the highest score (38.7) on the challengingMMMUbenchmark among all compared models, highlighting its strong generalization and reasoning capabilities on difficult, knowledge-intensive tasks.
6.1.2. Audio Recognition
Table 3 shows the model's performance on English (LibriSpeech) and Chinese (Aishell-1) speech recognition.
The following are the results from Table 3 of the original paper:
| Dataset | Data (h) | Encoder | Librispeech | Aishell-1 | ||
|---|---|---|---|---|---|---|
| Clean WER(%) | Other WER(%) | Dev CER(%) | Test CER(%) | |||
| Librispeech | 960 | wavlm large | 2.43 | 6.51 | - | - |
| wavlm base+ | 3.08 | 10.38 | - | - | ||
| whisper medium | 5.33 | 12.28 | - | - | ||
| whisper small | 6.24 | 16.92 | - | - | ||
| Aishell-1 | 178 | wavlm large | - | - | 9.68 | 10.33 |
| wavlm base+ | - | - | 12.40 | 13.46 | ||
| whisper medium | - | - | 5.08 | 5.83 | ||
| whisper small | - | - | 6.29 | 6.95 | ||
- Encoder Performance: For English (
LibriSpeech),WavLMencoders outperformWhisperencoders, withwavlm largeachieving the best WER of 2.43% on clean speech. Conversely, for Chinese (Aishell-1),Whisperencoders perform significantly better, withwhisper mediumachieving the lowest CER of 5.83% on the test set. This highlights that the optimal choice of audio encoder can be language-dependent. - Overall Capability: The results demonstrate that
ModRWKVcan achieve strong performance on speech recognition tasks in different languages, confirming its effectiveness in the audio modality.
6.1.3. Time Series Forecasting
Tables 4 and 5 evaluate ModRWKV on time series forecasting tasks.
- Encoder Choice (Table 4): The paper finds that the
WaveNetencoder outperforms theTimerencoder, hypothesizing this is due toWaveNet's causal dilated convolutions and point-wise embedding strategy, which better capture fine-grained temporal features. - Adapter Scaling (Table 5): Experiments show that an adapter scaling factor of 4x (i.e., the hidden layer is 4 times the input dimension) provides the best performance, outperforming 2x and 8x settings. This suggests that the adapter needs sufficient capacity but can suffer from being oversized.
- Performance (Table 4): When trained on 100% of the
gift-evaldataset,ModRWKVachieves competitive Mean Squared Error (MSE) scores against specialized time series models likeTimeFMandTTM, even outperforming them on theETTm1benchmark.
6.1.4. Efficiency and Scalability Analysis
A key claim of the paper is efficiency. Tables 9, 10, and 11 provide strong evidence.
- Inference Throughput (Table 9):
ModRWKVachieves significantly higher throughput (tokens/s) thanLLaVAandQwen2.5-VL. At a batch size of 16, it is nearly 3x faster thanLLaVA-1.6-7Band ~29x faster thanqwen2.5-vl-3B. This is attributed to its efficient RNN backbone and lightweightSigLIP2vision encoder. - Scalability with Sequence Length (Table 10): This is the most critical result for the paper's core thesis. The throughput of the Transformer-based
Qwen2.5-3Bplummets by over 90% as the sequence length increases from 1k to 64k tokens, demonstrating its quadratic complexity bottleneck. In stark contrast,ModRWKV's underlyingRWKVmodel maintains a nearly constant throughput across all sequence lengths, empirically confirming its linear scaling and suitability for long-context applications. - Evaluation Time (Table 11):
ModRWKVdemonstrates faster evaluation times on benchmarks likeGQAandTextVQAcompared to the largerqwen2.5v1-3B, highlighting its overall efficiency.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Effect of Different Vision Encoders (Table 6)
The following are the results from Table 6 of the original paper:
| Vision | Size | VQA1 | VQA1 (TextVQA) | GQA | SQA1 |
|---|---|---|---|---|---|
| CLIP | 0.4B | 62.04 | 31.72 | 49.32 | 51.10 |
| 1.5B | 72.31 | 40.27 | 54.56 | 62.77 | |
| 3B | 73.13 | 45.56 | 57.00 | 70.66 | |
| SigLIP2 | 0.4B | 72.04 | 38.75 | 55.52 | 43.32 |
| 1.5B | 76.95 | 44.96 | 58.88 | 63.10 | |
| 3B | 78.30 | 51.09 | 60.75 | 70.93 |
The SigLIP2 encoder consistently outperforms the CLIP encoder across all model sizes and benchmarks. Notably, the SigLIP2 encoder used has only 90M parameters, compared to CLIP's ~300M. This demonstrates that the encoder's design and pretraining methodology are more critical for multimodal performance than sheer parameter count.
6.2.2. Efficiency of Sequence Compression (Table 7)
The following are the results from Table 7 of the original paper:
| Size | (k,s) | Token | VQAv2 | VQA4 (TextVQA) | GQA | SQA1 |
|---|---|---|---|---|---|---|
| 1.5B | (0,0) | 577 | 76.95 | 44.96 | 58.88 | 63.10 |
| (3,2) | 288 | 75.21 | 45.75 | 58.28 | 66.02 | |
| (4,3) | 192 | 74.17 | 44.27 | 57.53 | 65.72 | |
| (5,4) | 144 | 73.21 | 42.65 | 57.07 | 65.29 |
This study explores using a 1D convolution with kernel size and stride to compress the visual token sequence from 577 tokens. Compressing the sequence reduces performance on VQAv2 and GQA, which may rely on fine-grained visual details. However, it surprisingly improves performance on ScienceQA (SQA1), suggesting that compressing sequences might help the model focus on more salient global features, which is beneficial for high-level reasoning. This shows a trade-off between efficiency and performance that is task-dependent.
6.2.3. Impact of Pretrained LLM Weights (Table 8)
The following are the results from Table 8 of the original paper:
| Size | Model | VQAv2 | VQA4 (TextVQA) | GQA | SQA1 |
|---|---|---|---|---|---|
| 0.4B | base | 72.04 | 38.75 | 55.52 | 43.32 |
| g1 | 73.21 | 41.13 | 57.34 | 55.58 | |
| 1.5B | base | 76.95 | 44.96 | 58.88 | 63.10 |
| g1 | 77.87 | 50.91 | 60.18 | 64.63 |
This experiment compares two sets of pretrained weights for the RWKV7 backbone: base and . The model was post-trained on a dataset rich in "think-type" (reasoning) data. Although both perform similarly on text-only benchmarks, the -initialized model shows significantly better performance across all multimodal benchmarks. The improvement is particularly dramatic for ScienceQA (SQA1), with a 28% relative improvement for the 0.4B model. This strongly confirms that enhancing the underlying reasoning capabilities of the language model is crucial for improving its ability to understand and process multimodal information.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces and validates ModRWKV, a multimodal framework built on the RWKV7 RNN architecture. It demonstrates that modern RNNs, contrary to their limited use in the past, can serve as a powerful and efficient backbone for MLLMs. By integrating lightweight, interchangeable encoders, ModRWKV achieves strong performance across vision, audio, and time-series tasks. It is competitive with or superior to other linear-time models like VL-Mamba and even rivals larger Transformer-based models on several benchmarks. The paper's most compelling finding is the empirical confirmation of the RWKV architecture's linear scalability and constant-time inference, positioning it as a highly promising alternative for building efficient, long-context MLLMs.
7.2. Limitations & Future Work
The authors acknowledge a primary limitation: the current work does not explore more complex multimodal fusion scenarios. For instance, it does not address tasks that require simultaneous understanding of three or more modalities (e.g., video, which involves vision, audio, and language). Future work will aim to extend the ModRWKV framework to these richer, tri-modal (and beyond) settings.
7.3. Personal Insights & Critique
-
Strengths and Inspirations:
- This paper is a significant contribution to the growing field of efficient LLMs. It provides a robust, empirical answer to an important question: "Can we escape the quadratic bottleneck of Transformers in multimodality?" The head-to-head performance and scalability comparisons (especially Table 10) are powerful and convincing.
- The modular "plug-and-play" design is a practical and elegant engineering choice, making the framework highly extensible.
- The finding that improving the LLM's text-based reasoning ( weights) directly boosts its multimodal capabilities is a crucial insight. It reinforces the idea that the LLM acts as the "brain" of the operation, and a smarter brain can better interpret sensory inputs, regardless of the modality.
-
Potential Issues and Areas for Improvement:
- Training Stability: The paper briefly mentions an "inconsistent" emergence of capabilities during adapter training, which sometimes failed to converge. This hints at potential training instabilities that are not fully explored. A more detailed analysis of training dynamics would be beneficial.
- Performance Gap: While
ModRWKVis very efficient and competitive for its size, it still lags behind larger, state-of-the-art Transformer models likeLLaVA-1.6on several standard benchmarks (e.g.,VQAv2,GQA). The work proves viability, but there is still a path to closing the performance gap with top-tier models. - Clarity of Technical Details: The paper contains some minor but potentially confusing elements, such as the unconventional outer product in the
RWKV7state update formula, which is not explained in depth. While the directive is to be faithful, this mathematical choice warrants more justification than provided. Additionally, minor formatting errors and typos in the source document slightly detract from its polish. - "First" Claim: The claim of being the "first RNN-based linear model that extends its capabilities to the cross-modal domain" is strong. While it is certainly a pioneering work for modern RNNs like
RWKVin a full MLLM context, earlier research may have explored simpler RNNs for specific multimodal tasks. The novelty lies in its comprehensive MLLM paradigm and its use of a state-of-the-art RNN.
Similar papers
Recommended via semantic vector search.