Paper status: completed

ModRWKV: Transformer Multimodality in Linear Time

Published:11/01/2025
Original LinkPDF
Price: 0.10
Price: 0.10
12 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces ModRWKV, a framework based on RWKV architecture that achieves multimodal processing with linear time complexity, outperforming traditional quadratic-complexity Transformer models. It balances performance and computational efficiency for multi-source informat

Abstract

Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV—a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone—which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly accelerates multimodal training. Comparative experiments with different pretrained checkpoints further demonstrate that such initialization plays a crucial role in enhancing the model’s ability to understand multimodal signals. Supported by extensive experiments, we conclude that modern RNN architectures present a viable alternative to Transformers in the domain of multimodal large language models (MLLMs). Furthermore, we identify the optimal configuration of the ModRWKV architecture through systematic exploration.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

ModRWKV: Transformer Multimodality in Linear Time

The title clearly states the paper's core contribution: achieving multimodal capabilities, which are typically associated with Transformer models, using an architecture that operates in linear time complexity. This points to a focus on computational efficiency.

1.2. Authors

Jiale Kang, Ziyin Yue, Qingyu Yin, Jiang Rui, Weile Li, Zening Lu, and zhouran Ji.

The authors are affiliated with Yuanishi Inc, Zhejiang University, and The Hong Kong University of Science and Technology. This mix of industry and academic researchers suggests the work has both practical application and theoretical grounding.

1.3. Journal/Conference

The paper is slated for publication at the EMNLP 2025 main conference, as indicated by the ACL Anthology link (aclanthology.org/2025.emnlp-main.204/). EMNLP (Conference on Empirical Methods in Natural Language Processing) is a top-tier, highly competitive, and influential conference in the field of natural language processing. Publication at EMNLP signifies a high level of peer-reviewed validation and impact.

1.4. Publication Year

2025, with the specific publication date given as October 31, 2025.

1.5. Abstract

The abstract introduces ModRWKV, a multimodal framework built upon the RWKV architecture, a modern Recurrent Neural Network (RNN). Current multimodal models are predominantly based on Transformers, which have quadratic computational complexity. In contrast, linear models like RNNs are more efficient but have been mostly limited to text. ModRWKV aims to bridge this gap by using a decoupled design with dynamically adaptable encoders for different modalities (e.g., vision, audio). The framework uses an extremely lightweight multimodal module and leverages pretrained weights from the RWKV7 language model to speed up training and enhance understanding. Through extensive experiments, the paper concludes that modern RNNs are a viable alternative to Transformers for Multimodal Large Language Models (MLLMs) and identifies an optimal architectural configuration for ModRWKV.

2. Executive Summary

2.1. Background & Motivation

  • Core Problem: The dominant architecture for Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) is the Transformer. While powerful, its core self-attention mechanism has a computational and memory complexity that scales quadratically (O(N2)O(N^2)) with the input sequence length (NN). This makes processing long sequences—common in high-resolution images, audio, and video—prohibitively expensive.
  • Identified Gap: In contrast, architectures like Recurrent Neural Networks (RNNs) have linear time complexity (O(N)O(N)) for inference, making them highly efficient for long sequences. However, their application has been largely confined to text-only domains, and they have historically struggled with long-range dependencies and parallelizable training. Modern RNNs, such as RWKV, have overcome many of these limitations, but their potential in multimodal contexts remains largely unexplored.
  • Innovative Idea: The paper's entry point is to investigate whether a modern, efficient RNN architecture (RWKV7) can serve as the backbone for a high-performing MLLM, offering a computationally cheaper alternative to Transformer-based systems without a significant performance trade-off. The core idea is to build a unified framework, ModRWKV, that combines the RWKV7 LLM with lightweight, plug-and-play encoders for various modalities.

2.2. Main Contributions / Findings

The paper presents three main contributions:

  1. A Novel RNN-based Multimodal Framework: It proposes ModRWKV, which the authors claim is the first unified multimodal training paradigm based on an RNN architecture. Its "plug-and-play" design for modality encoders enhances scalability and allows the model to easily switch between processing different data types like images, audio, and time series.

  2. Comprehensive Benchmarking: The paper conducts a systematic and wide-ranging evaluation of ModRWKV's capabilities across vision, audio, and time-series tasks. This establishes a strong benchmark for assessing the cross-modal performance of RNN-based architectures, which has been lacking in the field.

  3. Optimal Design Identification: Through extensive ablation studies, the authors identify the most effective components and configurations for ModRWKV. This includes finding the best-performing vision encoders, sequence compression techniques, and adapter designs that strike an optimal balance between performance and computational efficiency.

    Key Finding: The central conclusion is that modern RNN architectures, exemplified by RWKV, are a viable and competitive alternative to Transformers for building MLLMs, offering significant advantages in inference efficiency and scalability with long sequences.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Recurrent Neural Networks (RNNs): An RNN is a type of neural network designed for sequential data. It processes inputs one step at a time, maintaining an internal hidden state that captures information from previous steps. This "memory" allows it to handle tasks where context is important, like language. However, simple RNNs suffer from the vanishing/exploding gradient problem, making it difficult for them to learn long-range dependencies.
  • Transformer Architecture: The Transformer, introduced in "Attention Is All You Need," revolutionized sequence modeling. Instead of recurrence, it relies on a self-attention mechanism. This mechanism allows every token in the input sequence to directly attend to every other token, calculating a weighted sum of their values based on their compatibility. This enables the model to capture complex, long-range dependencies effectively and allows for high parallelization during training. However, this all-to-all comparison is what leads to its quadratic complexity.
  • Linear Complexity Models: These are models designed to have a computational cost that grows linearly with sequence length, i.e., O(N)O(N). This is in contrast to the standard Transformer's O(N2)O(N^2) complexity. Examples include linear attention variants, State Space Models (SSM) like Mamba, and RNNs like RWKV. They are particularly advantageous for very long sequences.
  • Multimodal Large Language Models (MLLMs): MLLMs are models that can process and understand information from multiple modalities, not just text. A typical MLLM architecture consists of:
    1. A Modality Encoder: A specialized model that converts non-text data (e.g., an image) into a sequence of numerical representations (embeddings). Examples include CLIP for images or WavLM for audio.
    2. An LLM Backbone: A pretrained text-based LLM (e.g., Llama, GPT) that performs reasoning.
    3. An Adapter/Projector: A small neural network (often a simple MLP) that maps the embeddings from the modality encoder into the same vector space as the LLM's text embeddings. This allows the LLM to "read" the non-text information.

3.2. Previous Works

  • RWKV (Receptance Weighted Key Value): This is the core architecture used in the paper. RWKV is a novel RNN that combines the parallelizable training of Transformers with the efficient inference of RNNs. It reformulates self-attention into a linear, recurrent form. Its key components are:

    • Time-mixing Block: An RNN-based block that acts as a replacement for the Transformer's self-attention layer. It updates its state based on the current input and the state from the previous time step.
    • Channel-mixing Block: A simple feed-forward network that acts as a replacement for the Transformer's feed-forward layer, processing information within each time step. The RWKV7 version used in this paper introduces enhancements like a generalized delta rule, vector-valued gating, and in-context learning rates to improve its expressiveness, as detailed in the methodology section.
  • LLaVA (Large Language and Vision Assistant): A pioneering work in building open-source MLLMs. LLaVA's architecture is simple yet effective: it connects a pretrained vision encoder (CLIP) to a pretrained LLM (Vicuna) using a simple MLP projector. It introduced a two-stage training strategy that ModRWKV also adopts:

    1. Feature Alignment (Phase I): Freeze the vision encoder and LLM, and train only the MLP projector on image-caption pairs. This quickly aligns the visual features with the LLM's word embedding space.
    2. Instruction Tuning (Phase II): Unfreeze the projector and the LLM (or parts of it), and fine-tune on a more complex dataset of multimodal instructions (e.g., visual question answering).
  • State Space Models (SSMs) & Mamba: SSMs are another class of linear-time models inspired by classical state space systems in control theory. Mamba is a recent, highly successful SSM that uses a selection mechanism to allow its state to be context-aware, selectively remembering or forgetting information. The paper compares ModRWKV to VL-Mamba, a multimodal version of Mamba, positioning RWKV as a competitive RNN-based alternative to SSMs.

  • Modality Encoders: The paper leverages several established encoders:

    • Vision: CLIP (Contrastive Language-Image Pre-training) and SigLIP2 (Sigmoid Loss for Language Image Pre-training) are models that learn to associate images with text. They are powerful visual feature extractors.
    • Audio: WavLM and Whisper are large models pretrained on vast amounts of audio data, making them excellent for extracting features from raw speech for tasks like speech recognition.
    • Time Series: WaveNet uses causal dilated convolutions to capture temporal dependencies, while Timer is a Transformer-based model for time series forecasting.

3.3. Technological Evolution

The field of large-scale modeling has evolved from text-only Transformers (e.g., GPT-3) to Transformer-based MLLMs (e.g., GPT-4V, LLaVA), which extended these models' capabilities to vision and other modalities. The major bottleneck of this paradigm has been the quadratic complexity of attention. This has spurred research into more efficient, linear-time architectures. This research has largely split into two camps: non-recurrent models like Mamba (SSMs) and modern recurrent models like RWKV. This paper firmly places itself in the latter camp, pushing the boundaries of what RNNs can do by systematically adapting them to the complex, multi-domain challenges of multimodality.

3.4. Differentiation Analysis

Compared to mainstream MLLMs like LLaVA, the core innovation of ModRWKV is its backbone architecture.

  • LLaVA (and similar models): Uses a Transformer backbone. Inference is slow for long sequences, and the key-value cache grows linearly with context length.

  • ModRWKV: Uses an RWKV (RNN) backbone. Inference is extremely fast and has constant memory usage per new token, as it only needs to maintain a fixed-size state. Prefill computation scales linearly with sequence length, a major advantage over the Transformer's quadratic scaling.

    Compared to other linear-time MLLMs like VL-Mamba:

  • VL-Mamba: Based on a State Space Model.

  • ModRWKV: Based on a Recurrent Neural Network. This paper provides a direct comparison point between these two distinct approaches to linear-time sequence modeling in the multimodal domain. ModRWKV's design is intentionally simple (lightweight adapters) to rigorously test the inherent multimodal reasoning capability of the RWKV architecture itself.

4. Methodology

4.1. Principles

The core principle of ModRWKV is to create a computationally efficient yet powerful MLLM by replacing the standard Transformer backbone with a modern RNN, RWKV7. The design is decoupled and modular: a shared RWKV7 language model serves as the central reasoning engine, while specialized, "plug-and-play" modules handle the initial processing of different modalities. This allows the framework to adapt to new data types by simply adding a new encoder, without altering the core LLM. The methodology intentionally uses a lightweight adapter to force the RWKV7 backbone to perform the complex cross-modal fusion and reasoning, thereby providing a strong test of the RNN's capabilities.

4.2. Core Methodology In-depth

The overall architecture of ModRWKV is shown in Figure 1. A multimodal input (e.g., an image) and a text prompt are processed through parallel streams before being fused and fed into the RWKV7 LLM.

fig 5 该图像是一个示意图,展示了ModRWKV框架中的多模态数据处理流程。该框架通过1D卷积和Mod Encoder对多模态输入进行编码,并结合文本嵌入,以实现信息融合,最终输出RWKV模块。该图描绘了视觉理解、音频识别等任务的组成部分。

The data flow is as follows:

  1. Modality Encoding: Non-text data is first processed by a modality-specific encoder (e.g., SigLIP2 for images) to extract a sequence of high-level feature vectors.
  2. Sequence Compression: The resulting feature sequence, which can be very long, is passed through a 1D Convolutional layer to reduce its length, making subsequent processing more efficient.
  3. Dimension Alignment: The compressed feature sequence is then fed into a lightweight Adapter (an MLP) that projects the features into the same embedding dimension as the RWKV7 model's vocabulary.
  4. Text Encoding: Concurrently, the input text prompt is converted into a sequence of embeddings using the RWKV7 model's standard Text Embedding layer.
  5. Concatenation & Fusion: The projected multimodal features and the text embeddings are concatenated into a single sequence. This combined sequence is then fed into the RWKV7 backbone, which processes the information autoregressively to generate a response.

4.2.1. RWKV7 Backbone

RWKV7 is a modern RNN that serves as the LLM backbone. The paper provides the state update equations that define its core "time-mixing" block. While a simple RNN can be written as: $ h_t = Wh_{t - 1} + Ux_t \quad (1) $ RWKV7 uses a much more expressive formulation to update its state sts_t at each time step tt. The state update is given by: $ s_t = G_{t}s_{t - 1} + a_{t}k_{t}v_{t}^{T} \quad (3) $ Where:

  • st1s_{t-1} is the state from the previous time step.
  • ktk_t and vtv_t are the key and value vectors, which are linearly projected from the input token xtx_t, similar to a Transformer.
  • ata_t is the in-context learning rate, a vector that controls how much of the new information (ktvtTk_t v_t^T) should be added to the state. It is dynamically computed from the input: at=Waxta_t = W_a x_t. This makes the update content-dependent.
  • GtG_t is the dynamic transition matrix, which controls how the previous state st1s_{t-1} decays and transforms. It is defined as: Gt=(IatktktT)diag(eewt)G_{t} = (I - a_{t}k_{t}k_{t}^{T})\mathrm{diag}(e^{-e^{w_t}}).
    • wtw_t is a vector-valued gating parameter, also computed from the input (wt=Wwxtw_t = W_w x_t). This allows different channels in the hidden state to decay at different rates, making the model's memory highly adaptive.
    • The term (IatktktT)(I - a_{t}k_{t}k_{t}^{T}) represents a "relaxed value replacement rule."
    • Note on notation: The term ktvtTk_t v_t^T represents an outer product, which would result in a matrix. This is an unusual formulation as the state sts_t is typically a vector. While unconventional, this analysis adheres strictly to the formula presented in the paper. This formulation allows for complex interactions between key and value dimensions to be incorporated into the state matrix.

4.2.2. Multimodal Encoders

ModRWKV is designed to be agnostic to the choice of encoder. The paper experiments with several options for different modalities:

  • Vision Encoder: CLIP and SigLIP2 are tested. These encoders take a raw image and output a sequence of patch embeddings.
  • Audio Encoder: WavLM and Whisper are used. They process raw audio sampled at 16,000 Hz and generate feature vectors.
  • Time Series Encoder: WaveNet and Timer are used to transform raw time-series data into feature embeddings.

4.2.3. Adapter Design

To connect the encoders to the LLM, the paper uses a very simple adapter consisting of a single MLP block with a ReLU activation function. This forces the RWKV7 backbone to handle the bulk of the cross-modal reasoning. $ h = \mathrm{Linear}_2(\mathrm{ReLU}(\mathrm{Linear}_1(\boldsymbol {\mathfrak{x}}))) \quad (4) $ Where:

  • x\mathfrak{x} is the input feature sequence from the encoder.
  • Linear1\mathrm{Linear}_1 projects the features to a hidden dimension (e.g., 4x the input dimension).
  • Linear2\mathrm{Linear}_2 projects the hidden representation to the final dimension required by the RWKV model.
  • hh is the final sequence of embeddings ready to be concatenated with the text embeddings.

4.2.4. Sequence Compression

To manage the computational cost of long sequences from high-resolution modalities, ModRWKV employs a 1D convolution layer. This layer acts as a downsampler, reducing the sequence length while preserving important features. The computation for the cc-th output channel is: $ y_c = \sum_{i = 1}^{C_{\mathrm{in}}}\left(\sum_{j = 0}^{k - 1}W_{c,i,j}\cdot {\pmb{x}}_{i,s \cdot t + j}\right) + b_c \quad (5) $ Where:

  • YY is the output sequence, with ycy_c being one of its channels.
  • xx is the input sequence with CinC_{\mathrm{in}} channels and length LL.
  • WW is the convolutional kernel of size kk.
  • ss is the stride, which determines the downsampling factor.
  • bcb_c is a bias term.
  • The output length LL' is reduced according to the formula: L=L+2pks+1L' = \left\lfloor \frac{L + 2p - k}{s} \right\rfloor + 1, where pp is the padding.

5. Experimental Setup

5.1. Datasets

The authors used a diverse set of public datasets for training and evaluation across three modalities. The following are the benchmark datasets used for evaluation, as listed in Table 1:

  • Vision:
    • VQA-v2: A visual question answering dataset requiring image understanding.
    • TextVQA: Requires reading and understanding text within images to answer questions (OCR-VQA).
    • GQA: Focuses on compositional reasoning and real-world visual understanding.
    • ScienceQA: Multimodal reasoning over scientific diagrams and text.
    • POPE: A benchmark designed to evaluate object hallucination in vision-language models.
    • MMMU: A challenging benchmark with college-level problems spanning multiple disciplines.
    • MMBench: A comprehensive benchmark for evaluating MLLMs on a wide range of capabilities.
  • Audio:
    • LibriSpeech: A large corpus (960 hours) of English read-aloud speech, used for speech recognition.
    • Aishell-1: A corpus (178 hours) of Mandarin Chinese speech, used for non-English speech recognition.
  • Time Series:
    • GIFT-Eval: A benchmark for general time series forecasting.

    • UTSD: Another public dataset for time series analysis.

    • The paper also mentions ECL, ETTh1, ETTh2, ETTm1, ETTm2, WTH, and Traffic as specific forecasting datasets.

      For training, the model used LLaVA-595K for the initial feature alignment phase and LLaVA-665K for the instruction tuning phase in the vision domain. For audio and time series, models were trained on the training splits of their respective datasets (e.g., LibriSpeech, GIFT-Eval).

5.2. Evaluation Metrics

The paper uses several standard metrics to evaluate performance.

  • Word Error Rate (WER): Used for speech recognition. It measures the number of errors (substitutions, deletions, and insertions) made by the model compared to a reference transcript. A lower WER is better.

    • Conceptual Definition: WER quantifies the difference between the predicted text and the ground-truth text at the word level.
    • Mathematical Formula: $ \mathrm{WER} = \frac{S + D + I}{N} $
    • Symbol Explanation:
      • SS: Number of substitutions (words replaced incorrectly).
      • DD: Number of deletions (words missed by the model).
      • II: Number of insertions (words added by the model that are not in the reference).
      • NN: Total number of words in the reference transcript.
  • Character Error Rate (CER): Also used for speech recognition, especially for languages like Chinese where word segmentation is ambiguous. It is analogous to WER but operates at the character level. A lower CER is better.

    • Conceptual Definition: CER quantifies the difference between the predicted text and the ground-truth text at the character level.
    • Mathematical Formula: $ \mathrm{CER} = \frac{S + D + I}{N_c} $
    • Symbol Explanation:
      • S, D, I: Substitutions, deletions, and insertions at the character level.
      • NcN_c: Total number of characters in the reference transcript.
  • Mean Squared Error (MSE): Used for time series forecasting. It measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. A lower MSE is better.

    • Conceptual Definition: MSE quantifies the average squared difference between predicted and actual values, heavily penalizing large errors.
    • Mathematical Formula: $ \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 $
    • Symbol Explanation:
      • nn: The number of data points.
      • YiY_i: The actual value for the ii-th data point.
      • Y^i\hat{Y}_i: The value predicted by the model for the ii-th data point.
  • VQA Accuracy: Used for benchmarks like VQA-v2. It measures the percentage of questions for which the model provides the correct answer. For VQA-v2, the formula is more nuanced, accounting for inter-annotator agreement, but is generally reported as a percentage score.

5.3. Baselines

The paper compares ModRWKV against several state-of-the-art models in its class:

  • Transformer-based MLLMs:
    • LLaVA-1.5 / LLaVA-1.6: Popular open-source vision-language models based on Vicuna (a Llama variant).
    • LLaVA-Phi: A smaller version using the Phi-2 LLM.
    • MobileVLM-3B: A lightweight MLLM designed for mobile applications.
    • qwen2.5-vl-3B: A strong MLLM from Alibaba.
  • Linear-Time MLLMs:
    • VL-Mamba: A multimodal model using the Mamba SSM backbone, representing a direct competitor in the linear-time space.
  • Time Series Models:
    • TimeFM, Timer, UniTS, TTM, MOIRAI, ROSE: Various state-of-the-art models for time series forecasting.

      These baselines are representative because they include both dominant Transformer-based models of various sizes and a direct architectural competitor (VL-Mamba), allowing for a fair and comprehensive comparison.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Vision Understanding

The primary results for vision tasks are presented in Table 2, comparing ModRWKV with other MLLMs. The following are the results from Table 2 of the original paper:

Method LLM PT IT VQAv2 GQA SQA1 VQA T POPE MMB MMMU
LLaVA-1.5 Vicuna-7B 558K 665K 78.5 62.0 66.8 58.2 86.5 64.3 -
LLaVA-1.6 Vicuna-7B 558K 665K 81.8 64.2 72.8 65.7 86.7 67.7 35.8
LLaVA-Phi Phi-2-2.7B 558K 665K 71.4 - 68.4 48.6 85.0 59.8 -
MobileVLM-3B MobileLLaMA-2.7B 558K 665K - 59.0 61.2 47.5 84.9 59.6 -
VL-Mamba Mamba LLM-2.8B 558K 665K 76.6 56.2 65.4 48.9 84.4 57.0
+ShareGPT4V LLaMA2-13B 558K 665K 80.6 63.2 73.1 65.3 84.8 70.8
ModRWKV RWKV7 LLM-3B 558K 665K 78.3 60.8 70.9 51.1 87.1 66.6 38.7

(Note: SQA1 is ScienceQA-IMG, VQA T is TextVQA, MMB is MMBench)

  • Comparison with VL-Mamba: ModRWKV-3B consistently and significantly outperforms VL-Mamba-2.8B across all reported benchmarks. This suggests that for a similar parameter count, the RWKV architecture is more effective for multimodal reasoning than the Mamba architecture in this setup.
  • Comparison with Transformer Models: ModRWKV (3B) is highly competitive with LLaVA-1.5 (7B), a model more than twice its size. It surpasses LLaVA-1.5 on ScienceQA, POPE, and MMBench.
  • Standout Performance: ModRWKV achieves the highest score (38.7) on the challenging MMMU benchmark among all compared models, highlighting its strong generalization and reasoning capabilities on difficult, knowledge-intensive tasks.

6.1.2. Audio Recognition

Table 3 shows the model's performance on English (LibriSpeech) and Chinese (Aishell-1) speech recognition. The following are the results from Table 3 of the original paper:

Dataset Data (h) Encoder Librispeech Aishell-1
Clean WER(%) Other WER(%) Dev CER(%) Test CER(%)
Librispeech 960 wavlm large 2.43 6.51 - -
wavlm base+ 3.08 10.38 - -
whisper medium 5.33 12.28 - -
whisper small 6.24 16.92 - -
Aishell-1 178 wavlm large - - 9.68 10.33
wavlm base+ - - 12.40 13.46
whisper medium - - 5.08 5.83
whisper small - - 6.29 6.95
  • Encoder Performance: For English (LibriSpeech), WavLM encoders outperform Whisper encoders, with wavlm large achieving the best WER of 2.43% on clean speech. Conversely, for Chinese (Aishell-1), Whisper encoders perform significantly better, with whisper medium achieving the lowest CER of 5.83% on the test set. This highlights that the optimal choice of audio encoder can be language-dependent.
  • Overall Capability: The results demonstrate that ModRWKV can achieve strong performance on speech recognition tasks in different languages, confirming its effectiveness in the audio modality.

6.1.3. Time Series Forecasting

Tables 4 and 5 evaluate ModRWKV on time series forecasting tasks.

  • Encoder Choice (Table 4): The paper finds that the WaveNet encoder outperforms the Timer encoder, hypothesizing this is due to WaveNet's causal dilated convolutions and point-wise embedding strategy, which better capture fine-grained temporal features.
  • Adapter Scaling (Table 5): Experiments show that an adapter scaling factor of 4x (i.e., the hidden layer is 4 times the input dimension) provides the best performance, outperforming 2x and 8x settings. This suggests that the adapter needs sufficient capacity but can suffer from being oversized.
  • Performance (Table 4): When trained on 100% of the gift-eval dataset, ModRWKV achieves competitive Mean Squared Error (MSE) scores against specialized time series models like TimeFM and TTM, even outperforming them on the ETTm1 benchmark.

6.1.4. Efficiency and Scalability Analysis

A key claim of the paper is efficiency. Tables 9, 10, and 11 provide strong evidence.

  • Inference Throughput (Table 9): ModRWKV achieves significantly higher throughput (tokens/s) than LLaVA and Qwen2.5-VL. At a batch size of 16, it is nearly 3x faster than LLaVA-1.6-7B and ~29x faster than qwen2.5-vl-3B. This is attributed to its efficient RNN backbone and lightweight SigLIP2 vision encoder.
  • Scalability with Sequence Length (Table 10): This is the most critical result for the paper's core thesis. The throughput of the Transformer-based Qwen2.5-3B plummets by over 90% as the sequence length increases from 1k to 64k tokens, demonstrating its quadratic complexity bottleneck. In stark contrast, ModRWKV's underlying RWKV model maintains a nearly constant throughput across all sequence lengths, empirically confirming its linear scaling and suitability for long-context applications.
  • Evaluation Time (Table 11): ModRWKV demonstrates faster evaluation times on benchmarks like GQA and TextVQA compared to the larger qwen2.5v1-3B, highlighting its overall efficiency.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Effect of Different Vision Encoders (Table 6)

The following are the results from Table 6 of the original paper:

Vision Size VQA1 VQA1 (TextVQA) GQA SQA1
CLIP 0.4B 62.04 31.72 49.32 51.10
1.5B 72.31 40.27 54.56 62.77
3B 73.13 45.56 57.00 70.66
SigLIP2 0.4B 72.04 38.75 55.52 43.32
1.5B 76.95 44.96 58.88 63.10
3B 78.30 51.09 60.75 70.93

The SigLIP2 encoder consistently outperforms the CLIP encoder across all model sizes and benchmarks. Notably, the SigLIP2 encoder used has only 90M parameters, compared to CLIP's ~300M. This demonstrates that the encoder's design and pretraining methodology are more critical for multimodal performance than sheer parameter count.

6.2.2. Efficiency of Sequence Compression (Table 7)

The following are the results from Table 7 of the original paper:

Size (k,s) Token VQAv2 VQA4 (TextVQA) GQA SQA1
1.5B (0,0) 577 76.95 44.96 58.88 63.10
(3,2) 288 75.21 45.75 58.28 66.02
(4,3) 192 74.17 44.27 57.53 65.72
(5,4) 144 73.21 42.65 57.07 65.29

This study explores using a 1D convolution with kernel size kk and stride ss to compress the visual token sequence from 577 tokens. Compressing the sequence reduces performance on VQAv2 and GQA, which may rely on fine-grained visual details. However, it surprisingly improves performance on ScienceQA (SQA1), suggesting that compressing sequences might help the model focus on more salient global features, which is beneficial for high-level reasoning. This shows a trade-off between efficiency and performance that is task-dependent.

6.2.3. Impact of Pretrained LLM Weights (Table 8)

The following are the results from Table 8 of the original paper:

Size Model VQAv2 VQA4 (TextVQA) GQA SQA1
0.4B base 72.04 38.75 55.52 43.32
g1 73.21 41.13 57.34 55.58
1.5B base 76.95 44.96 58.88 63.10
g1 77.87 50.91 60.18 64.63

This experiment compares two sets of pretrained weights for the RWKV7 backbone: base and g1g1. The g1g1 model was post-trained on a dataset rich in "think-type" (reasoning) data. Although both perform similarly on text-only benchmarks, the g1g1-initialized model shows significantly better performance across all multimodal benchmarks. The improvement is particularly dramatic for ScienceQA (SQA1), with a 28% relative improvement for the 0.4B model. This strongly confirms that enhancing the underlying reasoning capabilities of the language model is crucial for improving its ability to understand and process multimodal information.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces and validates ModRWKV, a multimodal framework built on the RWKV7 RNN architecture. It demonstrates that modern RNNs, contrary to their limited use in the past, can serve as a powerful and efficient backbone for MLLMs. By integrating lightweight, interchangeable encoders, ModRWKV achieves strong performance across vision, audio, and time-series tasks. It is competitive with or superior to other linear-time models like VL-Mamba and even rivals larger Transformer-based models on several benchmarks. The paper's most compelling finding is the empirical confirmation of the RWKV architecture's linear scalability and constant-time inference, positioning it as a highly promising alternative for building efficient, long-context MLLMs.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation: the current work does not explore more complex multimodal fusion scenarios. For instance, it does not address tasks that require simultaneous understanding of three or more modalities (e.g., video, which involves vision, audio, and language). Future work will aim to extend the ModRWKV framework to these richer, tri-modal (and beyond) settings.

7.3. Personal Insights & Critique

  • Strengths and Inspirations:

    • This paper is a significant contribution to the growing field of efficient LLMs. It provides a robust, empirical answer to an important question: "Can we escape the quadratic bottleneck of Transformers in multimodality?" The head-to-head performance and scalability comparisons (especially Table 10) are powerful and convincing.
    • The modular "plug-and-play" design is a practical and elegant engineering choice, making the framework highly extensible.
    • The finding that improving the LLM's text-based reasoning (g1g1 weights) directly boosts its multimodal capabilities is a crucial insight. It reinforces the idea that the LLM acts as the "brain" of the operation, and a smarter brain can better interpret sensory inputs, regardless of the modality.
  • Potential Issues and Areas for Improvement:

    • Training Stability: The paper briefly mentions an "inconsistent" emergence of capabilities during adapter training, which sometimes failed to converge. This hints at potential training instabilities that are not fully explored. A more detailed analysis of training dynamics would be beneficial.
    • Performance Gap: While ModRWKV is very efficient and competitive for its size, it still lags behind larger, state-of-the-art Transformer models like LLaVA-1.6 on several standard benchmarks (e.g., VQAv2, GQA). The work proves viability, but there is still a path to closing the performance gap with top-tier models.
    • Clarity of Technical Details: The paper contains some minor but potentially confusing elements, such as the unconventional ktvtTk_t v_t^T outer product in the RWKV7 state update formula, which is not explained in depth. While the directive is to be faithful, this mathematical choice warrants more justification than provided. Additionally, minor formatting errors and typos in the source document slightly detract from its polish.
    • "First" Claim: The claim of being the "first RNN-based linear model that extends its capabilities to the cross-modal domain" is strong. While it is certainly a pioneering work for modern RNNs like RWKV in a full MLLM context, earlier research may have explored simpler RNNs for specific multimodal tasks. The novelty lies in its comprehensive MLLM paradigm and its use of a state-of-the-art RNN.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.