Paper status: completed

RWKV-7 "Goose" with Expressive Dynamic State Evolution

Published:03/19/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
11 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

RWKV-7 "Goose" is a novel sequence modeling architecture that achieves constant memory usage and inference time. This 2.9 billion parameter model sets new state-of-the-art performance on multilingual tasks and matches existing benchmarks in English, while introducing generalized

Abstract

We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter language model achieves a new 3B SoTA on multilingual tasks and matches the current 3B SoTA on English language downstream performance. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0\mathsf{TC}^0. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

RWKV-7 "Goose" with Expressive Dynamic State Evolution

1.2. Authors

The paper lists a large number of authors, indicating a collaborative effort across multiple institutions and projects. The primary authors mentioned at the start are Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Duˉ6\mathbf { D } \bar { \mathbf { u } } ^ { 6 }, Haowen Houˉ7\bar { \bf H o u } ^ { 7 }, Jiju Lin8\mathbf { L i n } ^ { 8 }, Jiaxg Liu, Jana Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wˉu15\mathbf { \bar { W } u } ^ { 1 5 }, Dan Wuttke, Chrsan Zhou.

Their affiliations include:

  • RWKV Project (under Linux Foundation AI & Data)

  • EleutherAI

  • Tsinghua University

  • Recursal AI

  • Dalle Molle Institute for Artificial Intelligence USI-SUPSI

  • University of Rochester

  • Other unnamed universities such as Zhejang University, George Mason University, New York University, University Oslo, and Beijing Normal University.

    This diverse set of affiliations suggests a broad range of expertise contributing to the project, from core architectural design and optimization (RWKV Project, EleutherAI) to academic research (universities) and applied AI (Recursal AI).

1.3. Journal/Conference

The paper is published as a preprint on arXiv (arXiv:2503.14456). While arXiv is not a peer-reviewed journal or conference, it is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, and other fields. Papers posted here often undergo community scrutiny and may later be submitted to formal venues. The publication date (2025-03-18) indicates it's a very recent work.

1.4. Publication Year

2025

1.5. Abstract

The paper introduces RWKV-7 "Goose", a novel sequence modeling architecture designed to overcome the limitations of traditional Transformers, particularly their quadratic scaling with sequence length. RWKV-7 achieves constant memory usage and inference time per token, making it highly efficient for long sequences. Despite being trained on significantly less data than competitor models, the 2.9 billion parameter version of RWKV-7 establishes a new State-of-the-Art (SoTA) for 3B parameter models on multilingual tasks and matches the current 3B SoTA on English language benchmarks.

Key innovations in RWKV-7 include a generalized delta rule with vector-valued gating and in-context learning rates, along with a relaxed value replacement rule. The authors demonstrate RWKV-7's capability for state tracking and its ability to recognize all regular languages, a computational power that exceeds that of Transformers under standard complexity conjectures (which are limited to TC0\mathsf{TC}^0).

To support RWKV-7's language modeling capabilities, the paper also introduces RWKV World v3, an extended open-source multilingual corpus of 3.1 trillion tokens. Four RWKV-7 models, ranging from 0.19 billion to 2.9 billion parameters, were trained on this new dataset. To promote transparency and reproducibility, the models, dataset component listings, and training/inference code are open-sourced under the Apache 2.0 License.

2. Executive Summary

2.1. Background & Motivation

The field of sequence modeling has been largely dominated by Transformer architectures, particularly for autoregressive tasks like language modeling. Transformers excel at in-context processing and offer highly parallelizable training, primarily due to their softmax attention mechanism. However, this very mechanism is also their Achilles' heel: it incurs quadratic computational complexity and memory usage with respect to the sequence length. This quadratic scaling makes Transformer inference increasingly costly and impractical for very long sequences, a critical limitation in many real-world applications.

This limitation has spurred significant research into alternative architectures, particularly recurrent neural networks (RNNs) and State Space Models (SSMs), that aim to achieve linear computational complexity and constant memory usage per token during inference, while retaining efficient parallel training capabilities. These models typically rely on compressive states—fixed-size representations of past information that are updated recurrently.

One notable line of research has focused on linear attention variants and architectures incorporating the delta rule. The delta rule, originally from adaptive filtering, enables models to explicitly learn and update a key-value compressive state, allowing for both adding new memories and selectively removing old ones. Previous RWKV models (e.g., RWKV-4, RWKV-5, RWKV-6) have shown increasing potential to rival Transformers in performance while significantly reducing inference costs.

The core problem this paper aims to solve is to further enhance the efficiency and expressivity of RNN-based models to match or exceed Transformer performance at comparable scales, especially for long-context scenarios, without sacrificing their superior inference characteristics (constant memory and linear time). The challenge lies in designing a recurrent update mechanism that is both powerful enough to capture complex dependencies and efficient enough for large-scale deployment, while also being parallelizable for training.

The paper's innovative idea centers on generalizing the delta rule to create a more expressive and dynamically evolving state update mechanism for RWKV-7, moving beyond scalar learning rates and fixed decay mechanisms seen in prior delta rule variants.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

  • The RWKV-7 "Goose" Architecture: Introduction of a novel RNN architecture that dramatically improves downstream benchmark performance over its predecessor, RWKV-6. It achieves state-of-the-art (SoTA) multilingual performance and near SoTA English language performance for 3B parameter models, despite being trained on significantly fewer tokens than competing top models. This demonstrates its superior parameter and data efficiency.

  • Generalized Delta Rule with Enhanced Expressivity: RWKV-7 introduces a new formulation of the delta rule that includes vector-valued state gating and in-context learning rates, as well as a relaxed value replacement rule. These modifications allow for more flexible and channel-wise state updates, enhancing the model's ability to selectively modify its memory.

  • Theoretical Expressivity Beyond TC0\mathsf{TC}^0: The paper provides theoretical proofs that RWKV-7 can perform state tracking and recognize all regular languages. Under standard complexity conjectures (specifically, TC0NC1\mathsf{TC}^0 \neq \mathsf{NC}^1), this capability exceeds the capabilities of Transformers, which are limited to TC0\mathsf{TC}^0. This includes solving an NC1\mathsf{NC}^1-complete state tracking problem (swaps on 5 elements) with a single layer and recognizing any regular language with a constant number of layers.

  • The RWKV World v3 Public Dataset: Release of a new, expanded multilingual corpus totaling 3.1 trillion tokens, designed to enhance performance across English, code, and multilingual tasks. This dataset helps address the data scale gap with modern large language models.

  • Publicly Released Pre-Trained Models: Release of four RWKV-7 models (0.19B to 2.9B parameters) trained on the RWKV World v3 dataset, and three RWKV-7 Pile models (0.17B to 1.47B parameters) trained on The Pile dataset. These models are open-sourced under the Apache 2.0 License, fostering research and adoption.

  • Efficient Model Upgradation Method: A method is presented for upgrading existing RWKV models (e.g., from RWKV-5/RWKV-6 checkpoints) to the RWKV-7 architecture without training from scratch, reducing computational expense while producing competitive models.

  • Multimodal Capabilities: Demonstration of RWKV-7's versatility through adaptations to VisualRWKV-7 (for image understanding) and AudioRWKV-7 (for audio embedding analysis), showing competitive or superior performance to Transformer-based and other RNN-based multimodal models.

    In essence, RWKV-7 addresses the critical balance between efficiency and expressivity in sequence modeling, offering a compelling RNN-based alternative to Transformers with strong theoretical and empirical backing, coupled with open-source releases to accelerate community research.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand RWKV-7, a grasp of several fundamental concepts in neural networks and sequence modeling is essential:

  • Recurrent Neural Networks (RNNs):

    • Conceptual Definition: RNNs are a class of neural networks designed to process sequential data (like text, speech, time series) by maintaining an internal hidden state that captures information from previous time steps. Unlike feedforward networks, RNNs have connections that loop back on themselves, allowing information to persist.
    • How it works: At each time step tt, an RNN takes an input xtx_t and its previous hidden state ht1h_{t-1} to compute a new hidden state hth_t and an output yty_t. The same set of weights is used across all time steps, enabling RNNs to handle sequences of arbitrary length.
    • Basic Update Equation: $ h_t = f(W_h h_{t-1} + W_x x_t + b_h) $ $ y_t = W_y h_t + b_y $
      • hth_t: Hidden state at time tt
      • xtx_t: Input at time tt
      • yty_t: Output at time tt
      • WhW_h, WxW_x, WyW_y: Weight matrices
      • bhb_h, byb_y: Bias vectors
      • ff: Non-linear activation function (e.g., tanh or ReLU)
    • Limitations: Traditional RNNs suffer from vanishing or exploding gradients, making it difficult to learn long-range dependencies. Variants like LSTMs and GRUs address this with gating mechanisms.
  • Transformers and Attention Mechanism:

    • Conceptual Definition: Transformers revolutionized sequence modeling by replacing recurrence with attention mechanisms. They process all tokens in a sequence simultaneously, allowing them to capture long-range dependencies much more effectively than traditional RNNs.
    • Attention Mechanism: The core innovation is self-attention, which allows each token in a sequence to "attend" to (i.e., weigh the importance of) all other tokens in the same sequence. This is done by computing query (Q), key (K), and value (V) vectors for each token.
    • Standard Attention Formula (softmax attention): $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
      • QQ: Query matrix (from current tokens, asking "what am I looking for?")
      • KK: Key matrix (from all tokens, providing "what I have")
      • VV: Value matrix (from all tokens, providing "information associated with what I have")
      • dkd_k: Dimension of the key vectors (used for scaling)
      • softmax\mathrm{softmax}: Normalization function to get attention weights.
    • Limitations: The softmax attention mechanism computes a matrix of attention scores of size sequence length ×\times sequence length, leading to quadratic computational complexity (O(N2)O(N^2)) and memory usage (O(N2)O(N^2)) with respect to sequence length (NN). This makes Transformers expensive for very long sequences.
  • State Space Models (SSMs):

    • Conceptual Definition: SSMs are a class of models rooted in control theory, adapted for sequence modeling. They maintain a continuous latent state that evolves over time, summarizing the history of the input. They aim to combine the efficiency of RNNs (linear complexity) with the ability to capture long-range dependencies, often through specialized structured state space formulations.
    • How it works (simplified): An SSM can be seen as having a hidden state hth_t that is updated via linear dynamics (ht=Aht1+Bxth_t = A h_{t-1} + B x_t) and then mapped to an output (yt=Cht+Dxty_t = C h_t + D x_t). The matrices A, B, C, D can be learned.
    • Motivation: SSMs like S4 and Mamba aim to provide linear scaling for inference and parallelizable training (via their convolutional representation), making them a promising alternative to Transformers.
  • Delta Rule:

    • Conceptual Definition: The delta rule (also known as the Widrow-Hoff learning rule) is a fundamental algorithm in neural networks for updating weights to minimize the error between a network's output and a target output. It's akin to a single step of stochastic gradient descent (SGD).
    • Application in Sequence Models: In the context of recurrent sequence models, the delta rule is applied to update a key-value compressive state. The state acts as a memory, where for a given input key, the model learns to retrieve a corresponding value. The update rule allows the model to "add" new key-value associations and "remove" (or diminish) old ones, addressing the fixed-size state limitation of simple linear attention which only adds to the state.
    • Core Idea: It reframes state updates as an online learning problem, where the state StS_t is trained at test time to output desired values νt\nu_t for keys ktk_t.
  • Complexity Classes (TC0\mathsf{TC}^0, NC1\mathsf{NC}^1):

    • Conceptual Definition: These are computational complexity classes used to classify the difficulty of problems based on the resources (time, memory, parallel operations) required to solve them.
    • TC0\mathsf{TC}^0: This class consists of problems solvable by constant-depth, polynomial-size circuits using threshold gates. Transformers and many SSMs (especially those with diagonal transition matrices) are conjectured to be limited to TC0\mathsf{TC}^0. Problems in TC0\mathsf{TC}^0 can perform parallel counting and simple arithmetic, but struggle with tasks requiring unbounded state tracking or complex sequential logic.
    • NC1\mathsf{NC}^1: This class contains problems solvable by logarithmic-depth, polynomial-size circuits using AND, OR, NOT gates. NC1\mathsf{NC}^1 is considered more powerful than TC0\mathsf{TC}^0 (TC0NC1\mathsf{TC}^0 \subseteq \mathsf{NC}^1, and it's widely conjectured that TC0NC1\mathsf{TC}^0 \neq \mathsf{NC}^1). Problems in NC1\mathsf{NC}^1 can perform tasks like group multiplication and recognizing all regular languages, which require more sophisticated state tracking than TC0\mathsf{TC}^0 allows.
    • Significance: Proving that RWKV-7 can solve NC1\mathsf{NC}^1-complete problems (like group multiplication) or recognize all regular languages suggests it has fundamentally greater expressive power than architectures limited to TC0\mathsf{TC}^0.
  • Token Shift:

    • Conceptual Definition: A mechanism used in RWKV models where, for a given input token, a portion of its features are linearly interpolated with features from the previous token. This provides a form of local context or short 1D convolution without increasing computational complexity.
    • Purpose: Allows the model to consider information from the immediate past token, potentially aiding in induction heads (a type of circuit that copies previous inputs) and capturing local dependencies.
  • Low-Rank MLPs:

    • Conceptual Definition: A type of Multi-Layer Perceptron (MLP) where the hidden layer has a significantly smaller dimension compared to the input and output dimensions.
    • Purpose: Used to implement data dependency with minimal parameters. By projecting inputs into a lower-dimensional space, applying a non-linearity, and then projecting back, low-rank MLPs can create complex, input-dependent transformations while keeping the parameter count and computational cost low.

3.2. Previous Works

The paper builds upon a rich history of RNN and linear attention research, explicitly referencing several key models:

  • Linear Attention Variants (e.g., RWKV-4, RWKV-5, RWKV-6, RetNet, Gated Linear Attention (GLA), Mamba-2):

    • Core Idea: These models aim to approximate softmax attention with a linear kernel (linear attention) or use recurrent formulations to achieve O(N)O(N) inference time and O(1)O(1) memory. They maintain a fixed-size state that aggregates past information.
    • Limitation (addressed by delta rule): Early linear attention models often suffered from the "numerically increasing state" problem: old state contents were never truly "removed," only diluted. This can lead to state mixing or muddying of information over long sequences. Modern variants introduce per-time-step decay (e.g., RetNet, Mamba-2) to mitigate this, but decay is a "blunt tool" that cannot selectively remove specific memories associated with particular keys.
    • RWKV (Receptance Weighted Key Value): A family of RNN-based models.
      • RWKV-4 (Peng et al., 2023): Marked a significant step, showing that RNNs could rival Transformers in performance. Introduced token-shift and a form of vector-valued decay.
      • RWKV-5 & RWKV-6 (Peng et al., 2024a, 2024b): Continued architectural improvements, with RWKV-6 introducing matrix-valued states and dynamic recurrence. RWKV-6c (a sub-version) attempted to address state normalization issues.
    • RetNet (Sun et al., 2023): Another linear attention model that employs a per-time-step decay mechanism to manage state growth.
    • Gated Linear Attention (GLA) (Yang et al., 2023a): Further refined linear attention with gating mechanisms.
    • Mamba 2 (Dao & Gu, 2024): A state-space model that also incorporates per-time-step decay and selective state spaces.
  • DeltaNet (Schlag et al., 2021):

    • Core Idea: The first to apply the Error Correcting Delta Rule to key-value compressive states in the context of RNN-like sequence models. It directly addresses the state mixing problem by explicitly "replacing" values.
    • Update Rule: $ S_t = S_{t-1}(I - a k_t^T k_t) + a \nu_t^T k_t $
      • StS_t: The state matrix at time tt.
      • II: Identity matrix.
      • aa: A scalar learning rate (how much of the old value is removed and new value is added).
      • ktk_t: Current input key vector.
      • νt\nu_t: Desired output value vector for ktk_t.
    • Significance: Allowed for partial replacement of values associated with a specific key, enabling true "forgetting" and "learning" in the state. Showed diagonal plus low-rank (DPLR) state evolution, enabling parallelizable training.
  • Concurrent Work (building on DeltaNet or similar ideas):

    • Longhorn (Liu et al., 2024): Approximates a globally optimal update objective, applied to a Mamba architecture.
    • Gated Delta Networks (Yang et al., 2024a): Applies gating to the DeltaNet state, multiplying the transition matrix by a data-dependent scalar per head. Combines delta rule with scalar decay.
      • Update Rule: $ S_t = S_{t-1}(\mathrm{diag}(w_t) - k_t^T k_t \mathrm{diag}(a_t)) + \nu_t^T k_t \mathrm{diag}(a_t) $
        • wtw_t: data-dependent scalar decay.
        • ata_t: data-dependent scalar learning rate.
    • TTT (Test-Time Training) (Sun et al., 2024) and Titans (Behrouz et al., 2024): Both apply scalar decay but use batched multi-timestep approaches instead of per-step gradient descent updates. Titans also adds momentum.
    • Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues (Grazzi et al., 2024): Explores the expressiveness gained by allowing the state transition matrix to have negative eigenvalues, hinting at richer dynamics.

3.3. Technological Evolution

The field of sequence modeling has undergone a significant evolution:

  1. Early RNNs (e.g., vanilla RNN, LSTM, GRU): Focused on sequential processing and state memory. Suffered from gradient issues and limited parallelism.
  2. Attention Mechanisms (Transformers): Revolutionized by enabling parallel processing and global context capture, leading to rapid advancements in language understanding. However, introduced quadratic scaling bottlenecks.
  3. Efficiency-Focused Models (Linear Attention, SSMs): A response to Transformer limitations, aiming to regain linear scaling for inference while retaining parallel training. Examples include RWKV, RetNet, Mamba, GLA. These often struggle with state mixing or require complex decay mechanisms.
  4. Delta Rule Integration: DeltaNet introduced an explicit replacement mechanism for state updates, allowing for more precise memory management than simple decay. This marked a shift towards online learning paradigms within RNN states.
  5. Generalized Delta Rule (RWKV-7): The current paper represents an evolution of the delta rule by making its components (gating, learning rates) vector-valued and data-dependent, and decoupling removal and replacement keys. This aims to enhance expressivity and control over the state beyond previous scalar-based or fixed-decay approaches. The theoretical demonstration of RWKV-7 achieving NC1NC^1 expressivity, while previous efficient models are often confined to TC0TC^0, positions it as a significant step in pushing the computational boundaries of RNN-like architectures.

3.4. Differentiation Analysis

RWKV-7 differentiates itself from previous RNN-based and delta rule-inspired architectures in several key ways:

  • Generalized Delta Rule Formulation:

    • RWKV-7's update: St=St1(diag(wt)+ztTbt)+νtTktS_t = S_{t-1}(\mathrm{diag}(w_t) + z_t^T b_t) + \nu_t^T k_t. This is a more general diagonal plus rank one update than DeltaNet or Gated Delta Networks.
    • Vector-valued Gating and In-Context Learning Rates: Unlike DeltaNet, Gated Delta Networks, TTT, or Titans which use scalar decay (wtw_t) or learning rates (ata_t), RWKV-7 employs vector-valued wtw_t and ata_t. This means each channel or dimension within the state can have its own independent decay and learning rate, allowing for much finer-grained control over information flow and memory updates. This is a significant expressive boost over scalar approaches.
    • Decoupled Removal and Replacement Keys: RWKV-7 explicitly separates the removal key (κ^t\hat{\kappa}_t) from the replacement key (k~t\tilde{k}_t) and the in-context learning rate (ata_t) from the replacement rate booster (α\alpha). This allows the model to "remove" information associated with one key (or a transformed version of it) while "adding" information based on another, or to control the amount added independently. This is more flexible than prior delta rule formulations where removal and addition amounts were tightly coupled by a single scalar learning rate.
  • Expressivity Beyond TC0\mathsf{TC}^0:

    • The paper formally proves that RWKV-7 can recognize all regular languages and solve NC1NC^1-complete problems (like tracking swaps on five elements) with a constant number of layers. This is a crucial theoretical advantage over Transformers and many SSMs (like S4 and Mamba), which are conjectured to be limited to TC0TC^0. This implies RWKV-7 can handle more complex state-tracking problems inherently.
    • This expressive power comes from RWKV-7's non-diagonal and input-dependent transition matrix, particularly its ability to represent the "copy" state transition.
  • Modified Architecture Components:

    • Simplified Token Shift: Reverted to a simpler, non-data-dependent token-shift from RWKV-6 to improve training speed, making a pragmatic trade-off.
    • Modified MLP Module: Removed the receptance gating matrix and expanded the hidden dimension to maintain parameter count, indicating a streamlined feedforward component.
    • Low-Rank Projections: Increased use of low-rank projections for intermediate calculations, optimizing the balance between parameters, speed, and performance.
  • Numerical Stability: The modifications in RWKV-7 (e.g., restricting wtw_t entries, replacement rate booster) lead to better numerical stability of the WKV state, preventing values from accumulating to thousands, a problem observed in earlier RWKV-6 versions.

    In summary, RWKV-7 advances the RNN paradigm by offering a more generalized, expressive, and numerically stable delta rule implementation, providing stronger theoretical guarantees for complex state tracking while maintaining linear scaling properties.

4. Methodology

4.1. Principles

The core principle behind RWKV-7 "Goose" is to overcome the quadratic complexity of Transformer attention by employing a highly efficient recurrent neural network (RNN) architecture that scales linearly with sequence length and maintains constant memory usage per token during inference. This is achieved through a dynamically evolving, matrix-valued state that learns and adapts in-context.

The central theoretical basis is the generalized delta rule. Instead of simply summing information into a state (as in some linear attention models), the delta rule enables explicit online learning at test time. The model's state acts as a programmable memory, where new key-value associations can be added, and older, less relevant information can be selectively removed or updated. This "compressive state" mechanism is crucial for retaining long-range dependencies within a fixed memory footprint.

RWKV-7 extends this delta rule by making its key components vector-valued and data-dependent, rather than scalar. This allows for a much richer and fine-grained control over how information is stored and retrieved across different channels of the state. The architecture is designed to be fully parallelizable during training, leveraging efficient CUDA kernels despite its recurrent nature.

A significant intuition is that by enabling explicit state manipulation (e.g., selective replacement, vector-valued learning rates, decoupled keys), the model gains greater expressive power. This is formally backed by proofs that RWKV-7 can recognize all regular languages and solve NC1NC^1-complete problems, which are beyond the capabilities of architectures limited to TC0TC^0 like Transformers under common complexity conjectures. This expressive dynamic state evolution allows RWKV-7 to perform more sophisticated state tracking and in-context learning.

4.2. Core Methodology In-depth (Layer by Layer)

The RWKV-7 architecture processes sequences by stacking multiple identical RWKV-7 blocks. Each block consists of a Time Mixing module and an MLP module, with LayerNorm applications interspersed. The overall architecture is depicted in Figure 1 and Figure 11.

The following figure (Figure 1 from the original paper) presents the overall architecture of RWKV-7. Please refer to Appendix F for more details.

Figure 1 presents the overall architecture of RWKV-7. Please refer to Appendix F for more details. Figure 1 presents the overall architecture of RWKV-7. Please refer to Appendix F for more details. Figure 1: RWKV-7's overall architecture.

The following figure (Figure 11 from the original paper) shows the detailed architecture of RWKV-7.

Figure 11: The architecture of RWKV-7, drawn in detail. Figure 11: The architecture of RWKV-7, drawn in detail.

4.2.1. Time Mixing Module

The Time Mixing module is the core of RWKV-7, responsible for its recurrent state evolution. It processes the input token by first applying token-shift operations and then computing various data-dependent parameters that control the Weighted Key Value (WKV) state update.

1. Token Shift: The module starts by preparing a token-shifted input. xi=lerp(x,xi1,μ1)x_i^\sharp = \mathrm{lerp}(x, x_{i-1}, \mu_1)

  • xix_i: The current input feature vector for the token at position ii.
  • xi1x_{i-1}: The input feature vector for the token at position i-1.
  • μ1\mu_1: A learned parameter controlling the interpolation amount.
  • lerp(A,B,μ)=(1μ)A+μB\mathrm{lerp}(A, B, \mu) = (1-\mu)A + \mu B: Linear interpolation function. The token-shifted input xix_i^\sharp is a linear interpolation between the current token's input xix_i and the previous token's input xi1x_{i-1}. This provides a form of local context or 1D short convolution without data dependency (unlike RWKV-6). The shift_state variable in the pseudocode (Appendix G, line 11) is used to store the last token's input for the next time step.

2. Weight Preparation: Following token shift, several data-dependent parameters are computed. These parameters control the WKV state evolution, acting as gates or modifiers. RWKV-7 leverages low-rank MLPs (loramlp) for efficiency in computing these parameters.

The loramlp function is defined as: $ \mathrm{loramlp}{\Pi}(f, x, \mathrm{bias}) = f(x A{\Pi}) B_{\Pi} + (\lambda_{\Pi} \mathrm{if bias else } 0) $

  • ff: An activation function.
  • xx: The input vector.
  • AΠ,BΠA_{\Pi}, B_{\Pi}: Weight matrices for the low-rank MLP.
  • λΠ\lambda_{\Pi}: A bias term (optional, depending on bias flag). This function represents a 2-layer MLP where the hidden dimension is smaller than the input/output, minimizing parameters.

The following parameters are computed (Appendix G, lines 14-19):

  • Receptance (rr): Acts like a query in Transformers. $ x_{\mathrm{receptance}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_r}) $ $ r = x_{\mathrm{receptance}} @ \mathrm{params.W_{receptance}} $
    • params.murparams.mu_r: Learned interpolation parameter.
    • params.W_receptance: Weight matrix.
  • Decay (dd): Used to compute the in-context weight decay (wtw_t). $ x_{\mathrm{decay}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_d}) $ $ d = \mathrm{params.decay_lora}(x_{\mathrm{decay}}) $
    • params.mudparams.mu_d: Learned interpolation parameter.
    • params.decay_lora: A low-rank MLP (Figure 11 shows dL).
  • Key (kk): A precursor to both removal key (κ^\hat{\kappa}) and replacement key (k~\tilde{k}). $ x_{\mathrm{key}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_k}) $ $ k = x_{\mathrm{key}} @ \mathrm{params.W_{key}} $
    • params.mukparams.mu_k: Learned interpolation parameter.
    • params.W_key: Weight matrix.
  • Value Precursor (vprime): An initial value for the value residual learning. $ x_{\mathrm{value}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_v}) $ $ \mathrm{vprime} = x_{\mathrm{value}} @ \mathrm{params.W_{value}} $
    • params.muvparams.mu_v: Learned interpolation parameter.
    • params.W_value: Weight matrix.
  • In-Context Learning Rate (ICLR) Precursor: Used to compute the in-context learning rate (aa). $ x_{\mathrm{iclr}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_a}) $ $ \mathrm{iclr_raw} = \mathrm{params.iclr_lora}(x_{\mathrm{iclr}}) $
    • params.muaparams.mu_a: Learned interpolation parameter.
    • params.iclr_lora: A low-rank MLP (Figure 11 shows da). The final in-context learning rate aa is then obtained by sigmoid activation: a=iclr_raw.sigmoid()a = \mathrm{iclr\_raw.sigmoid()} to restrict elements to (0,1)(0,1).
  • Gate Precursor: Used to compute the rwkv gate (gg). $ x_{\mathrm{gate}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_g}) $ $ \mathrm{gate_raw} = \mathrm{params.gate_lora}(x_{\mathrm{gate}}) $
    • params.mugparams.mu_g: Learned interpolation parameter.
    • params.gate_lora: A low-rank MLP (Figure 11 shows dg). The final rwkv gate gg is then obtained by sigmoid activation: g=gate_raw.sigmoid()g = \mathrm{gate\_raw.sigmoid()}.

The raw value vv is derived from vprime. For the first layer (layer_id == 0), vv is simply vprime. For subsequent layers, value residual learning is applied: $ \mathrm{value_residual_gate} = \mathrm{th.sigmoid}(\mathrm{params.nu_lora}(x_{\mathrm{value}})) $ $ V = \mathrm{th.lerp}(\mathrm{vprime}, \mathrm{vprime_0}, \mathrm{value_residual_gate}) $

  • params.nu_lora: A low-rank MLP (Figure 11 shows dnu).
  • vprime0vprime_0: The vprime from the first layer, carried forward. This allows later layers to learn to blend their own vprime with the vprime from the initial layer, which can be seen as a form of residual connection for the value.

Next, the actual decay wtw_t, removal key κ^t\hat{\kappa}_t, and replacement key k~t\tilde{k}_t are formed:

  • Decay (wtw_t): $ w_t = \mathrm{th.exp}(-\mathrm{math.exp}(-0.5) * d.\mathrm{to(th.float).sigmoid()}) $
    • This formula ensures all entries of wtw_t lie within (exp(e0.5),1)(\exp(-e^{-0.5}), 1), which is approximately (0.545,1)(0.545, 1). This range is chosen to maintain training stability and a smaller condition number for diag(wt)\mathrm{diag}(w_t).
  • Removal Key (κ^t\hat{\kappa}_t): $ \mathrm{removal_k_raw} = k * \mathrm{params.removal_key_multiplier} $ $ \hat{\kappa}_t = \mathrm{F.normalize}(\mathrm{removal_k_raw.view}(B,T,H,-1), \mathrm{dim}=-1).\mathrm{view}(B,T,C) $
    • params.removal_key_multiplier (ξ\xi in the paper): A learned parameter to scale the key.
    • F.normalize: The removal key is L2-normalized per head. This is crucial for the delta rule variations to prevent unwanted changes in the amount removed due to implicit squaring of its length.
  • Replacement Key (k~t\tilde{k}_t): $ \tilde{k}_t = \mathrm{th.lerp}(k, k \star \mathrm{iclr}, \mathrm{params.iclr_mix_amt}) $
    • params.iclr_mix_amt (α\alpha in the paper): A learned replacement rate booster that interpolates between the raw key kk and the key scaled by the in-context learning rate iclr. This allows dynamic control over the amount added to the state, per-channel.

3. Weighted Key Value (WKV) State Evolution: The WKV is a multi-headed matrix-valued state (fast weights) that undergoes dynamic evolution. The WKV state (wkv_state in pseudocode) is crucial for encoding context information by learning key-value mappings. The state update is defined by the recurrence relation: $ \pmb { w k v } _ { 0 } = \mathbf { 0 } $ $ \pmb { w k v } _ { t } = \pmb { w k v } _ { t - 1 } \left( \mathrm { d i a g } ( \pmb { w } _ { t } ) - \hat { \kappa } _ { t } ^ { T } ( a _ { t } \odot \hat { \kappa } _ { t } ) \right) + \pmb { \nu } _ { t } ^ { T } \cdot \tilde { \pmb { k } } _ { t } $

  • wkvt\pmb { wkv }_t: The WKV state matrix at time tt, with dimensions (D/h)×(D/h)(D/h) \times (D/h) for each head.

  • wkvt1\pmb { wkv }_{t-1}: The WKV state from the previous time step.

  • diag(wt)\mathrm{diag}(\pmb { w }_t): A diagonal matrix where the diagonal elements are the vector-valued decay wtw_t.

  • κ^tT\hat{\kappa}_t^T: Transpose of the normalized removal key vector.

  • ata_t: Vector-valued in-context learning rate.

  • \odot: Element-wise product (Hadamard product).

  • νtT\nu_t^T: Transpose of the value vector.

  • k~t\tilde{k}_t: Replacement key vector.

    The term Gt=diag(wt)κ^tT(atκ^t)G _ { t } = \mathrm { d i a g } ( w _ { t } ) - \hat { \kappa } _ { t } ^ { T } ( a _ { t } \odot \hat { \kappa } _ { t } ) is the transition matrix. This matrix is a diagonal plus rank one update, which allows for efficient parallelization. The transition matrix is no longer a Householder matrix but a scaled approximation, offering expanded dynamics while maintaining eigenvalues in a stable range of [1,1][-1, 1]. This formulation combines dynamic state evolution with an approximation of a forget gate.

The recurrent formulation can also be written in a parallel manner (Appendix G, lines 41-52 demonstrate the recurrent loop for clarity, but parallel implementations exist): $ \pmb { w } \pmb { k } \pmb { \nu } _ { t } = \sum _ { i = 1 } ^ { t } \left( \nu _ { i } ^ { T } \tilde { k } _ { i } \prod _ { j = i + 1 } ^ { t } \left( \mathrm { d i a g } ( w _ { j } ) - \hat { \kappa } _ { j } ^ { T } ( a _ { j } \odot \hat { \kappa } _ { j } ) \right) \right) \in \mathbb { R } ^ { (D / h) \times (D / h) } $ This formulation allows RWKV-7 to be trained in parallel across the time dimension, similar to Transformers or SSMs, despite its recurrent inference.

4. WKV Bonus and Output: After the WKV state is updated, the receptance rtr_t (query) is applied to the state to retrieve information: $ y_t = \mathrm{wkv_state} @ r_t $

  • wkv_state: The current WKV state.
  • rtr_t: The receptance vector. The result yty_t is then passed through LayerNorm (specifically GroupNorm per head in pseudocode, Appendix G, lines 56-57) to ensure numerical stability and consistent scaling: $ p_t = \mathrm{LayerNorm}(r_t \mathrm{wkv}_t^T) + u_t $ Where utu_t is a bonus term: $ u_t = \left( r _ { t } \cdot ( \rho \odot \tilde { k } _ { t } ) ^ { T } \right) \nu _ { t } $
  • ρ\rho: A trainable parameter (params.bonus_multiplier in pseudocode) which weights the bonus. This term ensures that the current shifted input token can receive extra attention without necessarily being stored in the state. The heads are then recombined (y.view(B, T, -1) in pseudocode) to form ptRDp_t \in \mathbb{R}^D. Finally, this recombined output is gated and transformed into the module's output: $ o_t = (g_t \odot p_t) W_o \in \mathbb{R}^D $
  • gtg_t: The rwkv gate (computed in weight preparation).
  • WoW_o: Output weight matrix (params.W_output).

4.2.2. MLP Module (Channel Mixing)

The MLP module in RWKV-7 differs from previous RWKV versions. It is a two-layer feedforward network without the receptance gating matrix (WrW_r) found in RWKV-4,5,6. The hidden dimension is set to 4D to compensate for the removed gating parameters and maintain similar parameter counts.

The MLP module operates as follows: $ k_t^\prime = \mathbf{lerp}(x_t^\prime, x_{t-1}^\prime, \mu_k^\prime) \mathbf{W}_{k^\prime} \in \mathbb{R}^{4D} $

  • xtx_t^\prime: Input to the MLP module.
  • xt1x_{t-1}^\prime: token-shifted input to the MLP module.
  • μk\mu_k^\prime: Learned interpolation parameter (params.muxparams.mu_x in pseudocode).
  • Wk\mathbf{W}_{k^\prime}: Weight matrix (params.Wkparams.W_k). The activation function used is ReLU squared (ReLU(kt)2ReLU(k_t^\prime)^2). This non-linearity is then followed by a final linear transformation: $ o_t^\prime = \mathrm{ReLU}(k_t^\prime)^2 \mathbf{W}_{\nu^\prime} \in \mathbb{R}^D $
  • Wν\mathbf{W}_{\nu^\prime}: Weight matrix (params.Wvparams.W_v). The output oto_t^\prime is then added back to the residual stream.

4.2.3. Pseudocode For RWKV-7 (Appendix G)

The provided pseudocode outlines the forward pass of RWKV-7. It demonstrates the sequential processing within a batch and across time steps for the Time Mixing and Channel Mixing modules.

rwkv_timemix function:

  • Lines 11-12: Implements the token-shift operation by concatenating the previous shift_state (last token of the previous sequence or batch) with the current sequence, then updating shift_state with the last token of the current sequence.
  • Lines 14-19: Compute the lerp for x_receptance, x_decay, x_key, x_value, x_iclr, x_gate using mu parameters.
  • Lines 21-26: Apply linear transformations (@ params.W_...) or low-rank MLPs (params...._lora) and sigmoid activation for iclr to get rr, dd, kk, vprime, gate, iclr.
  • Lines 29-33: Handle value residual learning. For layer_id == 0 (first layer), vv is simply vprime. For other layers, vv is an interpolation between vprime and vprime0vprime_0 (from the first layer), controlled by value_residual_gate.
  • Lines 35-38: Compute decay (wtw_t), normalizedremovalknormalized removal_k (κ^t\hat{\kappa}_t), and replacementkreplacement_k (k~t\tilde{k}_t) as described above.
  • Lines 42-52: The core WKV state evolution loop. For each time step tt:
    • It extracts decaytdecay_t, iclrticlr_t, removal_k_t, replacement_k_t, vtv_t, rtr_t for the current token.
    • Line 51: wkv_state = wkv_state * decay_t.mT - wkv_state @ removal_k_t @ (iclr_t * removal_k_t).mT. This updates the wkv_state based on decaytdecay_t and removes information using removal_k_t and iclrticlr_t. Note the .mT (matrix transpose) implies these are (N,1)(N,1) vectors becoming (1,N)(1,N) or (N,N) matrices for matrix multiplication. This is equivalent to wkv_state * diag(decay_t) - wkv_state @ (removal_k_t.T @ (iclr_t * removal_k_t)).
    • Line 52: wkv_state = wkv_state + v_t @ replacement_k_t.mT. This adds new information using vtv_t and replacement_k_t. Equivalent to wkv_state + (v_t.T @ replacement_k_t).
  • Lines 53-54: Computes y=wkvstate@rty = wkv_state @ r_t (applying receptance) and stores it in out.
  • Lines 56-57: Applies GroupNorm to yy.
  • Lines 60-61: Calculates and adds the bonus term.
  • Line 63: Applies the rwkv gate and final output linear transformation.

rwkv_channelmix function:

  • Lines 69-70: token-shift for the MLP module.
  • Line 71: xk=th.lerp(x,xshifted,params.mux)xk = th.lerp(x, x_shifted, params.mu_x). Interpolated input.
  • Line 72: k=params.Wk@xkk = params.W_k @ xk. First linear layer.
  • Line 73: v=params.Wv@th.relu(k).square()v = params.W_v @ th.relu(k).square(). ReLU squared activation and second linear layer.

rwkv_model function:

  • Lines 78-79: Initial embedding lookup and LayerNorm.
  • Lines 81-99: Loop through layers, applying rwkv_timemix and rwkv_channelmix to the residual stream.
  • Lines 101-102: Final LayerNorm and linear head for logits.

4.2.4. PyTorch code For Naive WKV7 Kernel (Forward and Backward) (Appendix H)

The provided WKV7_Kernel class in PyTorch illustrates a naive implementation of the WKV state evolution, including both forward and backward passes.

  • forward method:
    • Lines 12-13: self.state_cache stores the WKV state at each time step, starting from the initial state (line 13). This is crucial for the backward pass.
    • Line 14: W=torch.exp(torch.exp(w.view(B,T,H,N)))W = torch.exp(-torch.exp(w.view(B, T, H, N))). This calculates the decay matrix elements (the diagonal elements of diag(wt)\mathrm{diag}(w_t)).
    • Lines 16-28: The loop iterates through time steps tt.
      • Lines 23-25: These lines implement the WKV state update:
        • state * W[:, t, :, None, :]: This applies the diag(wt)diag(w_t) term element-wise.
        • +torch.einsum(bhik,bhk,bhj>bhij,state,aa,bb)+ torch.einsum('bhik,bhk,bhj->bhij', state, aa, bb): This term corresponds to St1(ztTbt)S_{t-1}(z_t^T b_t) in the generalized delta rule St=St1(diag(wt)+ztTbt)+νtTktS_t = S_{t-1}(\mathrm{diag}(w_t) + z_t^T b_t) + \nu_t^T k_t. Here, ztTbtz_t^T b_t is effectively computed as a rank-1 update from aa and bb. Note: The pseudocode and the main text's formula St1(diag(wt)κ^tT(atκ^t))S_{t-1}(\mathrm{diag}(w_t) - \hat{\kappa}_t^T(a_t \odot \hat{\kappa}_t)) differ slightly from this kernel. The kernel uses St1+St1(ztTbt)S_{t-1} + S_{t-1}(z_t^T b_t) where ztTbtz_t^T b_t is a rank-1 update from aa and bb, which implies that ztTbtz_t^T b_t is meant to implement κ^tT(atκ^t)-\hat{\kappa}_t^T(a_t \odot \hat{\kappa}_t) in the original formula.
        • +torch.einsum(bhj,bhi>bhij,kk,vv)+ torch.einsum('bhj,bhi->bhij', kk, vv): This corresponds to νtTk~t\nu_t^T \cdot \tilde{k}_t (or νtTkt\nu_t^T k_t in the kernel).
      • Line 28: out[:,t,:]=torch.einsum(bhj,bhij>bhi,rr,state)out[:, t, :] = torch.einsum('bhj,bhij->bhi', rr, state). This applies receptance rr to the state to get the output out.
  • backward method:
    • Lines 32-38: Initializes gradients for rr, ww, kk, vv, aa, bb to zeros.
    • Lines 40-58: Iterates backward through time tt from T-1 down to 0. This is typical for RNN backpropagation, using the state_cache from the forward pass.
    • It computes gradients gr, gk, gv, ga, gb, gw using einsum for matrix multiplications involving the gout (gradient of output) and gstate (gradient of state) and cached states.
    • Line 57: gstate=torch.einsum(bhj,bhij>bhij,w[:,t,:],gstate)+torch.einsum(bhk,bhj,bhij>bhik,a[:,t,:],b[:,t,:],gstate)gstate = torch.einsum('bhj,bhij->bhij', w[:, t, :], gstate) + torch.einsum('bhk,bhj,bhij->bhik', a[:, t, :], b[:, t, :], gstate). This updates gstate for the previous time step, effectively backpropagating through the recurrent state update.
    • Line 58: gw=torch.exp(w0torch.exp(w0))gwgw = -torch.exp(w0-torch.exp(w0)) * gw. This calculates the gradient of ww with respect to w0w0 (raw ww values).

4.2.5. Expressivity of RWKV-7 (Appendix D)

This section provides theoretical proofs for RWKV-7's expressive power, claiming it surpasses TC0TC^0 and can recognize all regular languages.

D.1 Warmup: Expressivity Beyond TC0

  • Theorem 2: RWKV-7 can solve an NC1NC^1-complete problem under AC0AC^0 reductions. This is significant because Transformers and diagonal SSMs are limited to TC0TC^0, and it's conjectured that TC0NC1\mathsf{TC}^0 \neq \mathsf{NC}^1.
  • Lemma 1 (Arbitrary Swap Matrix): The RWKV-7 transition matrix can represent an arbitrary swap matrix (identity matrix with two rows swapped).
    • Proof Sketch: By setting wt=1w_t = 1, c=2c = 2, κ^t=(exey)/2\hat{\kappa}_t = (e_x - e_y) / \sqrt{2}, and at=1a_t = 1, the transition matrix becomes At=IexTexeyTey+exTey+eyTexA_t = I - e_x^T e_x - e_y^T e_y + e_x^T e_y + e_y^T e_x, which is the permutation matrix that swaps indices xx and yy.
  • Lemma 2 (Tracking Swaps on 5 elements): A one-layer RWKV-7 model can track sequences of swaps on 5 elements and determine if the final permutation is the identity.
    • Proof Sketch: Uses 5 WKV heads of dimension 5.
      • A special beginning-of-sequence token initializes the ii-th head state to eiTeie_i^T e_i (representing state ii).
      • Swap tokens are handled by setting parameters (w=1,c=2,κ^=(exey)/2,a=1,ν=k~=0w=1, c=2, \hat{\kappa}=(e_x-e_y)/\sqrt{2}, a=1, \nu=\tilde{k}=0) such that the state correctly swaps xx and yy if the state was originally xx or yy.
      • The MLP combines outputs to check if all 5 heads are back in their initial states (identity permutation).

D.2 Main Result: RWKV-7 Can Recognize Any Regular Language

  • Theorem 3: For any regular language, there exists a 4-layer RWKV-7 model that recognizes it.
    • Proof Strategy: Simulating a Deterministic Finite Automaton (DFA). A DFA is defined as a tuple A=(Q,Σ,δ,q0,F)\mathcal{A} = (Q, \Sigma, \delta, q_0, F), where QQ is a finite set of states, Σ\Sigma is a vocabulary, δσ\delta_\sigma is a transition function, q0q_0 is the initial state, and FF is a set of accepting states. DFA computation can be represented by matrix products: αMw1MwTωT\alpha \cdot M_{w_1} \cdots M_{w_T} \cdot \omega^T.
    • Challenge: An arbitrary DFA transition matrix can have rank Q|Q|, while a single WKV head only implements a simple rank-1 update.
    • Solution: Factor each DFA transition matrix into a product of n=Qn = |Q| elementary transition matrices (Lemma 3), each of which can be directly implemented by WKV heads (Lemma 4). The RWKV-7 model uses its first three layers to compute these elementary matrices for blocks of DFA transitions and the fourth layer to multiply them. Layernorm is key to preserving one-hot encoded information between layers.

D.3 Lemmas for Theorem 3

  • Lemma 3 (DFA Transition Matrix Factorization): Any DFA transition matrix MM (which has a single 1 in each column) can be factored into a product of nn elementary transition matrices (M=G1G2GnM = G_1 G_2 \dots G_n), where each GiG_i is either an identity matrix, a swap matrix (xyx \leftrightarrow y), or a copy matrix (xyx \to y, replacing column yy with a copy of column xx).
    • Proof Sketch: Greedily build MM from an identity matrix by right-multiplying elementary transition matrices in three stages: fixing mismatched columns by swaps, then by copies, then filling remaining with identities.
  • Lemma 4 (Implementing Elementary Transition Matrices): For any elementary transition matrix GG (from Lemma 3), there exist vectors κ^\hat{\kappa} and a\vec{a} such that G=diag(w)cκ^T(aκ^)G = \mathrm{diag}(w) - c \hat{\kappa}^T (\vec{a} \odot \hat{\kappa}) where c=2c=2 and w=1w=\mathbf{1}.
    • Proof Sketch: Shows specific settings for κ^\hat{\kappa} and a\vec{a} for identity, swap, and copy matrices.
      • Identity: κ^=e1\hat{\kappa} = e_1, a=0\vec{a} = \mathbf{0}.
      • Swap (xyx \leftrightarrow y): κ^=(exey)/2\hat{\kappa} = (e_x - e_y) / \sqrt{2}, a=1\vec{a} = \mathbf{1}.
      • Copy (xyx \to y): κ^=(exey)/2\hat{\kappa} = (e_x - e_y) / \sqrt{2}, a=ex\vec{a} = e_x.
  • Lemma 5 (Position Tracking - First & Parity): A 1-layer RWKV-7 can output if the current position is first, and if it's even or odd.
    • Proof Sketch: Uses token-shift for "first". For parity, the WKV state is initialized and then updated by (I2e1Te1)(I - 2 e_1^T e_1) for t2t \ge 2, causing the state to flip sign, which receptance can read out.
  • Lemma 6 (Position Modulo 2n): A 2-layer RWKV-7 can output the position modulo 2n.
    • Proof Sketch: Builds on Lemma 5. Uses a rotation mechanism in the WKV state for odd t ((I2κ^tTκ^t) (I - 2 \hat{\kappa}_t^T \hat{\kappa}_t) where κ^t\hat{\kappa}_t rotates) and an identity for even t, allowing the state to track angle. Multiple WKV heads with different receptance vectors can then read out the position modulo 2n.
  • Lemma 7 (Lookup Table Simulation): A layer of RWKV-7 can simulate a lookup table Ξ[t~,wt,,wt(2n1)]\Xi[\tilde{t}, w_t, \dots, w_{t-(2n-1)}] that takes the current position modulo 2n and the 2n most recent tokens as keys. A 3-layer RWKV-7 can compute Ξ\Xi.
    • Proof Sketch: The WKV state is updated as wkvt=wkvt1(Iet~Tet~)+ewtTet~\mathbf{wkv}_t = \mathbf{wkv}_{t-1}(I - e_{\tilde{t}}^T e_{\tilde{t}}) + e_{w_t}^T e_{\tilde{t}}. This effectively stores the last 2n tokens in the state. n WKV heads with receptance eie_i can read out the state. An MLP layer then performs the lookup. Lemmas 5 and 6 provide the t~\tilde{t} and recent tokens.

      Removing the Assumption c=2c=2: The construction uses c=2c=2 for elementary transition matrices, while the actual model uses c=1c=1. The paper notes that halving cc and wtw_t simply halves the transition matrix AtA_t. Since GroupNorm immediately follows, the magnitude of the WKV state does not affect calculations. Also, for constructions requiring c1c \neq 1, νt=k~t=0\nu_t = \tilde{k}_t = \mathbf{0} is used, so mismatch with their scales is not an issue.

4.2.6. Additional Architectural and Training Details (Appendix E)

  • Parameters and Dimensions:
    • Model dimension DD, number of layers LL, number of heads h=D/Dhh = D/D_h, vocabulary size VV.
    • All models use head size Dh=64D_h=64.
    • Pile models: V=50304V = 50304 (GPT-NeoX 20B tokenizer).
    • World models: V=65536V = 65536 (RWKV World tokenizer).
    • Table 15 provides detailed parameters for released models.
    • RWKV-7 uses four low-rank MLPs for decay (dwd_w), in-context learning rate (dad_a), value residual (dνd_\nu), and gate (dgd_g). Intermediate dimensions for these loramlps are listed in Table 16.
    • The total number of parameters is calculated by: $ # (\mathrm{Params}) = 2DV + 4D + LD(12D + 2(d_w + d_a + d_\nu + d_g) + 19) - (2Dd_\nu + D) $
      • 2DV+4D2DV + 4D: For embeddings, head, and LayerNorms.
      • D(12D+2(dw+da+dν+dg)+19)D(12D + 2(d_w + d_a + d_\nu + d_g) + 19): Parameters per layer (except first).
      • (2Ddν+D)(2Dd_\nu + D): Subtracted because the value residual low-rank MLP is not in the first layer.
  • Parameter Initializations: Emphasizes the importance of specific initialization strategies detailed in the official code repository for replicating results.
  • Dataset Loading:
    • Uses mmap for the 3.1 trillion token dataset.
    • Employs a custom pseudo-random number generator f(x)=ax3(modp)f(x) = ax^3 \pmod p over Z/pZ\mathbb{Z}/p\mathbb{Z} for diverse sampling.
    • pp is the largest prime of form 3n+23n+2 smaller than [dataset_size/4096]. aa is close to 0.618p. This ensures uniform access and pseudo-randomness.
  • Training Details:
    • bfloat16 format on Nvidia H800 GPUs.
    • AdamW optimizer: β1=0.9,β2=0.99,ϵ=1×1018\beta_1=0.9, \beta_2=0.99, \epsilon=1 \times 10^{-18}, weight decay 0.1 (only to linear layers and embeddings). The small ϵ\epsilon is to stabilize training and mitigate loss spikes.
    • Context length: 4096 tokens.
    • Base decay rate w0w_0 parameters have a 2x learning rate multiplier.
    • Cosine learning rate decay schedule combined with phased dynamic batch size scaling (inspired by critical batch size and Smith et al. (2018)). This increases batch size and adjusts LR over phases, optimizing GPU resource utilization.
    • Figure 12 shows stable training loss curves without spikes.
    • Observed NaN loss sometimes, attributed to very low AdamW ϵ\epsilon. Handled by rewinding to prior checkpoint and clearing optimizer states.

4.2.7. Additional Architecture Discussion (Appendix F)

This appendix offers a more qualitative discussion of RWKV-7's design choices and their rationale.

  • Token Shift: Simpler, non-data-dependent token-shift (like RWKV-4,5) was chosen over RWKV-6's data-dependent version for improved training/inference efficiency, despite slight loss decrease per step. It still allows induction heads and local context.
  • Delta Rule Philosophy: RWKV-7 operates on the principle of decay, removal, and addition to the state, akin to SGD. Decay is like weight decay.
  • Vector-Valued Decay: RWKV models use vector-valued decay instead of scalar-valued (like Mamba-2) because it offers significant loss per step improvement, despite being harder to compute efficiently.
  • Decay Limit: Max decay is limited (e.g., 45.5%45.5\% removal per timestep) for stability and efficient kernel implementation.
  • Vector-Valued In-Context Learning Rate (ICLR): More expressive than scalar ICLRs in TTT, Gated DeltaNet, Titans. Allows each key channel its own learning rate.
  • Decoupling Keys: The removal key is decoupled and normalized separately from the replacement key. This contrasts with RWKV-6c which enforced normalization by pre-multiplying the key by (1 - decay), but RWKV-7 lets the model learn these decisions.
  • Expressivity Beyond TC0\mathsf{TC}^0 Significance: Emphasizes that RWKV-7's ability to edit its state at each token (unlike Transformer's immutable KV Cache) allows complex operations like swapping entries, leading to greater computational abilities for fixed inputs and solving problems (e.g., tracking swaps) that Transformers cannot. The RWKV-7 state is like an "internal scratchpad."
  • WKV Head as Linear Model: A WKV head can be viewed as a linear model that updates its weights (WKV state entries) as context progresses.
  • Normalization after Receptance: Group Normalization after receptance is common practice to ensure numerical stability by preventing state magnitude changes from impacting model use.
  • Bonus Term: The per-head bonus term ut=(rt(ρk~t)T)νtu_t = \left( r _ { t } \cdot ( \rho \odot \tilde { k } _ { t } ) ^ { T } \right) \nu _ { t } (related to time-first uu term in RWKV-4 to RWKV-6) allows special treatment for the current token's information, now extracted from the WKV kernel for simplicity.

5. Experimental Setup

5.1. Datasets

The RWKV-7 models were trained and evaluated on a combination of existing and newly constructed datasets.

1. RWKV World v3 Dataset:

  • Source: Extended open-source multilingual corpus.

  • Scale: 3.119 trillion tokens.

  • Characteristics: Designed for enhanced English, code, and multilingual task performance. It builds upon RWKV World v2 (0.6 trillion tokens) and v2.1 (1.4 trillion tokens) by adding new components and slightly enhancing Chinese novels. All tokens are given equal weighting unless specified.

  • Domain: Diverse, including Web, Books, Code, Science & Wiki, Fiction, Chat & QA & Instruction, Math, Law & Government, Poetry & Lyrics.

  • Components: The RWKV World v2.1 dataset components (added to v2v2 to sum to ~1.4 trillion tokens): The following are the results from Table 11 of the original paper:

    Dataset Domain Dataset Domain
    slimpajama C4 Web Llama-3-Magpie-Pro-1M-v0.1 Align
    dolma v1.6 (reddit only)a Forums Magpie-Pro-MT-300K-v0.1 Align
    glaive-code-assistant-v3 Code Magpie-Air-MT-300K-v0.1 Align
    m-a-p_Code-Feedback Code Magpie-Qwen2-Pro-1M-v0.1 Align
    cosmopedia-v0.1 Synthetic Magpie-Phi3-Pro-300K-Filtered- Align
    SystemChat-2.0 Instruct vl
    Tess-v1.5 Instruct Magpie-Gemma2-Pro-200K- Align
    UltraInteract_sft Instruct Filtered-v0.1

    aWe added only the reddit datasets from dolma v1.6

    The RWKV World v3 dataset components (added to v2.1 to sum to ~3.1 trillion tokens): The following are the results from Table 12 of the original paper:

    Dataset Domain Dataset Domain
    REMOVED slimpajama partsa Web StarCoderc Code
    dclm-baseline-10-of-10b Web python-edu Code
    ccnews Web cosmopedia-v0.2 Synthetic
    fineweb-edu Web Edu WebInstructSub Forums
    TemplateGSM Math Buzz-v1.2 Instruct
    open-web-math Math SKGInstruct Instruct
    algebraic-stack Math FLAN Instruct

    aWe removed the CC and C4 components of SlimPajama from the corpus for World v3 CL-baseie, clueonly global-shard100 For StarCoder, we nowinclude alldatasets, insteadof just those datases with at least 10 stars

    A summary of categories and token counts for RWKV World v3: The following are the results from Table 14 of the original paper:

    Category Tokens (B)
    Web 1945.2
    Books 337.2
    Code 258.4
    Science & Wiki 222.7
    Fiction 192.6
    Chat & QA & Instruction 110.0
    Math 32.3
    Law & Government 19.0
    Poetry & Lyrics 1.7
    Total 3119.2

    Why chosen: This dataset is designed to provide a broad and deep multilingual, code, and English corpus, aiming to close the data gap with very large modern LLMs and achieve strong generalization across various domains.

2. The Pile Dataset (Gao et al., 2020):

  • Source: A large, diverse, open-source dataset of text.
  • Scale: 332 billion tokens.
  • Characteristics: Composed of 22 smaller, high-quality datasets spanning various domains (e.g., scientific papers, books, web text, code).
  • Domain: General English language.
  • Why chosen: It is a standard benchmark dataset for training and evaluating language models, allowing for comparative study with other architectures like Pythia and Mamba.

3. PG19 Dataset (Rae et al., 2019):

  • Source: A collection of books from Project Gutenberg, filtered for quality.
  • Characteristics: Contains long documents.
  • Why chosen: Used for evaluating long context capabilities by measuring loss versus sequence position.

4. Recent Internet Data:

  • Source: Temporally novel internet data created after January 2025.
  • Characteristics: Includes new computer science/physics papers on arXiv, Python/C++ repositories on GitHub, Wikipedia entries, fiction on Archive of Our Own (AO3), and recent news articles.
  • Why chosen: To complement traditional benchmarks and address concerns about data leakage by evaluating on data not present in training sets.

5. Associative Recall (AR) Datasets (e.g., MQAR):

  • Source: Synthetic datasets designed for associative recall tasks.
  • Why chosen: To evaluate the model's ability to recall previously encountered information within a context, reflecting its effectiveness in in-context learning.

6. Mechanistic Architecture Design (MAD) Benchmark (Poli et al., 2024):

  • Source: A suite of synthetic token manipulation tasks.
  • Why chosen: To probe architectural capabilities in sequence modeling, such as in-context recall, fuzzy recall, memorization, and selective copying.

7. Group Multiplication Tasks (Merrill et al., 2024):

  • Source: Synthetic sequences of group elements (e.g., from A5A_5, A4×Z5A_4 \times \mathbb{Z}_5, or Z60\mathbb{Z}_{60}).
  • Why chosen: To evaluate state-tracking capabilities for problems known to be NC1NC^1-complete, directly testing RWKV-7's theoretical expressivity.

8. Context Length Extension Dataset:

  • Source: Public and custom sources (Table 8).

  • Characteristics: Constructed to prioritize longer documents with a length-based weighting scheme (documents < 32,768 characters weighted 1.0, longer documents 2.0-3.0, up to 128k tokens).

  • Why chosen: For fine-tuning RWKV-7 models on very long contexts (128k tokens) to improve retrieval accuracy at extreme lengths. The following are the results from Table 8 of the original paper:

    Dataset Type Amount
    dclm-baseline-1.0 Public 25%
    fineweb-edu Public 15%
    fineweb Public 5%
    codeparrot/github-code Public 10%
    arXiv-CC0-v0.5 Custom 10%
    SuperWikiNEXT-32B Custom 10%
    public domain books Custom 15%
    the-stack (filtered) Custom 10%

9. AudioSet Dataset (Gemmeke et al., 2017):

  • Source: A large-scale collection of human-labeled audio events.
  • Why chosen: For evaluating AudioRWKV-7 performance on audio embedding analysis tasks.

10. Othello/Reversi Board Game Data:

  • Source: Custom-designed training data based on the game of Othello (Reversi).
  • Data Sample: The following is an example of a training data sample from Appendix I of the original paper:
    <input>
    2 .  ** . .
    3 * • O O O * * •
    4 * * * O * O O O
    5 * * * * * O O O
    6 * O * O O * O O
    7 * * O O O * O O
    8 *   O O O O   .
    9 ....OO..
    10NEXT
    11MAX_WIDTH-2
    12MAX_DEPTH-2
    </input>
    <reasoning>
    15 Possible moves and score: g1 -19 h1 -01 b2 -08 h2 -23 b7 -12 g7 -09
    16<stack>
    17Remaining_Depth:2
    18Max_Node Alpha: -in Beta: +in Best: -- Current: h1 -01 Unexplored: b2 -08
    19</stack>
    20 => Search next node
    21 [Depth limit not reached]
    22 <board>
    23 . O O * * . O
    24 * . O O O * O .
    25 * * * O * O O O
    26 * * * * O O O O
    27 * O * O O * O O
    28 * * O O O * O .
    29 * .   O O . .
    30 • O O .
    31 </board>
    32 NEXT *
    33 Possible moves and score: b1 +02 b2 +05 h2 +10 h6 +03 h7 +08 c8 +06 d8
    +01 g8 +09
    34 [Current player has legal moves]
    35 [Internal node - expand]
    36 <stack>
    37 Remaining_Depth:1
    38 Min_Node Alpha: -in Beta: +in Best: -- Current: d8 +01 Unexplored: b1 +02
    39 Max_Node Alpha: -in Beta: +in Best: -- Current: h1 -01 Unexplored: b2 -08
    40 </stack>
    41 => Search next node
    42 [Depth limit reached - evaluate all leaves]
    43 [Updated stack]
    44 <stack>
    45 Remaining_Depth:1
    46 Min_Node Alpha: -in Beta: +01 Best: d8 Current: -- --- Unexplored:
    47 Max_Node Alpha: +01 Beta: +in Best: h1 Current: b2 -08 Unexplored:
    48 </stack>
    49 => Search next node
    50 [Depth limit not reached]
    51 <board>
    52 . . O O * * . .
    53 * O O O O * * .
    54 * O O O * * O O
    55 * O * O * O O O
    56 * O * O O O * O
    57 * * O O O O O O
    58 *
    59 .  ..   ..
    60 </board>
    61 NEXT *
    62 Possible moves and score: al -07 bl +13 h6 -01 b7 -08 g7 +08 h7 -02 c8
    +01 d8 -03 g8 +04
    63 [Current player has legal moves]
    64 [Internal node - expand]
    65 <stack>
    66 Remaining_Depth:1
    67 Min_Node Alpha: +01 Beta: +in Best: -- Current: b7 -08 Unexplored: a1 -07
    68 Max_Node Alpha: +01 Beta: +in Best: h1 Current: b2 -08 Unexplored:
    69 </stack>
    70 => Search next node
    71 [Depth limit reached - evaluate all leaves]
    72 [Updated stack]
    73 <stack>
    74 Remaining_Depth:1
    75 Min_Node Alpha: +01 Beta: -08 Best: b7 Current: -- --- Unexplored:
    76 Max_Node Alpha: +01 Beta: +in Best: h1 Current: -- --- Unexplored:
    77 </stack>
    78 [End of search]
    79 > Playing h1
    </reasoning>
    <output>
    82 h1
    83 . .  O * * . 0
    84 • .  O * * .  .
    85 • .  O * * .  .
    86 * *   O * O O O
    87 * O * O * O O O
    88 * O * O O O * O
    89 * * O O O O O O
    90 *
    91
    </output>
    
    • Description: Each sample includes an Input section (game state, search parameters), a Reasoning section (legal moves, evaluations, Alpha-Beta pruning steps for optimal moves, generated by Egaroucid engine), and an Output section (final move, resulting board).
    • Why chosen: To evaluate RWKV-7's state tracking and in-context search capabilities for strategic decision-making in complex environments.

5.2. Evaluation Metrics

The paper employs various metrics to evaluate RWKV-7 across different tasks:

1. Accuracy (ACC):

  • Conceptual Definition: Measures the proportion of correctly predicted instances out of the total number of instances. It is a fundamental metric for classification tasks, including language modeling where the task is to predict the next token correctly.
  • Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
  • Symbol Explanation:
    • Number of Correct Predictions: The count of times the model's output matches the ground truth.
    • Total Number of Predictions: The total number of predictions made by the model.

2. Perplexity (PPL):

  • Conceptual Definition: A common metric for evaluating language models, particularly for how well they predict a sample of text. Lower perplexity indicates a better model, as it means the model is less "perplexed" (more confident and accurate) by the text it is evaluating. It is the exponentiated average negative log-likelihood of a sequence.
  • Mathematical Formula: For a sequence of tokens W=(w1,w2,,wN)W = (w_1, w_2, \ldots, w_N), the perplexity is defined as: $ \mathrm{PPL}(W) = \left( \prod_{i=1}^N \frac{1}{P(w_i | w_1, \ldots, w_{i-1})} \right)^{1/N} $ This can also be expressed in terms of cross-entropy loss (LL): $ \mathrm{PPL}(W) = e^{L(W)} = e^{-\frac{1}{N} \sum_{i=1}^N \log P(w_i | w_1, \ldots, w_{i-1})} $
  • Symbol Explanation:
    • WW: A sequence of NN tokens (w1,w2,,wN)(w_1, w_2, \ldots, w_N).
    • NN: The total number of tokens in the sequence.
    • P(wiw1,,wi1)P(w_i | w_1, \ldots, w_{i-1}): The probability assigned by the language model to the ii-th token wiw_i, given the preceding tokens (w1,,wi1)(w_1, \ldots, w_{i-1}).
    • L(W): The cross-entropy loss for the sequence WW.
    • ee: Euler's number (base of the natural logarithm).

3. Compression Rate:

  • Conceptual Definition: Used for evaluating models on temporally novel internet data. It measures how efficiently a model can compress new, unseen data. A lower compression rate indicates that the model has learned better representations and can predict the data more effectively, requiring fewer bits to encode it. Inspired by the idea that language modeling is compression.
  • Mathematical Formula: While the paper doesn't explicitly provide the formula, compression rate in this context typically refers to the average number of bits per token (BPT) or character (BPC) required by the model to encode the data. This is often approximated by the cross-entropy loss (converted to bits) or related to perplexity. For a model MM and data DD: $ \mathrm{CompressionRate}(M, D) \approx \mathrm{BitsPerToken} = \frac{\sum_{i=1}^N -\log_2 P_M(w_i | \text{context}_i)}{N} $
  • Symbol Explanation:
    • NN: Total number of tokens in the data.
    • PM(wicontexti)P_M(w_i | \text{context}_i): The probability assigned by model MM to token wiw_i given its preceding context.
    • log2\log_2: Logarithm base 2, to convert probabilities into bits.
    • The paper expresses compression rate in percentage (unit: %\%), likely indicating a normalized version or a specific compression algorithm's output.

4. Mean Average Precision (mAP):

  • Conceptual Definition: A widely used metric for object detection and information retrieval tasks, particularly multi-label classification or ranking tasks like audio event detection. It calculates the Average Precision (AP) for each class/category and then averages these AP values across all classes. AP itself is the area under the Precision-Recall Curve. Higher mAP indicates better performance.
  • Mathematical Formula: $ \mathrm{mAP} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \mathrm{AP}_q $ Where APq\mathrm{AP}_q for a query (or class) qq is often calculated as: $ \mathrm{AP}q = \sum{k=1}^N P(k) \Delta r(k) $ or as the area under the Precision-Recall curve for class qq.
  • Symbol Explanation:
    • Q|Q|: The total number of queries or classes.
    • APq\mathrm{AP}_q: The Average Precision for query (or class) qq.
    • NN: The number of retrieved documents (or predicted events).
    • P(k): The precision at cut-off kk in the ranked list.
    • Δr(k)\Delta r(k): The change in recall from k-1 to kk.

5. Stable Rank (SR):

  • Conceptual Definition: A measure of the "effective rank" of a matrix. It quantifies how close a matrix is to a low-rank matrix. A lower stable rank suggests that the matrix's information can be represented more compactly or that it has fewer "active" dimensions, even if its mathematical rank is high.
  • Mathematical Formula (Rudelson & Vershynin, 2007): $ \mathrm{SR}(A) := \left( \frac{|A|_F}{|A|_2} \right)^2 $
  • Symbol Explanation:
    • AA: The matrix (e.g., WKV state matrix).
    • AF\|A\|_F: The Frobenius norm of AA, defined as i,jAi,j2\sqrt{\sum_{i,j} |A_{i,j}|^2}.
    • A2\|A\|_2: The spectral norm (or operator norm) of AA, defined as its largest singular value.

6. Root Mean Square (RMS):

  • Conceptual Definition: A statistical measure of the magnitude of a varying quantity. For a matrix, it gives a sense of the typical magnitude of its elements.
  • Mathematical Formula: For a matrix AA with m×nm \times n elements: $ \mathrm{RMS}(A) = \sqrt{\frac{1}{mn} \sum_{i=1}^m \sum_{j=1}^n A_{i,j}^2} $
  • Symbol Explanation:
    • AA: The matrix.
    • m, n: Dimensions of the matrix.
    • Ai,jA_{i,j}: The element at row ii, column jj.

5.3. Baselines

The paper compares RWKV-7 against a range of contemporary large language models (LLMs) and sequence modeling architectures, representing different paradigms and scales:

1. Transformer-based Models:

  • Qwen2.5 (Qwen et al., 2025): A highly optimized Transformer model, often representing state-of-the-art performance, especially for its large training data. RWKV-7 aims to match its performance with significantly fewer tokens.
  • Llama-3.2 (Grattafiori et al., 2024): Another prominent Transformer series, known for strong performance. The paper notes that Llama-3.2 models are created via pruning and distillation from larger models, so their FLOPs are not directly comparable for training efficiency.
  • Pythia (e.g., pythia-1.4b-v0, pythia-2.8b-v0): An open-source suite of Transformer models trained on The Pile dataset, serving as a direct comparison for models trained on identical data.
  • Falcon3-1B-Base: Another Transformer-based model.

2. State Space Models (SSMs):

  • Mamba (Gu & Dao, 2023): A foundational SSM that achieves linear time complexity during inference.
  • Mamba2 (Dao & Gu, 2024): An improved version of Mamba, often used as a strong baseline for efficient sequence modeling.
  • mamba-1.4b-hf, mamba-2.8b-hf: Specific Mamba variants.
  • mamba2attn-2.7b: A Mamba variant that likely integrates attention.
  • RecurrentGemma-2B: A recurrent model from Google, part of the Gemma family.
  • gemma-2-2b: Another Gemma family model.

3. Other RNN / Linear Attention Models:

  • RWKV (Predecessors): RWKV-4, RWKV-5, RWKV-6 (including RWKV6-World2.1-1.6B, RWKV6-World2.1-3B, RWKV5-World1-0.1B, RWKV5-World2-0.4B, RWKV5-World3B, etc.). These serve as direct architectural baselines to demonstrate the improvements of RWKV-7.
  • SmolLM2 (Allal et al., 2025): A small language model series.
  • Index-1.9B, MobileLLM-1.5B, MobileLLM-1B, Minitron-4B-Base, Zamba2-1.2B, Zamba2-2.7B: Other efficient or small-scale LLMs.
  • S4: A foundational Structured State Space Model used in group multiplication experiments.
  • GLA (Gated Linear Attention): A linear attention variant.
  • Hyena / Multihead Hyena (Poli et al., 2023): Convolutional language models used in MAD benchmark.
  • DeltaNet (Schlag et al., 2021): The original Delta Rule model, used in MAD benchmark.

Why these baselines are representative:

  • SOTA Comparison: Qwen2.5 and Llama-3.2 represent the highest-performing Transformer models in their size class, crucial for evaluating RWKV-7's competitive performance despite lower token counts.
  • Architectural Comparison: Mamba, S4, GLA, Hyena represent the leading alternatives to Transformers that also aim for efficiency, allowing direct comparison of architectural strengths.
  • Evolutionary Comparison: RWKV-4,5,6 models provide a clear baseline to show the incremental improvements of the RWKV-7 architecture.
  • Dataset Consistency: Pythia models trained on The Pile offer a controlled comparison for RWKV-7 Pile models on the exact same training data.
  • Specific Task Benchmarks: Baselines like DeltaNet, Hyena, S4, Mamba, and classical RNNs are included for specialized tasks like Mechanistic Architecture Design and Group Multiplication to probe specific capabilities.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results validate RWKV-7's effectiveness across various benchmarks, demonstrating its competitive performance despite being trained on significantly fewer tokens, and highlighting its efficiency and advanced state-tracking capabilities.

1. Language Modeling Experiments (LM Evaluation Harness Benchmarks): RWKV-7 models were evaluated on common English-focused and multilingual benchmarks using LM Evaluation Harness.

  • English-Focused Benchmarks (Table 3): The following are the results from Table 3 of the original paper:

    Model (Name) Tokens (T) lmb.o acc↑ hella acc_n↑ piqa acc↑ arcE acc↑ arcC acc↑ glue aacc WG acc↑ sciq acc↑ mmlu acc↑ avg acc↑
    RWKV5-World1-0.1B SmolLM2-135M 0.6 38.4 31.9 61.4 44.2 19.9 45.5 52.9 76.3 23.1 43.7
    RWKV7-World2.8-0.1B 2.0 1.6 42.9 48.1 43.1 68.4 64.4 28.1 49.0 53.0 84.0 25.8 51.0 50.5
    42.1 67.3 59.3 25.5 48.1 52.7 86.3 25.4
    RWKV5-World2-0.4B 1.1 54.0 40.9 66.5 54.0 24.0 50.0 53.2 86.9 23.8 50.4
    SmolLM2-360M 4.0 53.8 56.4 72.1 70.4 36.5 50.7 59.0 91.2 26.3 57.4
    Qwen2.5-0.5B 18.0 52.5 52.1 70.2 64.6 29.5 54.7 56.4 93.1 47.8 57.9
    RWKV7-World2.9-0.4B 3.1 58.6 56.8 72.9 68.7 31.9 49.4 59.9 89.7 26.1 57.1
    RWKV6-World2.1-1.6B 2.5 67.4 61.1 74.4 64.3 31.0 51.0 60.7 89.5 25.1 58.3
    Llama3.2-1B b15.0 63.0 63.7 74.5 65.5 31.3 49.7 60.7 91.4 32.1 59.1
    SmolLM2-1.7B 11.0 67.7 71.5 77.0 77.7 44.7 51.5 66.1 93.3 50.3 66.6
    Qwen2.5-1.5B 18.0 63.0 67.7 75.8 75.5 41.2 65.0 63.4 94.2 61.0 67.4
    RWKV7-World3-1.5B 5.6 69.5 70.8 77.1 78.1 44.5 62.4 68.2 94.3 43.3 67.6
    RWKV6-World2.1-3B 2.5 71.7 68.4 76.4 71.2 35.6 56.3 66.3 92.2 28.3 62.9
    Llama3.2-3B b15.0 70.5 73.6 76.7 74.5 42.2 50.7 69.9 95.7 56.5 67.8
    Qwen2.5-3B 18.0 67.1 73.5 78.6 77.4 45.0 70.2 68.5 96.2 65.7 71.4
    RWKV7-World3-2.9B 5.6 73.4 76.4 79.7 81.0 48.7 61.8 72.8 95.0 55.0 71.5

    Analysis:

    • Competitive English Performance: RWKV-7 models generally match the English performance of Qwen2.5, despite Qwen2.5 being trained on significantly more tokens (e.g., RWKV7-World3-2.9B (5.6T tokens, 71.5 avg acc) vs. Qwen2.5-3B (18.0T tokens, 71.4 avg acc)). This highlights RWKV-7's superior data efficiency.
    • MMLU Leap: RWKV-7 models show substantial improvements in MMLU performance compared to RWKV-6, indicating better multi-task language understanding.
    • Overall Scaling: As model size increases, RWKV-7 consistently shows strong performance gains, often outperforming RWKV-6 and SmolLM2 at comparable sizes.
  • Multilingual Benchmarks (Table 4): The following are the results from Table 4 of the original paper:

    Model (Name) Tokens (T) lmb.m appll lmb.m acc↑ pawsx acc↑ xcopa acc↑ xnli acc↑ xsClz acc↑ xwin acc↑ avg acc↑
    RWKV5-World1-0.1B SmolLM2-135M 0.6 2.0 270 1514 22.0 18.6 48.6 51.2 53.0 52.2 36.1 34.9 51.7 50.6 59.5 61.7 45.1 44.9
    RWKV7-0.1B 1.6 114 31.6 46.1 53.3 37.6 52.6 64.1 47.5
    RWKV5-World2-0.4B 1.1 66 36.8 49.5 54.0 38.5 54.1 65.6 49.8
    SmolLM2-360M 4.0 389 25.8 51.4 51.7 36.0 51.2 67.8 47.3
    Qwen2.5-0.5B 18.0 108 32.9 52.6 54.4 38.6 53.9 67.8 50.0
    RWKV7-World3-0.4B 3.1 52 39.6 48.7 55.4 40.3 55.3 72.9 52.0
    RWKV6-World2.1-1.6B 2.5 28 47.2 52.5 58.1 41.4 58.2 76.5 55.7
    Llama3.2-1B b15.0 52 39.0 53.9 55.3 41.2 56.6 72.2 53.0
    SmolLM2-1.7B 11.0 85 37.1 56.5 53.1 38.1 54.1 72.8 52.0
    Qwen2.5-1.5B 18.0 49 40.0 55.3 57.4 40.6 57.7 75.8 54.5
    RWKV7-World3-1.5B 5.6 25 48.4 54.8 59.7 43.7 61.4 79.8 58.0
    RWKV6-World2.1-3B 2.5 21 51.0 53.4 60.2 42.7 61.3
    Llama3.2-3B b15.0 30 45.9 58.5 60.6 78.8 57.9
    Qwen2.5-3B 18.0 36 43.5 59.9 53.3 59.0 44.2 38.5 59.6 79.2 79.8 58.1
    RWKV7-World3-2.9B 5.6 18 52.9 58.2 63.1 45.4 64.7 82.4 55.6 61.1

    Analysis:

    • SoTA Multilingual Performance: RWKV-7-World models show significant improvements on multilingual benchmarks, outperforming SmolLM2, Llama-3.2, and Qwen-2.5 by a notable margin at comparable parameter counts (e.g., RWKV7-World3-2.9B (5.6T tokens, 61.1 avg acc) vs. Qwen2.5-3B (18.0T tokens, 58.1 avg acc)). This indicates strong multilingual capabilities.
  • FLOPs vs. Accuracy (Figures 3 & 4): The following figure (Figure 3 from the original paper) shows model comparisons across multilingual benchmarks.

    Figure 3: Model Comparisons across Multilingual Benchmarks Figure 3: Model Comparisons across Multilingual Benchmarks

    The following figure (Figure 4 from the original paper) shows model comparisons across English-language benchmarks.

    Figure 4: Model Comparisons across English-Language Benchmarks Analysis:

    • RWKV-7 models demonstrate a Pareto improvement in multilingual evaluations against Transformer models (Figure 3a).
    • For English language evaluations, RWKV-7-World models achieve similar scores to highly trained Transformer models but with dramatically lower total FLOPs usage (Figure 4a). This suggests higher computational efficiency. The authors theorize this difference would be even more dramatic if models were trained from scratch with equivalent total tokens.

2. Recent Internet Data Evaluation (Table 5): The following are the results from Table 5 of the original paper:

Model arXiv CS↓ arXiv Phys. ↓ Github Python ↓ Github C++↓ A03 Eng ↓ BBC news ↓ Wiki Eng ↓ average ↓
Qwen2.5-1.5B 8.12 8.65 4.42 4.40 11.76 9.58 9.49 8.06
RWKV-7 1.5B 8.25 8.77 5.57 5.29 10.93 9.34 8.97 8.16
Llama-3.2-1B 8.37 8.76 5.18 5.16 11.69 9.34 9.07 8.23
SmolLM2-1.7B 8.38 9.04 5.17 4.94 11.20 9.40 9.46 8.23
Index-1.9B 8.34 8.59 5.65 5.29 11.49 9.51 9.23 8.30
stablelm-2-1.6b 8.58 9.08 5.54 5.45 11.42 9.24 9.06 8.34
RWKV-6 1.5B 8.62 9.00 6.06 5.80 11.09 9.57 9.30 8.49
RWKV-5 1.5B 8.77 9.11 6.20 5.92 11.25 9.75 9.50 8.64
mamba2-1.3b 8.74 8.74 6.32 5.71 11.63 9.74 9.86 8.68
MobileLLM-1.5B 8.82 9.29 6.79 6.29 11.59 9.15 9.22 8.73
mamba-1.4b-hf 8.88 8.86 6.43 5.81 11.70 9.83 9.97 8.78
Zamba2-1.2B 8.57 9.21 6.91 7.08 11.39 9.38 9.26 8.83
SmolLM-1.7B 8.38 9.02 5.76 6.55 12.68 9.85 9.89 8.88
MobileLLM-1B 9.03 9.57 7.03 6.53 11.86 9.35 9.43 8.97
RWKV-4 1.5B 9.34 9.80 6.54 6.16 11.33 10.00 9.82 9.00
pythia-1.4b-v0 9.12 9.20 6.79 6.15 12.19 10.20 10.43 9.15
Falcon3-1B-Base 8.60 9.20 6.92 7.16 13.04 10.45 10.75 9.45
Llama-3.2-3B 7.78 8.10 4.15 4.59 10.90 8.70 8.28 7.57
Qwen2.5-3B 7.79 8.25 4.15 4.12 11.23 9.15 8.96 7.66
RWKV-7 2.9B 7.90 8.34 5.16 4.88 10.48 8.92 8.47 7.74
stablelm-3b-4e1t 8.15 8.50 5.28 4.85 10.89 8.82 8.51 7.86
Minitron-4B-Base 8.09 8.70 5.13 4.74 11.05 9.08 8.90 7.96
recurrentgemma-2b 8.24 8.52 5.22 4.80 11.30 8.94 8.88 7.99
RWKV-6 3B 8.27 8.58 5.66 5.39 10.67 9.17 8.82 8.08
gemma-2-2b 8.39 8.81 5.36 5.01 11.35 8.90 9.03 8.12
mamba2attn-2.7b 8.33 8.29 5.78 5.22 11.13 9.28 9.26 8.18
RWKV-5 3B 8.42 8.70 5.78 5.51 10.83 9.36 9.00 8.23
mamba2-2.7b 8.43 8.37 5.93 5.34 11.21 9.37 9.38 8.29
Zamba2-2.7B 8.17 8.70 6.30 6.39 10.97 8.95 8.74 8.32
mamba-2.8b-hf 8.57 8.52 6.03 5.46 11.31 9.49 9.53 8.41
RWKV-4 3B 8.90 9.27 6.07 5.67 10.90 9.57 9.30 8.53
pythia-2.8b-v0 8.72 8.73 6.29 5.71 11.66 9.74 9.82 8.67

Analysis:

  • RWKV-7 Goose shows competitive performance on temporally novel data (data created after training periods), despite being trained on significantly less data than models like Qwen2.5 and Llama-3.2.
  • For 3B-scale models, RWKV-7 2.9B achieves an average compression rate of 7.74%, very close to Qwen2.5-3B (7.66%) and Llama-3.2-3B (7.57%). This indicates good generalization to unseen data and robust language modeling capabilities without data leakage.

3. Associative Recall (Table 6): The following are the results from Table 6 of the original paper:

Dim WKV state dim (64,4) (128,8) (256, 16) (512,64) (1024,128) (2048,256)
64 8192 ¸ 98.43 95.01 72.93
128 16384 94.97
256 32768 98.97
512 65536

Analysis:

  • RWKV-7 demonstrates strong associative recall capabilities. With a WKV state dimension of 8192, it recalls 72.93% of information for 256 Key-value pairs (sequence length 2048).
  • The results suggest an information density of 0.547 bits per dimension, indicating efficient storage in the fixed-size state.
  • Achieving >99% accuracy (indicated by '✓') for various configurations, including up to 512 Key-value pairs and 65536 WKV state dimension, underscores its ability to effectively learn from context.

4. Mechanistic Architecture Design (MAD) Benchmark (Table 7): The following are the results from Table 7 of the original paper:

Model Compress Fuzzy Recall In-Context Recall Memorize Noisy Recall Selective Copy Avg
RWKV-7 44.5 43.2 100 89.1 100 98.8 79.3
Transformer 51.6 29.8 94.1 85.2 86.8 99.6 74.5
Multihead Hyena 44.8 14.4 99.0 89.4 98.6 93.0 73.2
DeltaNet 42.2 35.7 100 52.8 100 100 71.8
Mamba 52.7 6.7 90.4 89.5 90.1 86.3 69.3
Hyena 45.2 7.9 81.7 89.5 78.8 93.1 66.0
GLA 38.8 6.9 80.8 63.3 81.6 88.6 60.0

Analysis:

  • RWKV-7 achieves the highest average score (79.3) across the six MAD tasks, outperforming Transformers, Mamba, DeltaNet, and Hyena.
  • It demonstrates perfect accuracy on In-Context and Noisy Recall tasks, matching DeltaNet.
  • Sets a new state-of-the-art for Fuzzy Recall (43.2), indicating strong robustness to noisy inputs.
  • Strong performance in Memorization and Selective Copying suggests an effective combination of attention-based and recurrent model strengths.

5. Long Context Experiments:

  • PG19 Loss vs. Sequence Position (Figures 5 & 6): The following figure (Figure 5 from the original paper) shows PG19 loss versus sequence position for RWKV and Mamba models trained on The Pile datasets.

    Figure 5: PG19 loss versus sequence position for RWKV and Mamba models trained on The Pile datasets. Table 7: Results on the MAD benchmark Figure 5: PG19 loss versus sequence position for RWKV and Mamba models trained on The Pile datasets.

    The following figure (Figure 6 from the original paper) shows PG19 loss versus sequence position for RWKV7 models and predecessors trained on the World dataset.

    Figure 6: PG19 loss versus sequence position for RWKV7 models and predecessors trained on the World dataset. Analysis:

    • Pile-trained RWKV-7 shows more significant loss reduction on long contexts compared to its predecessors (Figure 5), demonstrating effective long-context extrapolation.
    • Conversely, World-trained RWKV-7 exhibits an increasing loss trend for contexts longer than 10k (Figure 6). This is speculated to be due to inductive biases from larger datasets/models causing overfitting to specific context lengths. However, fine-tuning on long contexts can restore this capability.
  • Pass-Key Retrieval (Figure 7): The following figure (Figure 7 from the original paper) shows RWKV7-World3 pass-key retrieval evaluation.

    Figure 7: RWKV7-World3 pass-key retrieval evaluation Analysis:

    • RWKV7-World3-1.5B achieves perfect retrieval up to 19,600 tokens, degrading beyond 20,600.
    • RWKV7-World3-2.9B extends perfect retrieval up to 35,000 tokens, showing benefits of scaling. Performance degrades around 50k tokens.
    • Fine-tuning on a specially constructed dataset with 128k token sequences further improves performance: RWKV-7 (1.5B) reliably retrieves up to 29k tokens (degrading around 40k), and RWKV-7 (2.9B) reliably retrieves up to 30k (degrading around 50k). This demonstrates the effectiveness of targeted long-context training.

6. Evaluating State Tracking Using Group Multiplication (Figure 8): The following figure (Figure 8 from the original paper) shows minimum number of layers (lower is better) required to attain >95%> 9 5 \% validation accuracy on group multiplication problems by sequence length and group.

Figure 8: Minimum number of layers (lower is better) required to attain \(> 9 5 \\%\) validation accuracy on group multiplication problems by sequence length and group. Analysis:

  • RWKV-7 exhibits stronger state-tracking capabilities than Transformers, Mamba, and S4, requiring fewer layers to achieve high accuracy on group multiplication tasks.
  • It performs slightly weaker than classical RNNs in this specific benchmark, but classical RNNs often suffer from gradient vanishing/exploding and cannot be parallelized efficiently, unlike RWKV-7.
  • The results align with the theoretical prediction (Appendix D.2) that RWKV-7 can perform state tracking and recognize any regular language with a constant number of layers, highlighting its theoretical expressivity advantage over TC0TC^0-limited models.

7. Speed and Memory Usage (Figure 9): The following figure (Figure 9 from the original paper) shows time vs. sequence length (H100).

Figure 9: Time vs. Sequence Length (H100) Analysis:

  • RWKV-7 models scale linearly with sequence length, while Flash Attention v3 scales quadratically. This makes RWKV models faster for large sequence lengths.
  • The optimized RWKV-7 kernel is approximately three times faster than the official RWKV-6 kernel.
  • For a sequence length of 16k, the RWKV-7 forward pass (without storing state) takes 7.9ms, Flash Attention v3 forward pass takes 33.9ms, demonstrating significant inference speed advantages.
  • Memory usage is constant for single token inference and scales linearly with chunk size for pre-fill, allowing flexible memory trade-offs.

8. Multimodal Experiments:

  • VisualRWKV-7 (Table 9, Figure 10): The following figure (Figure 10 from the original paper) shows the architecture of VisualRWKV-7. The input image is processed by three vision encoders, and the obtained features are concatenated. Afterward, they are projected through an MLP with context gating to align with the dimensions of the RWKV-7 block. Finally, the image features are concatenated with the text embeddings and fed into the RWKV-7 LLM.

    Figure 10: The architecture of VisualRWKV-7. The input image is processed by three vision encoders, and the obtained features are concatenated. Afterward, they are projected through an MLP with context gating to align with the dimensions of the RWKV-7 block. Finally, the image features are concatenated with the text embeddings and fed into the RWKV-7 LLM. Figure 10: The architecture of VisualRWKV-7. The input image is processed by three vision encoders, and the obtained features are concatenated. Afterward, they are projected through an MLP with context gating to align with the dimensions of the RWKV-7 block. Finally, the image features are concatenated with the text embeddings and fed into the RWKV-7 LLM.

    The following are the results from Table 9 of the original paper:

    Method Vision Encoder LLM VQA SQA TQA GQA
    VisualRWKV-6 SigLIP+DINOv2+SAM RWKV6-1.6B 73.6 57.0 48.7 58.2
    VisualRWKV-6 SigLIP+DINOv2+SAM RWKV6-3.1B 79.1 62.9 52.7 61.0
    VisualRWKV-7 SigLIP+DINOv2+SAM RWKV7-0.1B 75.2 50.6 37.9 59.9
    VisualRWKV-7 SigLIP+DINOv2+SAM RWKV7-0.4B 77.9 55.0 41.1 62.3
    VisualRWKV-7 SigLIP+DINOv2+SAM RWKV7-1.5B 79.8 59.7 49.5 63.2
    VisualRWKV-7 SigLIP+DINOv2+SAM RWKV7-2.9B 80.5 63.4 58.0 63.7

    Analysis:

    • VisualRWKV-7 models (e.g., 0.1B, 0.4B) outperform VisualRWKV-6 1.6B on VQAv2 and GQA benchmarks with significantly fewer parameters, demonstrating the powerful modeling capabilities of the RWKV-7 block.
    • VisualRWKV-7 2.9B outperforms VisualRWKV-6 3.1B on the out-of-domain SQA benchmark, indicating stronger generalization ability.
    • A 5.3-point improvement on TextQA (TQA) for VisualRWKV-7 2.9B over VisualRWKV-6 3.1B further confirms its superior associative recall.
  • AudioRWKV-7 (Table 10): The following are the results from Table 10 of the original paper:

    Model #Parameters Architecture mAP ↑
    DeepRes (Ford et al., 2019) 26M CNN 0.392
    HST-AT 88.5M Transformer 0.433*
    HST-AT pretrained(Chen et al., 2022) 88.5M Transformer 0.429*
    MambaOut(Yu & Wang, 2024) 101.3M Mamba 0.397
    AudioRWKV-6 8.9M RWKV6 0.381
    AudioRWKV-6 19.8M RWKV6 0.426
    AudioRWKV-7 8.9M RWKV7 0.392
    AudioRWKV-7 19.8M RWKV7 0.431

    Analysis:

    • AudioRWKV-7 achieves comparable performance to CNN, Transformer, and Mamba-based architectures with a much smaller parameter count.
    • It exceeds the performance of AudioRWKV-6 at both 8.9M and 19.8M parameters, demonstrating improved capabilities for audio embedding analysis.

6.2. Ablation Studies / Parameter Analysis

1. Ablation Experiments on The Pile (Table 18): The following are the results from Table 18 of the original paper:

Model Tokens (B) lmb.o ppl ↓ lmb.0 acc ↑ hella acc_n ↑ piqa acc ↑ arcE acc ↑ arcC acc ↑ glue acc ↑ WG acc ↑ sciq acc↑ avg acc
RWKV4-169M-Pile 332 29.2 33.2 32.2 64.8 47.1 19.9 47.6 51.2 77.6 46.7
Pythia-160M 300 37.3 35.4 30.3 62.3 43.6 19.5 46.5 51.3 75.4 45.5
Mamba-130M 300 16.0 44.3 35.3 64.5 48.0 19.7 48.5 52.1 78.2 48.8
Mamba2-130M 300 16.8 43.9 35.3 64.9 47.4 20.9 45.8 52.6 81.0 49.0
RWKV6-173M-Pile 332 16.0 44.5 34.9 64.4 48.3 19.7 48.9 51.9 80.6 49.2
RWKV7-168M-Pile 332 14.2 45.7 36.9 65.5 47.9 19.7 49.1 52.4 81.6 49.8
RWKV4-430M-Pile 332 13.1 45.1 40.8 67.7 52.8 24.1 49.4 52.0 80.7 51.6
Pythia-410M 300 10.8 51.6 40.6 66.7 51.9 21.4 44.1 53.3 81.5 51.4
Mamba-370M 300 8.1 55.6 46.5 69.5 55.0 25.0 46.8 55.5 84.5 54.8
Mamba2-370M 300 8.0 55.9 46.9 70.5 54.8 25.1 48.1 55.4 85.3 55.2
RWKV7-421M-Pile 332 7.2 57.9 48.0 69.3 56.3 23.5 50.3 56.4 85.9 56.0
RWKV4-1.5B-Pile 332 7.1 56.4 52.8 72.2 60.7 24.9 50.5 54.3 85.8 57.2
Pythia-1.4B 300 6.1 61.7 52.0 70.8 60.5 26.1 47.7 57.5 86.6 57.9
Mamba-1.4B 300 5.0 65.0 59.1 74.2 65.5 29.8 46.2 61.4 87.3 61.1
Mamba2-1.3B 300 5.0 65.6 59.9 73.2 64.2 29.9 46.1 61.0 89.8 61.2
RWKV7-1.47B-Pile 332 4.8 67.0 61.8 73.6 64.9 30.2 48.0 64.4 91.1 62.6

Analysis:

  • This table showcases the consistent improvements of RWKV-7 over its predecessors (RWKV-4, RWKV-6) when all models are trained on the same dataset (The Pile) under identical configurations.
  • Across all three size scales (168M, 421M, 1.47B parameters), RWKV-7 achieves lower perplexity (e.g., 4.8 vs. 5.0 for Mamba2-1.3B) and higher average accuracy (e.g., 62.6 vs. 61.2 for Mamba2-1.3B) compared to RWKV-6 and often Mamba models.
  • The performance gap sustains as model size increases, suggesting that the RWKV-7 architecture scales more effectively.

2. Architecture Choice Ablations (Table 19): Ablation studies were conducted on a small 6-layer, 768-dimension Goose model trained on minipile to validate specific design choices. The following are the results from Table 19 of the original paper:

Model Training Loss Validation Loss
Goose 2.834
Goose, scalar decay 2.873
Goose, scalar in-context learning rate 2.609 2.843 2.591
Goose, same removal/replacement keys 2.840 2.560
Goose, no bonus term 2.841

Analysis:

  • The baseline Goose model (with all RWKV-7 innovations) achieves the lowest validation loss (2.560, derived from the last entry that is likely the validation loss for the best model).
  • Scalar Decay: Using a scalar-valued decay instead of vector-valued (2.873 vs 2.834 training loss) results in a higher training loss, validating the benefit of vector-valued decay.
  • Scalar In-Context Learning Rate: Switching to a scalar-valued in-context learning rate (2.843 vs 2.834 training loss) also leads to a higher training loss, confirming the advantage of vector-valued ICLR.
  • Same Removal/Replacement Keys: Forcing the use of the same removal and replacement keys (2.840 vs 2.834 training loss) degrades training performance, supporting the design choice to decouple these keys.
  • No Bonus Term: Removing the bonus term (utu_t) slightly increases training loss (2.841 vs 2.834), indicating its positive contribution.
  • These ablations confirm that the novel design choices in RWKV-7 (vector-valued decay, vector-valued ICLR, decoupled keys, and bonus term) each contribute positively to the model's performance.

3. Parameter Statistics (Appendix L, Figures 17-23):

  • ξ\xi (Removal Key Multiplier) and α\alpha (Replacement Rate Booster): The following figure (Figure 17 from the original paper) shows box plots of ξ\xi and α\alpha across models.

    Figure 17: Box Plots of \(\\xi\) and \(\\alpha\) Across Models Figure 17: Box Plots of ξ\xi and α\alpha Across Models

    The following figure (Figure 18 from the original paper) shows mean ξ\xi and α\alpha across layers for different models.

    Figure 18: Mean \(\\xi\) and \(\\alpha\) Across Layers for Different Models Figure 18: Mean ξ\xi and α\alpha Across Layers for Different Models

    The following figure (Figure 19 from the original paper) shows maximum and minimum of ξ\xi across layers in different models.

    Figure 19: Maximum and Minimum of \(\\xi\) Across Layers in Different Models Figure 19: Maximum and Minimum of ξ\xi Across Layers in Different Models

    The following figure (Figure 20 from the original paper) shows maximum and minimum of α\alpha across layers in different models.

    Figure 20: Maximum and Minimum of \(\\alpha\) Across Layers in Different Models Figure 20: Maximum and Minimum of α\alpha Across Layers in Different Models Analysis: Box plots show the distribution of these learned parameters across models. Mean values across layers (Figures 18, 22) show how these parameters evolve and are utilized at different depths of the network. The maximum and minimum values (Figures 19, 20, 23) indicate the range of dynamic adjustments the model makes, e.g., ξ\xi ranges roughly [5.3,9.4][-5.3, 9.4], showing significant scaling of the removal key.

  • Biases of dtd_t (Decay Precursor): The following figure (Figure 21 from the original paper) shows box plots of biases of d _ { t } across models.

    Figure 21: Box Plots of biases of `d _ { t }` Across Models Figure 21: Box Plots of biases of d _ { t } Across Models

    The following figure (Figure 22 from the original paper) shows mean biases of d _ { t } across layers for different models.

    Figure 22: Mean biases of `d _ { t }` Across Layers for Different Models Figure 22: Mean biases of d _ { t } Across Layers for Different Models

    The following figure (Figure 23 from the original paper) shows maximum and minimum of biases of d _ { t } across layers in different models.

    Figure 23: Maximum and Minimum of biases of `d _ { t }` Across Layers in Different Models Figure 23: Maximum and Minimum of biases of d _ { t } Across Layers in Different Models Analysis: The statistics for biasesofdtbiases of d_t provide insights into the behavior of the decay mechanism. Their distribution and range across layers show how the model learns to control the in-context weight decay.

    4. Board Game Modeling (Reversi/Othello):

  • Training Loss (Figure 13): The following figure (Figure 13 from the original paper) shows Reversi Training loss for different token types.

    Figure 13: Reversi Training loss for different token types Figure 13: Reversi Training loss for different token types Analysis: The training loss for different token types indicates a phased learning process:

    1. Model first mastered output formatting.
    2. Then developed board state tracking capability.
    3. Continuously improved evaluation accuracy throughout training. This suggests RWKV-7 can learn complex reasoning strategies in-context.
  • Win Rates with Search (Figure 14): The following figure (Figure 14 from the original paper) shows Reversi Token consumption and win rates under different search configurations.

    Figure 14: Reversi Token consumption and win rates under different search configurations Figure 14: Reversi Token consumption and win rates under different search configurations Analysis: By increasing the thinking budget (max width and depth for Alpha-Beta pruning), RWKV-7 can effectively search for better strategies in Othello, demonstrating a positive test-time scaling law. This indicates that the model can leverage its state-tracking abilities for complex in-context search.

    5. State Inspections (WKV Matrix RMS and Stable Rank):

  • Visualization (Figure 15): The following figure (Figure 15 from the original paper) shows visualization example of RWKV's WKV matrices.

    Figure 15: Visualization example of RWKV's WKV matrices. Figure 15: Visualization example of RWKV's WKV matrices. Analysis: Visual inspection reveals that RWKV-7's WKV states have elements consistently of order O(1)O(1), without the thousands-order outliers seen in RWKV-5 and RWKV-6. This confirms improved numerical stability.

  • RMS and Stable Rank (Figure 16): The following figure (Figure 16 from the original paper) shows the global RMS and average stable rank of WKV matrices, plotted over sequence length.

    Figure 16: The global RMS and average stable rank of WKV matrices, plotted over sequence length. Analysis:

    • RWKV-7 exhibits significantly smaller RMS values for its WKV states compared to RWKV-5 and RWKV-6, further confirming better numerical stability.
    • The stable rank of RWKV-7's WKV matrix is lower than its predecessors for contexts longer than 32. This seemingly contradictory observation (lower stable rank implies less information, yet RWKV-7 performs better) is hypothesized to be due to enhanced information compression and utilization capabilities in RWKV-7's state evolution, allowing it to maintain important information in a more compact form.

6. Initial Token Sensitivity (Table 20): The following are the results from Table 20 of the original paper:

Model EOS padding PPL ACC (%) Significance
RWKV7 World 0.1B 0 357 9.2 ***
1 16.4 36.6
RWKV7 World 0.4B 0 42.7 28.9 ***
1 7.25 48.6
SmolLM2 360M 0 21.1 39.4 *
1 9.17 49.3
Qwen2.5 0.5B 0 12.2 47.9 NS
1 7.97 54.9

Analysis:

  • RWKV-7 models exhibit significant prompt sensitivity to the inclusion of the <endoftext><|endoftext|> token at the beginning of the input, especially smaller models.
  • For RWKV7 World 0.1B, including the token improves accuracy from 9.2% to 36.6% (PPL from 357 to 16.4).
  • This suggests RWKV-7 struggles to retain the very first token in memory without proper state initialization or context setting.
  • Interestingly, Transformer-based models like Qwen2.5 show less pronounced impact, implying better mechanisms for attending to initial tokens.
  • The authors even found that two consecutive <endoftext><|endoftext|> tokens can further improve performance, suggesting the model leverages these special tokens for state initialization.

6.3. Training Details (Figures 12)

The following figure (Figure 12 from the original paper) shows training loss curve for RWKv-7 World models. Blue line: loss; Red line: smoothed loss; Green line: actual learning rate; Dotted red line: learning rate schedule.

Figure 12: Training loss curve for RWKv-7 World models. Blue line: loss; Red line: smoothed loss; Green line: actual learning rate; Dotted red line: learning rate schedule. Analysis:

  • The training loss curves for all four RWKV-7 World models demonstrate extreme stability, with no observed loss spikes. This indicates robust training dynamics for the RWKV-7 architecture.
  • The cosine learning rate decay schedule and phased dynamic batch size scaling strategy appear to be effective in managing training stability and efficiency.
  • The stability is attributed partly to the small AdamW ϵ\epsilon value chosen, which helps mitigate sudden loss increases.

6.4. Transition Matrix Eigenvalues and Stability (Appendix C)

  • Theorem 1: When decay ww entries are in (u,1)(u, 1) and learning rate aa entries are in (0,1)(0, 1), and c(0,1+u)c \in (0, 1+u), then the transition matrix A=diag(w)cκ^T(aκ^)A = \mathrm{diag}(w) - c \hat{\kappa}^T (a \odot \hat{\kappa}) is similar to a symmetric matrix, all its eigenvalues lie in (1,1)(-1, 1), and it admits at most one negative eigenvalue. If aa is time-independent, the update formula is guaranteed to be stable.
  • Analysis: This theoretical result is crucial for RWKV-7's stability. It shows that the transition matrix in RWKV-7 is a contraction matrix (spectral norm 1\le 1), preventing unbounded growth of the WKV state. The ability to have negative eigenvalues (up to one) is distinct from some prior DeltaNet variants and hints at richer dynamics that might contribute to its enhanced expressivity, as discussed in concurrent work by Grazzi et al. (2024).

6.5. RWKV-7a for Board Game Modeling (Appendix I)

  • RWKV-7a Formula: $ \pmb { S } _ { t } = \hat { \pmb { S } } _ { t - 1 } \mathrm { d i a g } ( w _ { t } ) ( \pmb { I } - c \hat { \kappa } _ { t } ^ { T } ( a _ { t } \odot \hat { \kappa } _ { t } ) ) + \nu _ { t } ^ { T } k _ { t } $
    • This expanded formulation is useful for Othello, allowing a full range of (1,1)(-1, 1) eigenvalues when c=2c=2. This might give more control over state updates for tasks requiring fine-grained strategic adjustments.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces RWKV-7 "Goose", a significant advancement in recurrent neural network (RNN) architectures for sequence modeling. It successfully demonstrates that RNN-based models can achieve state-of-the-art (SoTA) performance competitive with or superior to Transformer-based models in the 3 billion parameter class, particularly in multilingual tasks, despite being trained on dramatically fewer tokens. This highlights RWKV-7's exceptional parameter and data efficiency.

The core innovation lies in a newly generalized delta rule formulation, which incorporates vector-valued gating, in-context learning rates, and decoupled removal/replacement keys. These features provide RWKV-7 with a highly expressive and dynamic state evolution mechanism, allowing for fine-grained control over its internal memory. Critically, the paper provides theoretical proofs that RWKV-7 possesses computational power beyond TC0TC^0 (the conjectured limit of Transformers), enabling it to perform state tracking and recognize all regular languages with a constant number of layers, thereby solving NC1NC^1-complete problems.

Beyond its core language modeling capabilities, RWKV-7 maintains inherent RNN advantages: linear time complexity and constant memory usage per token during inference, making it highly efficient for long contexts. The release of RWKV World v3, a 3.1 trillion token multilingual corpus, and four pre-trained RWKV-7 models under the Apache 2.0 License, further supports openness, reproducibility, and adoption within the research community. The architecture also shows promising results in multimodal applications (vision and audio) and in-context search for board games, demonstrating its versatility and robustness.

7.2. Limitations & Future Work

7.2.1. Limitations

The authors acknowledge several limitations of RWKV-7 and the current research:

  • Numerical Precision: Certain operators, especially the WKV7 kernel, are sensitive to numerical precision. Differences in kernel implementations and precision settings during training can impact dynamics and results, necessitating careful handling for deployment. The observation of NaN loss during training, despite overall stability, points to this sensitivity.
  • Lack of Instruction Tuning and Alignment: The released RWKV-7 models are pretrained base models. They have not undergone Supervised Fine-Tuning (SFT) for instruction following or alignment with human preferences (RLHF). This limits their direct usability in real-world conversational or instruction-based applications.
  • Prompt Sensitivity: RWKV-7 models exhibit significant prompt sensitivity, particularly to the presence of the <endoftext><|endoftext|> special token at the beginning of inputs. Omitting it can lead to degraded performance, suggesting challenges in retaining initial token information or properly initializing the state.
  • Compute Resources: Training was constrained by available compute resources (max 12×8=9612 \times 8 = 96 Nvidia H800 GPUs). This is less than resources used for very large modern LLM training (e.g., DeepSeek-V3). The need to continue training from pre-existing checkpoints of earlier RWKV versions and re-use parts of the dataset might limit the full potential of RWKV-7 compared to training from scratch.

7.2.2. Future Work

The authors propose several directions for future research:

  • Training Larger Models and Datasets: Scaling RWKV-7 to even larger sizes and training on more extensive datasets is a primary goal, contingent on securing additional computational resources.
  • Speedup Techniques: Exploring various speed optimization techniques (e.g., Dual Pipelining Mechanism, Mixture-of-Experts, Multi-Token Prediction, FP8 Training) highlighted in other works like DeepSeek-V3. Many of these are orthogonal to RWKV-7's architectural optimizations and could be integrated. Further kernel-level optimizations and distributed training strategies are also planned.
  • Incorporating Chain-of-Thought Reasoning: Developing reinforcement learning pipelines to integrate Chain-of-Thought (CoT) reasoning (Wei et al., 2022) into RWKV-7. The authors believe that as a linear RNN, RWKV-7 is well-suited for efficient CoT due to its state-tracking capabilities, which could enable it to excel in multi-step logical reasoning and complex problem-solving.

7.3. Personal Insights & Critique

This paper presents a compelling case for the continued relevance and potential superiority of RNN-based architectures in the era of Transformers, particularly for efficiency and deep state-tracking. The RWKV-7 architecture, with its generalized delta rule, stands out as a thoughtful and theoretically grounded approach to overcoming long-standing RNN limitations while avoiding the quadratic scaling of Transformers.

Innovations and Strengths:

  • Theoretical Expressivity: The formal proofs that RWKV-7 can recognize all regular languages and solve NC1NC^1-complete problems are a standout contribution. This isn't just an empirical gain; it suggests a fundamental architectural advantage over TC0TC^0-limited models. This expressive power for state tracking is highly valuable for tasks requiring sequential logic, programmatic execution, or complex memory operations, which Transformers often struggle with or require external tools for.
  • Efficiency and Performance: Achieving SoTA multilingual and near-SoTA English performance with significantly fewer training tokens than competitors like Qwen2.5 is a testament to RWKV-7's data and parameter efficiency. Coupled with its linear inference time and constant memory usage, it positions RWKV-7 as a highly practical and scalable choice for long-context applications where Transformers become prohibitive.
  • Fine-Grained State Control: The vector-valued gating and in-context learning rates, along with decoupled removal/replacement keys, represent a sophisticated evolution of the delta rule. This level of control over the WKV state likely contributes to its superior memory management and associative recall capabilities. The analogy of the RWKV-7 state as an "internal scratchpad" is very intuitive and highlights its dynamic nature.
  • Open Science: The commitment to open-sourcing models, datasets, and code is commendable and crucial for fostering research and adoption of RWKV as a viable alternative.

Potential Issues and Areas for Improvement/Critique:

  • Numerical Precision and Stability Trade-offs: While stability is generally good, the observed NaN loss and sensitivity to AdamW ϵ\epsilon suggest that the underlying numerical stability, especially in complex CUDA kernels, is a continuous engineering challenge. The balance between maximum decay, expressivity, and numerical stability might require further exploration.
  • Complexity of State Dynamics vs. Interpretability: The generalized delta rule is powerful, but its vector-valued and decoupled nature might make mechanistic interpretability more challenging than simpler delta rule variants or Transformers. Understanding why specific channels decay or learn at certain rates could be a complex research area.
  • Pre-training from Checkpoints: The limitation of continuing training from older RWKV checkpoints rather than starting from scratch on the full RWKV-7 architecture and World v3 dataset is a practical constraint. It's possible that RWKV-7's full potential has not yet been unlocked, and some inductive biases from older architectures might persist. This is acknowledged by the authors but remains an area where future research could yield even stronger results.
  • Prompt Sensitivity (Practical Implications): The prompt sensitivity to <endoftext><|endoftext|> is a practical hurdle. While a recommendation is given, it implies that RWKV-7's state initialization is not as robust or context-agnostic as Transformers. This might require more careful prompt engineering or architectural modifications to make the models more user-friendly out-of-the-box.
  • Role of c=2c=2 in Expressivity Proofs: The proof for NC1NC^1 expressivity relies on setting c=2c=2 in the transition matrix, while the current model uses c=1c=1. Although the authors explain that GroupNorm can compensate for magnitude scaling, the theoretical reliance on c=2c=2 (which enables operations like perfect swaps) for certain proofs might indicate a slight mismatch or a need for fine-tuning the model to fully exploit this theoretical capability in practice.

Transferability and Future Value:

  • Alternative to Transformers: RWKV-7 provides a strong, efficient alternative to Transformers, especially as sequence lengths grow and computational resources become a bottleneck. Its linear scaling is a fundamental advantage that will become increasingly important.

  • Foundation for State-Tracking AI: The proven NC1NC^1 expressivity opens doors for RNNs to excel in tasks that inherently require complex state tracking, such as symbolic reasoning, code execution simulation, or advanced agents that maintain sophisticated internal states. This could be applied to domains like robotics, formal verification, or even more advanced board game AI.

  • Multimodal Potential: The successful application to VisualRWKV-7 and AudioRWKV-7 highlights the architecture's versatility beyond pure text. Its efficiency can be a significant advantage in multimodal settings where input sequences (e.g., high-resolution images, long audio streams) are inherently long and high-dimensional.

  • Bridging Theory and Practice: The paper successfully bridges theoretical insights into computational complexity with practical, high-performing models, providing a valuable framework for understanding the capabilities of different neural network architectures.

    Overall, RWKV-7 "Goose" is a significant and exciting development. It demonstrates that with innovative architectural design and a deep understanding of recurrent dynamics, RNNs can not only compete with but theoretically surpass Transformers in specific areas of expressivity and practical efficiency.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.