RWKV-7 "Goose" with Expressive Dynamic State Evolution
TL;DR Summary
RWKV-7 "Goose" is a novel sequence modeling architecture that achieves constant memory usage and inference time. This 2.9 billion parameter model sets new state-of-the-art performance on multilingual tasks and matches existing benchmarks in English, while introducing generalized
Abstract
We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter language model achieves a new 3B SoTA on multilingual tasks and matches the current 3B SoTA on English language downstream performance. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to . To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
RWKV-7 "Goose" with Expressive Dynamic State Evolution
1.2. Authors
The paper lists a large number of authors, indicating a collaborative effort across multiple institutions and projects. The primary authors mentioned at the start are Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian , Haowen , Jiju , Jiaxg Liu, Jana Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi , Dan Wuttke, Chrsan Zhou.
Their affiliations include:
-
RWKV Project (under Linux Foundation AI & Data)
-
EleutherAI
-
Tsinghua University
-
Recursal AI
-
Dalle Molle Institute for Artificial Intelligence USI-SUPSI
-
University of Rochester
-
Other unnamed universities such as Zhejang University, George Mason University, New York University, University Oslo, and Beijing Normal University.
This diverse set of affiliations suggests a broad range of expertise contributing to the project, from core architectural design and optimization (RWKV Project, EleutherAI) to academic research (universities) and applied AI (Recursal AI).
1.3. Journal/Conference
The paper is published as a preprint on arXiv (arXiv:2503.14456). While arXiv is not a peer-reviewed journal or conference, it is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, and other fields. Papers posted here often undergo community scrutiny and may later be submitted to formal venues. The publication date (2025-03-18) indicates it's a very recent work.
1.4. Publication Year
2025
1.5. Abstract
The paper introduces RWKV-7 "Goose", a novel sequence modeling architecture designed to overcome the limitations of traditional Transformers, particularly their quadratic scaling with sequence length. RWKV-7 achieves constant memory usage and inference time per token, making it highly efficient for long sequences. Despite being trained on significantly less data than competitor models, the 2.9 billion parameter version of RWKV-7 establishes a new State-of-the-Art (SoTA) for 3B parameter models on multilingual tasks and matches the current 3B SoTA on English language benchmarks.
Key innovations in RWKV-7 include a generalized delta rule with vector-valued gating and in-context learning rates, along with a relaxed value replacement rule. The authors demonstrate RWKV-7's capability for state tracking and its ability to recognize all regular languages, a computational power that exceeds that of Transformers under standard complexity conjectures (which are limited to ).
To support RWKV-7's language modeling capabilities, the paper also introduces RWKV World v3, an extended open-source multilingual corpus of 3.1 trillion tokens. Four RWKV-7 models, ranging from 0.19 billion to 2.9 billion parameters, were trained on this new dataset. To promote transparency and reproducibility, the models, dataset component listings, and training/inference code are open-sourced under the Apache 2.0 License.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2503.14456
- PDF Link: https://arxiv.org/pdf/2503.14456v2.pdf
- Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The field of sequence modeling has been largely dominated by Transformer architectures, particularly for autoregressive tasks like language modeling. Transformers excel at in-context processing and offer highly parallelizable training, primarily due to their softmax attention mechanism. However, this very mechanism is also their Achilles' heel: it incurs quadratic computational complexity and memory usage with respect to the sequence length. This quadratic scaling makes Transformer inference increasingly costly and impractical for very long sequences, a critical limitation in many real-world applications.
This limitation has spurred significant research into alternative architectures, particularly recurrent neural networks (RNNs) and State Space Models (SSMs), that aim to achieve linear computational complexity and constant memory usage per token during inference, while retaining efficient parallel training capabilities. These models typically rely on compressive states—fixed-size representations of past information that are updated recurrently.
One notable line of research has focused on linear attention variants and architectures incorporating the delta rule. The delta rule, originally from adaptive filtering, enables models to explicitly learn and update a key-value compressive state, allowing for both adding new memories and selectively removing old ones. Previous RWKV models (e.g., RWKV-4, RWKV-5, RWKV-6) have shown increasing potential to rival Transformers in performance while significantly reducing inference costs.
The core problem this paper aims to solve is to further enhance the efficiency and expressivity of RNN-based models to match or exceed Transformer performance at comparable scales, especially for long-context scenarios, without sacrificing their superior inference characteristics (constant memory and linear time). The challenge lies in designing a recurrent update mechanism that is both powerful enough to capture complex dependencies and efficient enough for large-scale deployment, while also being parallelizable for training.
The paper's innovative idea centers on generalizing the delta rule to create a more expressive and dynamically evolving state update mechanism for RWKV-7, moving beyond scalar learning rates and fixed decay mechanisms seen in prior delta rule variants.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
-
The RWKV-7 "Goose" Architecture: Introduction of a novel
RNNarchitecture that dramatically improves downstream benchmark performance over its predecessor,RWKV-6. It achieves state-of-the-art (SoTA) multilingual performance and near SoTA English language performance for 3B parameter models, despite being trained on significantly fewer tokens than competing top models. This demonstrates its superior parameter and data efficiency. -
Generalized Delta Rule with Enhanced Expressivity:
RWKV-7introduces a new formulation of thedelta rulethat includes vector-valued state gating and in-context learning rates, as well as a relaxed value replacement rule. These modifications allow for more flexible and channel-wise state updates, enhancing the model's ability to selectively modify its memory. -
Theoretical Expressivity Beyond : The paper provides theoretical proofs that
RWKV-7can perform state tracking and recognize all regular languages. Under standard complexity conjectures (specifically, ), this capabilityexceeds the capabilities of Transformers, which are limited to . This includes solving an -complete state tracking problem (swaps on 5 elements) with a single layer and recognizing any regular language with a constant number of layers. -
The RWKV World v3 Public Dataset: Release of a new, expanded
multilingual corpustotaling 3.1 trillion tokens, designed to enhance performance across English, code, and multilingual tasks. This dataset helps address the data scale gap with modern large language models. -
Publicly Released Pre-Trained Models: Release of four
RWKV-7models (0.19B to 2.9B parameters) trained on theRWKV World v3dataset, and threeRWKV-7 Pilemodels (0.17B to 1.47B parameters) trained on The Pile dataset. These models are open-sourced under the Apache 2.0 License, fostering research and adoption. -
Efficient Model Upgradation Method: A method is presented for upgrading existing
RWKVmodels (e.g., fromRWKV-5/RWKV-6checkpoints) to theRWKV-7architecture without training from scratch, reducing computational expense while producing competitive models. -
Multimodal Capabilities: Demonstration of
RWKV-7's versatility through adaptations toVisualRWKV-7(for image understanding) andAudioRWKV-7(for audio embedding analysis), showing competitive or superior performance toTransformer-based and otherRNN-based multimodal models.In essence,
RWKV-7addresses the critical balance between efficiency and expressivity in sequence modeling, offering a compellingRNN-based alternative toTransformerswith strong theoretical and empirical backing, coupled with open-source releases to accelerate community research.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand RWKV-7, a grasp of several fundamental concepts in neural networks and sequence modeling is essential:
-
Recurrent Neural Networks (RNNs):
- Conceptual Definition:
RNNsare a class of neural networks designed to process sequential data (like text, speech, time series) by maintaining an internalhidden statethat captures information from previous time steps. Unlike feedforward networks,RNNshave connections that loop back on themselves, allowing information to persist. - How it works: At each time step , an
RNNtakes an input and its previous hidden state to compute a new hidden state and an output . The same set of weights is used across all time steps, enablingRNNsto handle sequences of arbitrary length. - Basic Update Equation:
$
h_t = f(W_h h_{t-1} + W_x x_t + b_h)
$
$
y_t = W_y h_t + b_y
$
- : Hidden state at time
- : Input at time
- : Output at time
- , , : Weight matrices
- , : Bias vectors
- : Non-linear activation function (e.g.,
tanhorReLU)
- Limitations: Traditional
RNNssuffer fromvanishingorexploding gradients, making it difficult to learn long-range dependencies. Variants likeLSTMsandGRUsaddress this withgating mechanisms.
- Conceptual Definition:
-
Transformers and Attention Mechanism:
- Conceptual Definition:
Transformersrevolutionized sequence modeling by replacing recurrence withattention mechanisms. They process all tokens in a sequence simultaneously, allowing them to capture long-range dependencies much more effectively than traditionalRNNs. - Attention Mechanism: The core innovation is
self-attention, which allows each token in a sequence to "attend" to (i.e., weigh the importance of) all other tokens in the same sequence. This is done by computingquery (Q),key (K), andvalue (V)vectors for each token. - Standard Attention Formula (
softmax attention): $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $- : Query matrix (from current tokens, asking "what am I looking for?")
- : Key matrix (from all tokens, providing "what I have")
- : Value matrix (from all tokens, providing "information associated with what I have")
- : Dimension of the key vectors (used for scaling)
- : Normalization function to get attention weights.
- Limitations: The
softmax attentionmechanism computes a matrix of attention scores of size sequence length sequence length, leading toquadratic computational complexity() andmemory usage() with respect to sequence length (). This makesTransformersexpensive for very long sequences.
- Conceptual Definition:
-
State Space Models (SSMs):
- Conceptual Definition:
SSMsare a class of models rooted in control theory, adapted for sequence modeling. They maintain a continuouslatent statethat evolves over time, summarizing the history of the input. They aim to combine the efficiency ofRNNs(linear complexity) with the ability to capture long-range dependencies, often through specializedstructured state spaceformulations. - How it works (simplified): An
SSMcan be seen as having a hidden state that is updated via linear dynamics () and then mapped to an output (). The matricesA, B, C, Dcan be learned. - Motivation:
SSMslikeS4andMambaaim to providelinear scalingfor inference andparallelizable training(via their convolutional representation), making them a promising alternative toTransformers.
- Conceptual Definition:
-
Delta Rule:
- Conceptual Definition: The
delta rule(also known as the Widrow-Hoff learning rule) is a fundamental algorithm in neural networks for updating weights to minimize the error between a network's output and a target output. It's akin to a single step ofstochastic gradient descent (SGD). - Application in Sequence Models: In the context of recurrent sequence models, the
delta ruleis applied to update akey-value compressive state. The state acts as a memory, where for a given inputkey, the model learns to retrieve a correspondingvalue. The update rule allows the model to "add" newkey-valueassociations and "remove" (or diminish) old ones, addressing the fixed-size state limitation of simple linear attention which only adds to the state. - Core Idea: It reframes state updates as an online learning problem, where the state is trained at test time to output desired values for keys .
- Conceptual Definition: The
-
Complexity Classes (, ):
- Conceptual Definition: These are computational complexity classes used to classify the difficulty of problems based on the resources (time, memory, parallel operations) required to solve them.
- : This class consists of problems solvable by
constant-depth, polynomial-size circuitsusingthreshold gates.Transformersand manySSMs(especially those with diagonal transition matrices) are conjectured to be limited to . Problems in can perform parallel counting and simple arithmetic, but struggle with tasks requiring unbounded state tracking or complex sequential logic. - : This class contains problems solvable by
logarithmic-depth, polynomial-size circuitsusingAND, OR, NOT gates. is considered more powerful than (, and it's widely conjectured that ). Problems in can perform tasks likegroup multiplicationandrecognizing all regular languages, which require more sophisticated state tracking than allows. - Significance: Proving that
RWKV-7can solve -complete problems (like group multiplication) or recognize all regular languages suggests it has fundamentally greater expressive power than architectures limited to .
-
Token Shift:
- Conceptual Definition: A mechanism used in
RWKVmodels where, for a given input token, a portion of its features are linearly interpolated with features from the previous token. This provides a form oflocal contextorshort 1D convolutionwithout increasing computational complexity. - Purpose: Allows the model to consider information from the immediate past token, potentially aiding in
induction heads(a type of circuit that copies previous inputs) and capturing local dependencies.
- Conceptual Definition: A mechanism used in
-
Low-Rank MLPs:
- Conceptual Definition: A type of
Multi-Layer Perceptron (MLP)where the hidden layer has a significantly smaller dimension compared to the input and output dimensions. - Purpose: Used to implement
data dependencywith minimal parameters. By projecting inputs into a lower-dimensional space, applying a non-linearity, and then projecting back,low-rank MLPscan create complex, input-dependent transformations while keeping the parameter count and computational cost low.
- Conceptual Definition: A type of
3.2. Previous Works
The paper builds upon a rich history of RNN and linear attention research, explicitly referencing several key models:
-
Linear Attention Variants (e.g., RWKV-4, RWKV-5, RWKV-6, RetNet, Gated Linear Attention (GLA), Mamba-2):
- Core Idea: These models aim to approximate
softmax attentionwith alinear kernel(linear attention) or userecurrent formulationsto achieve inference time and memory. They maintain a fixed-sizestatethat aggregates past information. - Limitation (addressed by
delta rule): Early linear attention models often suffered from the "numerically increasing state" problem: old state contents were never truly "removed," only diluted. This can lead tostate mixingormuddyingof information over long sequences. Modern variants introduceper-time-step decay(e.g.,RetNet,Mamba-2) to mitigate this, but decay is a "blunt tool" that cannot selectively remove specific memories associated with particularkeys. - RWKV (Receptance Weighted Key Value): A family of
RNN-based models.- RWKV-4 (Peng et al., 2023): Marked a significant step, showing that
RNNscould rivalTransformersin performance. Introducedtoken-shiftand a form ofvector-valued decay. - RWKV-5 & RWKV-6 (Peng et al., 2024a, 2024b): Continued architectural improvements, with
RWKV-6introducingmatrix-valued statesanddynamic recurrence.RWKV-6c(a sub-version) attempted to address state normalization issues.
- RWKV-4 (Peng et al., 2023): Marked a significant step, showing that
- RetNet (Sun et al., 2023): Another linear attention model that employs a
per-time-step decaymechanism to manage state growth. - Gated Linear Attention (GLA) (Yang et al., 2023a): Further refined linear attention with gating mechanisms.
- Mamba 2 (Dao & Gu, 2024): A state-space model that also incorporates
per-time-step decayandselective state spaces.
- Core Idea: These models aim to approximate
-
DeltaNet (Schlag et al., 2021):
- Core Idea: The first to apply the
Error Correcting Delta Ruletokey-value compressive statesin the context ofRNN-like sequence models. It directly addresses thestate mixingproblem by explicitly "replacing" values. - Update Rule:
$
S_t = S_{t-1}(I - a k_t^T k_t) + a \nu_t^T k_t
$
- : The state matrix at time .
- : Identity matrix.
- : A scalar
learning rate(how much of the old value is removed and new value is added). - : Current input
keyvector. - : Desired output
valuevector for .
- Significance: Allowed for partial replacement of values associated with a specific key, enabling true "forgetting" and "learning" in the state. Showed
diagonal plus low-rank (DPLR)state evolution, enablingparallelizable training.
- Core Idea: The first to apply the
-
Concurrent Work (building on DeltaNet or similar ideas):
- Longhorn (Liu et al., 2024): Approximates a globally optimal update objective, applied to a
Mambaarchitecture. - Gated Delta Networks (Yang et al., 2024a): Applies
gatingto theDeltaNetstate, multiplying the transition matrix by adata-dependent scalar per head. Combinesdelta rulewithscalar decay.- Update Rule:
$
S_t = S_{t-1}(\mathrm{diag}(w_t) - k_t^T k_t \mathrm{diag}(a_t)) + \nu_t^T k_t \mathrm{diag}(a_t)
$
- :
data-dependent scalar decay. - :
data-dependent scalar learning rate.
- :
- Update Rule:
$
S_t = S_{t-1}(\mathrm{diag}(w_t) - k_t^T k_t \mathrm{diag}(a_t)) + \nu_t^T k_t \mathrm{diag}(a_t)
$
- TTT (Test-Time Training) (Sun et al., 2024) and Titans (Behrouz et al., 2024): Both apply
scalar decaybut use batched multi-timestep approaches instead of per-step gradient descent updates.Titansalso addsmomentum. - Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues (Grazzi et al., 2024): Explores the expressiveness gained by allowing the state transition matrix to have
negative eigenvalues, hinting at richer dynamics.
- Longhorn (Liu et al., 2024): Approximates a globally optimal update objective, applied to a
3.3. Technological Evolution
The field of sequence modeling has undergone a significant evolution:
- Early RNNs (e.g., vanilla RNN, LSTM, GRU): Focused on sequential processing and state memory. Suffered from
gradient issuesand limited parallelism. - Attention Mechanisms (Transformers): Revolutionized by enabling
parallel processingandglobal contextcapture, leading to rapid advancements in language understanding. However, introducedquadratic scalingbottlenecks. - Efficiency-Focused Models (Linear Attention, SSMs): A response to
Transformerlimitations, aiming to regainlinear scalingfor inference while retainingparallel training. Examples includeRWKV,RetNet,Mamba,GLA. These often struggle withstate mixingor require complexdecay mechanisms. - Delta Rule Integration:
DeltaNetintroduced an explicitreplacement mechanismfor state updates, allowing for more precise memory management than simple decay. This marked a shift towardsonline learningparadigms withinRNNstates. - Generalized Delta Rule (RWKV-7): The current paper represents an evolution of the
delta ruleby making its components (gating, learning rates)vector-valuedanddata-dependent, and decouplingremovalandreplacement keys. This aims to enhance expressivity and control over the state beyond previous scalar-based or fixed-decay approaches. The theoretical demonstration ofRWKV-7achieving expressivity, while previous efficient models are often confined to , positions it as a significant step in pushing the computational boundaries ofRNN-like architectures.
3.4. Differentiation Analysis
RWKV-7 differentiates itself from previous RNN-based and delta rule-inspired architectures in several key ways:
-
Generalized Delta Rule Formulation:
- RWKV-7's update: . This is a more general
diagonal plus rank oneupdate thanDeltaNetorGated Delta Networks. - Vector-valued Gating and In-Context Learning Rates: Unlike
DeltaNet,Gated Delta Networks,TTT, orTitanswhich use scalardecay() orlearning rates(),RWKV-7employsvector-valuedand . This means eachchannelordimensionwithin the state can have its own independent decay and learning rate, allowing for much finer-grained control over information flow and memory updates. This is a significant expressive boost over scalar approaches. - Decoupled Removal and Replacement Keys:
RWKV-7explicitly separates theremoval key() from thereplacement key() and thein-context learning rate() from thereplacement rate booster(). This allows the model to "remove" information associated with one key (or a transformed version of it) while "adding" information based on another, or to control the amount added independently. This is more flexible than priordelta ruleformulations where removal and addition amounts were tightly coupled by a single scalar learning rate.
- RWKV-7's update: . This is a more general
-
Expressivity Beyond :
- The paper formally proves that
RWKV-7can recognize allregular languagesand solve -complete problems (like tracking swaps on five elements) with a constant number of layers. This is a crucial theoretical advantage overTransformersand manySSMs(likeS4andMamba), which are conjectured to be limited to . This impliesRWKV-7can handle more complexstate-trackingproblems inherently. - This expressive power comes from
RWKV-7's non-diagonal andinput-dependent transition matrix, particularly its ability to represent the "copy" state transition.
- The paper formally proves that
-
Modified Architecture Components:
- Simplified Token Shift: Reverted to a simpler, non-data-dependent
token-shiftfromRWKV-6to improve training speed, making a pragmatic trade-off. - Modified MLP Module: Removed the
receptance gating matrixand expanded the hidden dimension to maintain parameter count, indicating a streamlined feedforward component. - Low-Rank Projections: Increased use of
low-rank projectionsfor intermediate calculations, optimizing the balance between parameters, speed, and performance.
- Simplified Token Shift: Reverted to a simpler, non-data-dependent
-
Numerical Stability: The modifications in
RWKV-7(e.g., restricting entries,replacement rate booster) lead to better numerical stability of theWKVstate, preventing values from accumulating to thousands, a problem observed in earlierRWKV-6versions.In summary,
RWKV-7advances theRNNparadigm by offering a more generalized, expressive, and numerically stabledelta ruleimplementation, providing stronger theoretical guarantees for complexstate trackingwhile maintaininglinear scalingproperties.
4. Methodology
4.1. Principles
The core principle behind RWKV-7 "Goose" is to overcome the quadratic complexity of Transformer attention by employing a highly efficient recurrent neural network (RNN) architecture that scales linearly with sequence length and maintains constant memory usage per token during inference. This is achieved through a dynamically evolving, matrix-valued state that learns and adapts in-context.
The central theoretical basis is the generalized delta rule. Instead of simply summing information into a state (as in some linear attention models), the delta rule enables explicit online learning at test time. The model's state acts as a programmable memory, where new key-value associations can be added, and older, less relevant information can be selectively removed or updated. This "compressive state" mechanism is crucial for retaining long-range dependencies within a fixed memory footprint.
RWKV-7 extends this delta rule by making its key components vector-valued and data-dependent, rather than scalar. This allows for a much richer and fine-grained control over how information is stored and retrieved across different channels of the state. The architecture is designed to be fully parallelizable during training, leveraging efficient CUDA kernels despite its recurrent nature.
A significant intuition is that by enabling explicit state manipulation (e.g., selective replacement, vector-valued learning rates, decoupled keys), the model gains greater expressive power. This is formally backed by proofs that RWKV-7 can recognize all regular languages and solve -complete problems, which are beyond the capabilities of architectures limited to like Transformers under common complexity conjectures. This expressive dynamic state evolution allows RWKV-7 to perform more sophisticated state tracking and in-context learning.
4.2. Core Methodology In-depth (Layer by Layer)
The RWKV-7 architecture processes sequences by stacking multiple identical RWKV-7 blocks. Each block consists of a Time Mixing module and an MLP module, with LayerNorm applications interspersed. The overall architecture is depicted in Figure 1 and Figure 11.
The following figure (Figure 1 from the original paper) presents the overall architecture of RWKV-7. Please refer to Appendix F for more details.
Figure 1 presents the overall architecture of RWKV-7. Please refer to Appendix F for more details. Figure 1: RWKV-7's overall architecture.
The following figure (Figure 11 from the original paper) shows the detailed architecture of RWKV-7.
Figure 11: The architecture of RWKV-7, drawn in detail.
4.2.1. Time Mixing Module
The Time Mixing module is the core of RWKV-7, responsible for its recurrent state evolution. It processes the input token by first applying token-shift operations and then computing various data-dependent parameters that control the Weighted Key Value (WKV) state update.
1. Token Shift:
The module starts by preparing a token-shifted input.
- : The current input feature vector for the token at position .
- : The input feature vector for the token at position
i-1. - : A learned parameter controlling the interpolation amount.
- : Linear interpolation function.
The
token-shifted inputis a linear interpolation between the current token's input and the previous token's input . This provides a form oflocal contextor1D short convolutionwithout data dependency (unlikeRWKV-6). Theshift_statevariable in the pseudocode (Appendix G, line 11) is used to store the last token's input for the next time step.
2. Weight Preparation:
Following token shift, several data-dependent parameters are computed. These parameters control the WKV state evolution, acting as gates or modifiers. RWKV-7 leverages low-rank MLPs (loramlp) for efficiency in computing these parameters.
The loramlp function is defined as:
$
\mathrm{loramlp}{\Pi}(f, x, \mathrm{bias}) = f(x A{\Pi}) B_{\Pi} + (\lambda_{\Pi} \mathrm{if bias else } 0)
$
- : An activation function.
- : The input vector.
- : Weight matrices for the low-rank MLP.
- : A bias term (optional, depending on
biasflag). This function represents a 2-layer MLP where the hidden dimension is smaller than the input/output, minimizing parameters.
The following parameters are computed (Appendix G, lines 14-19):
- Receptance (): Acts like a query in
Transformers. $ x_{\mathrm{receptance}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_r}) $ $ r = x_{\mathrm{receptance}} @ \mathrm{params.W_{receptance}} $- : Learned interpolation parameter.
params.W_receptance: Weight matrix.
- Decay (): Used to compute the
in-context weight decay(). $ x_{\mathrm{decay}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_d}) $ $ d = \mathrm{params.decay_lora}(x_{\mathrm{decay}}) $- : Learned interpolation parameter.
params.decay_lora: Alow-rank MLP(Figure 11 showsdL).
- Key (): A precursor to both
removal key() andreplacement key(). $ x_{\mathrm{key}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_k}) $ $ k = x_{\mathrm{key}} @ \mathrm{params.W_{key}} $- : Learned interpolation parameter.
params.W_key: Weight matrix.
- Value Precursor (
vprime): An initial value for thevalue residual learning. $ x_{\mathrm{value}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_v}) $ $ \mathrm{vprime} = x_{\mathrm{value}} @ \mathrm{params.W_{value}} $- : Learned interpolation parameter.
params.W_value: Weight matrix.
- In-Context Learning Rate (ICLR) Precursor: Used to compute the
in-context learning rate(). $ x_{\mathrm{iclr}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_a}) $ $ \mathrm{iclr_raw} = \mathrm{params.iclr_lora}(x_{\mathrm{iclr}}) $- : Learned interpolation parameter.
params.iclr_lora: Alow-rank MLP(Figure 11 showsda). The finalin-context learning rateis then obtained bysigmoidactivation: to restrict elements to .
- Gate Precursor: Used to compute the
rwkv gate(). $ x_{\mathrm{gate}} = \mathrm{lerp}(x, x_{\mathrm{shifted}}, \mathrm{params.mu_g}) $ $ \mathrm{gate_raw} = \mathrm{params.gate_lora}(x_{\mathrm{gate}}) $- : Learned interpolation parameter.
params.gate_lora: Alow-rank MLP(Figure 11 showsdg). The finalrwkv gateis then obtained bysigmoidactivation: .
The raw value is derived from vprime. For the first layer (layer_id == 0), is simply vprime. For subsequent layers, value residual learning is applied:
$
\mathrm{value_residual_gate} = \mathrm{th.sigmoid}(\mathrm{params.nu_lora}(x_{\mathrm{value}}))
$
$
V = \mathrm{th.lerp}(\mathrm{vprime}, \mathrm{vprime_0}, \mathrm{value_residual_gate})
$
params.nu_lora: Alow-rank MLP(Figure 11 showsdnu).- : The
vprimefrom the first layer, carried forward. This allows later layers to learn to blend their ownvprimewith thevprimefrom the initial layer, which can be seen as a form ofresidual connectionfor the value.
Next, the actual decay , removal key , and replacement key are formed:
- Decay ():
$
w_t = \mathrm{th.exp}(-\mathrm{math.exp}(-0.5) * d.\mathrm{to(th.float).sigmoid()})
$
- This formula ensures all entries of lie within , which is approximately . This range is chosen to maintain training stability and a smaller condition number for .
- Removal Key ():
$
\mathrm{removal_k_raw} = k * \mathrm{params.removal_key_multiplier}
$
$
\hat{\kappa}_t = \mathrm{F.normalize}(\mathrm{removal_k_raw.view}(B,T,H,-1), \mathrm{dim}=-1).\mathrm{view}(B,T,C)
$
params.removal_key_multiplier( in the paper): A learned parameter to scale the key.F.normalize: Theremoval keyisL2-normalizedper head. This is crucial for thedelta rulevariations to prevent unwanted changes in the amount removed due to implicit squaring of its length.
- Replacement Key ():
$
\tilde{k}_t = \mathrm{th.lerp}(k, k \star \mathrm{iclr}, \mathrm{params.iclr_mix_amt})
$
params.iclr_mix_amt( in the paper): A learnedreplacement rate boosterthat interpolates between the raw key and the key scaled by thein-context learning rateiclr. This allows dynamic control over the amount added to the state, per-channel.
3. Weighted Key Value (WKV) State Evolution:
The WKV is a multi-headed matrix-valued state (fast weights) that undergoes dynamic evolution. The WKV state (wkv_state in pseudocode) is crucial for encoding context information by learning key-value mappings.
The state update is defined by the recurrence relation:
$
\pmb { w k v } _ { 0 } = \mathbf { 0 }
$
$
\pmb { w k v } _ { t } = \pmb { w k v } _ { t - 1 } \left( \mathrm { d i a g } ( \pmb { w } _ { t } ) - \hat { \kappa } _ { t } ^ { T } ( a _ { t } \odot \hat { \kappa } _ { t } ) \right) + \pmb { \nu } _ { t } ^ { T } \cdot \tilde { \pmb { k } } _ { t }
$
-
: The
WKV statematrix at time , with dimensions for each head. -
: The
WKV statefrom the previous time step. -
: A diagonal matrix where the diagonal elements are the vector-valued
decay. -
: Transpose of the
normalized removal keyvector. -
: Vector-valued
in-context learning rate. -
: Element-wise product (Hadamard product).
-
: Transpose of the
valuevector. -
:
Replacement keyvector.The term is the
transition matrix. This matrix is adiagonal plus rank oneupdate, which allows for efficient parallelization. The transition matrix is no longer aHouseholder matrixbut ascaled approximation, offering expanded dynamics while maintaining eigenvalues in a stable range of . This formulation combines dynamic state evolution with an approximation of aforget gate.
The recurrent formulation can also be written in a parallel manner (Appendix G, lines 41-52 demonstrate the recurrent loop for clarity, but parallel implementations exist):
$
\pmb { w } \pmb { k } \pmb { \nu } _ { t } = \sum _ { i = 1 } ^ { t } \left( \nu _ { i } ^ { T } \tilde { k } _ { i } \prod _ { j = i + 1 } ^ { t } \left( \mathrm { d i a g } ( w _ { j } ) - \hat { \kappa } _ { j } ^ { T } ( a _ { j } \odot \hat { \kappa } _ { j } ) \right) \right) \in \mathbb { R } ^ { (D / h) \times (D / h) }
$
This formulation allows RWKV-7 to be trained in parallel across the time dimension, similar to Transformers or SSMs, despite its recurrent inference.
4. WKV Bonus and Output:
After the WKV state is updated, the receptance (query) is applied to the state to retrieve information:
$
y_t = \mathrm{wkv_state} @ r_t
$
wkv_state: The currentWKV state.- : The
receptancevector. The result is then passed throughLayerNorm(specificallyGroupNormper head in pseudocode, Appendix G, lines 56-57) to ensure numerical stability and consistent scaling: $ p_t = \mathrm{LayerNorm}(r_t \mathrm{wkv}_t^T) + u_t $ Where is a bonus term: $ u_t = \left( r _ { t } \cdot ( \rho \odot \tilde { k } _ { t } ) ^ { T } \right) \nu _ { t } $ - : A trainable parameter (
params.bonus_multiplierin pseudocode) which weights thebonus. This term ensures that the current shifted input token can receive extra attention without necessarily being stored in the state. The heads are then recombined (y.view(B, T, -1)in pseudocode) to form . Finally, this recombined output is gated and transformed into the module's output: $ o_t = (g_t \odot p_t) W_o \in \mathbb{R}^D $ - : The
rwkv gate(computed in weight preparation). - : Output weight matrix (
params.W_output).
4.2.2. MLP Module (Channel Mixing)
The MLP module in RWKV-7 differs from previous RWKV versions. It is a two-layer feedforward network without the receptance gating matrix () found in RWKV-4,5,6. The hidden dimension is set to 4D to compensate for the removed gating parameters and maintain similar parameter counts.
The MLP module operates as follows:
$
k_t^\prime = \mathbf{lerp}(x_t^\prime, x_{t-1}^\prime, \mu_k^\prime) \mathbf{W}_{k^\prime} \in \mathbb{R}^{4D}
$
- : Input to the
MLPmodule. - :
token-shiftedinput to theMLPmodule. - : Learned interpolation parameter ( in pseudocode).
- : Weight matrix ().
The activation function used is
ReLUsquared (). This non-linearity is then followed by a final linear transformation: $ o_t^\prime = \mathrm{ReLU}(k_t^\prime)^2 \mathbf{W}_{\nu^\prime} \in \mathbb{R}^D $ - : Weight matrix (). The output is then added back to the residual stream.
4.2.3. Pseudocode For RWKV-7 (Appendix G)
The provided pseudocode outlines the forward pass of RWKV-7. It demonstrates the sequential processing within a batch and across time steps for the Time Mixing and Channel Mixing modules.
rwkv_timemix function:
- Lines 11-12: Implements the
token-shiftoperation by concatenating the previousshift_state(last token of the previous sequence or batch) with the current sequence, then updatingshift_statewith the last token of the current sequence. - Lines 14-19: Compute the
lerpforx_receptance,x_decay,x_key,x_value,x_iclr,x_gateusingmuparameters. - Lines 21-26: Apply linear transformations (
@ params.W_...) orlow-rank MLPs(params...._lora) andsigmoidactivation foriclrto get , , ,vprime,gate,iclr. - Lines 29-33: Handle
value residual learning. Forlayer_id == 0(first layer), is simplyvprime. For other layers, is an interpolation betweenvprimeand (from the first layer), controlled byvalue_residual_gate. - Lines 35-38: Compute
decay(), (), and () as described above. - Lines 42-52: The core
WKV state evolutionloop. For each time step :- It extracts , , removal_k_t, replacement_k_t, , for the current token.
- Line 51: wkv_state = wkv_state * decay_t.mT - wkv_state @ removal_k_t @ (iclr_t * removal_k_t).mT. This updates the
wkv_statebased on and removes information using removal_k_t and . Note the.mT(matrix transpose) implies these are vectors becoming or(N,N)matrices for matrix multiplication. This is equivalent to wkv_state * diag(decay_t) - wkv_state @ (removal_k_t.T @ (iclr_t * removal_k_t)). - Line 52:
wkv_state = wkv_state + v_t @ replacement_k_t.mT. This adds new information using and replacement_k_t. Equivalent to wkv_state + (v_t.T @ replacement_k_t).
- Lines 53-54: Computes (applying receptance) and stores it in
out. - Lines 56-57: Applies
GroupNormto . - Lines 60-61: Calculates and adds the
bonusterm. - Line 63: Applies the
rwkv gateand final output linear transformation.
rwkv_channelmix function:
- Lines 69-70:
token-shiftfor theMLPmodule. - Line 71: . Interpolated input.
- Line 72: . First linear layer.
- Line 73: .
ReLUsquared activation and second linear layer.
rwkv_model function:
- Lines 78-79: Initial embedding lookup and
LayerNorm. - Lines 81-99: Loop through layers, applying
rwkv_timemixandrwkv_channelmixto the residual stream. - Lines 101-102: Final
LayerNormand linearheadfor logits.
4.2.4. PyTorch code For Naive WKV7 Kernel (Forward and Backward) (Appendix H)
The provided WKV7_Kernel class in PyTorch illustrates a naive implementation of the WKV state evolution, including both forward and backward passes.
forwardmethod:- Lines 12-13:
self.state_cachestores theWKV stateat each time step, starting from the initial state (line 13). This is crucial for the backward pass. - Line 14: . This calculates the
decaymatrix elements (the diagonal elements of ). - Lines 16-28: The loop iterates through time steps .
- Lines 23-25: These lines implement the
WKV stateupdate:state * W[:, t, :, None, :]: This applies the term element-wise.- : This term corresponds to in the generalized delta rule . Here, is effectively computed as a
rank-1 updatefromaaandbb. Note: The pseudocode and the main text's formula differ slightly from this kernel. The kernel uses where is a rank-1 update fromaaandbb, which implies that is meant to implement in the original formula. - : This corresponds to (or in the kernel).
- Line 28: . This applies
receptancerrto thestateto get the outputout.
- Lines 23-25: These lines implement the
- Lines 12-13:
backwardmethod:- Lines 32-38: Initializes gradients for , , , , , to zeros.
- Lines 40-58: Iterates backward through time from
T-1down to0. This is typical forRNNbackpropagation, using thestate_cachefrom the forward pass. - It computes gradients
gr,gk,gv,ga,gb,gwusingeinsumfor matrix multiplications involving thegout(gradient of output) andgstate(gradient of state) and cached states. - Line 57: . This updates
gstatefor the previous time step, effectively backpropagating through the recurrent state update. - Line 58: . This calculates the gradient of with respect to (raw values).
4.2.5. Expressivity of RWKV-7 (Appendix D)
This section provides theoretical proofs for RWKV-7's expressive power, claiming it surpasses and can recognize all regular languages.
D.1 Warmup: Expressivity Beyond TC0
- Theorem 2:
RWKV-7can solve an -complete problem under reductions. This is significant becauseTransformersand diagonalSSMsare limited to , and it's conjectured that . - Lemma 1 (Arbitrary Swap Matrix): The
RWKV-7 transition matrixcan represent an arbitraryswap matrix(identity matrix with two rows swapped).- Proof Sketch: By setting , , , and , the transition matrix becomes , which is the permutation matrix that swaps indices and .
- Lemma 2 (Tracking Swaps on 5 elements): A one-layer
RWKV-7model can track sequences of swaps on 5 elements and determine if the final permutation is the identity.- Proof Sketch: Uses 5
WKV headsof dimension 5.- A special
beginning-of-sequencetoken initializes the -th head state to (representing state ). Swap tokensare handled by setting parameters () such that the state correctly swaps and if the state was originally or .- The
MLPcombines outputs to check if all 5 heads are back in their initial states (identity permutation).
- A special
- Proof Sketch: Uses 5
D.2 Main Result: RWKV-7 Can Recognize Any Regular Language
- Theorem 3: For any
regular language, there exists a 4-layerRWKV-7model that recognizes it.- Proof Strategy: Simulating a
Deterministic Finite Automaton (DFA). ADFAis defined as a tuple , where is a finite set of states, is a vocabulary, is a transition function, is the initial state, and is a set of accepting states. DFA computation can be represented by matrix products: . - Challenge: An arbitrary
DFA transition matrixcan have rank , while a singleWKV headonly implements a simple rank-1 update. - Solution: Factor each
DFA transition matrixinto a product ofelementary transition matrices(Lemma 3), each of which can be directly implemented byWKV heads(Lemma 4). TheRWKV-7model uses its first three layers to compute these elementary matrices for blocks ofDFA transitionsand the fourth layer to multiply them.Layernormis key to preserving one-hot encoded information between layers.
- Proof Strategy: Simulating a
D.3 Lemmas for Theorem 3
- Lemma 3 (DFA Transition Matrix Factorization): Any
DFA transition matrix(which has a single 1 in each column) can be factored into a product ofelementary transition matrices(), where each is either anidentity matrix, aswap matrix(), or acopy matrix(, replacing column with a copy of column ).- Proof Sketch: Greedily build from an
identity matrixby right-multiplyingelementary transition matricesin three stages: fixing mismatched columns by swaps, then by copies, then filling remaining with identities.
- Proof Sketch: Greedily build from an
- Lemma 4 (Implementing Elementary Transition Matrices): For any
elementary transition matrix(from Lemma 3), there exist vectors and such that where and .- Proof Sketch: Shows specific settings for and for
identity,swap, andcopymatrices.- Identity: , .
- Swap (): , .
- Copy (): , .
- Proof Sketch: Shows specific settings for and for
- Lemma 5 (Position Tracking - First & Parity): A 1-layer
RWKV-7can output if the current position is first, and if it's even or odd.- Proof Sketch: Uses
token-shiftfor "first". For parity, theWKV stateis initialized and then updated by for , causing the state to flip sign, whichreceptancecan read out.
- Proof Sketch: Uses
- Lemma 6 (Position Modulo
2n): A 2-layerRWKV-7can output the position modulo2n.- Proof Sketch: Builds on Lemma 5. Uses a rotation mechanism in the
WKV stateforodd t( where rotates) and an identity foreven t, allowing the state to track angle. MultipleWKV headswith differentreceptancevectors can then read out the position modulo2n.
- Proof Sketch: Builds on Lemma 5. Uses a rotation mechanism in the
- Lemma 7 (Lookup Table Simulation): A layer of
RWKV-7can simulate a lookup table that takes the current position modulo2nand the2nmost recent tokens as keys. A 3-layerRWKV-7can compute .-
Proof Sketch: The
WKV stateis updated as . This effectively stores the last2ntokens in the state.n WKV headswithreceptancecan read out the state. AnMLPlayer then performs the lookup. Lemmas 5 and 6 provide the and recent tokens.Removing the Assumption : The construction uses for
elementary transition matrices, while the actual model uses . The paper notes that halving and simply halves thetransition matrix. SinceGroupNormimmediately follows, the magnitude of theWKV statedoes not affect calculations. Also, for constructions requiring , is used, so mismatch with their scales is not an issue.
-
4.2.6. Additional Architectural and Training Details (Appendix E)
- Parameters and Dimensions:
- Model dimension , number of layers , number of heads , vocabulary size .
- All models use head size .
Pilemodels: (GPT-NeoX 20B tokenizer).Worldmodels: (RWKV World tokenizer).- Table 15 provides detailed parameters for released models.
RWKV-7uses fourlow-rank MLPsfordecay(),in-context learning rate(),value residual(), andgate(). Intermediate dimensions for theseloramlpsare listed in Table 16.- The total number of parameters is calculated by:
$
# (\mathrm{Params}) = 2DV + 4D + LD(12D + 2(d_w + d_a + d_\nu + d_g) + 19) - (2Dd_\nu + D)
$
- : For embeddings, head, and
LayerNorms. - : Parameters per layer (except first).
- : Subtracted because the
value residual low-rank MLPis not in the first layer.
- : For embeddings, head, and
- Parameter Initializations: Emphasizes the importance of specific initialization strategies detailed in the official code repository for replicating results.
- Dataset Loading:
- Uses
mmapfor the 3.1 trillion token dataset. - Employs a custom
pseudo-random number generatorover for diverse sampling. - is the largest prime of form smaller than
[dataset_size/4096]. is close to0.618p. This ensures uniform access and pseudo-randomness.
- Uses
- Training Details:
bfloat16format on Nvidia H800 GPUs.AdamWoptimizer: , weight decay 0.1 (only to linear layers and embeddings). The small is to stabilize training and mitigate loss spikes.- Context length: 4096 tokens.
- Base
decay rateparameters have a 2x learning rate multiplier. Cosine learning rate decayschedule combined withphased dynamic batch size scaling(inspired bycritical batch sizeandSmith et al. (2018)). This increases batch size and adjusts LR over phases, optimizing GPU resource utilization.- Figure 12 shows stable training loss curves without spikes.
- Observed
NaNloss sometimes, attributed to very lowAdamW. Handled by rewinding to prior checkpoint and clearing optimizer states.
4.2.7. Additional Architecture Discussion (Appendix F)
This appendix offers a more qualitative discussion of RWKV-7's design choices and their rationale.
- Token Shift: Simpler, non-data-dependent
token-shift(likeRWKV-4,5) was chosen overRWKV-6's data-dependent version for improved training/inference efficiency, despite slight loss decrease per step. It still allowsinduction headsand local context. - Delta Rule Philosophy:
RWKV-7operates on the principle ofdecay,removal, andadditionto the state, akin toSGD.Decayis likeweight decay. - Vector-Valued Decay:
RWKVmodels usevector-valued decayinstead ofscalar-valued(likeMamba-2) because it offers significantloss per stepimprovement, despite being harder to compute efficiently. - Decay Limit: Max decay is limited (e.g., removal per timestep) for stability and efficient kernel implementation.
- Vector-Valued In-Context Learning Rate (ICLR): More expressive than scalar
ICLRsinTTT,Gated DeltaNet,Titans. Allows each key channel its own learning rate. - Decoupling Keys: The
removal keyis decoupled and normalized separately from thereplacement key. This contrasts withRWKV-6cwhich enforced normalization by pre-multiplying the key by (1 - decay), butRWKV-7lets the model learn these decisions. - Expressivity Beyond Significance: Emphasizes that
RWKV-7's ability to edit its state at each token (unlikeTransformer's immutable KV Cache) allowscomplex operationslike swapping entries, leading to greatercomputational abilitiesfor fixed inputs and solving problems (e.g., tracking swaps) thatTransformerscannot. TheRWKV-7 stateis like an "internal scratchpad." - WKV Head as Linear Model:
A WKV headcan be viewed as a linear model that updates its weights (WKV state entries) as context progresses. - Normalization after Receptance:
Group Normalizationafterreceptanceis common practice to ensurenumerical stabilityby preventing state magnitude changes from impacting model use. - Bonus Term: The
per-headbonus term (related totime-firstterm inRWKV-4toRWKV-6) allows special treatment for the current token's information, now extracted from theWKV kernelfor simplicity.
5. Experimental Setup
5.1. Datasets
The RWKV-7 models were trained and evaluated on a combination of existing and newly constructed datasets.
1. RWKV World v3 Dataset:
-
Source: Extended open-source multilingual corpus.
-
Scale: 3.119 trillion tokens.
-
Characteristics: Designed for enhanced English, code, and multilingual task performance. It builds upon
RWKV World v2(0.6 trillion tokens) andv2.1(1.4 trillion tokens) by adding new components and slightly enhancing Chinese novels. All tokens are given equal weighting unless specified. -
Domain: Diverse, including Web, Books, Code, Science & Wiki, Fiction, Chat & QA & Instruction, Math, Law & Government, Poetry & Lyrics.
-
Components: The
RWKV World v2.1dataset components (added to to sum to ~1.4 trillion tokens): The following are the results from Table 11 of the original paper:Dataset Domain Dataset Domain slimpajama C4 Web Llama-3-Magpie-Pro-1M-v0.1 Align dolma v1.6 (reddit only)a Forums Magpie-Pro-MT-300K-v0.1 Align glaive-code-assistant-v3 Code Magpie-Air-MT-300K-v0.1 Align m-a-p_Code-Feedback Code Magpie-Qwen2-Pro-1M-v0.1 Align cosmopedia-v0.1 Synthetic Magpie-Phi3-Pro-300K-Filtered- Align SystemChat-2.0 Instruct vl Tess-v1.5 Instruct Magpie-Gemma2-Pro-200K- Align UltraInteract_sft Instruct Filtered-v0.1 aWe added only the reddit datasets from dolma v1.6
The
RWKV World v3dataset components (added tov2.1to sum to ~3.1 trillion tokens): The following are the results from Table 12 of the original paper:Dataset Domain Dataset Domain REMOVED slimpajama partsa Web StarCoderc Code dclm-baseline-10-of-10b Web python-edu Code ccnews Web cosmopedia-v0.2 Synthetic fineweb-edu Web Edu WebInstructSub Forums TemplateGSM Math Buzz-v1.2 Instruct open-web-math Math SKGInstruct Instruct algebraic-stack Math FLAN Instruct aWe removed the CC and C4 components of SlimPajama from the corpus for World v3 CL-baseie, clueonly global-shard100 For StarCoder, we nowinclude alldatasets, insteadof just those datases with at least 10 stars
A summary of categories and token counts for
RWKV World v3: The following are the results from Table 14 of the original paper:Category Tokens (B) Web 1945.2 Books 337.2 Code 258.4 Science & Wiki 222.7 Fiction 192.6 Chat & QA & Instruction 110.0 Math 32.3 Law & Government 19.0 Poetry & Lyrics 1.7 Total 3119.2 Why chosen: This dataset is designed to provide a broad and deep multilingual, code, and English corpus, aiming to close the data gap with very large modern LLMs and achieve strong generalization across various domains.
2. The Pile Dataset (Gao et al., 2020):
- Source: A large, diverse, open-source dataset of text.
- Scale: 332 billion tokens.
- Characteristics: Composed of 22 smaller, high-quality datasets spanning various domains (e.g., scientific papers, books, web text, code).
- Domain: General English language.
- Why chosen: It is a standard benchmark dataset for training and evaluating language models, allowing for comparative study with other architectures like
PythiaandMamba.
3. PG19 Dataset (Rae et al., 2019):
- Source: A collection of books from Project Gutenberg, filtered for quality.
- Characteristics: Contains long documents.
- Why chosen: Used for evaluating
long contextcapabilities by measuring loss versus sequence position.
4. Recent Internet Data:
- Source: Temporally novel internet data created after January 2025.
- Characteristics: Includes new computer science/physics papers on
arXiv, Python/C++ repositories onGitHub, Wikipedia entries, fiction onArchive of Our Own (AO3), and recent news articles. - Why chosen: To complement traditional benchmarks and address concerns about
data leakageby evaluating on data not present in training sets.
5. Associative Recall (AR) Datasets (e.g., MQAR):
- Source: Synthetic datasets designed for
associative recalltasks. - Why chosen: To evaluate the model's ability to recall previously encountered information within a context, reflecting its effectiveness in
in-context learning.
6. Mechanistic Architecture Design (MAD) Benchmark (Poli et al., 2024):
- Source: A suite of synthetic token manipulation tasks.
- Why chosen: To probe architectural capabilities in sequence modeling, such as
in-context recall,fuzzy recall,memorization, andselective copying.
7. Group Multiplication Tasks (Merrill et al., 2024):
- Source: Synthetic sequences of group elements (e.g., from , , or ).
- Why chosen: To evaluate
state-tracking capabilitiesfor problems known to be -complete, directly testingRWKV-7's theoretical expressivity.
8. Context Length Extension Dataset:
-
Source: Public and custom sources (Table 8).
-
Characteristics: Constructed to prioritize longer documents with a length-based weighting scheme (documents < 32,768 characters weighted 1.0, longer documents 2.0-3.0, up to 128k tokens).
-
Why chosen: For fine-tuning
RWKV-7models on verylong contexts(128k tokens) to improve retrieval accuracy at extreme lengths. The following are the results from Table 8 of the original paper:Dataset Type Amount dclm-baseline-1.0 Public 25% fineweb-edu Public 15% fineweb Public 5% codeparrot/github-code Public 10% arXiv-CC0-v0.5 Custom 10% SuperWikiNEXT-32B Custom 10% public domain books Custom 15% the-stack (filtered) Custom 10%
9. AudioSet Dataset (Gemmeke et al., 2017):
- Source: A large-scale collection of human-labeled audio events.
- Why chosen: For evaluating
AudioRWKV-7performance onaudio embedding analysistasks.
10. Othello/Reversi Board Game Data:
- Source: Custom-designed training data based on the game of Othello (Reversi).
- Data Sample:
The following is an example of a training data sample from Appendix I of the original paper:
<input> 2 . ** . . 3 * • O O O * * • 4 * * * O * O O O 5 * * * * * O O O 6 * O * O O * O O 7 * * O O O * O O 8 * O O O O . 9 ....OO.. 10NEXT 11MAX_WIDTH-2 12MAX_DEPTH-2 </input> <reasoning> 15 Possible moves and score: g1 -19 h1 -01 b2 -08 h2 -23 b7 -12 g7 -09 16<stack> 17Remaining_Depth:2 18Max_Node Alpha: -in Beta: +in Best: -- Current: h1 -01 Unexplored: b2 -08 19</stack> 20 => Search next node 21 [Depth limit not reached] 22 <board> 23 . O O * * . O 24 * . O O O * O . 25 * * * O * O O O 26 * * * * O O O O 27 * O * O O * O O 28 * * O O O * O . 29 * . O O . . 30 • O O . 31 </board> 32 NEXT * 33 Possible moves and score: b1 +02 b2 +05 h2 +10 h6 +03 h7 +08 c8 +06 d8 +01 g8 +09 34 [Current player has legal moves] 35 [Internal node - expand] 36 <stack> 37 Remaining_Depth:1 38 Min_Node Alpha: -in Beta: +in Best: -- Current: d8 +01 Unexplored: b1 +02 39 Max_Node Alpha: -in Beta: +in Best: -- Current: h1 -01 Unexplored: b2 -08 40 </stack> 41 => Search next node 42 [Depth limit reached - evaluate all leaves] 43 [Updated stack] 44 <stack> 45 Remaining_Depth:1 46 Min_Node Alpha: -in Beta: +01 Best: d8 Current: -- --- Unexplored: 47 Max_Node Alpha: +01 Beta: +in Best: h1 Current: b2 -08 Unexplored: 48 </stack> 49 => Search next node 50 [Depth limit not reached] 51 <board> 52 . . O O * * . . 53 * O O O O * * . 54 * O O O * * O O 55 * O * O * O O O 56 * O * O O O * O 57 * * O O O O O O 58 * 59 . .. .. 60 </board> 61 NEXT * 62 Possible moves and score: al -07 bl +13 h6 -01 b7 -08 g7 +08 h7 -02 c8 +01 d8 -03 g8 +04 63 [Current player has legal moves] 64 [Internal node - expand] 65 <stack> 66 Remaining_Depth:1 67 Min_Node Alpha: +01 Beta: +in Best: -- Current: b7 -08 Unexplored: a1 -07 68 Max_Node Alpha: +01 Beta: +in Best: h1 Current: b2 -08 Unexplored: 69 </stack> 70 => Search next node 71 [Depth limit reached - evaluate all leaves] 72 [Updated stack] 73 <stack> 74 Remaining_Depth:1 75 Min_Node Alpha: +01 Beta: -08 Best: b7 Current: -- --- Unexplored: 76 Max_Node Alpha: +01 Beta: +in Best: h1 Current: -- --- Unexplored: 77 </stack> 78 [End of search] 79 > Playing h1 </reasoning> <output> 82 h1 83 . . O * * . 0 84 • . O * * . . 85 • . O * * . . 86 * * O * O O O 87 * O * O * O O O 88 * O * O O O * O 89 * * O O O O O O 90 * 91 </output>- Description: Each sample includes an
Inputsection (game state, search parameters), aReasoningsection (legal moves, evaluations, Alpha-Beta pruning steps for optimal moves, generated byEgaroucid engine), and anOutputsection (final move, resulting board). - Why chosen: To evaluate
RWKV-7'sstate trackingandin-context searchcapabilities for strategic decision-making in complex environments.
- Description: Each sample includes an
5.2. Evaluation Metrics
The paper employs various metrics to evaluate RWKV-7 across different tasks:
1. Accuracy (ACC):
- Conceptual Definition: Measures the proportion of correctly predicted instances out of the total number of instances. It is a fundamental metric for classification tasks, including language modeling where the task is to predict the next token correctly.
- Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: The count of times the model's output matches the ground truth.Total Number of Predictions: The total number of predictions made by the model.
2. Perplexity (PPL):
- Conceptual Definition: A common metric for evaluating language models, particularly for how well they predict a sample of text. Lower
perplexityindicates a better model, as it means the model is less "perplexed" (more confident and accurate) by the text it is evaluating. It is the exponentiated average negative log-likelihood of a sequence. - Mathematical Formula:
For a sequence of tokens , the
perplexityis defined as: $ \mathrm{PPL}(W) = \left( \prod_{i=1}^N \frac{1}{P(w_i | w_1, \ldots, w_{i-1})} \right)^{1/N} $ This can also be expressed in terms ofcross-entropy loss(): $ \mathrm{PPL}(W) = e^{L(W)} = e^{-\frac{1}{N} \sum_{i=1}^N \log P(w_i | w_1, \ldots, w_{i-1})} $ - Symbol Explanation:
- : A sequence of tokens .
- : The total number of tokens in the sequence.
- : The probability assigned by the language model to the -th token , given the preceding tokens .
L(W): Thecross-entropy lossfor the sequence .- : Euler's number (base of the natural logarithm).
3. Compression Rate:
- Conceptual Definition: Used for evaluating models on
temporally novel internet data. It measures how efficiently a model can compress new, unseen data. A lowercompression rateindicates that the model has learned better representations and can predict the data more effectively, requiring fewer bits to encode it. Inspired by the idea thatlanguage modeling is compression. - Mathematical Formula: While the paper doesn't explicitly provide the formula,
compression ratein this context typically refers to the average number of bits per token (BPT) or character (BPC) required by the model to encode the data. This is often approximated by thecross-entropy loss(converted to bits) or related toperplexity. For a model and data : $ \mathrm{CompressionRate}(M, D) \approx \mathrm{BitsPerToken} = \frac{\sum_{i=1}^N -\log_2 P_M(w_i | \text{context}_i)}{N} $ - Symbol Explanation:
- : Total number of tokens in the data.
- : The probability assigned by model to token given its preceding context.
- : Logarithm base 2, to convert probabilities into bits.
- The paper expresses
compression ratein percentage (unit: ), likely indicating a normalized version or a specific compression algorithm's output.
4. Mean Average Precision (mAP):
- Conceptual Definition: A widely used metric for
object detectionandinformation retrievaltasks, particularly multi-label classification or ranking tasks like audio event detection. It calculates theAverage Precision (AP)for each class/category and then averages theseAPvalues across all classes.APitself is the area under thePrecision-Recall Curve. HighermAPindicates better performance. - Mathematical Formula:
$
\mathrm{mAP} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \mathrm{AP}_q
$
Where for a query (or class) is often calculated as:
$
\mathrm{AP}q = \sum{k=1}^N P(k) \Delta r(k)
$
or as the area under the
Precision-Recall curvefor class . - Symbol Explanation:
- : The total number of queries or classes.
- : The
Average Precisionfor query (or class) . - : The number of retrieved documents (or predicted events).
P(k): The precision at cut-off in the ranked list.- : The change in recall from
k-1to .
5. Stable Rank (SR):
- Conceptual Definition: A measure of the "effective rank" of a matrix. It quantifies how close a matrix is to a low-rank matrix. A lower
stable ranksuggests that the matrix's information can be represented more compactly or that it has fewer "active" dimensions, even if its mathematical rank is high. - Mathematical Formula (Rudelson & Vershynin, 2007): $ \mathrm{SR}(A) := \left( \frac{|A|_F}{|A|_2} \right)^2 $
- Symbol Explanation:
- : The matrix (e.g.,
WKV state matrix). - : The
Frobenius normof , defined as . - : The
spectral norm(or operator norm) of , defined as its largest singular value.
- : The matrix (e.g.,
6. Root Mean Square (RMS):
- Conceptual Definition: A statistical measure of the magnitude of a varying quantity. For a matrix, it gives a sense of the typical magnitude of its elements.
- Mathematical Formula: For a matrix with elements: $ \mathrm{RMS}(A) = \sqrt{\frac{1}{mn} \sum_{i=1}^m \sum_{j=1}^n A_{i,j}^2} $
- Symbol Explanation:
- : The matrix.
m, n: Dimensions of the matrix.- : The element at row , column .
5.3. Baselines
The paper compares RWKV-7 against a range of contemporary large language models (LLMs) and sequence modeling architectures, representing different paradigms and scales:
1. Transformer-based Models:
- Qwen2.5 (Qwen et al., 2025): A highly optimized
Transformermodel, often representing state-of-the-art performance, especially for its large training data.RWKV-7aims to match its performance with significantly fewer tokens. - Llama-3.2 (Grattafiori et al., 2024): Another prominent
Transformerseries, known for strong performance. The paper notes thatLlama-3.2models are created via pruning and distillation from larger models, so their FLOPs are not directly comparable for training efficiency. - Pythia (e.g.,
pythia-1.4b-v0,pythia-2.8b-v0): An open-source suite ofTransformermodels trained onThe Piledataset, serving as a direct comparison for models trained on identical data. - Falcon3-1B-Base: Another
Transformer-based model.
2. State Space Models (SSMs):
- Mamba (Gu & Dao, 2023): A foundational
SSMthat achieveslinear time complexityduring inference. - Mamba2 (Dao & Gu, 2024): An improved version of
Mamba, often used as a strong baseline for efficient sequence modeling. - mamba-1.4b-hf, mamba-2.8b-hf: Specific
Mambavariants. - mamba2attn-2.7b: A
Mambavariant that likely integratesattention. - RecurrentGemma-2B: A recurrent model from Google, part of the
Gemmafamily. - gemma-2-2b: Another
Gemmafamily model.
3. Other RNN / Linear Attention Models:
- RWKV (Predecessors):
RWKV-4,RWKV-5,RWKV-6(includingRWKV6-World2.1-1.6B,RWKV6-World2.1-3B,RWKV5-World1-0.1B,RWKV5-World2-0.4B,RWKV5-World3B, etc.). These serve as direct architectural baselines to demonstrate the improvements ofRWKV-7. - SmolLM2 (Allal et al., 2025): A
small language modelseries. - Index-1.9B, MobileLLM-1.5B, MobileLLM-1B, Minitron-4B-Base, Zamba2-1.2B, Zamba2-2.7B: Other efficient or small-scale
LLMs. - S4: A foundational
Structured State Space Modelused ingroup multiplicationexperiments. - GLA (Gated Linear Attention): A
linear attentionvariant. - Hyena / Multihead Hyena (Poli et al., 2023):
Convolutional language modelsused inMADbenchmark. - DeltaNet (Schlag et al., 2021): The original
Delta Rulemodel, used inMADbenchmark.
Why these baselines are representative:
- SOTA Comparison:
Qwen2.5andLlama-3.2represent the highest-performingTransformermodels in their size class, crucial for evaluatingRWKV-7's competitive performance despite lower token counts. - Architectural Comparison:
Mamba,S4,GLA,Hyenarepresent the leading alternatives toTransformersthat also aim for efficiency, allowing direct comparison of architectural strengths. - Evolutionary Comparison:
RWKV-4,5,6models provide a clear baseline to show the incremental improvements of theRWKV-7architecture. - Dataset Consistency:
Pythiamodels trained onThe Pileoffer a controlled comparison forRWKV-7 Pilemodels on the exact same training data. - Specific Task Benchmarks: Baselines like
DeltaNet,Hyena,S4,Mamba, and classicalRNNsare included for specialized tasks likeMechanistic Architecture DesignandGroup Multiplicationto probe specific capabilities.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results validate RWKV-7's effectiveness across various benchmarks, demonstrating its competitive performance despite being trained on significantly fewer tokens, and highlighting its efficiency and advanced state-tracking capabilities.
1. Language Modeling Experiments (LM Evaluation Harness Benchmarks):
RWKV-7 models were evaluated on common English-focused and multilingual benchmarks using LM Evaluation Harness.
-
English-Focused Benchmarks (Table 3): The following are the results from Table 3 of the original paper:
Model (Name) Tokens (T) lmb.o acc↑ hella acc_n↑ piqa acc↑ arcE acc↑ arcC acc↑ glue aacc WG acc↑ sciq acc↑ mmlu acc↑ avg acc↑ RWKV5-World1-0.1B SmolLM2-135M 0.6 38.4 31.9 61.4 44.2 19.9 45.5 52.9 76.3 23.1 43.7 RWKV7-World2.8-0.1B 2.0 1.6 42.9 48.1 43.1 68.4 64.4 28.1 49.0 53.0 84.0 25.8 51.0 50.5 42.1 67.3 59.3 25.5 48.1 52.7 86.3 25.4 RWKV5-World2-0.4B 1.1 54.0 40.9 66.5 54.0 24.0 50.0 53.2 86.9 23.8 50.4 SmolLM2-360M 4.0 53.8 56.4 72.1 70.4 36.5 50.7 59.0 91.2 26.3 57.4 Qwen2.5-0.5B 18.0 52.5 52.1 70.2 64.6 29.5 54.7 56.4 93.1 47.8 57.9 RWKV7-World2.9-0.4B 3.1 58.6 56.8 72.9 68.7 31.9 49.4 59.9 89.7 26.1 57.1 RWKV6-World2.1-1.6B 2.5 67.4 61.1 74.4 64.3 31.0 51.0 60.7 89.5 25.1 58.3 Llama3.2-1B b15.0 63.0 63.7 74.5 65.5 31.3 49.7 60.7 91.4 32.1 59.1 SmolLM2-1.7B 11.0 67.7 71.5 77.0 77.7 44.7 51.5 66.1 93.3 50.3 66.6 Qwen2.5-1.5B 18.0 63.0 67.7 75.8 75.5 41.2 65.0 63.4 94.2 61.0 67.4 RWKV7-World3-1.5B 5.6 69.5 70.8 77.1 78.1 44.5 62.4 68.2 94.3 43.3 67.6 RWKV6-World2.1-3B 2.5 71.7 68.4 76.4 71.2 35.6 56.3 66.3 92.2 28.3 62.9 Llama3.2-3B b15.0 70.5 73.6 76.7 74.5 42.2 50.7 69.9 95.7 56.5 67.8 Qwen2.5-3B 18.0 67.1 73.5 78.6 77.4 45.0 70.2 68.5 96.2 65.7 71.4 RWKV7-World3-2.9B 5.6 73.4 76.4 79.7 81.0 48.7 61.8 72.8 95.0 55.0 71.5 Analysis:
- Competitive English Performance:
RWKV-7models generally match the English performance ofQwen2.5, despiteQwen2.5being trained on significantly more tokens (e.g.,RWKV7-World3-2.9B(5.6T tokens, 71.5 avg acc) vs.Qwen2.5-3B(18.0T tokens, 71.4 avg acc)). This highlightsRWKV-7's superior data efficiency. - MMLU Leap:
RWKV-7models show substantial improvements inMMLUperformance compared toRWKV-6, indicating better multi-task language understanding. - Overall Scaling: As model size increases,
RWKV-7consistently shows strong performance gains, often outperformingRWKV-6andSmolLM2at comparable sizes.
- Competitive English Performance:
-
Multilingual Benchmarks (Table 4): The following are the results from Table 4 of the original paper:
Model (Name) Tokens (T) lmb.m appll lmb.m acc↑ pawsx acc↑ xcopa acc↑ xnli acc↑ xsClz acc↑ xwin acc↑ avg acc↑ RWKV5-World1-0.1B SmolLM2-135M 0.6 2.0 270 1514 22.0 18.6 48.6 51.2 53.0 52.2 36.1 34.9 51.7 50.6 59.5 61.7 45.1 44.9 RWKV7-0.1B 1.6 114 31.6 46.1 53.3 37.6 52.6 64.1 47.5 RWKV5-World2-0.4B 1.1 66 36.8 49.5 54.0 38.5 54.1 65.6 49.8 SmolLM2-360M 4.0 389 25.8 51.4 51.7 36.0 51.2 67.8 47.3 Qwen2.5-0.5B 18.0 108 32.9 52.6 54.4 38.6 53.9 67.8 50.0 RWKV7-World3-0.4B 3.1 52 39.6 48.7 55.4 40.3 55.3 72.9 52.0 RWKV6-World2.1-1.6B 2.5 28 47.2 52.5 58.1 41.4 58.2 76.5 55.7 Llama3.2-1B b15.0 52 39.0 53.9 55.3 41.2 56.6 72.2 53.0 SmolLM2-1.7B 11.0 85 37.1 56.5 53.1 38.1 54.1 72.8 52.0 Qwen2.5-1.5B 18.0 49 40.0 55.3 57.4 40.6 57.7 75.8 54.5 RWKV7-World3-1.5B 5.6 25 48.4 54.8 59.7 43.7 61.4 79.8 58.0 RWKV6-World2.1-3B 2.5 21 51.0 53.4 60.2 42.7 61.3 Llama3.2-3B b15.0 30 45.9 58.5 60.6 78.8 57.9 Qwen2.5-3B 18.0 36 43.5 59.9 53.3 59.0 44.2 38.5 59.6 79.2 79.8 58.1 RWKV7-World3-2.9B 5.6 18 52.9 58.2 63.1 45.4 64.7 82.4 55.6 61.1 Analysis:
- SoTA Multilingual Performance:
RWKV-7-Worldmodels show significant improvements on multilingual benchmarks,outperforming SmolLM2,Llama-3.2, andQwen-2.5by a notable margin at comparable parameter counts (e.g.,RWKV7-World3-2.9B(5.6T tokens, 61.1 avg acc) vs.Qwen2.5-3B(18.0T tokens, 58.1 avg acc)). This indicates strong multilingual capabilities.
- SoTA Multilingual Performance:
-
FLOPs vs. Accuracy (Figures 3 & 4): The following figure (Figure 3 from the original paper) shows model comparisons across multilingual benchmarks.
Figure 3: Model Comparisons across Multilingual BenchmarksThe following figure (Figure 4 from the original paper) shows model comparisons across English-language benchmarks.
Analysis:RWKV-7models demonstrate aPareto improvementin multilingual evaluations againstTransformermodels (Figure 3a).- For English language evaluations,
RWKV-7-Worldmodels achieve similar scores to highly trainedTransformermodels but with dramaticallylower total FLOPs usage(Figure 4a). This suggests highercomputational efficiency. The authors theorize this difference would be even more dramatic if models were trained from scratch with equivalent total tokens.
2. Recent Internet Data Evaluation (Table 5): The following are the results from Table 5 of the original paper:
| Model | arXiv CS↓ | arXiv Phys. ↓ | Github Python ↓ | Github C++↓ | A03 Eng ↓ | BBC news ↓ | Wiki Eng ↓ | average ↓ |
| Qwen2.5-1.5B | 8.12 | 8.65 | 4.42 | 4.40 | 11.76 | 9.58 | 9.49 | 8.06 |
| RWKV-7 1.5B | 8.25 | 8.77 | 5.57 | 5.29 | 10.93 | 9.34 | 8.97 | 8.16 |
| Llama-3.2-1B | 8.37 | 8.76 | 5.18 | 5.16 | 11.69 | 9.34 | 9.07 | 8.23 |
| SmolLM2-1.7B | 8.38 | 9.04 | 5.17 | 4.94 | 11.20 | 9.40 | 9.46 | 8.23 |
| Index-1.9B | 8.34 | 8.59 | 5.65 | 5.29 | 11.49 | 9.51 | 9.23 | 8.30 |
| stablelm-2-1.6b | 8.58 | 9.08 | 5.54 | 5.45 | 11.42 | 9.24 | 9.06 | 8.34 |
| RWKV-6 1.5B | 8.62 | 9.00 | 6.06 | 5.80 | 11.09 | 9.57 | 9.30 | 8.49 |
| RWKV-5 1.5B | 8.77 | 9.11 | 6.20 | 5.92 | 11.25 | 9.75 | 9.50 | 8.64 |
| mamba2-1.3b | 8.74 | 8.74 | 6.32 | 5.71 | 11.63 | 9.74 | 9.86 | 8.68 |
| MobileLLM-1.5B | 8.82 | 9.29 | 6.79 | 6.29 | 11.59 | 9.15 | 9.22 | 8.73 |
| mamba-1.4b-hf | 8.88 | 8.86 | 6.43 | 5.81 | 11.70 | 9.83 | 9.97 | 8.78 |
| Zamba2-1.2B | 8.57 | 9.21 | 6.91 | 7.08 | 11.39 | 9.38 | 9.26 | 8.83 |
| SmolLM-1.7B | 8.38 | 9.02 | 5.76 | 6.55 | 12.68 | 9.85 | 9.89 | 8.88 |
| MobileLLM-1B | 9.03 | 9.57 | 7.03 | 6.53 | 11.86 | 9.35 | 9.43 | 8.97 |
| RWKV-4 1.5B | 9.34 | 9.80 | 6.54 | 6.16 | 11.33 | 10.00 | 9.82 | 9.00 |
| pythia-1.4b-v0 | 9.12 | 9.20 | 6.79 | 6.15 | 12.19 | 10.20 | 10.43 | 9.15 |
| Falcon3-1B-Base | 8.60 | 9.20 | 6.92 | 7.16 | 13.04 | 10.45 | 10.75 | 9.45 |
| Llama-3.2-3B | 7.78 | 8.10 | 4.15 | 4.59 | 10.90 | 8.70 | 8.28 | 7.57 |
| Qwen2.5-3B | 7.79 | 8.25 | 4.15 | 4.12 | 11.23 | 9.15 | 8.96 | 7.66 |
| RWKV-7 2.9B | 7.90 | 8.34 | 5.16 | 4.88 | 10.48 | 8.92 | 8.47 | 7.74 |
| stablelm-3b-4e1t | 8.15 | 8.50 | 5.28 | 4.85 | 10.89 | 8.82 | 8.51 | 7.86 |
| Minitron-4B-Base | 8.09 | 8.70 | 5.13 | 4.74 | 11.05 | 9.08 | 8.90 | 7.96 |
| recurrentgemma-2b | 8.24 | 8.52 | 5.22 | 4.80 | 11.30 | 8.94 | 8.88 | 7.99 |
| RWKV-6 3B | 8.27 | 8.58 | 5.66 | 5.39 | 10.67 | 9.17 | 8.82 | 8.08 |
| gemma-2-2b | 8.39 | 8.81 | 5.36 | 5.01 | 11.35 | 8.90 | 9.03 | 8.12 |
| mamba2attn-2.7b | 8.33 | 8.29 | 5.78 | 5.22 | 11.13 | 9.28 | 9.26 | 8.18 |
| RWKV-5 3B | 8.42 | 8.70 | 5.78 | 5.51 | 10.83 | 9.36 | 9.00 | 8.23 |
| mamba2-2.7b | 8.43 | 8.37 | 5.93 | 5.34 | 11.21 | 9.37 | 9.38 | 8.29 |
| Zamba2-2.7B | 8.17 | 8.70 | 6.30 | 6.39 | 10.97 | 8.95 | 8.74 | 8.32 |
| mamba-2.8b-hf | 8.57 | 8.52 | 6.03 | 5.46 | 11.31 | 9.49 | 9.53 | 8.41 |
| RWKV-4 3B | 8.90 | 9.27 | 6.07 | 5.67 | 10.90 | 9.57 | 9.30 | 8.53 |
| pythia-2.8b-v0 | 8.72 | 8.73 | 6.29 | 5.71 | 11.66 | 9.74 | 9.82 | 8.67 |
Analysis:
RWKV-7 Gooseshows competitive performance ontemporally novel data(data created after training periods), despite being trained on significantly less data than models likeQwen2.5andLlama-3.2.- For 3B-scale models,
RWKV-7 2.9Bachieves an average compression rate of 7.74%, very close toQwen2.5-3B(7.66%) andLlama-3.2-3B(7.57%). This indicates good generalization to unseen data and robust language modeling capabilities withoutdata leakage.
3. Associative Recall (Table 6): The following are the results from Table 6 of the original paper:
| Dim | WKV state dim | (64,4) | (128,8) | (256, 16) | (512,64) | (1024,128) | (2048,256) |
| 64 | 8192 | ✓ | ✓ | ¸ | 98.43 | 95.01 | 72.93 |
| 128 | 16384 | ✓ | ✓ | 94.97 | |||
| 256 | 32768 | √ | ✓ | √ | ✓ | ✓ | 98.97 |
| 512 | 65536 | √ | ✓ | ✓ | ✓ | ✓ | ✓ |
Analysis:
RWKV-7demonstrates strongassociative recallcapabilities. With aWKV state dimensionof 8192, it recalls 72.93% of information for 256 Key-value pairs (sequence length 2048).- The results suggest an information density of 0.547 bits per dimension, indicating efficient storage in the fixed-size state.
- Achieving >99% accuracy (indicated by '✓') for various configurations, including up to 512 Key-value pairs and 65536
WKV state dimension, underscores its ability to effectively learn from context.
4. Mechanistic Architecture Design (MAD) Benchmark (Table 7): The following are the results from Table 7 of the original paper:
| Model | Compress | Fuzzy Recall | In-Context Recall | Memorize | Noisy Recall | Selective Copy | Avg |
| RWKV-7 | 44.5 | 43.2 | 100 | 89.1 | 100 | 98.8 | 79.3 |
| Transformer | 51.6 | 29.8 | 94.1 | 85.2 | 86.8 | 99.6 | 74.5 |
| Multihead Hyena | 44.8 | 14.4 | 99.0 | 89.4 | 98.6 | 93.0 | 73.2 |
| DeltaNet | 42.2 | 35.7 | 100 | 52.8 | 100 | 100 | 71.8 |
| Mamba | 52.7 | 6.7 | 90.4 | 89.5 | 90.1 | 86.3 | 69.3 |
| Hyena | 45.2 | 7.9 | 81.7 | 89.5 | 78.8 | 93.1 | 66.0 |
| GLA | 38.8 | 6.9 | 80.8 | 63.3 | 81.6 | 88.6 | 60.0 |
Analysis:
RWKV-7achieves thehighest average score(79.3) across the sixMADtasks, outperformingTransformers,Mamba,DeltaNet, andHyena.- It demonstrates
perfect accuracyonIn-ContextandNoisy Recalltasks, matchingDeltaNet. - Sets a
new state-of-the-artforFuzzy Recall(43.2), indicating strong robustness to noisy inputs. - Strong performance in
MemorizationandSelective Copyingsuggests an effective combination ofattention-basedandrecurrent model strengths.
5. Long Context Experiments:
-
PG19 Loss vs. Sequence Position (Figures 5 & 6): The following figure (Figure 5 from the original paper) shows PG19 loss versus sequence position for RWKV and Mamba models trained on The Pile datasets.
Table 7: Results on the MAD benchmark Figure 5: PG19 loss versus sequence position for RWKV and Mamba models trained on The Pile datasets.The following figure (Figure 6 from the original paper) shows PG19 loss versus sequence position for RWKV7 models and predecessors trained on the World dataset.
Analysis:Pile-trainedRWKV-7showsmore significant loss reductionon long contexts compared to its predecessors (Figure 5), demonstrating effectivelong-context extrapolation.- Conversely,
World-trainedRWKV-7exhibits anincreasing loss trendfor contexts longer than 10k (Figure 6). This is speculated to be due toinductive biasesfrom larger datasets/models causingoverfitting to specific context lengths. However, fine-tuning onlong contextscan restore this capability.
-
Pass-Key Retrieval (Figure 7): The following figure (Figure 7 from the original paper) shows RWKV7-World3 pass-key retrieval evaluation.
Analysis:RWKV7-World3-1.5Bachievesperfect retrievalup to 19,600 tokens, degrading beyond 20,600.RWKV7-World3-2.9Bextendsperfect retrievalup to 35,000 tokens, showing benefits of scaling. Performance degrades around 50k tokens.Fine-tuningon a specially constructed dataset with 128k token sequences further improves performance:RWKV-7 (1.5B)reliably retrieves up to 29k tokens (degrading around 40k), andRWKV-7 (2.9B)reliably retrieves up to 30k (degrading around 50k). This demonstrates the effectiveness of targeted long-context training.
6. Evaluating State Tracking Using Group Multiplication (Figure 8): The following figure (Figure 8 from the original paper) shows minimum number of layers (lower is better) required to attain validation accuracy on group multiplication problems by sequence length and group.
Analysis:
RWKV-7exhibitsstronger state-tracking capabilitiesthanTransformers,Mamba, andS4, requiring fewer layers to achieve high accuracy ongroup multiplication tasks.- It performs slightly weaker than
classical RNNsin this specific benchmark, butclassical RNNsoften suffer fromgradient vanishing/explodingand cannot beparallelized efficiently, unlikeRWKV-7. - The results align with the theoretical prediction (Appendix D.2) that
RWKV-7can performstate trackingand recognize anyregular languagewith a constant number of layers, highlighting its theoretical expressivity advantage over -limited models.
7. Speed and Memory Usage (Figure 9): The following figure (Figure 9 from the original paper) shows time vs. sequence length (H100).
Analysis:
RWKV-7models scalelinearlywith sequence length, whileFlash Attention v3scalesquadratically. This makesRWKVmodels faster forlarge sequence lengths.- The optimized
RWKV-7kernel is approximatelythree times fasterthan the officialRWKV-6 kernel. - For a sequence length of 16k, the
RWKV-7forward pass (without storing state) takes 7.9ms,Flash Attention v3forward pass takes 33.9ms, demonstrating significant inference speed advantages. - Memory usage is
constant for single token inferenceand scaleslinearly with chunk sizeforpre-fill, allowing flexible memory trade-offs.
8. Multimodal Experiments:
-
VisualRWKV-7 (Table 9, Figure 10): The following figure (Figure 10 from the original paper) shows the architecture of VisualRWKV-7. The input image is processed by three vision encoders, and the obtained features are concatenated. Afterward, they are projected through an MLP with context gating to align with the dimensions of the RWKV-7 block. Finally, the image features are concatenated with the text embeddings and fed into the RWKV-7 LLM.
Figure 10: The architecture of VisualRWKV-7. The input image is processed by three vision encoders, and the obtained features are concatenated. Afterward, they are projected through an MLP with context gating to align with the dimensions of the RWKV-7 block. Finally, the image features are concatenated with the text embeddings and fed into the RWKV-7 LLM.The following are the results from Table 9 of the original paper:
Method Vision Encoder LLM VQA SQA TQA GQA VisualRWKV-6 SigLIP+DINOv2+SAM RWKV6-1.6B 73.6 57.0 48.7 58.2 VisualRWKV-6 SigLIP+DINOv2+SAM RWKV6-3.1B 79.1 62.9 52.7 61.0 VisualRWKV-7 SigLIP+DINOv2+SAM RWKV7-0.1B 75.2 50.6 37.9 59.9 VisualRWKV-7 SigLIP+DINOv2+SAM RWKV7-0.4B 77.9 55.0 41.1 62.3 VisualRWKV-7 SigLIP+DINOv2+SAM RWKV7-1.5B 79.8 59.7 49.5 63.2 VisualRWKV-7 SigLIP+DINOv2+SAM RWKV7-2.9B 80.5 63.4 58.0 63.7 Analysis:
VisualRWKV-7models (e.g., 0.1B, 0.4B)outperform VisualRWKV-6 1.6BonVQAv2andGQAbenchmarks with significantly fewer parameters, demonstrating the powerful modeling capabilities of theRWKV-7block.VisualRWKV-7 2.9Boutperforms VisualRWKV-6 3.1Bon the out-of-domainSQAbenchmark, indicating strongergeneralization ability.- A 5.3-point improvement on
TextQA (TQA)forVisualRWKV-7 2.9BoverVisualRWKV-6 3.1Bfurther confirms itssuperior associative recall.
-
AudioRWKV-7 (Table 10): The following are the results from Table 10 of the original paper:
Model #Parameters Architecture mAP ↑ DeepRes (Ford et al., 2019) 26M CNN 0.392 HST-AT 88.5M Transformer 0.433* HST-AT pretrained(Chen et al., 2022) 88.5M Transformer 0.429* MambaOut(Yu & Wang, 2024) 101.3M Mamba 0.397 AudioRWKV-6 8.9M RWKV6 0.381 AudioRWKV-6 19.8M RWKV6 0.426 AudioRWKV-7 8.9M RWKV7 0.392 AudioRWKV-7 19.8M RWKV7 0.431 Analysis:
AudioRWKV-7achievescomparable performancetoCNN,Transformer, andMamba-based architectures with amuch smaller parameter count.- It
exceeds the performance of AudioRWKV-6at both 8.9M and 19.8M parameters, demonstrating improved capabilities foraudio embedding analysis.
6.2. Ablation Studies / Parameter Analysis
1. Ablation Experiments on The Pile (Table 18): The following are the results from Table 18 of the original paper:
| Model | Tokens (B) | lmb.o ppl ↓ | lmb.0 acc ↑ | hella acc_n ↑ | piqa acc ↑ | arcE acc ↑ | arcC acc ↑ | glue acc ↑ | WG acc ↑ | sciq acc↑ | avg acc |
| RWKV4-169M-Pile | 332 | 29.2 | 33.2 | 32.2 | 64.8 | 47.1 | 19.9 | 47.6 | 51.2 | 77.6 | 46.7 |
| Pythia-160M | 300 | 37.3 | 35.4 | 30.3 | 62.3 | 43.6 | 19.5 | 46.5 | 51.3 | 75.4 | 45.5 |
| Mamba-130M | 300 | 16.0 | 44.3 | 35.3 | 64.5 | 48.0 | 19.7 | 48.5 | 52.1 | 78.2 | 48.8 |
| Mamba2-130M | 300 | 16.8 | 43.9 | 35.3 | 64.9 | 47.4 | 20.9 | 45.8 | 52.6 | 81.0 | 49.0 |
| RWKV6-173M-Pile | 332 | 16.0 | 44.5 | 34.9 | 64.4 | 48.3 | 19.7 | 48.9 | 51.9 | 80.6 | 49.2 |
| RWKV7-168M-Pile | 332 | 14.2 | 45.7 | 36.9 | 65.5 | 47.9 | 19.7 | 49.1 | 52.4 | 81.6 | 49.8 |
| RWKV4-430M-Pile | 332 | 13.1 | 45.1 | 40.8 | 67.7 | 52.8 | 24.1 | 49.4 | 52.0 | 80.7 | 51.6 |
| Pythia-410M | 300 | 10.8 | 51.6 | 40.6 | 66.7 | 51.9 | 21.4 | 44.1 | 53.3 | 81.5 | 51.4 |
| Mamba-370M | 300 | 8.1 | 55.6 | 46.5 | 69.5 | 55.0 | 25.0 | 46.8 | 55.5 | 84.5 | 54.8 |
| Mamba2-370M | 300 | 8.0 | 55.9 | 46.9 | 70.5 | 54.8 | 25.1 | 48.1 | 55.4 | 85.3 | 55.2 |
| RWKV7-421M-Pile | 332 | 7.2 | 57.9 | 48.0 | 69.3 | 56.3 | 23.5 | 50.3 | 56.4 | 85.9 | 56.0 |
| RWKV4-1.5B-Pile | 332 | 7.1 | 56.4 | 52.8 | 72.2 | 60.7 | 24.9 | 50.5 | 54.3 | 85.8 | 57.2 |
| Pythia-1.4B | 300 | 6.1 | 61.7 | 52.0 | 70.8 | 60.5 | 26.1 | 47.7 | 57.5 | 86.6 | 57.9 |
| Mamba-1.4B | 300 | 5.0 | 65.0 | 59.1 | 74.2 | 65.5 | 29.8 | 46.2 | 61.4 | 87.3 | 61.1 |
| Mamba2-1.3B | 300 | 5.0 | 65.6 | 59.9 | 73.2 | 64.2 | 29.9 | 46.1 | 61.0 | 89.8 | 61.2 |
| RWKV7-1.47B-Pile | 332 | 4.8 | 67.0 | 61.8 | 73.6 | 64.9 | 30.2 | 48.0 | 64.4 | 91.1 | 62.6 |
Analysis:
- This table showcases the consistent improvements of
RWKV-7over its predecessors (RWKV-4,RWKV-6) when all models are trained on the same dataset (The Pile) under identical configurations. - Across all three size scales (168M, 421M, 1.47B parameters),
RWKV-7achieves lowerperplexity(e.g., 4.8 vs. 5.0 for Mamba2-1.3B) and higheraverage accuracy(e.g., 62.6 vs. 61.2 for Mamba2-1.3B) compared toRWKV-6and oftenMambamodels. - The performance gap
sustains as model size increases, suggesting that theRWKV-7 architecturescales more effectively.
2. Architecture Choice Ablations (Table 19):
Ablation studies were conducted on a small 6-layer, 768-dimension Goose model trained on minipile to validate specific design choices.
The following are the results from Table 19 of the original paper:
| Model | Training Loss Validation Loss |
| Goose | 2.834 |
| Goose, scalar decay | 2.873 |
| Goose, scalar in-context learning rate | 2.609 2.843 2.591 |
| Goose, same removal/replacement keys | 2.840 2.560 |
| Goose, no bonus term | 2.841 |
Analysis:
- The baseline
Goosemodel (with allRWKV-7innovations) achieves the lowestvalidation loss(2.560, derived from the last entry that is likely the validation loss for the best model). - Scalar Decay: Using a
scalar-valued decayinstead ofvector-valued(2.873 vs 2.834 training loss) results in ahigher training loss, validating the benefit ofvector-valued decay. - Scalar In-Context Learning Rate: Switching to a
scalar-valued in-context learning rate(2.843 vs 2.834 training loss) also leads to ahigher training loss, confirming the advantage ofvector-valued ICLR. - Same Removal/Replacement Keys: Forcing the use of the
same removal and replacement keys(2.840 vs 2.834 training loss) degrades training performance, supporting the design choice todecouple these keys. - No Bonus Term: Removing the
bonus term() slightly increasestraining loss(2.841 vs 2.834), indicating its positive contribution. - These ablations confirm that the novel design choices in
RWKV-7(vector-valued decay, vector-valued ICLR, decoupled keys, and bonus term) each contribute positively to the model's performance.
3. Parameter Statistics (Appendix L, Figures 17-23):
-
(Removal Key Multiplier) and (Replacement Rate Booster): The following figure (Figure 17 from the original paper) shows box plots of and across models.
Figure 17: Box Plots of and Across ModelsThe following figure (Figure 18 from the original paper) shows mean and across layers for different models.
Figure 18: Mean and Across Layers for Different ModelsThe following figure (Figure 19 from the original paper) shows maximum and minimum of across layers in different models.
Figure 19: Maximum and Minimum of Across Layers in Different ModelsThe following figure (Figure 20 from the original paper) shows maximum and minimum of across layers in different models.
Figure 20: Maximum and Minimum of Across Layers in Different Models
Analysis: Box plots show the distribution of these learned parameters across models. Mean values across layers (Figures 18, 22) show how these parameters evolve and are utilized at different depths of the network. The maximum and minimum values (Figures 19, 20, 23) indicate the range of dynamic adjustments the model makes, e.g., ranges roughly , showing significant scaling of the removal key. -
Biases of (Decay Precursor): The following figure (Figure 21 from the original paper) shows box plots of biases of
d _ { t }across models.
Figure 21: Box Plots of biases of d _ { t }Across ModelsThe following figure (Figure 22 from the original paper) shows mean biases of
d _ { t }across layers for different models.
Figure 22: Mean biases of d _ { t }Across Layers for Different ModelsThe following figure (Figure 23 from the original paper) shows maximum and minimum of biases of
d _ { t }across layers in different models.
Figure 23: Maximum and Minimum of biases of d _ { t }Across Layers in Different Models Analysis: The statistics for provide insights into the behavior of thedecay mechanism. Their distribution and range across layers show how the model learns to control thein-context weight decay.4. Board Game Modeling (Reversi/Othello):
-
Training Loss (Figure 13): The following figure (Figure 13 from the original paper) shows Reversi Training loss for different token types.
Figure 13: Reversi Training loss for different token types
Analysis: The training lossfor differenttoken typesindicates a phased learning process:- Model first
mastered output formatting. - Then developed
board state tracking capability. - Continuously
improved evaluation accuracythroughout training. This suggestsRWKV-7can learn complex reasoning strategies in-context.
- Model first
-
Win Rates with Search (Figure 14): The following figure (Figure 14 from the original paper) shows Reversi Token consumption and win rates under different search configurations.
Figure 14: Reversi Token consumption and win rates under different search configurations
Analysis: By increasing the thinking budget(max width and depth forAlpha-Beta pruning),RWKV-7can effectivelysearch for better strategiesin Othello, demonstrating apositive test-time scaling law. This indicates that the model can leverage itsstate-trackingabilities for complexin-context search.5. State Inspections (WKV Matrix RMS and Stable Rank):
-
Visualization (Figure 15): The following figure (Figure 15 from the original paper) shows visualization example of RWKV's WKV matrices.
Figure 15: Visualization example of RWKV's WKV matrices.
Analysis: Visual inspection reveals that RWKV-7'sWKV stateshave elements consistently of order , without the thousands-order outliers seen inRWKV-5andRWKV-6. This confirms improvednumerical stability. -
RMS and Stable Rank (Figure 16): The following figure (Figure 16 from the original paper) shows the global RMS and average stable rank of WKV matrices, plotted over sequence length.
Analysis:RWKV-7exhibitssignificantly smaller RMS valuesfor itsWKV statescompared toRWKV-5andRWKV-6, further confirming betternumerical stability.- The
stable rankofRWKV-7'sWKV matrixislowerthan its predecessors for contexts longer than 32. This seemingly contradictory observation (lowerstable rankimplies less information, yetRWKV-7performs better) is hypothesized to be due toenhanced information compression and utilization capabilitiesinRWKV-7's state evolution, allowing it to maintain important information in a more compact form.
6. Initial Token Sensitivity (Table 20): The following are the results from Table 20 of the original paper:
| Model | EOS padding | PPL | ACC (%) | Significance |
| RWKV7 World 0.1B | 0 | 357 | 9.2 | *** |
| 1 | 16.4 | 36.6 | ||
| RWKV7 World 0.4B | 0 | 42.7 | 28.9 | *** |
| 1 | 7.25 | 48.6 | ||
| SmolLM2 360M | 0 | 21.1 | 39.4 | * |
| 1 | 9.17 | 49.3 | ||
| Qwen2.5 0.5B | 0 | 12.2 | 47.9 | NS |
| 1 | 7.97 | 54.9 |
Analysis:
RWKV-7models exhibitsignificant prompt sensitivityto the inclusion of the token at the beginning of the input, especially smaller models.- For
RWKV7 World 0.1B, including the token improves accuracy from 9.2% to 36.6% (PPL from 357 to 16.4). - This suggests
RWKV-7struggles to retain the very first token in memory without properstate initializationorcontext setting. - Interestingly,
Transformer-based models likeQwen2.5show less pronounced impact, implying better mechanisms for attending to initial tokens. - The authors even found that two consecutive tokens can further improve performance, suggesting the model leverages these special tokens for state initialization.
6.3. Training Details (Figures 12)
The following figure (Figure 12 from the original paper) shows training loss curve for RWKv-7 World models. Blue line: loss; Red line: smoothed loss; Green line: actual learning rate; Dotted red line: learning rate schedule.
Analysis:
- The
training loss curvesfor all fourRWKV-7 World modelsdemonstrateextreme stability, with no observed loss spikes. This indicates robust training dynamics for theRWKV-7 architecture. - The
cosine learning rate decay scheduleandphased dynamic batch size scaling strategyappear to be effective in managing training stability and efficiency. - The stability is attributed partly to the small
AdamWvalue chosen, which helps mitigate sudden loss increases.
6.4. Transition Matrix Eigenvalues and Stability (Appendix C)
- Theorem 1: When
decayentries are in andlearning rateentries are in , and , then thetransition matrixis similar to a symmetric matrix, all itseigenvalueslie in , and it admits at most onenegative eigenvalue. If is time-independent, the update formula is guaranteed to be stable. - Analysis: This theoretical result is crucial for
RWKV-7's stability. It shows that thetransition matrixinRWKV-7is acontraction matrix(spectral norm ), preventing unbounded growth of theWKV state. The ability to havenegative eigenvalues(up to one) is distinct from some priorDeltaNetvariants and hints at richer dynamics that might contribute to its enhanced expressivity, as discussed in concurrent work by Grazzi et al. (2024).
6.5. RWKV-7a for Board Game Modeling (Appendix I)
- RWKV-7a Formula:
$
\pmb { S } _ { t } = \hat { \pmb { S } } _ { t - 1 } \mathrm { d i a g } ( w _ { t } ) ( \pmb { I } - c \hat { \kappa } _ { t } ^ { T } ( a _ { t } \odot \hat { \kappa } _ { t } ) ) + \nu _ { t } ^ { T } k _ { t }
$
- This expanded formulation is useful for Othello, allowing a full range of
eigenvalueswhen . This might give more control over state updates for tasks requiring fine-grained strategic adjustments.
- This expanded formulation is useful for Othello, allowing a full range of
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces RWKV-7 "Goose", a significant advancement in recurrent neural network (RNN) architectures for sequence modeling. It successfully demonstrates that RNN-based models can achieve state-of-the-art (SoTA) performance competitive with or superior to Transformer-based models in the 3 billion parameter class, particularly in multilingual tasks, despite being trained on dramatically fewer tokens. This highlights RWKV-7's exceptional parameter and data efficiency.
The core innovation lies in a newly generalized delta rule formulation, which incorporates vector-valued gating, in-context learning rates, and decoupled removal/replacement keys. These features provide RWKV-7 with a highly expressive and dynamic state evolution mechanism, allowing for fine-grained control over its internal memory. Critically, the paper provides theoretical proofs that RWKV-7 possesses computational power beyond (the conjectured limit of Transformers), enabling it to perform state tracking and recognize all regular languages with a constant number of layers, thereby solving -complete problems.
Beyond its core language modeling capabilities, RWKV-7 maintains inherent RNN advantages: linear time complexity and constant memory usage per token during inference, making it highly efficient for long contexts. The release of RWKV World v3, a 3.1 trillion token multilingual corpus, and four pre-trained RWKV-7 models under the Apache 2.0 License, further supports openness, reproducibility, and adoption within the research community. The architecture also shows promising results in multimodal applications (vision and audio) and in-context search for board games, demonstrating its versatility and robustness.
7.2. Limitations & Future Work
7.2.1. Limitations
The authors acknowledge several limitations of RWKV-7 and the current research:
- Numerical Precision: Certain operators, especially the
WKV7 kernel, are sensitive to numerical precision. Differences in kernel implementations and precision settings during training can impact dynamics and results, necessitating careful handling for deployment. The observation ofNaNloss during training, despite overall stability, points to this sensitivity. - Lack of Instruction Tuning and Alignment: The released
RWKV-7models arepretrained base models. They have not undergoneSupervised Fine-Tuning (SFT)forinstruction followingoralignment with human preferences (RLHF). This limits their direct usability in real-world conversational or instruction-based applications. - Prompt Sensitivity:
RWKV-7models exhibit significantprompt sensitivity, particularly to the presence of the special token at the beginning of inputs. Omitting it can lead to degraded performance, suggesting challenges in retaining initial token information or properly initializing the state. - Compute Resources: Training was constrained by available compute resources (max Nvidia H800 GPUs). This is less than resources used for very large modern
LLMtraining (e.g., DeepSeek-V3). The need tocontinue training from pre-existing checkpointsof earlierRWKVversions andre-use parts of the datasetmight limit the full potential ofRWKV-7compared to training from scratch.
7.2.2. Future Work
The authors propose several directions for future research:
- Training Larger Models and Datasets: Scaling
RWKV-7to even larger sizes and training on more extensive datasets is a primary goal, contingent on securing additional computational resources. - Speedup Techniques: Exploring various
speed optimization techniques(e.g.,Dual Pipelining Mechanism,Mixture-of-Experts,Multi-Token Prediction,FP8 Training) highlighted in other works like DeepSeek-V3. Many of these areorthogonaltoRWKV-7's architectural optimizations and could be integrated. Further kernel-level optimizations and distributed training strategies are also planned. - Incorporating Chain-of-Thought Reasoning: Developing
reinforcement learning pipelinesto integrateChain-of-Thought (CoT)reasoning (Wei et al., 2022) intoRWKV-7. The authors believe that as alinear RNN,RWKV-7is well-suited for efficientCoTdue to itsstate-trackingcapabilities, which could enable it to excel in multi-step logical reasoning and complex problem-solving.
7.3. Personal Insights & Critique
This paper presents a compelling case for the continued relevance and potential superiority of RNN-based architectures in the era of Transformers, particularly for efficiency and deep state-tracking. The RWKV-7 architecture, with its generalized delta rule, stands out as a thoughtful and theoretically grounded approach to overcoming long-standing RNN limitations while avoiding the quadratic scaling of Transformers.
Innovations and Strengths:
- Theoretical Expressivity: The formal proofs that
RWKV-7can recognize allregular languagesand solve -complete problems are a standout contribution. This isn't just an empirical gain; it suggests a fundamental architectural advantage over -limited models. Thisexpressive powerforstate trackingis highly valuable for tasks requiring sequential logic, programmatic execution, or complex memory operations, whichTransformersoften struggle with or require external tools for. - Efficiency and Performance: Achieving SoTA multilingual and near-SoTA English performance with significantly fewer training tokens than competitors like
Qwen2.5is a testament toRWKV-7'sdata and parameter efficiency. Coupled with itslinear inference timeandconstant memory usage, it positionsRWKV-7as a highly practical and scalable choice for long-context applications whereTransformersbecome prohibitive. - Fine-Grained State Control: The
vector-valued gatingandin-context learning rates, along withdecoupled removal/replacement keys, represent a sophisticated evolution of thedelta rule. This level of control over theWKV statelikely contributes to its superior memory management andassociative recallcapabilities. The analogy of theRWKV-7 stateas an "internal scratchpad" is very intuitive and highlights its dynamic nature. - Open Science: The commitment to open-sourcing models, datasets, and code is commendable and crucial for fostering research and adoption of
RWKVas a viable alternative.
Potential Issues and Areas for Improvement/Critique:
- Numerical Precision and Stability Trade-offs: While stability is generally good, the observed
NaNloss and sensitivity toAdamWsuggest that the underlying numerical stability, especially in complexCUDA kernels, is a continuous engineering challenge. The balance between maximum decay, expressivity, and numerical stability might require further exploration. - Complexity of State Dynamics vs. Interpretability: The
generalized delta ruleis powerful, but its vector-valued and decoupled nature might makemechanistic interpretabilitymore challenging than simplerdelta rulevariants orTransformers. Understanding why specificchannelsdecay or learn at certain rates could be a complex research area. - Pre-training from Checkpoints: The limitation of continuing training from older
RWKVcheckpoints rather than starting from scratch on the fullRWKV-7architecture andWorld v3dataset is a practical constraint. It's possible thatRWKV-7's full potential has not yet been unlocked, and someinductive biasesfrom older architectures might persist. This is acknowledged by the authors but remains an area where future research could yield even stronger results. - Prompt Sensitivity (Practical Implications): The
prompt sensitivityto is a practical hurdle. While a recommendation is given, it implies thatRWKV-7'sstate initializationis not as robust or context-agnostic asTransformers. This might require more careful prompt engineering or architectural modifications to make the models more user-friendly out-of-the-box. - Role of in Expressivity Proofs: The proof for expressivity relies on setting in the
transition matrix, while the current model uses . Although the authors explain thatGroupNormcan compensate for magnitude scaling, the theoretical reliance on (which enables operations like perfect swaps) for certain proofs might indicate a slight mismatch or a need for fine-tuning the model to fully exploit this theoretical capability in practice.
Transferability and Future Value:
-
Alternative to Transformers:
RWKV-7provides a strong, efficient alternative toTransformers, especially as sequence lengths grow and computational resources become a bottleneck. Itslinear scalingis a fundamental advantage that will become increasingly important. -
Foundation for State-Tracking AI: The proven expressivity opens doors for
RNNsto excel in tasks that inherently require complexstate tracking, such as symbolic reasoning, code execution simulation, or advanced agents that maintain sophisticated internal states. This could be applied to domains like robotics, formal verification, or even more advancedboard game AI. -
Multimodal Potential: The successful application to
VisualRWKV-7andAudioRWKV-7highlights the architecture's versatility beyond pure text. Its efficiency can be a significant advantage inmultimodal settingswhere input sequences (e.g., high-resolution images, long audio streams) are inherently long and high-dimensional. -
Bridging Theory and Practice: The paper successfully bridges theoretical insights into computational complexity with practical, high-performing models, providing a valuable framework for understanding the capabilities of different neural network architectures.
Overall,
RWKV-7 "Goose"is a significant and exciting development. It demonstrates that with innovative architectural design and a deep understanding of recurrent dynamics,RNNscan not only compete with but theoretically surpassTransformersin specific areas ofexpressivityand practical efficiency.
Similar papers
Recommended via semantic vector search.